HD Voice: crystal clear audio powered by AI

On By Ferdinando Olivieri5 Min Read
Man On Phone Standing Infront Of Multiple Distractions Feature

At WebexOne23 in Anaheim, California, we announced HD Voice, an AI-powered technology that removes background noises and converts narrowband audio to wideband. The result is better audio quality and improved speech intelligibility when receiving calls over the Public Switched Telephone Network (PSTN). In this blog post, we delve into the details of the problem HD Voice solves, what it does, and the benefits to our customers.

The problem with narrowband audio

When we call our local pizza shop or our friends and family with our smartphones, chances are we are making a PSTN call. PSTN calls are largely based on narrowband audio – a type of signal processing that only preserves the lower portion of the speech frequency spectrum (i.e., up to 4 kHz). The upper portion (i.e., 4 – 20 kHz), which contains cues that help us differentiate certain consonants (link), is lost. As a result, narrowband audio (and background noises) may lead to phoneme confusion, resulting in decreased speech intelligibility and audio quality. On the other hand, wideband audio preserves the speech spectrum up to 8 kHz, and, in general, provides better speech intelligibility (link1, link2) and accessibility for people with hearing loss compared to narrowband audio, among other benefits.

Let us dive deeper into this issue by looking into what happens when speech goes through a narrowband and wideband communication system. For simplicity, we will not consider the effects of background noise – which makes things worse in both the narrowband and wideband audio cases.

Figure 1 shows the spectrogram of an unprocessed speech snippet sampled at 48 kHz. Let us assume this is the speech signal that goes through the communication system. Note how the speech snippet shows frequency components across the whole spectrum.

When the speech goes through a narrowband codec (e.g., a PSTN call based on the G.711 codec), the resulting signal (Figure 2) loses most of the original frequency spectrum.

On the other hand, when the signal in Figure 1 is processed with a wideband system, the resulting spectrum might look like the one in Figure 3. We note how the wideband signal is “more similar” to the original in Figure 1 than the narrowband one in Figure 2.

If you listened to the three clips (original vs. processed ones), you would notice that the resulting audio quality in the wideband clip is much better than the narrowband one.

Narrowband vs. Wideband audio: a “visual” analogy

To understand the relation between narrowband and wideband audio, let us use an analogy based on a “visual” example.

Picture this: it is 1958, and you are watching TV. NBC President Robert Sarnoff is introducing President Eisenhower who is set to address the nation shortly after. The TV program is shown in black and white (B&W). Suddenly, Sarnoff pushes a button and – just like that – you see the picture in color.

You just witnessed one of the most defining moments in TV broadcasting: the switch from B&W to color TV. Since then, similar moments have happened all over the world – for example, Australia in 1975 (video), India in 1982 (video), and Norway in 1972 (video), to name a few.

Let us look at the implications of this switch.

First, in terms of conveying information, broadcast B&W TV served its purpose. For example, you can see the presenter, their surroundings, and make sense of the visual scene. However, you cannot see the color of the presenter’s jacket or the color of the background. On the other hand, the video in color is more realistic and relatable.

Now the question: would you go back to B&W after seeing the video in color? Likely not.

Sure, B&W TV has its charm. But when it comes to live TV, color is better. Scientific studies seem to confirm that this is indeed the case – e.g., by showing that images in color (compared to B&W ones) have a positive impact on memory and brand associations (link). Today, all broadcasters produce in color.

Just like in our visual analogy, narrowband audio may be sufficient to convey information (like B&W TV), but the “cut” in the frequency spectrum makes the processed speech harder to understand. Wideband audio (similar to color TV), on the other hand, preserves a larger portion of the spectrum thus making the sound “closer” to what we would experience in real life and, hence, making it easier to understand what is being said.

Cisco’s HD Voice – AI brings back “color” to narrowband communications.

In the case of audio communications, there was no one-time switch from narrowband to wideband audio when the G.722 wideband codec was standardized in the late 80s by the International Telecommunication Union. Today, a significant portion of PSTN calls is based on narrowband audio. We know they do not sound great, and there is little that can be done about it. Until now.

To improve the quality of narrowband calls, Cisco developed HD Voice, an AI-based speech-processing technology that simultaneously:

  1. removes background noises, AND
  2. converts narrowband audio to wideband.

The result is better audio quality and speech intelligibility when receiving PSTN calls.

In a way, HD Voice can be seen as the (metaphorical) “switch” to convert narrowband to wideband audio – just like the one NBC President Robert Sarnoff used to switch from B&W to Color TV.

HD Voice takes the narrowband speech (e.g., G.711) as input and uses AI to remove background noises and reconstruct the high-frequency portion of the speech spectrum that is lost due to the narrowband processing (See Spectrogram 1 in Figure 5).

Figure 5: Narrowband signal converted to clear wideband audio with HD Voice

Hence, the output from HD Voice is a noise-free wideband signal which includes the reconstructed spectrum (see Spectrogram 2 in Figure 5). Compared to the original (Spectrogram 1), the HD Voice audio (Spectrogram 2) sounds better.

By relying on psychoacoustical principles and the power of AI, HD Voice’s objective is to blindly reconstruct the high-frequency portion of the original narrowband speech without any side information. HD Voice can operate locally on the user’s device or in the cloud. We do not use user data for the training of the HD Voice neural network. Due to this reason, HD Voice does not reconstruct the high frequencies of a specific person’s voice. As an AI-based innovation, HD Voice will always adhere to Cisco’s Responsible AI Principles, which include Transparency, Fairness, Accountability, Privacy, Security, and Reliability.

Benefits for Webex customers

Cisco is committed to bringing the best audio quality and user experience to Webex users. After bringing Noise Suppression to a wide range of Webex products and endpoints, HD Voice is set to make the audio user experience for Webex customers even better.

By removing distracting noises and converting narrowband audio to wideband, Cisco HD Voice will exponentially improve the experience of users who rely on narrowband communications (e.g., PSTN) to perform their work. HD Voice users will experience clearer, noise-free, high-quality audio allowing them to focus on the conversation.

HD Voice will be available for all Webex Suite customers. Webex Calling customers will be the first to have access to HD Voice through the beta version of the Webex App starting November 2023. We will publish more details on the availability of HD Voice for the other Webex Suite customers in 2024.

Learn more:

About The Author

Ferdinando Olivieri
Ferdinando Olivieri Product Manager Cisco
Ferdinando is a Product Manager for Webex AI Audio Innovations, with the objective of advancing audio AI technologies to significantly enhance the Webex audio experience for customers.
Learn more

Topics


More like this