BabbleLabs – AI audio wizardry for Cisco Webex Meetings
When it comes to having the best video conferencing experience, people often get excited about visuals. Cisco Webex Meetings is certainly an industry leader in this regard, providing users with progressive features like customizable views, gestures and reactions, advanced video layouts and immersive share. I myself love a fun background — and love getting a thumbs up emoji from meeting participants. But today, I want to talk about the importance of achieving excellent audio in video conferencing through clear, noise-free speech.
According to Gartner, by 2024 only 25 percent of meetings will take place in person. With the majority of meetings happening through conferencing solutions, intelligible speech is not simply nice to have – it’s crucial. The performance of an organization and their ability to provide a diverse, dispersed workforce depends on how well people can understand each other. And speaking and hearing are core to understanding.
Challenges developing effective speech enhancement technology
Understanding meeting participants while video conferencing can be challenging when in an office. It’s even worse when working from a noisy home on a laptop with limited network connection when the system is trying to push audio streams through complex global networks to 100s of colleagues.
In the past, algorithms have struggled to extract useful information from speech in a way that results in a clear video conferencing audio experience that eases the cognitive load on the human and computing load on electronics. They struggled with the amount of noise, degree of reverberation, number of talkers, bandwidth and latency limitations. They wrestled with packet loss and the effects of compressing audio while respecting privacy and data security of users.
Additionally, noises found in speech are so diverse that algorithm developers have struggled to know what audio should be separated from speech – what is speech and what is not speech. So they focused on suppressing stationary noises that are constant in amplitude and frequency over time, like fans and motors. But the most annoying noise is transitory – barking dogs, beeping horns, banging keyboards and the babble of background noises. Plus many environments, especially home offices not designed for acoustic perfection, are highly reverberant or “echo-y”.
Achieving better understanding with Cisco Webex Meetings
Today, neural network speech methods are starting to make a big difference in how effectively we can solve these problems. As the founder and CEO of BabbleLabs – which was acquired by Cisco in October 2020 – my team and I have been working on developing best-in-class speech enhancement. We’re now implementing our AI audio wizardry as part of the Webex Voice Technology team. How do we do it? In the simplest terms, we take neural network structures, collect hundreds of thousands of hours of speech and noise, tens of thousands of hours of room acoustics and create precisely tuned models to transform speech. And we do it in a latency of only 10 milliseconds.
Speech Enhancement has recently gone mainstream in video conferencing. Everyone has some version of it but not everyone is achieving the same results. Our systematic testing shows that Cisco Webex Meetings’ speech enhancement algorithm is the most effective one available for widespread commercial use. We used the same quality tool to measure most of the available video conferencing systems – ITU standard P.862 – Perceptual Evaluation of Speech Quality (PESQ) and three large suites of typical noise and reverberant streams, one developed by Cisco and two from Microsoft. On all the tests, Webex removed more noise and reverberation and scored significantly higher than recent Zoom (5.4.1) and Microsoft Teams (1.4.00.4167) releases.
Since the first public release of this speech enhancement technology two years ago, and with Cisco’s accelerating commitment of resources, we have improved speech quality more than 2X and reduced the computing requirements to run these models 400 times faster.
What’s next for Cisco Webex Meetings voice technology?
We continue to push the envelope for higher levels of performance and to further reduce the computing load to achieve ubiquitous and painless implementation. We can understand who speakers are and where they are and remove distracting background noise from their environment while amplifying their speech.
AI is giving us some potent new tools to extract more insights and communicate with less effort. Soon, we will release smart new features that will make an even greater difference to understanding, including:
- Speech enhancement that can distinguish intelligible speakers in conference rooms: Precise extraction of talkers who are near the microphone versus those who are far, so we have the ability to suppress or boost speech as needed.
- New speech enhancement capabilities for smart devices: New implementations and features to leverage the power of leading-edge laptops, devices, and phones.
- Command recognition using unique speech enhancement algorithms: To complement Webex’s large vocabulary voice assistant and transcription technologies and bring efficient edge execution, high accuracy, and easy configuration to new commands.
We live in a noisy world, but you don’t need to let it stop productivity. Speech enhancement has been shipping in volume deployments in Cisco Webex Meeting products for more than seven months. And it does a whole lot more than remove noise – it enhances speech and understanding while maintaining the fundamental Cisco commitment to privacy, security, and fairness.
Want to hear our speech enhancement technology in action and learn more about Cisco Webex Meeting’s speech enhancement algorithms?
Watch my Cisco Live talk BabbleLabs – AI Audio Wizardry now available to Cisco Live All Access pass holders and available to the general public who register for a Cisco Live account in early summer.
Sep 27, 2022 — Geoffrey Huang