A look at enterprise video quality
The “hybrid” mode of work is quickly becoming the new mode of operation as the world’s employees emerge from their homes and head back to the office. Hybrid working requires people to leverage the latest innovations in conferencing technologies to remain connected and collaborate wherever they are. The pandemic made video communication essential, with enormous increases in-home and mobile usage. These environments are notorious for presenting challenges to delivering high-quality media, such as low or variable network bandwidth, poor lighting and cameras, and background noise. Central to innovating to improve user experience is measuring that experience. This article explores how Cisco approaches this multi-dimensional problem for video quality.
Measuring video quality is a multifaceted and complex endeavor
Why is measuring video quality hard? Partly because defining it is hard. We know bad quality when we see it, but video can be bad in many different ways: soft, or blocky, or noisy; frames freezing or corrupted, or out of sync.
Video conferencing systems are also highly adaptive. Networks are unreliable, CPU usage and video content change. In response, applications such as Webex will adapt by change resolution, adjusting frame rates, and collaborate with end-user clients to negotiate optimal network strategies. This makes what is being experienced a moving target.
Measuring individual video components’ quality is a top-down and a bottom-up process. Top-down because we want to measure the totality of what users experience. Bottom-up because we want to measure how every component is performing and what its contribution is.
Quality and network loss
An important part of user experience is how a client behaves when there are poor network conditions. Because video streams contain data that is predicted from previous frames, loss of data causes receiver errors. Different strategies can be adopted. At the data layer, errors can be minimized by using Forward Error Correction or retransmission. Video streams can be restarted with a new key frame. Any errors that remain will need to be concealed by some mixture of temporal or spatial concealment: spatial concealment borrows information from surrounding pixels to rebuild lost data; temporal concealment borrows data from nearby video frames to fill in lost frames. Finally, the data rate can be reduced by using lower bitrates and smaller video resolutions. Each of these techniques has costs as well as benefits.
From a metrics point of view, this is extremely challenging. The video that is received is not the same as the video that is transmitted. The package of optimization techniques adopted by different vendors is different, and so the video displayed by different vendors will be different, for example, favoring motion over sharpness or vice-versa or increasing latency to allow re-transmission. Vendors do not share their methods of quality optimization because these are proprietary implementations and make up their “secret sauce.” Regardless of whether the optimization happens on the sender side, during transmission, or by the receiver side, the resulting video has diverged from its source.
Even when there is no loss, there is still client adaptation: denoising, super-resolution, pre-and post-filtering, which is also different between vendors. All these factors make apples-to-apples comparisons extremely difficult.
Full-reference versus no-reference metrics
How, then, can quality be measured at all in such circumstances? To understand that, we need to understand the difference between Full-Reference and No-Reference metrics.
An FR metric is one where it is necessary to compare the video with an original. It requires a pixel-by-pixel correspondence: same resolution, same framerate, each input frame matched with an output frame. It is most useful when a single process may introduce some loss to a well-defined input, where the aim is to minimize that loss.
VMAF – full-reference testing
There are various FR metrics such as PSNR, SSIM, MS-SSIM, but a very popular metric, often considered state-of-the-art, is Video Multimethod Assessment Fusion or VMAF. This method of FR testing was designed specifically by Netflix to conduct perceptual video quality assessments for their video streaming service.
Using VMAF – or any FR metric – to measure quality is very challenging. Since the received and source video can be quite different, it is necessary to scale, crop, and synchronize the output so that it can be compared pixel-by-pixel with (some of) what is transmitted.
In our experience, although this approach has been attempted in vendor comparisons, the manipulations required are highly error-prone.
Moreover, although VMAF captures differences well, it does not capture absolute quality. In conferencing, we do not have expensively produced movie videos. User experience is influenced by the quality of the captured video itself, not just by how different the received video is from what is captured.
Finally, VMAF is a spatial only metric: it does not capture temporal effects, and the score is just an average of frame scores.
NIQE – No-reference testing
Alternatively, there has been much research over recent years into No-Reference video quality metrics that attempt to measure the absolute quality level without comparison to a reference. If a suitable, reliable NR metric can be found, it would be ideal for conferencing applications because of the adaptations and losses that video streams experience.
A popular NR quality metric is Naturalness Image Quality Evaluator or NIQE. NIQE fits a statistical model to an image to see how closely the statistic are representative of a corpus of natural images. NIQE can score the end-users video quality in any situation, regardless of the source image quality and any losses or processing along the video pipeline.
A NR metric like NIQE can be used to evaluate both source and destination video independently. Quality losses are then captured in the difference of scores, allowing loss recovery and video optimization techniques to be assessed. Since captured video can be of poor quality, video optimization can even improve it.
Our experience is that NIQE is quite reliable but is still missing some important features. In particular, it still does not address temporal quality.
Augmenting NIQE: additional metrics
One limitation of NIQE is that potentially you could get a very good NIQE score by allocating all the bitrate to one frame and never sending another. This is not a specific problem to NR metrics: as mentioned, using VMAF has the same problem since you can only compare the frames that are actually received to those they correspond to.
The first additional metric we consider, therefore, is the Drop Frame Metric or DFM. This metric calculates the number of dropped frames in a sequence and the occurrence of keyframes being used as a method of error recovery. In some instances, the use of keyframes can give false-positive results in the NIQE score. Therefore, this temporal measurement allows one to distinguish between accurate results and any false positives.
Although NIQE captures many aspects of image quality, it does not capture some compression artifacts. So, we also include a measurement of blocking and a measurement of blurring. Both FR and NIQE metrics can miss interpreting these elements, which are common to encoded video.
Cisco’s preference for no-reference testing
FR metrics have their place, particularly in assessing individual pipeline elements. But Cisco believes that to accurately assess end-to-end video quality, No-Reference metrics best capture user experience. Human perception is extraordinarily complex, and designed comprehensive metrics is hard, but the following four metrics together capture a significant part of the quality experience:
No-Reference (NIQE) | DFM | Blocking | Blurring
Together, these provide a concise measurement of subjective video quality across several different dimensions.
These metrics can capture both the quality lost end-to-end, and the impact of the source quality itself. Conferencing systems must often accept low-quality source content and use various methods to improve or maintain video quality from end to end. As such, quality assessments should not rely upon FR quality measuring methodologies but account for these factors when considering the end-to-end experience. Therefore, Cisco focuses on using No-Reference metrics for end-to-end quality, as it more accurately reflects end user experience.
Cisco’s continuous focus on video quality & quality overall
The metrics we have discussed are not perfect. There are some limitations, for example in judging the quality of graphics/synthetic content, and we are continuously developing our approach. But although it is the harder road to take, we are convinced that No-Reference metrics are the best framework for evaluating experienced video quality.
In the latest release of the Webex App, significant improvements to all media quality metrics have been achieved. These improvements include video quality, audio quality, background noise suppression, CPU utilization, as well as innovations designed for the world of hybrid work.
In response to the pandemic, we have seen media quality of all vendor’s solutions significantly improve this year. Our continual testing shows that the Webex App provides video quality that meets or exceeds that of any other vendor. It continues to be an incredibly competitive market, and a focus on quality and performance is at the forefront and continues to be a focus for Cisco.
Co-author Thomas Davies- Principal Engineer
Thomas Davies is a Principal Engineer in Cisco’s Collaboration Technology Group (CTG). Thomas has worked in satellite networking, RF communications, and broadcasting, but has spent most of his 20+ year career on video processing and video compression (codecs). He has worked for Cisco for over 10 years working on creating the next generation of collaboration experiences, and has contributed to video compression standards like HEVC (H.265) and AV1, but he has also been instrumental in implementations of those standards in real products, such as Cisco Webex.
Sep 16, 2021 — Aruna Ravichandran