Latency Resolved with a WebRTC Media Stack

The last GS Lab blog titled Exemplifying the streaming protocol – WebRTC discussed different WebRTC topologies, their advantages and disadvantages, and how selective forwarding units (SFU) architecture is more scalable and cost-effective than a microcontroller (MCU) and full-mesh.

We had also discussed a hybrid topology — a combination of SFU and MCU architecture where the application can utilize the MCU or SFU architecture as defined by business logic using the topology selector by the application programmer.

Since we had compared WebRTC with other streaming protocols and the various topologies, let’s continue the series and explore WebRTC technology in greater detail with a particular focus on the WebRTC media stack.

Decrypting the WebRTC Media Stack

Our previous discussion established the importance of selecting the proper topology, which helps us achieve scalable, cost-effective, and quality applications depending on the user device and available bandwidth. On a similar note, understanding the media stack beneath the WebRTC framework helps enhance the quality of experience (QoE) of conference applications as per participants’ devices and available bandwidth.

The media stack handles the management of the media plane in WebRTC-based communication. It involves a media engine and codecs. Codecs signify the media stream’s compression and decompression standards. For a successful media exchange with the users in the conference, they need a standard set of codecs to agree upon for the session. The codec between both participants via SDP (session description protocol).

A Closer Look at the Media Stack – Let’s take a look at the media stack’s flow in the VoIP system:

WebRTC Media Stack

At the browser end (user1 and user2), the WebRTC standard provides APIs for accessing cameras and microphones connected to the computer or smartphone. With the help of the “getUserMedia” API, the browser gets access to system hardware to capture both audio and video streams without using any plugin-ins or custom user-mode drivers. However, the captured raw audio (16 bit PCM) and video (YUV 420) streams are not sufficient on their own. With the help of an audio filter in an audio engine, we are trying to enhance the quality, and with the help of audio encoders, we try to compress the audio stream. On the other hand, with the help of a video encoder in the video engine, we try to compress the raw video stream, and the video jitter buffer helps conceal the effects of jitter and packet loss on overall video quality.

Thus audio and video engine processing helps us enhance the stream quality and achieve the desired output bit-rate, adjusted continuously for available bandwidth and latency.

At the receiving end, the process is reversed, and with the help of an audio and video decoder, the streams decode in real-time and adjust to network jitter and latency delays. In short, stream capturing and pre-processing, encoding, decoding, post-processing are complex problems. However, the good news is that the WebRTC media stack brings fully featured audio and video engines to the browser, which takes off the signal processing stuff.

WebRTC uses container-less media, so the media pipeline should not include muxer or demuxer plugins to containerize the elementary audio and video streams. It includes codec only. Here, codecs signify the media stream’s compression and decompression. For users to have a flourishing exchange of media, both users need a standard set of codecs to agree upon for the session. During the signaling process, besides IP and port-related information, the metadata (including codec and media type) also gets exchanged via a Session Description Protocol (SDP). So, the audio and video codecs are exchanged between the users, which helps start the multimedia communication session between them.

Here is a quick analysis of the audio and video codec support across the different WebRTC browsers and license details:

WebRTC browsers supportAs per the above infographic, Video Processor 8 (VP8) was developed by on2 and then acquired and rendered an open source by Google. Now it has a clean license from Google allowing royalty-free commercial use and is ubiquitous across all commonly used browsers. It has no limit on frame rate and provides a maximum resolution of 16384×16384 pixels.
Video processor 9 (VP9), VP8’s successor, achieves a higher bit rate than VP8 and H.264/AVC, i.e., up to 50%, while maintaining the same video quality. At the same time, VP9 is comparable to HEVC (MPEG High-Efficiency Video Coding) in terms of bit rate. VP9 is yet to get integrated with Safari.

AOMedia Video 1 (AV1) codec is an open format designed by the Alliance for Open Media. It is royalty-free and specially designed for internet video HTML elements and WebRTC. AV1 has higher data compression rates than VP9 and H.265/HEVC. It also supports HDR and variable frame rates. AV1 is yet to get integrated with Mozilla Firefox and Safari.
Opus is a royalty-free audio codec defined by IETF RFC 6176. It supports constant and variable bit rate encoding from 6 Kbit/s to 510 Kbit/s, frame sizes from 2.5 ms to 60 ms, and various sampling rates from 8 kHz (with 4 kHz bandwidth) to 48 kHz (with 20 kHz bandwidth, where the entire hearing range of the human auditory system can be reproduced). OPUS is ubiquitous across all commonly used browsers.

G.711 (A-law and µ-law) is an ITU-published Pulse Code Modulation (PCM) with either µ-law or A-law encoding. This is vital to interface with the standard telecom network and carriers. G.711 PCM (A-law) is known as PCMA, and G.711 PCM (µ-law) is known as PCMU. G.711 is ubiquitous across all commonly used browsers. We have 4-5 video codecs options in line with the above information. Most video codecs are promising: VP8, VP9, AV1, and even HEVC, but they come with some challenges: processing power, availability across browsers, available bandwidth, expected video quality and latency, and the number of participants/users present.

Enter the world of Scalable Video Coding (SVC)

How often have we been in a conference call with colleagues and business associates when we observe someone facing issues due to a weak connection? Not only does this “slow user” situation cause unwarranted delays and disrupted conversations, but it can be irksome and embarrassing too. We have three options that can help ensure a smooth and high-quality delivery. Consider these:

  • Use a scalable video codec such as VP9 or AV1.
  • Lower the bit rate of everyone’s streams not to overwhelm the slow user (i.e., the lowest common denominator).
  • Send separate streams to participants, which cater to each user’s available bandwidth.

Both VP9 and AV1 have built-in scalabilities. This video codec allows video transmissions to scale up and deliver content without degradation between various endpoints, for example, a conference between a laptop and a low compute mobile device. Both end devices have different compute capabilities, screen sizes, and even differences in available bandwidth.

Here SVC codecs adapt to network connections by dropping these bitstream subsets or packets to reduce the frame rate, frame resolution, or SNR (signal to noise ratio) to tackle temporal scalability, spatial scalability, and quality, fidelity, and scalability, respectively. For example, a mobile phone would receive only the base layer (low-resolution bitstream), while a laptop would receive both the base layer and enhancement layer (high definition resolution stream).

Moreover, SVC is backwards compatible, so an SVC codec can communicate with an H.264 codec that is not SVC-capable. Several video conferencing equipment manufacturers embrace SVC encoding, including Avaya, Life-size, Polycom, and Vidyo.

Partnering for Multimedia Expertise

Today organizations need integrated multimedia solutions to deliver more business value than ever before across the spectrum of analytics, improved security, and business applications. The pandemic has changed many aspects of our lives. Although tools like video conferencing were popular earlier, it has been rendered a “must-have” overnight.

For the last 18 years, GS Lab has developed multiple tools across conferencing and collaboration, media streaming and video surveillance, telephony, WebRTC, and an open-source cross-platform – Freeswitch. We have also developed solutions to address specific business use cases based on a specific platform. The recently developed Jitsi-meet based stop video conferencing solution known as Greendot is a free open-source multiplatform for VoIP/Video conferencing/IM applications and is compatible with WebRTC. Greendot has exceptionally flexible and scalable features for easy application development/integration and to answer your business needs and end-users latency issues. Learn more about Greendot today.

>> Learn more about GS Lab at

>> Know more about our expertise in WebRTC technology and how we can help you accelerate your business transformation

Swapnil Warkar

Swapnil Warkar

Swapnil Warkar is Software Architect at GS Lab and has an experience of more than 16 years in IT. He leads the Multimedia practice in GS Lab on the technology and business front. Apart from this, he also leads research in Voice Processing Domain. Swapnil focuses on addressing challenges around translating early-stage business vision to IT solutions to bring out innovation.