In previous blog entries, we covered SDP, and how it is used in negotiations via the Offer/Answer exchange. In this blog, we examine how H.264 is negotiated in SDP.
Every audio and video codec used in real-time conferencing has its own specifications for how it is negotiated in SDP, and it is outside the scope of this series to cover all of them. However, H.264 video is both ubiquitous and has a complex set of codec parameters that are not at all intuitive. Beyond that, there are parts of the specification that are not always followed. RFC6184 (previously RFC3984) is the specification for this, though certain aspects of it require access to ITU Series H specification for H.264 encoding/decoding.
The simple bit of negotiating H.264 is the rtpmap: a dynamic payload type with “H264/90000” will do the trick. Unsurprisingly, it is the payload-specific parameters in fmtp that are where the complexity comes in. An example fmtp line might look something like this:
a=fmtp:126 profile-level-id=42e016;packetization-mode=1;max-mbps=244800;max-fs=8160
The rest of this blog will cover what these elements mean.
Profile and Level
H.264 encodes a number of key limitations within a 6-character string called the profile-level-id, where each pair of characters represents a hexidecimal number.
The first value is the profile_idc, which defines the sub-profile, which is essentially the set of tools allowable when encoding H.264, while the second is the profile_iop, which defines a set of constraint flags. The first six bits of the profile_iop define the 6 constraint flags (while the remaining two are unused):
Flag | Hex Mask | Binary Mask |
---|---|---|
CONSTRAINT_SET_0 | 0x80 | 10000000 |
CONSTRAINT_SET_1 | 0x40 | 01000000 |
CONSTRAINT_SET_2 | 0x20 | 00100000 |
CONSTRAINT_SET_3 | 0x10 | 00010000 |
CONSTRAINT_SET_4 | 0x08 | 00001000 |
CONSTRAINT_SET_5 | 0x04 | 00000100 |
For instance, “0C” corresponds to CONSTRAINT_SET_4 and CONSTRAINT_SET_5.
The sub-profile and constraint flags combine to produce the actual profile. There are a wide range of profiles available, but the most commonly supported in video conferencing are the Constrained Baseline and Constrained High profiles. It is recommended that any implementation interested in interoperability advertises support for constrained baseline as one supported option. The following table shows how these and some other also seen profiles are derived from the profile_idc and constraints:
Profile | profile_idc | Constraints |
---|---|---|
Constrained Baseline Profile | 66 | CONSTRAINT_SET_1 |
Baseline Profile | 66 | |
Main Profile | 77 | |
Extended Profile | 88 | |
Constrained High Profile | 100 | CONSTRAINT_SET_4, CONSTRAINT_SET_5 |
High Profile | 100 |
The remaining byte in the profile-level-id covers the level, which provides a choice of values corresponding to a set of combined limits on a range of factors. These are defined in Table A-1 of the ITU H.264 specification, which is also included here for ease of reference:
A given level number corresponds to a set of 8 different maxima – advertising support for a particular level number means supporting H.264 streams that comply with these 8 maxima. What a few of the most important of these values are will be covered below.
So, to summarise, consider the example profile-level-id=428016. The first two characters represent the profile_idc: 0x42 corresponds to 66 in decimal. The second two represent the profile_iop: 0x80 is 0b10000000, corresponding to CONSTRAINT_SET_1. That means the overall profile is Constrained Baseline. Finally, the last two characters represent the level: 0x16 is 22 in decimal, meaning a level of 2.2, which corresponds to a max_mbps of 20250 macroblocks per second, a max-fs of 1620 macroblocks, and a max-br of 4 megabits/second.
Adjusting Level Limits
However, levels only allow for so much flexibility, so the fmtp also allows for each of these limits to be raised. Of these, the three key ones are:
- max-fs: The max frame size, in macroblocks. A macroblock in H.264 is a 16×16 block of pixels, so this is equivalent to the maximum number of pixels divided by 256 (it must be an integer value). A max-fs of 3600 corresponds to 720p, while 8160 or 8040 corresponds to 1080p – 1080p does not actually correspond to an integer number of macroblocks so in practice endpoints negotiate support for either 1920×1088 or 1920×1072. It’s recommended to advertise the former but be ready to send the latter if that is what the far end specifies is its maximum.
- max-mbps: The maximum number of macroblocks per second. If the device writing the SDP supports 30fps then this is commonly the max-fs multiplied by 30: 108000 corresponds to 720p30 while 244800 corresponds to 1080p30. In content video, however, where high resolutions are desirable to preserve text and images while framerate is less important, it is not uncommon for devices to advertise a max-mbps that corresponds to a framerate well below 30fps at the maximum resolution allowed by the max-fs.
- max-br: The maximum bitrate, in units that can generally be treated as 1000 bits per second.
So when advertising H.264, implementations should pick the largest level number where they can support each of the defined limits (with the potential exception of bandwidth), and then use additional parameters to raise individual limits to match. The reason bandwidth is often not included in that limitation is that the bandwidth limits in levels tend to be very high relative to all the other parameters, and bandwidth can also be limited with “b=” lines, and as such many implementations do not consider bandwidth limits when picking the largest level possible.
In practise most H.264 fmtp lines in SDP will include a max-fs and max-mbps line. When parsing H.264 fmtp lines, implementations should read the level number and use a lookup table to get the various maximum values it defines, and then read through the fmtp and find any of the parameters that raise these values and replace the values derived from the level number for those specific values.
Note that unlike some older codecs such as H.263, standard H.264 only negotiates the frame size as a single value rather than incorporating constraints on the aspect ratio (unlike older codecs such as H.263 where you can negotiate support for specific resolutions). As such, while a max-fs of 3600 signals support for 720p, it also signals support for 1920×480, as that is also 3600 macroblocks. Technically H.264 requires support for any aspect ratio up to 8:1 or 1:8, though if interoperability is a concern it is best to stick to well-supported aspect ratios no matter what the specification requires.
Similarly, there is no way in standard H.264 negotiation to specify a maximum framerate – while 108000 corresponds to 720p at 30 frames per second it also allows for a resolution with half as many pixels at up to 60fps, and lower resolutions at even higher framerates. Much like the aspect ratio, while this is mandated by the specification, implementations should be very wary of sending framerates higher than 30fps unless they have some way to know that the far end actually supports it.
Other Important H.264 parameters
Beyond profile and level (and the individual parameters associated with the level) there are a wide range of other H.264 parameters that can be configured. Some of the more widely used are mentioned below.
One key parameter is the packetization-mode, which along with the profile-level-id should generally be present in every H.264 fmtp line. H.264 supports three different packetization modes, of which the most commonly used in video-conferencing is probably 1 (non-interleaved), followed by 0 (single NALU), while 2 (interleaved) is the least common.
The sprop-parameter-sets parameter, while rarely seen in more modern implementations, used to be quite common. ‘sprop’ is a prefix that is commonly, though not always, used as a prefix to denote that an SDP parameter is a sender property (as SDP parameters normally define receiver capabilities). In this case, the parameter contains data that is required by the decoder when receiving H.264 frames (normally in the form of a pair of comma-separated base64-encoded strings representing the sequence and picture parameters). The ITU spec has full details on how these should be handled, but any implementation concerned with interoperability will need to handle these values and pass them to the H.264 decoder when it is created. In most modern implementations these sequence and picture parameters are more commonly embedded in-band within H.264 payloads themselves, and hence do not need to be sent in the SDP. There are some other sprop- parameters, but they are much less commonly seen.
The level-asymmetry-allowed parameter signals, per the name, whether the SDP Answer may use a different level value than that included in the Offer. According to the specification, it must be present in the Offer and set to ‘1’ for the Answer (which must also include it) to allow a different level number. In practice, it is not uncommon for implementations to use level asymmetry without including the parameter in their own SDP or reading the parameter from the far end. As such implementations should include it in their own Offers and Answers (not doing so can cause problems with devices such as Chrome browsers) but avoid enforcing the remote SDP’s usage of it.
There are many other parameters defined in RFC6184, but implementation of the above should be sufficient for good interoperability and performance with a wide range of devices. Be wary of using other fields with third-party devices, as support for them may be lacking, irrespective of what the specification defines as mandatory behavior.
When advertising H.264 support, bear in mind that a given rtpmap/fmtp pair advertises support for a single set of capabilities, but for interoperability reasons, it is often useful to advertise more than one. The two parameters that are generally varied between entries are the profile portion of the profile-level-id and the packetization-mode. For reasons of practicality (and because some endpoints and middle-boxes have limits on the number of codec entries or SDP size they can handle), it may not be feasible to advertise every possible combination of values an implementation can support. For those who care about interoperability it is recommended to include something like:
- The most favored configuration of H.264 (which usually corresponds to the highest set of supported capabilities)
- An instance with Constrained High profile and packetization mode 1 (if supported)
- An instance with Constrained Baseline profile and packetization-mode 1 (if supported)
- An instance with Constrained Baseline profile and packetization-mode 0 (if supported)
This maximizes interoperability at the most common profiles (Constrained High and Constrained Baseline) and packetization modes while also offering the ‘best’ configuration available if that differs from the above.
More from Rob’s Real-Time Media and Modern Video-Conferencing Systems series:
- Real-Time Media | An Introduction to Challenges for Video Conferencing
- Real-Time Media | The Fundamentals of Audio
- Real-Time Media | A Primer on Video
- Modern Video-Conferencing Systems: An Introduction to the Session Description Protocol
- Modern Video-Conferencing Systems: Understanding Attributes of the Session Description Protocol
- Modern Video-Conferencing Systems: Understanding SDP Offer/Answer Negotiation