In the new blog series, we will talk about how actual audio and video data is sent from sender to receiver across real-world networks. This initial blog entry will focus on the protocol and the format of the packets, while the next two entries will cover how these packets are sent, and then how they are handled on being received.
RTP, Real-time Transport Protocol, is the IP-based protocol used by almost all video-conferencing systems to send and receive media. Standardized in the mid-90s in RFC1889, some minor tweaks were made in 2003 by RFC3550, which is the current canonical specification. RTP is designed to work with a companion protocol, RTCP (RTP Control Protocol), which is defined in the same specification, but is covered in the next blog series.
The RTP Packet Format
An RTP packet has the following structure, shown in 32-bit (4-byte) blocks:
Standard RTP Headers
- Version (V): 2 bits representing the version. Always 2.
- Padding (P): 1-bit indicating additional bytes of padding at the end of the packet. If ‘1’ then there are one or more bytes of padding after the payload – the final byte is a count of how many padding bytes there are (including itself).
- Extension (X): 1 bit indicating the use of extension headers. Explained in more detail in a later section.
- CSRC Count (CC): 4 bits representing the number of CSRCs in the RTP header (0-15).
- Marker (M): A ‘marker’ bit, the meaning of which is payload-dependent. In video, the marker bit is generally set on frame boundaries (the last packet of a video frame), while in audio some implementations set it to indicate the first packet of non-silence audio after a period of silence.
- Payload Type (PT): 7 bits to define the payload type. This allows the receiver to match the packets to a codec configuration negotiated in SDP, and hence know how to decode them. See the previous blog series for more details.
- Sequence Number: A 16-bit sequence number field that increments with each packet and allows the receiver to detect loss and re-ordering; after reaching 65,535 (216 bits) it rolls over back to 0. The initial sequence number should be random, though it is best to set the top bit to 0 to ensure it does not roll over too quickly (this becomes important in media encryption, which will be covered in a later series).
- Timestamp: A 32-bit field of the timestamp of the RTP payload (technically the first byte of the payload, as it is possible to include multiple frames within a single packet that may have been generated at different times). Starting with a random initial value, for each subsequent frame the timestamp will increment monotonically by a fixed amount as defined by the payload, wrapping around once the value exceeds 232-1. Packets that are part of the same frame should have the same timestamp. Timestamp values are purely local to the stream, but can be converted to real-world time by combining them with information from RTCP; this will be covered in the RTCP series. Note that the timestamp stream should be linear and continuous, even if the SSRC changes or the underlying media jumps forward or backward.
- Synchronization Source Identifier (SSRC): A 32-bit number that uniquely identifies the source of the packet stream within the session, which can be very necessary when demultiplexing, decrypting, or performing other operations. SSRCs should be generated randomly and each stream sent by a device should use a unique SSRC. Devices can change SSRCs at any time; receivers need to be able to cope (they are meant to send an RTCP BYE, covered in the later RTCP series, but do not always do so). Devices are also meant to change their SSRCs if they see that the far end is using the same SSRC, though not all implementations will do so – how important this is will depend on what SSRCs are being used for.
- Contributing Source Identifier (CSRC): Optionally, up to 15 different 32-bit values (signaled in the CC field) that identify the SSRCs of sources that have contributed to the stream. For instance, a mixed audio stream might use this to identify the SSRCs of the individual audio streams that were included in the mix. Generally only useful if the far end has access to some metadata that includes additional information about those source SSRCs (such as a roster list of participants that include this information).
- Extension: An optional number of bytes of header extensions that are present if the extension (X) bit is set. The format and size of extensions are covered in the subsequent section. This section will always be a multiple of 32 bits.
The RTP payload itself is then appended to the packet. Finally, padding bytes are appended, if the P (Padding) bit was set. The final byte of the padding is equal to the number of padding bytes total, including itself. So if there were three bytes of padding, the final byte of the packet should be ‘3’.
RTP Header Extensions
These are the standard header fields, but it is also possible to add an arbitrary number of mutually-agreed header extensions. RFC3550 has very rudimentary header extension support, but this is bootstrapped by a separate specification, RFC8285 (originally RFC5285) into a usable form. Header extensions can have one-byte or two-byte forms, but this document will only cover the more common one-byte form – since header extensions are generally proprietary and only used between a small set of implementations there is less concern with interoperability, and many implementations only ever use and support 1-byte extensions.
Before any header extension can be used it must be negotiated in SDP to establish that both sides support it. This is done using a new attribute, extmap, which has the following format:
a=extmap:<ID>["/"<direction>] <URI> <extensionattributes>
- ID: An integer value that represents the ID to be used in the extension header. 0 and 15 are reserved; one-byte extensions must use values between 1 and 14 inclusive. IDs must be unique within a media block, or within the entire SDP if used at the session level.
- Direction: An optional direction string of the same form as an SDP direction attribute (sendrecv, recvonly, sendonly or inactive). If not present defaults to sendrecv, as with the direction attribute.
- URI: A URI that defines the extension. The specification recommends using a real, resolvable URI that documents the extension, but in practice, many implementations use this just as a unique identifier string that happens to be formatted like a URI.
- Extensionattributes: Additional optional parameters specific to the extension in question.
An SDP Offer may include any number of extmap attributes, which can be present at the session or media level, each of which may have any direction. An SDP Answer is more constrained. The Answer cannot include extmap lines that were not present in the Offer, with the exception of moving a session-level extension to the individual media blocks. If it wishes to negotiate support for an extension it should include the same extmap with the same ID and either with the same direction or with a subset as in the SDP direction attribute. (sendonly, recvonly or inactive if the Offer was sendrecv, recvonly or inactive if the Offer was sendonly, sendonly or inactive if the Offer was recvonly, and inactive if the Offer was inactive).
Note that, unlike the inactive SDP direction attribute, which actually has an operational difference to not offering or not accepting an SDP “m=” line, marking an extmap line as inactive in either the Offer or the Answer is equivalent to not negotiating it at all in operational terms, though it does imply that the implementation understands the extension in question but is choosing not to use it currently.
If an RTP packet is to include one or more header extensions the X header bit should be set to indicate their use. A new set of bytes should now be included in the optional ‘RTP Extension’ portion of the header (diagram above) with the following format:
Per RFC3550, the initial 16 bits marked “defined by profile” should be set to the unique identifier value 0xBEDE (48862) to indicate the use of one-byte header extensions (because, believe it or note, the first draft of the header extension specification was written on the feast day of the 6th century English monk the Venerable Bede…).
The length field is an additional 16 bits that include the number of additional 32-bit words following this extension header (eg, 2 corresponds to an additional 64 bits following this length field). These additional bits allow for any number of actual header extensions. Note when parsing that a length of 0 is valid, though if ever writing an RTP packet you should simply set the X bit to 0 and have no RTP extension section at all if you do not want to include any header extensions.
The format of each header extension is as follows:
- ID: 4 bits long, this corresponds to the value between 1 and 14 negotiated in SDP.
- Length: 4 bits, this corresponds to the length in bytes of the following header data minus 1. As such, a length of 0 corresponds to 1 byte, a length of 1 to 2 bytes, and a length of 15 to 16 bytes (the maximum length for this format of header extension).
- Data: The actual header extension data for this extension, of a length of between 1 and 16 bytes as defined by the length field above.
Implementations may also include bytes with value 0 between any extensions for padding, as well as at the end to pad the header extension section to 32-bit words. When parsing header extensions, if the initial byte is 0 skip it as padding and move on to the next. When writing header extensions, padding should not be included between extensions outside of very specific use cases such as aligning extension data to a particular point for easy packet inspection; otherwise, just add 0-3 bytes of padding at the end to align the extension portion to a 32-bit word.
Note that, even when negotiated, there is no requirement to include a header extension in every packet. In some cases, this may be necessary (such as an audio energy extension in audio packets, which signals the volume of the audio to allow servers to make decisions about such things without actually decrypting/decoding the audio), but for video, in particular, an extension may only be needed on frame boundaries, or even less frequently. Because header extensions not only contribute to increased overhead but also, due to MTU size (see later), reduce the maximum size of the RTP payload which also includes the overhead when a frame has to be split across additional packets, care should be taken to minimize their use in any situation where bandwidth efficiency has value.
When designing a way to convey proprietary information, there are often a range of options. RTP header extensions are best suited to relatively small amounts of data that either is directly relevant to the payload of that RTP packet, or where the information must be available as the media is being processed. If neither of those requirements hold it may be better to convey the information in RTCP (see the later series on RTCP covering feedback messages), in the signaling channel, or in a separate data channel such as SCTP. For some video codecs like H.264 there is also the option to embed extension data within the video frame itself (via techniques such as SEI messages in H.264).
More from the Real-Time Media Series: