- Blog home
- >
- Engineering
- >
- Modern Video-Conferencing Systems: Understanding SDP Offer/Answer Negotiation
Tags: Real-Time Media Series
In the previous blog entries in this series we introduced SDP (Session Description Protocol) and its attributes. SDP was originally created as a declarative, machine-readable way for systems to advertise media that would be broadcast at a particular time or set of times. However, it has since been adopted as a way for two devices to negotiate a mutually supported set of media streams and their properties, and this negotiation requires a new set of requirements and constraints. These are defined in RFC3264, which specifies what it calls the Offer/Answer Model for allowing the mutual negotiation of capabilities using SDP.
The way Offer/Answer works is that one side of the exchange constructs an initial SDP Offer and sends it to the other party. Once the other party receives it, they parse it and then construct an SDP Answer response that combines their own capabilities with the capabilities expressed in the Offer to produce a mutually acceptable set of capabilities. The originator of the Offer receives this SDP Answer, at which point both sides have mutually agreed on the media streams and their properties to be used in the call.
How and when the SDPs are sent is dependent on the signaling protocol in use, whether that is SIP, WebRTC or a proprietary format. One thing to note is that in many cases the initial SDP Offer is included in the initial call creation message of the exchange. This minimizes the number of round-trips before media can flow and hence minimizes the call set-up time, but it does introduce the potential for interoperability issues: the initial offer is being composed with no awareness of the device that will be answering the call or through which middle-boxes it will travel, meaning that in some cases steps may need to be taken to ensure maximum interoperability and prevent call failures with older equipment.
Because of this issue, some protocols such as SIP include an alternative, colloquially often called “Late Offer” or “Delayed Offer”, where the initial INVITE message does not include an SDP, meaning that the callee generates the initial Offer. This can help with interoperability (since the callee has some information about the caller’s device type) at the cost of delaying the call start. In practice, even in protocols such as SIP where Late Offer is available, the most common usage is to send the SDP in the initial message (“Early Offer”). Having two modes of operation like this is something of a long-running tradition in video conferencing protocols, with the old H.323 signaling protocol similarly having “fast start” and “slow start” modes.
The previous blog entries on SDP covered the generation of an SDP Offer. Generating an SDP Answer follows the same rules but with a number of additional constraints. Some are strict requirements, while others are recommended by the specification and should be followed as a matter of best practices.
One key requirement is the matching of “m=” lines and the codecs therein. Per the “m=” line section the SDP Answer must not remove or re-order any of the “m=” lines in the SDP, but furthermore, an SDP Answer must not add any new “m=” lines that were not in the SDP Offer – new “m=” lines can only ever be introduced in SDP Offers. An Answer that does not support any given “m=” line from the Offer should be rejected by setting the port to 0 while preserving its position in the SDP and its media type.
Inside each media block, the Answer must only include codec configurations that match those that appear in the offer. To match the rtpmap values (eg, the codec type) must match; how much of the fmtp must match will be defined by the specification for each codec. For instance, when negotiating the audio codec AAC fmtp parameters that define the bitrate, type, etc. in the Answer must match that of the Offer, while in contrast when negotiating the Opus codec, the fmtp parameters in the Answer can vary widely from those in the Offer. Codecs or configurations that appear in the Offer that the Answerer does not support should not appear in the Answer. If there are no mutually supported codec configurations the “m=” line should be disabled, and if no “m=” lines can be negotiated the call should fail rather than sending an SDP Answer with no enabled “m=” lines.
Beyond matching the codecs there are a number of actions RFC3264 recommends but which are not mandatory. When picking dynamic payload types for rtpmap/fmtp the Answer SDP should use the same payload type as was used for the corresponding codec configuration in the Offer (unless it has previously assigned that payload type to something, as these may never change once chosen). Further, the ordering of the codecs in the media block in the Answer should match the ordering from the Offer. Neither of these are hard requirements, though, so implementations must be ready to cope with receiving Answers that do not do either.
There is no requirement for “b=” or “c=” fields in Answers to match that of Offers, either in value or location within the SDP. Similarly note that the SDP version value in the “o=” line of the Answer is independent of the Offer.
When it comes to other “a=” attributes whether the fact that they appear in an Offer affects them depends on the specific attribute. Some attributes are declarative, meaning that they are essentially independent and can appear (or not) in the SDP Answer irrespective of the contents of the SDP Offer. Others are negotiated, meaning that their presence and value in the Answer must in some way correspond to matching fields in the originating Offer.
As a rule, declarative attributes and fields in SDP generally describe the SDP creator’s receive capabilities, defining what they are able to receive, rather than what they can transmit. For instance, the “b=” field defines the maximum bitrate the SDP creator is able to receive on the session or “m=” line, not the maximum they can transmit. Declarative attributes that define a transmit capability often begin with ‘sprop-‘ (short for sender property) to help clarify, but this is not guaranteed. For instance, the crypto attribute (covered in a later blog entry on media encryption) defines the transmit encryption key, not the receivers.
According to its specification (RFC4796) the content attribute is declarative – it can appear in media blocks in an Answer even if there was no content attribute in the corresponding media block in the Offer and if one did appear in the Offer the value of the content attribute in the Answer in the corresponding media block does not have to match. However, while this is valid according to the specification, an SDP Answer that negotiates a slides video media block with an Offer, that advertised main video is unlikely to lead to any kind of successful user experience, so in practice, the content type of content attributes in “m=” lines should always match between Offer and Answer.
The media direction attributes are negotiated fields with a set of rules that essentially correspond to the fact that the SDP Answer can never negotiate a direction that was not enabled in the Offer, but can restrict the directions further:
rtcp-fb attributes are negotiated fields; while an SDP Offer should contain all of the attributes the sender supports, a corresponding SDP Answer should only contain the subset of rtcp-fb attributes supported by both them and the original SDP Offer; the Answer should never contain rtcp-fb attributes that were not present in the Offer.
Once the initial SDP exchange has completed either side can add, change, or remove capabilities by sending a subsequent SDP Offer. As previously stated any “m=” line previously sent in an SDP in the session must be preserved in order, but new “m=” lines can now be added. Remember that the payload types of any codec configurations already used in “m=” lines must also be retained.
As a rule, a mid-call SDP Offer should include all of the endpoint’s capabilities, even if they were not mutually accepted in the previous SDP Offer. This may seem unnecessary, since if a particular codec was not previously negotiated why bother to include it again, but things may have happened behind the scenes such as a blind transfer (where the far end has changed the terminating endpoint without any signaling update beyond an SDP exchange) and so it is possible that capabilities that were previously not enabled may now be available.
Negotiated attributes should follow the rules for Offers, not Answers. For example, Alice’s endpoint wants to both send and receive video, but at the start of the call she receives an SDP Offer with the video block set to a=sendonly, so she replies a=recvonly in her Answer. If Alice subsequently sends an Offer and her configuration has not changed she should send a=sendrecv, not a=recvonly. This is quite important – there have been cases in the past where endpoint A mutes itself mid-call by changing its audio direction to recvonly, and the call then ‘locks down’ to unidirectional even after the user unmutes themselves because one or both endpoints does not send SDP Offer with all of their capabilities.
One thing to note for mid-call Offer/Answer exchanges is that it’s important that only one takes place at a time, or else it can be unclear to both sides on what was mutually agreed. There is a condition called glare where both sides generate a new SDP Offer at a similar time and they cross on the wire in transit. How this is handled is up to the signaling protocol (SIP, WebRTC, etc) and will be specified in that protocol. Normally the protocol either includes a tiebreaker field so that both sides understand which of the two SDPs should be accepted, or both SDPs are invalidated and both sides back off and try again (with some randomized delay to try and avoid a second glare clash).
Glare-handling code may seem like something that can skipped when developing a feature as an edge case that doesn’t come up in practice. It is true that in most call scenarios it is rare, when it does occur it can be very hard to debug, so effort should be spent to implement this functionality as soon as is practicable. The time glare is most likely to occur is at the start of the call, as there are a number of reasons why endpoints may choose to send a second SDP Offer as soon as the initial exchange completes, and if both sides of the call are doing so then glare is very likely.
One reason for these start of call SDP renegotiations is that some endpoints attempt to work around interoperability concerns by waiting for the initial SDP exchange to complete and for more information about the far end to be available to enable some more advanced functionality with a subsequent SDP (particularly if Early Offer is in use). This can be valuable in cases where some far ends may struggle to handle some particular piece of unexpected SDP syntax (particularly those such as extra “m=” audio and video lines for functionality such as content audio).
Another reason for immediate renegotiation of SDPs is endpoints that are working around the requirement to identify incoming media by its payload type and dynamically configuring their decoder. This is mandatory if they have advertised multiple codecs in their initial SDP answer, so they avoid it by “locking down” to a single negotiated codec per “m=” line. If they are sending the Answer then they can simply do this as part of the initial SDP exchange, but if they are sending the initial Offer and still want to interoperate with endpoints with a range of supported codecs then they need to send an initial Offer listing all their supported codecs, wait for the Answer and then immediately send a new Offer including just a single codec option per “m=” line chosen from those mutually negotiated in the initial SDP exchange. This can significantly increase the time taken for media to start (sometimes referred to as the “media cut-through time”), as well as being fragile to use cases such as hidden transfers, so this technique is best avoided.
More from Rob’s Real-Time Media and Modern Video-Conferencing Systems series: