Video on Demand: RTSP vs HTTP

HTTP was designed to convey documents, and is by far the most used application-layer protocol on the Internet (lets put DNS apart), and has been (very) widely extended and abused to do lots of other things. It started with dynamic content which is at the heart of most modern websites (not Remlab.net though!), especially the commercial ones, but HTTP can also convey Remote Procecure Calls (with SOAP), synchronize emails, play some web-based online games, stream radio channels, distribute video on demand, etc.

The IETF has specified another similar-looking protocol, RTSP, to transmit real-time multimedia content. RTSP supports delivery of both live and on-demand contents. Unfortunately, RTSP/1.0 has suffered from extremely poor interoperability, and serious specification errors.

As a developer of VLC media player, I often get to see people wondering whether they should use RTSP or HTTP to stream their audio or audio/video content. Even though I like open standards (which the IETF champions), I cannot say I am a big fan of RTSP. In fact, I think people should use HTTP for video-on-demand. Here's why:

Interoperability

HTTP is widely interoperable. It has countless implementations both on the server and client sides. It has Apache as a de facto reference implementation that every client can interoperate with.

RTSP is an interoperability disaster. None of the three big commercial implementations seem to abide by the standard (Microsoft Windows Media, Apple/Darwin and RealMedia), and writing a client stack that can handle all of them, plus the open-source servers properly, is... tough to say the least.

RTSP/2.0 will surely solve some specification errors and needed features, but it remains to be seen whether it will make interoperability better or worse.

Play/pause

Sure, RTSP does have built-in PLAY and PAUSE commands, which namely respectively play and pause the media. Some people seem to believe that HTTP cannot do that.

There are several ways you can pause a stream with HTTP:

stop dequeuing data from the TCP session, and let the TCP stacks handle pause with TCP congestion control, until you want to resume,
shutdown/reset the TCP session to pause, and start a new one to resume (it only takes one extra network round-trip from RTSP),
cache to memory (timeshifting) if you have lots of free storage space.

Some clever folks may point out that a normal HTTP server will not let a client hold a TCP session on forever. Indeed, a client can probably only pause for a few tens seconds or so, before the server resets the session.
So what? Do you really think an RTSP server will not do the exact same thing? Of course, it will. Normal RTSP servers close inactive control connections. Well-behaving servers will still keep RTSP sessions for a while longer. But not too long either anyway. (Note: some non-standard ones assume sessions are tied to connections and will destroy the session if/when the control connection is closed).

Seeking

RTSP supports seeking through a comprehensive Range header, and PLAY request pipelining. It can start from some time offset, stop at another time offset, and it can even combine different time chunks sequentially. In real life, this combining is more harmful than good: certain servers don't implement it for the sake of simplicity so clients cannot really assume it is supported. Handling pipelined PLAY request is rather difficult, especially on a non-threaded server implementation (requests must be processed and answered immediately then).

And in any case, I would think only engineers could ever use any client-side user interface that would be capable of leveraging such a feature. In practice, multimedia client software will issue one time range per playing request. If it ever needed to play discontinuous time range of a single stream, it could anyway issue multiple play requests. The extra bandwidth cost is virtually nil, and largely offset by the implementation simplicity.

What about HTTP?
Nastily, HTTP can only seek byte ranges, not time offsets... Or is it that bad? Of course, human beings care about and think in term of duration rather tham bytes length. Still, operating systems, digital storage medias (hard drives, memory sticks and laser discs alike) and file system formats only know of byte offsets.

In other words, there is always a conversion from date/time to byte offset in any case. With RTSP, the server takes care of it, while with HTTP, the client will probably be the one (though one also use HTTP GET request parameters to specify the start and/or end times).

Congestion

In case of network congestion or packet loss, RTSP will cause part of the stream to be lost. That's how RTP (the underlying conveyance protocol) behaves, as it uses UDP. This is a design choise. RTP is the de facto standard protocol to carry real-time delay-sensitive payloads such voice over IP. With voice calls, people definitely prefer to loose some audio frames that have to wait a few seconds for the network to recover and retransmit.

However, with video on demand, viewers would probably rather not miss some critical part of the movie's plot... They may well prefer to wait a while for the audio/video buffers to refill. This is all the more true as people are getting used to Youtube-ish transient pauses of video playback, instead of the fatal loss of information which occurs when the local terrestrial TV transmitter breaks down. That's precisely what happens with HTTP, as it relies on TCP rather than UDP to carry the multimedia data.

Whereas RTSP only carries the signaling reliably, HTTP conveys both signaling and payload reliably.

NAT traversal

What do I need to say? HTTP just works. RTSP just does not work, without ugly TCP encapsulation hacks that makes it look like HTTP anyway.

Adding support for Interactive Connection Establishment (ICE) to RTSP has gathered interest. Unfortunately, ICE is meant to be used with SIP, not RTSP, so that the marriage of RTSP and ICE will not be trivial (more interoperabilty issues to be expected).

Also, ICE is horribly complicated. Don't get me wrong. ICE is a well-designed scheme; it's not complicated for the sake of it. It attempts to solve a very intricate problem, so it has to be intricate. Namely, it aims at establishing communication channels between two NATted hosts.

For client-server protocols such as HTTP and RTSP, these complications are not at all needed. There is just one weird scenario that this helps: if the RTSP server is behind a NAT. It seems like a very bad complexity to usefulness tradeoff to me.

Firewall traversal

RTSP has theoretical firewall traversal capability, as it is designed to be easy for firewall to learn the IP addresse and port number tuples that need to be authorized. Not so many firewalls handle or allow RTSP however.

HTTP goes through almost any firewall. Even from tightly restricted Intranets, you can always find an outgoing HTTP proxy to use.

Software commodity

Some people noted that RTSP reused RTP, RTCP and SDP. That would made it easy/cheap to add to devices which supported these protocols already. That practically means SIP or XMPP voice over IP clients. And these clients are typically not media players, so it is not that obvious whether the benefit is that great.

Also, SIP and XMPP clients need symmetric full-duplex RTP stacks, offer/answer-centric SDP stacks, and very small jitter receive buffers. To the contrary, RTSP clients and servers normally have unidirectional RTP stacks, declarative SDP handling, and typically large receiver buffers (a few seconds of delay is not an issue for video on demand, as it would be for a phone call).

N.B.: this article is concerned about video-on-demand. For live streaming, there may be more reasons to prefer RTSP over HTTP.

Remlab

Projects