Voice over IP Packet Structure

This note’s purpose is to provide a brief refresher on the VoIP packet structure and offer a brief reference to other notes on topics related to VoIP. The information in this note is related to the IPv4. As migration to IPv6 does not occur very rapidly, IPv4 and IPv6 hosts will yet coexist for several years to come. A part of the planning for the seamless migration from IPv4 to IPv6 results in developing real-time translators, as described in Ref. [1].

VoIP packet structure reflects to a great extent the hierarchical structure of the OSI (cf. Ref. [2]). A VoIP packet is composed of the IP header, followed by the UDP header, followed by RTP header, and finally followed by the payload (see Figure 1).

Figure 1: Structure of the VoIP packet (as in IPv4)

By noting the sizes of the individual headers, the minimum size of the IP/UDP/RTP packet’s header is 40 bytes, which of course is a tangible overhead (for example, for a 20ms VoIP packet with G.711 PCM, the overhead is 25%, which is not negligible from the viewpoint of the network traffic). The RTP header includes several fields that are closely related to the nature of the VoIP packet’s payload. Thus, it is worth examining the header more closely. Its structure is shown in Figure 2 (cf. Ref. [3]).

Figure 2: Structure of the RTP header, according to RFC 3550

The first 4 bytes include:

Ver: version – 2 bits (RTP version always set to 2)
P: padding – 1 bit;
X: extension – 1 bit;
CC: CSRC count – 4 bits (thus, it allows up to 15 items on the CSRC list);
M: marker 1 bit; it is very specific to VoIP: M=1 indicating that the packet is the first speech packet after a silence period. Note: in some implementations of buffer playout schemes, it is used to detect the start of a talkspurt (cf. Ref. [4]);
PT: payload type – 7 bits; the value is different for different codecs (examples: G.711, PT=13 for comfort noise, PT=8 for a speech using A-law, and PT=0 for speech signal using μ-law);
Sequence Number – 16 bits. The sequence number increments are by one. The initial value is selected at random. The sequence number may be used by the receiving part software to detect missing packets and to restore the original packet sequence;

Then the following fields:

Timestamp – 32 bits. The timestamp reflects the sampling instant of the first octet in the RTP data packet. The sampling instant is derived from a clock (incrementing monotonically and linearly in time) to allow synchronization and jitter calculations;
SSRC: synchronization source – 32 bits. It identifies synchronization source. Examples of synchronization sources include the sender of a stream of packets derived from a signal source such as a microphone or a camera, or an RTP mixer (see below).
CSRC: contributing source – 32 bits. An array (from 0 to 15 CSRC elements, each 4 bytes long, identifying contributing sources for the payload contained in the packet). An example application is audio conferencing where a mixer indicates all the talkers whose speech was combined to produce the outgoing packet.

The last field in the VoIP packet structure is the payload field which carries the encoded voice data. The number of bytes constituting the entire packet comes from the pre-defined packet size. Typically VoIP packets are 10ms, 20ms or 40ms packets (where the size in milliseconds corresponds to the payload only). Other packet sizes are permissible although they are not frequently used.

VOCAL’s software modules form the foundation for our VoIP Reference Design and can be used to provide secure, real-time unified communications for voice, video, radio and data over the Internet or any other IP network. Contact us to discuss your VoIP application requirements with our engineering staff.

More Information

References

IPv4-IPv6 translator for VoIP and video conferencing, Davis, A.K. et al., 2011 International Conference on Communications and Signal Processing, 10-12 Feb. 2011; pp 367 – 369.
Internetworking with TCP/IP, Vol.1: Principles, Protocols and Architectures, Douglas E. Comer; Prentice Hall; 3^rd Edition.
RFC 3550: A Transport Protocol for Real-Time Applications.
Playout Buffering for Conversational Voice over IP, Gong, Q; McGill Univ. 2012.