VoIP or Voice over Internet Protocol has been largely heralded as the telecommunications paradigm of the 21st century. VoIP transmits data from the traditional analog Public Switched Telephone Network (PSTN) across an IP network through the use of an Analog Telephone Adapter (ATA). The signal is broken up into frames and the information in each frame is stored in digital packets that are sent over the network. Each packet has header information that gives the receiving end information about how to reconstruct the signal. This header is essential as each packet traverses the network independently and each may encounter different transmission scenarios.
Echo and Delay in VoIP Networks
Echo in VoIP networks is introduced by impedance mis-matching in the PSTN being carried over into the digital network. Thus, VoIP echo paths resemble line echo paths in their sparseness, but differ due to their much longer echo tails and delay. This is due to the plurality of different processing steps that the speech needs to undergo. The continuous real time valued speech signal needs to be sampled, transmitted, and reconstructed, which adds extra delay and length to the echo path.
Jitter in VoIP Networks
Jitter effects are typically due to either clock slippage or network delay . Clock slippage occurs when a clock rate difference exists between the receiving and transmitting side that can cause either lost packets or duplicate samples due to buffer read errors. If the buffer is persistent and circular, and the receiver is sampling faster than the transmitter then values will be duplicated, and similarly a slower receiver will miss values.
Network delay shifts the body of the impulse response along the echo path. Each packet may experience different levels of network traffic while in route, and thus may arrive out of sequence. Since the speech signal processing needs to reconstruct the signal at the other end, delay results.
Vocoder Distortion and Packet Loss in VoIP Networks
Conversion of a speech frame into a packet is typically done with a low bit rate vocoder for efficiency and ease of transmission. A vocoder essentially attempts to represent the speech frame by a smaller set of parameters that will excite a speech production model on the receiving end. Distortion is introduced by an inaccurate representation, pre- or post-filtering, and by parameter quantization, and thus non-linearity is introduced into the echo path which will degrade the performance of the linear echo canceller .
The effect packet loss has on the performance on an echo canceller depends on how such a packet is recovered in each instance . One such method is packet concealment, in which the lost packet is somehow replaced on the receiving end. Typical replacement possibilities are silence, noise, the previous packet. Alternatively, you might want to attempt an extrapolation from the previous packet.
The SPMMax-MDF Algorithm
To effectively deal with all of these issues, one must utilize a highly adaptive algorithm. An example is the SPMMax-MDF Algorithm  which produces a low delay sparse echo path model. With proper control you can create an effective VoIP network echo canceller. The algorithm consists of two parts, selective tap adaptation and multiple delay filtering.
Selective tap adaptation is done using the sparse partial update normalized least mean square algorithm (SPNLMS). The model updating equation is given by:
Where is the far end speech, e is the current iteration error, and Q is the tap-selection matrix:
Whose elements are selected by two criteria controlled by parameters M1 and M2. The first selection is made according to:
While a second update is given by:
Thus the algorithm takes the entire echo path, updates only the portion of the speech signal that is active, and then further filters that selection to only the most troublesome taps as seen at the end of the echo path. Thus sparsity is effectively handled, and non-linearities are dealt with as well as a linear echo canceller can.
The mutliple delay filter (MDF) is implemented by breaking up the frequency domain representation of the full echo path into various blocks of smaller lengths. Assuming F is the Fourier transform matrix of the input, we define a diagonal Fourier matrix D by:
Where is the far end speech vector defined as:
Where N is the block length, and k is the block index. We then define G to be the block tap-selection matrix G = FWF1, where W does the block selection. Elements of G are the far end speech vectors defined as above. The selected block is further filtered by Q similarly to the above.
The SPMMax-MDF Algorithm needs to be done after the speech is decoded by the vocoder. There are many algorithms available for sampled speech echo cancellation, as generously expanded upon on this site. An alternative method is to use a LPC type filter to directly filter the packets before they reach the vocoder. While you will run into much the same issues, at least the size of the filters used will be smaller.
Delay effects can instead be mitigated by attempting to predict the delay caused by the network. A Hidden Markov Model may be most appropriate for this task . Clock slippage can be reduced by resampling, or by using identical or good quality hardware. Vocoder distortion can by modeled as through a non-linear processing block and thus explicitly dealt with. Packet loss effects can be trivialized by the proper selection of a recovery technique. The best recovery techniques are the ones that introduce an approximation to the lost packet, as this will throw the system distance off the least.
Whichever way is chosen, it is clear that VoIP will be increasingly important in this century due to our growing reliance on digital networks, and will undoubtedly be introduced into applications not readily apparent at the moment. Thus, it is important that we monitor how the IP network evolves and assess how such evolution will change the typical characteristics of the VoIP echo path. With due diligence, VoIP quality can be assured.
 X. Xu. “Effects of Delay Jitter and Clock Slippage on Network Echo Cancellation.”, Master’s Thesis, Carleton University, Ottawa Ontario, 2000.
 Y. Huang. “Effects of Vocoder Distortion and Packet Loss on Network Echo Cancellation.”, Master’s Thesis, Carleton University, Ottawa Ontario, 2000.
 X. Lin et al. “Frequency-Domain Adaptive Algorithm for Network Echo Cancellation in VoIP.” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2008
 T. Yensen et al. “HMM Delay and Prediction Technique for VoIP”, IEEE Transactions on Multimedia, vol. 5, no. 3, pp. 444-456, September 2003