Voice Encoding Methods

Voice Encoding Methods for Digital Wireless Communication


Despite the evolution the fourth - generation (4G) which will be able to provide comprehensive IP solutions and broad-band access network standards and a promise “Anywhere, anytime” - the most important mobile radio services are still based on voice communications. A very important aspect of this coursework is voice encoding in digital wireless communications. Communication over wireless systems has never been simple especially voice communication which has its limitations. One of our concerns is the bandwidth. There are some encoding methods which have been discussed in the following that are not feasible over the communication channels because of this.

Furthermore, in orders to transmit voice, it has to be encoded to preserve bandwidth and without compromising the quality. Apparently, this is quite not possible. To achieve voice encoding, properties of speech has to be understood as well. Usually the tests are based on the non-linearity of speech which we will explore further.

There are several techniques from waveform encoding to source encoding. Waveform encoding takes an analogue signal and converts it into a digital format. The numerical coding of the result is called a Pulse Code Modulation, or PCM and the signal is transmitted through a digital medium. PCM encoders are the first encoders that emerged when voice transmission in the digital era began. In the following, modern encoders have been discussed in details; their operations and a comparison between their performances. These are commonly derived from source coders or parametric coders that use certain parameters of speech and perform encoding. There is another type of encoder called Hybrid which is a combination of waveform and source coders and are commonly used as speech codec.

Finally, this coursework describes test methods to determine voice quality which is another concern in wireless communication. When wireless technology came to market, users sacrificed voice quality for mobility. Today, significant improvements in wireless technology, users refuse to compromise and therefore, voice quality is now the most important customer satisfaction factor in wireless communications.


Voice encoding is used in digital voice communication systems to digitize and compress speech signals to minimize transmitted bit rate. Bandwidth is a precious commodity in wireless communication systems, since service providers must accommodate many users with a limited allocated bandwidth. Vocoders allow voice to be transmitted efficiently over circuit-switched or packet-switched digital networks [1].

Today, voice encoders have become essential components in telecommunications and in the multimedia infrastructure. Commercial systems that rely on efficient voice encoding include cellular communication, voice over internet protocol (VOIP), videoconferencing, electronic toys, archiving, and digital simultaneous voice and data (DSVD), as well as numerous PC-based games and multimedia applications [2].

Vocoders also make spectrum-efficient wireless voice communications possible, and they allow for the digitized voice stream to be encrypted [1].

The goal of voice encoding is to represent speech in digital form with as few bits as possible while maintaining the intelligibility and quality required for the particular application. Speech coding is performed using numerous steps or operations specified as an algorithm. Interest in voice encoding is motivated by the evolution to digital communications and the requirement to minimize bit rate, and hence, conserve bandwidth. There is always a trade-off between lowering the bit rate and maintaining the delivered voice quality and intelligibility; however, depending on the application, many other constraints also must be considered, such as complexity, delay, and performance with bit errors or packet losses. [3]

In order to perform encoding, it is also necessary to study the properties of speech. Speech is an acoustic waveform that conveys information from a speaker to a listener. Given the importance of this form of communication, it is no surprise that many applications of signal processing have been developed to manipulate speech signals. Much of the research in speech compression has been motivated by the need to conserve bandwidth in communication systems. For example, speech coding is used to reduce the bit rate in digital cellular systems [4], which is discussed further in this coursework.

The following briefly explains the types of coding schemes.

2.1. Coding techniques

Speech coding has an important role in modern voice-enabled technology, particularly for digital voice communication, where quality and complexity have a direct impact on the marketability and cost of the underlying products or services. There are many speech coding standards designed to suit the need of a given application [5]. Speech coders differ widely in their approaches to achieving signal compression.

Waveform coders perform compression techniques that exploit the redundant characteristics of the waveform itself [6].

The most common example is the PCM (Pulse Code Modulation) coder and its derivatives the DPCM (Differential PCM) and ADPCM (Adaptive DPCM). The DPCM and ADPCM achieve compression by quantizing the difference between consecutive samples instead of the samples themselves [7].

In addition to waveform coders, there are source coders that compress speech by sending only simplified parametric information about voice transmission; these coders require less bandwidth [7]. During encoding, parameters of the model are estimated from the input speech signal, with the parameters transmitted as the encoded bit-stream. This type of coder makes no attempt to preserve the original shape of the waveform. Example coders of this class include Linear Prediction Coding (LPC) and Mixed Excitation Linear Prediction (MELP) [5].

Lastly, there are Hybrid coders, which are a combination of both waveform and source coders [6]. Examples of hybrid coders are Multi-Pulse Excited (MPE) coder, the Regular-Pulse Excited (RPE), and the Code-Excited Linear Predictive (CELP) coder.

2.2. Application of Voice Encoding

Transmission of voice is the major application of voice encoding. The voice transmission systems can be divided in to two broad categories: terrestrial and satellite. Voice storage applications also employ speech compression schemes. Many of these applications of voice encoding, such as packet switched cellular telephony and answering machines, do not requires a fixed bit rate. As a result, significant effort has been dedicated to development of variable bit rate voice encoders in the recent years [8].

2.2.1. Terrestrial voice communication systems

The terrestrial voice communication systems include Public Switched Telephone Networks (PSTN), Integrated Services Digital Networks (ISDN), and cellular mobile radio systems [8].

2.2.2. Satellite communication systems

The use of satellite communication systems is primarily for long distance communication, due to the wide coverage area and point to point and point to multipoint connection capability. There are three main types of satellite services defined by international Telecommunications Unit (ITU): Fixed Satellite Services (FSS), Mobile Satellite Services (MSS), and Broadcast Satellite Services (BSS) [8].


Speech consists of acoustic pressure waves created by the voluntary movements of anatomical structures in the human speech production system, shown in 1. As the diaphragm forces air through the system, these structures are able to generate and shape a wide variety of waveforms. These waveforms can be broadly categorized into voiced and unvoiced speech [4].

Voiced sounds, vowels for example, are produced by forcing air through the larynx, with the tension of the vocal cords adjusted so that they vibrate in a relaxed oscillation. This produces quasi-periodic pulses of air which are acoustically filtered as they propagate through the vocal tract, and possibly through the nasal cavity. The shape of the cavities that comprise the vocal tract, known as the area function, determines the natural frequencies, or formants, which are emphasized in the speech waveform. The period of the excitation, known as the pitch period, is generally small with respect to the rate at which the vocal tract changes shape. Therefore, a segment of voiced speech covering several pitch periods will appear somewhat periodic. Average values for the pitch period are around 8 ms for male speakers, and 4 ms for female speakers [4].

In contrast, unvoiced speech has more of a noise-like quality. Unvoiced sounds are usually much smaller in amplitude, and oscillate much faster than voiced speech. These sounds are generally produced by turbulence, as air is forced through a constriction at some point in the vocal tract. For example, an h sound comes from a constriction at the vocal cords, and an f is generated by a constriction at the lips [4].

An illustrative example of voiced and unvoiced sounds contained in the word “erase” are shown in 2. The original utterance is shown in (a). The voiced segment in (b) is a time magnification of the “a” portion of the word. Notice the highly periodic nature of this segment. The fundamental period of this waveform, which is about 8.5 ms here, is called the pitch period. The unvoiced segment in (c) comes from the “s” sound at the end of the word. This waveform is much more noise-like than the voiced segment, and is much smaller in magnitude [4].


Waveform coding, as the name applies, attempts to copy the actual shape of the waveform produced by the microphone and its associated analogue circuits. It can be carried out in either the time or the frequency domains If the bandwidth is limited, the sampling theorem shows that it's theoretically possible to reconstruct the waveform exactly from the specification in terms of the amplitudes of regularly spaced ordinate samples taken at a frequency of at least twice the signal bandwidth [10].

Pulse Code Modulation (PCM) is an example waveform coding. It is the simplest waveform coding algorithm requiring 64,000 bits of information to be transmitted every second (64 kb/s) for the faithful reproduction of the speech waveform at the receiver. However, PCM makes no assumptions about the nature of the waveform to be coded; hence it works well for non-speech signals [12]. 3 shows an illustration of PCM process.

PCM Encoder

The invention of PCM in 1938 by Alec H. Reeves was, in fact, the beginning of the digital speech communication. Unlike the analogue systems, PCM systems allows perfect signal reconstruction at the repeaters of the communication systems, which compensate for the attenuation provided that the channel noise level is insufficient to corrupt the transmitted bit stream. The additional advantages of PCM over analogue transmission include the availability of sophisticated digital hardware for various other processing, error correction, encryption, multiplexing, switching, and compensation [12].

PCM operates at an 8 kHz sample rate, with 8 bits per sample. According to the Nyquist theorem, a signal must be sampled at twice its highest frequency component, which gives 64kbps. Since the compressed bit rate is also 64kbps, the bandwidth saving is not significant [13]. This process is not normally used in its simplest form for transmission or for bulk storage of speech, because the required digit rate for the acceptable quality is too high. Simple PCM does not exploit ay of the special properties of speech production or auditory perception except their limited bandwidth [10].

The main disadvantage of PCM is that transmission bandwidth is greater than that required by the original analogue signal. That is not desirable when using expensive and bandwidth-restricted channels such as satellite and cellular mobile radio systems. This has promoted extensive research into the area of voice encoding during the last two to three decades and as a result of this intense activity many strategies and approaches have been developed for voice encoding [12].

The major drawback is relatively high bandwidth consumption. The higher bandwidth requirement for waveform coding is a serious problem in many wireless applications, where bandwidth limitations can be severe and this is generally the reason why it is not commonly used over wireless communication channels. However, for cable transmission, the higher bandwidth requirement is not such a serious problem. In cables, degradations such as crosstalk and noise increase with frequency, placing a limit on useful bandwidth. However, waveform coding (specifically, PCM) is more tolerant to these degradations than analogue transmission, and thus can use higher frequencies that would have not been available for analogue transmission [15].


A rate of 64kbps is suitable for wire-line telecommunications where capacity is constrained in the short terms only by the amount of wire or fibre that is buried under the surface. In contrast, wireless communication is accessed through the air, and the available spectrum is permanently unyielding, and has always been exceedingly limited. Allocating 64 kbps of expensive spectrum to each channel is without doubt uneconomical and impractical. Consequently, the technology afforded to supporting heavy traffic over a wireless access channel resorted to coder technology that traded off expensive sampling for sophisticated algorithms requiring computer power and improved digital signal-processing technology. Wireless coder technology utilises perceptual irrelevancy in the speech signal by designing more efficient quantisation algorithm and intelligent adaptive-linear prediction schemes. These take the advantage of high short-term correlation between consecutive speech frames consequent of vocal-tract anatomy, and their long-term correlation - realised by the fact that human threshold of audibility is relatively narrow, and is limited to a 16 kHz range [16].

Wireless coders bring into play a class referred to as Analysis by Synthesis (Abs) coders, which merges a linear prediction scheme that models properties of the vocal tract with an adaptive excitation signal chosen via an algorithm minimising the error (using the least-squares algorithm) between the input speech and the reconstructed version [16]. 4 shows a simplified block diagram of Linear Prediction.

The linear-prediction part constructs the next speech sample by exercising linear combinations of the preceding speech samples. The mechanism involves splitting the input speech into small frames. Each frame is then subjected to an excitation signal that is chosen by the vocal tract properties that are fine tuned by employing coefficients that provide the best fit to the original waveform. This approach is known as Linear Predictive coding. The Abs encoding procedure it illustrated below in 5.

6 show a timeline of various encoders and the application they have been used in. Further discussed are the operation of the three commonly used modern voice encoders and a comparison of their performances. There are several modern coders, however, the following are discussed, naming, LPC, CELP, and RPE.


Linear Prediction Coders were introduced in the 1960s. Linear prediction (LP) based vocoders are modelled after the human speech production mechanism. They are low bit-rate encoders that provide a rate of between 1.2 kb/s and 4 kb/s. The sound that is generated is very synthetic.

Linear Predictive Coding (LPC) is the most popular technique for low bit-rate speech coding and has become a very important tool in speech analysis. The popularity of LPC derives from its compact yet precise representation of the speech spectral magnitude as well as its relative simplicity of computation. LPC analysis decomposes the speech into two highly independent components, the vocal tract parameters (LPC coefficients) and the glottal excitation (LP excitation). Basic LPC mostly used where bit-rate really matters (e.g. in military applications). And most modern voice codecs (e.g. GSM) are based on Enhanced LPC encoders. A good example is the 2400 bps LPC-10 vocoder used as a U.S. government standard for secure (i.e. encrypted) telephony [18].

An LPC vocoder digitises the signal and splits it into segment (20 ms). For each segment, it determines the pitch of the signal (i.e. basic formant frequency), the loudness of the signal, whether the sound is voiced or unvoiced, and the Vocal tract excitation parameters (LPC coefficients) [20].

The vocal tract is modelled by linear prediction filter. The glottal pulses and turbulent air flow at the glottis are modelled by periodic pulses and Gaussian noise respectively, which form the excitation signal of the linear prediction filter. The LP coefficients signal power, binary voicing decision i.e. periodic pulses or noise excitation, and the pitch period or noise excitation and the pitch period of the voiced segments are estimated and transmitted to the decoder. The main weakness of the LP based vocoders is the binary voicing decision of the excitation which fails to model the mixed types of signals with both periodic and noisy components. By employing frequency domain voicing decision techniques the performance of the LP can be improved [21]. 7 shows processes that take place in a LPC model.

LPC is not without drawbacks, however. To minimize analysis complexity the LPC signal is usually assumed to come from an all-pole source; i.e., the assumption is that its spectrum has no zeros. Since the actual speech spectrum has zeros due to the glottal source as well as zeros from the vocal tract response in nasals and unvoiced sounds, such a model is a simplification. The all-pole assumption does not cause major difficulties in speech coders.

The main weakness of LP based vocoders is the binary voicing decision of the excitation, which fails to model mixed signal types, with both periodic and noisy components. By employing frequency domain voicing decision techniques, the performance of LP based vocoders can be improved [16].


A key distinction among the various wireless coders is the method used for the pulse excitation. The computational effort associated with passing each and every non-zero pulse for every excitation speech frame through the synthesis filter is considerably large. Accordingly, excitation procedures incorporate a variety of intelligent interferences in addition to a condensed number of pulses per millisecond.

One of the first to be introduced in GSM (phase 1) was the Regular Pulse-Excited (RPE) coder. The RPE uses uniform spacing between pulses as shown in 8. The uniformity eliminates the need for the encoder to locate the position of any pulse beyond the first one. The RPE distinguishes between voiced and unvoiced signals. When the signal is classified as unvoiced, the RPE ceases from generating periodic pulses. Its pulsing becomes random, and it corresponds to the functioning of the unvoiced signal [16].

The RPE-LTP is the basis for the GSM full-rate (FR) codec as illustrated in 9. It was first implemented on digital-signal processors (DSPs) in the early 1990s. At that time, DSP technology limited the implementation to a computationally efficient method reasonable voice quality by means of practical computational effort. Still the codec algorithm delay experienced by the RPE-LTP is about 40 ms, and the codec rates below 10 kbps [16].

The RPE-LTP approach is not suitable for codec rates below 10 kbps.

4.3. Code-Excited Linear Prediction (CELP) coding

In the mid eighties Schroeder and Atal proposed a coder concept that employs a codebook of generating the pulse excitation. The class of codec algorithm that resulted from the proposal has been referred to as Code-Excited Linear Prediction (CELP) coding [16].

Speech is generated by exciting the LPC filter and the Pitch filter by proper excitation signal. In the analysis by synthesis technique the speech is reconstructed at the encoder itself and the excitation signal is determined by minimising the perceptually weighted error between the original and the synthesised speech. A CELP model is shown in 10. CELP coders are also called hybrid coders because they combine the features of traditional vocoders with the waveform matching features of the waveform coders. Although the first paper on CELP addressed the feasibility of vector excitation coding, follow-up work demonstrated that CELP coders were capable of producing medium-rate and low-rate speech adequate for communication applications. Real-time implementation of hybrid coders became feasible with the development of highly structured codebooks [24].

We will now discuss the basic CELP model. In CELP, the CELP excitation sequence is selected from a codebook of zero-mean Gaussian sequences. The block diagram of the CELP coder is shown in . It consists of cascades of two all-pole filters, with coefficients that are updated periodically. The filter is a long-delay pitch used to generate the pitch periodically in voiced speech. Its parameters can be determined by minimising the prediction error energy, after pitch estimation, over frame duration of 5 milliseconds. The second filter is a short-delay all-pole (vocal-tract) filter used to generate the spectral envelope (formants) of the speech signal. This filter usually has 10-12 coefficients that are determined periodically using the LP analysis as discussed earlier.

A stored sequence from a Gaussian excitation codebook is scaled and used to excite the cascade of a pitch synthesis filter and the LPC synthesis filter (computed over the current frame). The synthesised speech is compared with the original speech and the difference constitutes the residual filter that is characterised by the system function. This perpetually weighted error is squared and summed over a sub frame block to give the error energy. By performing an exhaustive search through the codebook we find the excitation sequence that minimised the error energy. The gain factor for scaling the excitation sequence is determined for each codeword in the codebook by minimising the error energy for the block samples [24].

4.4. Comparison of performances

Voice encoders attempt to minimize the bit rate for transmission or storage of the signal while maintaining required levels of speech quality, communication delay, and complexity of implementation (power consumption). We will now provide brief descriptions of the above parameters of performance, with particular reference to speech [14].

4.4.1. Speech Quality

Speech quality is usually evaluated on a five-point scale, known as the mean-opinion score (MOS) scale, in speech quality testing---an average over a large number of speech data, speakers, and listeners. The five points of quality are: bad, poor, fair, good, and excellent. Quality scores of 3.5 or higher generally imply high levels of intelligibility, speaker recognition and naturalness [14].

4.4.2. Bit Rate

The coding efficiency is expressed in bits per second (bps) [14].

4.4.3. Communication Delay

Speech coders often process speech in blocks and such processing introduces communication delay. Depending on the application, the permissible total delay could be as low as 1 m/sec, as in network telephony, or as high as 500 m/sec, as in video telephony. Communication delay is irrelevant for one-way communication, such as in voice mail [14].

4.4.4. Complexity

The complexity of a coding algorithm is the processing effort required to implement the algorithm, and it is typically measured in terms of arithmetic capability and memory requirement, or equivalently in terms of cost. A large complexity can result in high power consumption in the hardware.

For rates of 16 kbps and lower, high speech quality is achieved by using more complex adaptive prediction, such as linear predictive coding (LPC) and pitch prediction, and by exploiting auditory masking and the underlying perceptual limitations of the ear. Important examples of such coders are multi-pulse excitation, regular-pulse excitation, and code-excited linear prediction (CELP) coders. The CELP algorithm combines the high quality potential of waveform coding with the compression efficiency of model-based vocoders [14].

At 8 kbps, which is the bit rate chosen for first-generation digital cellular telephony in North America, speech quality is good, although significantly lower than that of the 64 kbps PCM speech. Both North American and Japanese first generation digital standards are based on the CELP technique. The first European digital cellular standard is based on regular-pulse excitation algorithm at 13.2 kbps [14].

The rate of 4.8 kbps is an important data rate because it can be transmitted over most local telephone lines in the United States. A version of CELP operating at 4.8 kbps has been chosen as a United States standard for secure voice communication. The other such standard uses an LPC vocoder operating at 2.4 kbps. The LPC vocoder produces intelligible speech but the speech quality is not natural [14]. In a CELP encoder, bit rates of 4kbps or lower bit-rates give synthetic quality speech / mechanical speech. Most modern CELP variants produce relatively higher bit-rates and good quality speech [25].

However, in this context, we will refer to intelligibility and quality of the speech produced by each coder. The term intelligibility usually refers to whether the output speech is easily understandable, while the term quality is an indicator of how natural the speech sounds. It is possible for a coder to produce highly intelligible speech that is low quality in that the speech may sound very machine-like and the speaker is not identifiable. On the other hand, it is unlikely that unintelligible speech would be called high quality, but there are situations in which perceptually pleasing speech does not have high intelligibility. We briefly discuss here the most common measures of intelligibility and quality used in formal tests of speech coders also referring to Table 1 below. We also highlight some newer performance indicators that attempt to incorporate the effects of the network on speech coder performance in particular applications [26].

Table 1: Voice encoder performance comparisons [27]

The perception of “good quality” speech is highly individual and subjective area. As such, no single performance measure has gained wide acceptance as an indicator of the quality and intelligibility of speech produced by a coder. Further, there is no substitute for subjective listening tests under the actual environmental conditions expected in a particular application. As a rough guide to the performances of some of the coders discussed here, we present the DRT, DAM, and MOS values in the table. These tests are discussed in details in the following section. This table is adapted from [Spanians, 1994; Jayant, 1990]. From the table, it is evident that at 8 kbits/s and above, performance is quite good and that the 4.8 kbits/s CELP has substantially better performance than LPC-10e (10 coefficients) [27].

5. Voice Quality

Generally speaking, the term "voice quality" refers to a subjective measurement, a judgment by the human ear, of the intelligibility of speech and how good the sound in a phone conversation is or is not. The efficiency of verbal exchanges over the telephone heavily relies upon it [28].

Service providers and equipment vendors are seeking reliable, cost-effective tools with which they can evaluate voice quality on next-generation networks. Such tools are designed to measure objectively not only the voice-quality levels but also the factors that affect voice quality [29].

5.1. Factors Affecting Voice Quality

There are certain factors which affect the quality of voice. A number of challenges, resulting from the extended delay on the voice path as well as from the rarity of the wireless bandwidth, make delivering superior voice quality in wireless networks more difficult. Echo and background noise also degrade voice quality. In wireless networks, echo always threatens voice quality. Additionally, double-talk and background noise make delivering good voice quality even more challenging [28]. Moreover, the analogue bandwidth supported by a vocoder also directly affects its speech quality.

All of these have been briefly explained in the following.

5.1.1. Echo

Echo is heard on a telephone line when sound waves are reflected and the speaker hears himself after a delay. In a telephone network, whether wire-line or wireless, echo is always present. However, it does not always hinder communication. The delay and the volume of the echo contribute to the extent to which echo is noticeable and detrimental. The distance, the method of transmission, and the type of network all impact delay. An echo is perceived and disrupts the normal course of conversation if the delay between the original transmission of voice and the receipt of its echo exceeds 20 milliseconds. The two kinds of echo present in telecommunication systems are electrical echo and acoustic echo [28]. Electrical Echo

A mismatch of impedances in the analogue local loop causes electrical echo. Electrical echo is caused by a reflection of the speech signal at 2-to-4-wire hybrid circuitry [28]. Acoustic Echo

Coupling (sound being picked up by the microphone due to proximity to the phone speaker) creates acoustic echo [28]. Double-Talk

When both parties speak at the same time, their voices and echoes get superposed, an occurrence known as double-talk [28]. Background Noise

The noises present in the environment surrounding the speakers make up the background noise [28].

5.1.2. Delay

Delay or latency is the amount of time a voice signal takes to travel from one caller to another in a telephone conversation. Another form of delay is jitter, which refers to variations in delay caused by fluctuating signal strength. Obviously it is impossible to eliminate delay completely but, if kept to a minimum, it is not detectable by either party in a telephone conversation. However, if delay exceeds acceptable levels, one party may hear unnaturally long periods of silence and try to talk, thereby inevitably interrupting the other party. Similarly, too much jitter can make it sound like both parties are talking at the same time [29].

5.1.3. Clarity

Finally, clarity, or the fidelity of the signal itself, also affects voice quality, and numerous factors contribute to clarity or lack thereof. For example, the "network" across which a voice signal travels actually consists of multiple networks operated by multiple service providers. The quality of the circuit, which obviously affects the clarity of that voice signal, may vary from one service provider to the next [29].

Another factor that may impair clarity is transcoding, which involves the use of multiple compression algorithms on a voice signal. For example, a call that originates on one type of network and is to be terminated on another type of network likely goes through a media gateway. That call is subject to one type of voice compression on the originating network and, in its compressed form, goes through the media gateway which, in turn, uses a different type of compression to process it. Compression of an already-compressed voice signal can degrade voice clarity [29].

5.2. Testing for Voice Quality

Subjective Voice quality testing has become increasingly important because of the popularity of wireless telephony and the development of new speech codecs (vocoders) for better speech compression and hence better spectrum utilization of the crowded bands of the spectrum [1].

As noted above, voice quality is a subjective measurement, a judgment by the talker and listener of the calibre of the telephone call. In other words, voice quality basically is an opinion, because it depends a great deal on individual perceptions of what is an acceptable quality level and what is an unacceptable quality level [29].

There are industry standard measures for vocoder quality called Mean Opinion Score (MOS), Diagnostic Acceptability Measure (DAM), and Dynamic Rhyme Test (DRT). These measures are most useful when all algorithms are tested on the same hardware platform with a wide variety of speech data at the same time.

5.2.1. DAM (Diagnostic Acceptability Measure)

The Diagnostic Acceptability Measure (DAM) developed by Dynastat [Voiers, 1977] is an attempt to make the measurement of speech quality more systematic. For the DAM, it is crucial that the listener crews be highly trained and repeatedly calibrated in order to get meaningful results. The listeners are each presented with encoded sentences taken from Harvard 1965 list of phonetically balanced sentences, such as “Cats and dogs each hate each other” and “The pipe began to rust while new” the listener is asked to assign a number between 0 and 100 to characteristics in three classification -signal qualities, and total effect. The ratings of each characteristic s are weighted and used in a multiple nonlinear regression. Finally, adjustments are made to compensate for listener performance. A typical DAM score is 45 to 55%, with 50% corresponding to good system [Papamichalis, 1987] [27].

5.2.2. DRT (Diagnostic Rhyme Test)

The Diagnostic Rhyme Test (DRT) was devised by Voires (1977) to test the intelligibility of coders known to produce speech of lower quality. Rhyme tests are so named because the listener must determine which consonant was spoken when presented with a pair of rhyming words; that is, the listener is asked to distinguish between word pairs such as meat-beat, pool-tool, saw-thaw, and caught-taught. Each pair of words differs on only one of six phonemic attributes: voicing nasality, sustention, sibilation, graveness, and compactness. Specifically, the listener is presented with one spoken word from the pair and asked to decide which word was spoken. The final DRT score is the present responses computed according to P = 1/T(R-W) x100, where R is the number correctly chosen, W is the number of incorrect choices, and T is the total word of pairs tested. Usually, 75≤DRT≤95, with a good being about 90 [Papamichalis, 1987] [27].

5.2.3. MOS (Mean Opinion Score)

In order to overcome the problems arising from the subjective measurement, the ITU-T recommendations standardised different algorithms within the bandwidth of 300-3400 Hz, and defined the listening quality scale. The standardised algorithms are based on the comparison of the samples of original unprocessed signal with the samples of the degraded version. The results of these algorithms are indexes that can be mapped into the listening quality scale. In particular, the indexes can be mapped into the mean opinion score (MOS) [30] as shown in Table 2.

MOS Usually involves 12 to 24 listeners who are instructed to rate phonetically balanced records according to a 5-level quality scale [33].


In MOS tests listeners are "calibrated" in the sense that they are familiarised with the listening conditions and the range of speech quality they will encounter. Ratings are obtained by averaging numerical scores over several hundreds of speech records. The MOS range relates to speech quality as follows: a MOS of 4-4.5 implies network quality, scores between 3.5 and 4 imply communications quality, and a MOS between 2.5 and 3.5 implies synthetic quality. We note here that MOS ratings may differ significantly from test to test and hence they are not absolute measures for the comparison of different coders.

A MOS of between 3.6 and 4.2 is widely accepted as being a good voice quality score for a network, and is generally considered to be “toll quality” voice.

Furthermore, four different algorithms have been proposed from the ITU-T on the basis of the operating conditions of the telecommunication network. They are PAMS (Perceptual Analysis/Measurement System), PSQM (Perceptual Speech Quality Measurement), PESQ (Perceptual Evaluation of Speech Quality), and MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor).

Obviously, such test methods are not only subjective but also expensive and time-consuming. However, the commonly accepted opinion is that voice quality is a subjective parameter and the determination of voice quality should be performed using the results of subjective testing and applying the concept of MOS.

The following 11 illustrates the commonly known encoders and the MOS with respect to date rate in kbps. MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor)

It is defined by ITU-R recommendation BS.1534-1. The MUSHRA methodology is recommended for assessing "intermediate audio quality". For very small audio impairments, Recommendation ITU-R BS.1116-1 (ABC/HR) is recommended instead. The recommendation specifies that one anchor must be a 3.5 kHz low-pass version of the reference. The purpose of the anchor(s) is to make the scale be closer to an "absolute scale", making sure that minor artifacts are not rated as having very bad quality.

The main advantage over the Mean Opinion Score (MOS) methodology (which serves a similar purpose) is that it requires fewer participants to obtain statistically significant results [9]. PAM (Perceptual Analysis/ Measurement System)

PAMS predicts overall subjective listening quality - Originally developed by British Telecom in the United Kingdom, PAMS measures the voice signal in terms of both listening effort and listening quality. It compares two signals—an original, unprocessed signal and a version of that signal which has been degraded by passing it through a distorting system. PAMS analyzes the amounts of different types of errors found in the degraded version and predicts a Mean Opinion Score (MOS). The degraded signal receives a MOS of 1-5, with a score of 5 denoting the best-possible voice quality [29]. PSQM (Perceptual Speech Quality Measurement)

PSQM predicts subjective quality of speech codes without requiring subjective testing. It provides a relative score corresponding to how a statistically large number of human listeners would react. The PSQM + is a modified version of the PSQM and it can be used in the case of remarkable distortion of the signal as packet loss and time clipping [30]. PESQ (Perceptual Evaluation of Speech Quality)

PESQ provides an objective measurement of subjective listening tests on telephony systems in the case of codec distortion, transmission error, packet loss, multiple transcoding, environmental noise, and variable delay [30].

There are three other algorithms that can be compared to the previous one [30]. TOSQA (Telecommunication Objective Speech Quality Assessment)

It is based on a similarity measurement between reference and impaired signal (which is essentially equivalent to a comparison). This step is based on modified short term loudness spectra. It also reduces the influence of signal parts with low loudness [30]. MNB (Measuring Normalizing Blocks)

It operates a multi-resolution analysis in the frequency domain. After evaluating the difference between reference and impaired signal in a broad frequency band, the difference is removed and the analysis is repeated with narrower frequency bands. Speech quality is essentially determined from a linear combination of the various frequency band differences [30]. PACE

Key element of comparison-based schemes is how they implement the comparison between reference and impaired sample. The comparison method of PACE is based on the assumption that the signal parts with high energy are more important for the perceived speech quality [30].

5.3. Finding the Right Voice Quality Testing Solution

A key question facing telecom operators of all network topologies is “how is quality defined?” The answer is not purely technical, as quality is the customer's perception of a service or product. The most accurate method of arriving at a voice quality measurement would be to actually ask the callers. Ideally, callers would be continually interrupted during their phone calls and asked what they thought of the quality of their connection. For obvious reasons, this isn't a practical or scalable voice quality test solution.

The widely accepted and applied test method is the PESQ. The result of comparing the reference and degraded signals is a quality score. The PESQ scores are calibrated using a large database of subjective tests. PESQ incorporates many new developments that distinguish it from earlier models for assessing codecs. These innovations allow PESQ to be used with confidence to assess end-to-end speech quality as well as the effect of individual elements, such as codecs. PESQ is specifically designed for active (intrusive) testing applications and complies with the ITU-T standard voice quality measurement techniques and algorithms found in P-series (P.800, P.862, and P.562) and G-series (G.107) recommendations [32].


Knowing the quality or “customer perception” of wireless network performance has significant benefits for wireless carriers as spending on infrastructure and maintenance can now be linked directly to customer satisfaction [32].

As traffic over current networks increase, carriers will be tempted to dynamically switch to alternative codecs and compression schemes (e.g. AMR). These allow more simultaneous calls though with technically constrained performance. Using voice quality measurement allows carriers to understand the scope of these decisions and use quality intelligence to decide if a network upgrade is necessary [32].

Wireless carriers are presented with arguments to purchase echo cancellers and voice quality enhancement (VQE) equipment to improve the “quality” to their customers. With voice quality measurement, carriers now have a meaningful measurement to judge the merit of these upgrades. A simple contract term may be defined which states how much the voice quality must be improved [32].

Drive testing is the popular method of assessing the “quality” of wireless networks. This involves making manual calls from a roaming technician or the more automated technique where vehicles are equipped with black boxes that make test calls to a central location. A test signal representing human voice is introduced into one end and captured at the far end. The far end has a voice quality measurement algorithm, amongst other things, that compares the received signal to a local copy of the original. Many times, poor over-the-air voice quality test results are mistaken as a coverage issue, and the course of action undertaken is the addition of a costly cell site to improve RF reception. The poor quality may actually be due to the transmission network connecting the cell site with the switching office itself and not be an RF issue at all [32].


Introduction of GPRS and UMTS data centric multi-media services (MMS) applications (e.g. real time news, quotes, pictures, slow motion video), PDA applications (for email), introduce IP complexities beyond managing the radio frequency and mobility environment. IP is being introduced initially to the core network backbone today, but with UMTS it will eventually find its way all the way out to the wireless device [32].

Leading service providers and equipment vendors recognize their long-term competitive success depends to a large extent on their ability to provide the same high-quality voice service on next-generation networks that customers take for granted on the PSTN. Consequently, they want cost-effective, easy-to-use testing solutions that can help them deliver that quality [32].

There are so many advantages of transmitting digital voice where you can save a lot of bandwidth just by compressing the signal and reducing the bit rate. The first encoder that was designed to perform such a transformation was PCM. It digitised the signal and compressed it for transmission. Then as the demand and desire for efficient grew, a new method was approached called Linear Prediction in which future discrete samples of the signal were estimated as a linear function of previous samples and so forth the other advanced encoders that emerged with better performance.

It is convenient to compare the performances of encoders by referring to the quality of voice, complexity, intelligibility, bit rate, etc. There are techniques to measure the quality, and the most commonly used is the MOS. CELP encoder has been shown to be have better performance compared to the other encoders designed with better MOS score, bit rate, and complexity.

In the end, it has never been easy to transmit voice over a wireless system. Since there is limitation of bandwidth, signals have to be compressed accordingly. Furthermore, there are few other factors which effect the transmission, mostly by echo.

Wireless communications is a rapidly growing segment of the communications industry, with the potential to provide high-speed high-quality information exchange between portable devices located anywhere in the world. Potential applications enabled by this technology include multimedia Internet-enabled cell phones, smart homes and appliances, automated highway systems, video teleconferencing and distance learning, and autonomous sensor networks, to name just a few. However, supporting these applications using wireless techniques poses a significant technical challenge.


[1]: NTIA Report 01-386 - Voice Quality Assessment of Vocoders in Tandem Configuration, Christopher Redding, Nicholas DeMinco, and Jeanne Lindner, U.S. DEPARTMENT OF COMMERCE, April 2001

[2]: SPEECH CODING: FUNDAMENTALS AND APPLICATIONS by MARK HASEGAWA-JOHNSON (University of Illinois), ABEER ALWAN (University of California at Los Angeles)

[3]: J. D. Gibson, T. Berger, T. Lookabaugh, D. Lindbergh, and R. L. Baker, Digital Compression for Multimedia: Principles & Standards, Morgan-Kaufmann, 1998

[4]: Purdue University: ECE438 - Digital Signal Processing with Applications - Speech Processing - Labs - Prof. Charles A. Bouman

[5]: SPEECH CODING ALGORITHMS Foundation and Evolution of Standardized Coders by WAI C. CHU, Mobile Media Laboratory, DoCoMo USA Labs, San Jose, California

[6]: Cisco 1751 Router Software Configuration Guide - Voice over IP Overview

[7]: Wireless digital signal processors, Ingrid Verbauwhede and Mihran Touriguian, UCLA, ATMEL Corporation

[8]: Multimode Speech Coding Below 6 kbps, by Nilantha N. Katugampala, Thesis for Doctor of Philosophy, University of Surrey, 2001

[9]: The MUSHRA AUDIO SUBJECTIVE test methods - A.J. mason research and Anchors - research and development - British Broadcasting Corporation - R& D WHITE PAPER - whP038 - Sep 2002

[10]: Speech Synthesis and Recognition by J. N. Holmes and Wendy Holmes 2nd edition

[11]: Lecture 4, Spectral Analysis: Modulation and Multiplexing II: Wire Technologies I, February 25, 2004 http://people.seas.harvard.edu/~jones/cscie129/nu_lectures/lecture4/lecture_4.html

[12]: Digital Speech, By Ahmed M. Kondoz, Second Edition

[13]: Lucent Technologies, Technology Description, Http://standards.lucentssg.com/description.html

[14]: AT&T Bell Laboratories, Murray Hill, New Jersey, USA, by Bishnu S. Atal & Nikil S. Jayant, http://cslu.cse.ogi.edu/HLTsurvey/ch10node4.html

[15]: Hoth, D.F., The T1 Carrier System, Bell Laboratories Record, Bell Telephone Laboratories, 1962

[16]: Fundamentals of Voice-Quality Engineering in Wireless Networks by Avi Perry

[17]: Introduction to Digital Speech Processing by Lawrence R. Rabiner, Rutgers University and University of California, Santa Barbara, USA, and Ronald W. Schafer, Hewlett-Packard Laboratories, Palo Alto, CA, USA

[18]: DVSI Voice Coding Overview, http://www.dvsinc.com/papers/vc_over.htm (Digital Voice INC.)

[19]: Data Compression by David Salomon, 3rd edition, Springer

[20]: Lecturer Mark Handley, Chapter 4, Speech Compression, http://www.cs.ucl.ac.uk/teaching/Z24/

[21]: I. Atkinson, S. Yeldener, and A. Kondoz, “High quality split-band LPC vocoder operating at low bit rates,” in Proc. Int. Conf. on Acoust. Speech, Signal Processing, May 1997

[22]: S. Keagy, Integrating Voice and Data Networks, Cisco Press, 2000

[23]: Digital Transmission Systems By David Russell Smith, 3rd edition

[24]: Speech Analysis & Synthesis Methods developed at ECT in NTT by N. Sugamura and F. Itakurna, 1986

[25]: http://www.egr.msu.edu/~kambohaw/index_files/ (Michigan State University, www.egr.msu.edu)

[26]: Speech Coding Methods, Standards, and Applications by Jerry D. Gibson - Department of Electrical & Computer Engineering, University of California, Santa Barbara

[27]: The Electrical Engineering Handbook by Richard C. Dorf, 2nd edition

[28]: Voice Quality in Wireless Networks, White Paper - Octasic Semiconductors, www.octasic.com

[29]: Voice Quality on Next-Generation Networks Demands Next-Generation Testing Tools, A White Paper Prepared by The Staff of GL Communications, Inc., http://www.gl.com

[30]: Voice quality measurement in telecommunication networks by optimized multi-sine signals by Domenico Luca Carnı`, Domenico Grimaldi, Department of Electronics, Computer and System Sciences, University of Calabria, 87036 Rende - CS, Italy, Available online at www.sciencedirect.com - www.elsevier.com/locate/measurement - Measurement 41 (2008) 266-273

[31]: Introduction to Telecommunications Network Engineeringby Tarmo Anttalainen

[32]: NETWORK ASSURANCE SYSTEM FOR VOICE QUALITY TESTING, QOVOX Corporation, www.qovox.com, Rev: 2/24/06]

[33]: http://www.diracdelta.co.uk/science/source/s/p/speech%20coding/source.html

[34]: Integrating Voice and Data Networks by Scott Keagy, www.ciscopress.com


Please be aware that the free essay that you were just reading was not written by us. This essay, and all of the others available to view on the website, were provided to us by students in exchange for services that we offer. This relationship helps our students to get an even better deal while also contributing to the biggest free essay resource in the UK!