VoIP quality VoIP testing VoIP test

Opticom

Home > Technology > Voice Quality Testing | PESQ

At the time PSQM was standardized as P.861, the scope of the standard was to assess speech codecs, used primarily for mobile transmission, like GSM. VoIP was not yet a topic. The requirements for measurement equipment have changed dramatically since then. As a consequence, the ITU set up a working group to revise the P.861 standard and to cope with the new demands arising from next generation networks like VoIP. Within these networks, the measurement algorithm has to deal with much higher distortions than with GSM codecs, but perhaps the most eminent factor is that the delay between the reference and the test signal is no longer constant.

> From P.861 over PSQM+ and PSQM/IP to PESQ

A first approach to overcome such problems was the development of PSQM+. It handled well the larger distortions as caused by burst errors, but still had significant problems with the compensation of the varying delay. An advanced Delay Tracking Feature has been added by OPTICOM in OPERA™, called "PSQM/IP" that implemented an easy way to solve the varying delay issue in most cases, without loosing the option of real-time operation. Although this feature may fail for some signals, it still allows to achieve PSQM results for the speech quality of VoIP networks.

With the new ITU standard P.862 (PESQ) this problem is finally eliminated. PESQ combines the excellent psycho-acoustic and cognitive model of PSQM+ with a time alignment algorithm adopted from PAMS, that handles varying delays perfectly. PESQ is not designed for streaming applications, which is it’s only drawback. This is why it cannot fully replace PSQM+. With PSQM and PESQ there are now two standards that cover the entire problem of measuring speech quality. An overview of the structure of the PESQ algorithm is shown in the block diagram, also indicating the new blocks that have been added to the PSQM algorithm.

> PESQ Block Diagram

click to [ enlarge picture ]

> Explanation of the Measured Values

The most eminent result of PESQ is the MOS. It directly expresses the voice quality. The PESQ MOS as defined by the ITU recommendation P.862 ranges from 1.0 (worst) up to 4.5 (best). This may surprise at first glance since the ITU scale ranges up to 5.0, but the explanation is simple: PESQ simulates a listening test and is optimized to reproduce the average result of all listeners (remember, MOS stands for Mean Opinion Score). Statistics however prove that the best average result one can generally expect from a listening test is not 5.0, instead it is ca. 4.5. It appears the subjects are always cautious to score a 5, meaning "excellent", even if there is no degradation at all. OPERA can determine the MOS for the entire signal, or only for active speech parts of the signals and for the silent parts of the signals. In the two latter cases, active speech is detected by using the VAD (voice activity detection), which is part of the PESQ time alignment. Knowing the individual MOS scores is especially useful for optimizing e.g. comfort noise generation or noise reduction systems.

> P.800 MOS and PESQ-LQ

Listening tests are very difficult to repeat and will never give identical results. Moreover it is generally required to apply at least a linear transformation to the results of one test if they shall match the results of a second test (with identical test material, but performed at a different place or at a different time). The same holds for the correspondence between a listening tests and the PESQ MOS. If highest correlation is required, a linear mapping of the PESQ MOS to the scale which was actually used by the test subjects must be applied. The PESQ MOS according to P.862 was derived by optimizing a third order polynome to give highest correlation on a very large set of data. Although this is generally the best approach, it is of course possible to achieve higher correlations on a smaller set of data by applying a second polynome. One such approach is the PESQ-LQ value. It uses the following formula to transform the PESQ score (x) into the PESQ-LQ value (y):

Y =

{

1.0, x <= 1.7
–0.157268 x3 + 1.386609 x2 – 2.504699 x + 2.023345, x > 1.7

This mapping was submitted to the ITU-T SG12 with the intention of extending P.862 by an annex or appendix. SG12 however clearly rejected this proposal, and OPTICOM supports this rejection for the following reasons:

•	The PESQ MOS has the best overall performance. If a user requires the mapping of the PESQ score to another listening test, he has to perform his own mapping in any case. PESQ-LQ will be as wrong as any other parameter in this case.
•	Having a second MOS-like parameter is confusing.
•	Appending a second third order polynome to the already third order mapped PESQ MOS doubles the mathematical degree of freedom. This will increase the correlation on the data which were used for the parameter fitting ("Training"), but it also increases the risk of complete failure on other data.

Similar is the situation with a MOS mapped to the full P.800 scale. Although in OPERA we use a linear mapping in this case only, we do not recommend using this value. Both parameters are supplied by OPERA, due to the sole reason, that customers wanted to see them. In our opinion it is scientifically wrong to use them, and we do not recommend it.

> G.107 Rating, R Factor according to the e-Model

The ETSI e-model as defined in ITU-T G.107 is a planning tool that assigns a certain equipment impairment factor Ie to each piece of equipment in the transmission chain. These Ie values are then summed up and combined with several other parameters to give the final R factor or R rating. This R Rating is an estimate of the quality that can be expected if the network is realized the way it is planned. Although the e-model is an excellent planning tool, it can never replace real measurements on the final network, since it has to make some very wide ranging assumptions. R ranges from 0 for terrible up to 100 for perfect voice quality, and of course there is a well defined relation between R and the MOS score. To allow for the comparison between the estimates from the network planning phase and the QoS of the live network, PESQ implementations, like in OPERA, provide the R factor as well. It is directly derived from the overall MOS, as it is calculated by PESQ. It neither takes delay nor echo or attenuation into account and strictly speaking should be considered more closely corresponding to the G.107 Ie value than to the R factor (which is a conversational measure, rather than a listening quality index).

> PESQ Results generated by OPERA

As an example, you can see some result diagrams available for PESQ in OPTICOM's OPERA implementation. Our PESQ implementation offers much more then just voice quality testing. A partial side effect is a detailed analysis of VAD behavior, jitter buffer adaptation or AGC/ALC tests.

The "Waveforms and VAD Parameters" display of PESQ shows the entire waveforms in one diagram instead of the frame-wise displays of the other algorithms. Within this diagram you can plot the following graphs:

•	Waveform of the reference signal
•	Waveform of the test signal
•	Front End Clipping region as a red shaded area
•	Hold Over Time region as a yellow shaded area
•	Drop outs as an orange shaded area

Additional information is shown on the right side of the diagram. This information includes the time and date of the measurement, as well as general information on the input data. The delay shown is the average delay in milliseconds as well as in samples. This delay is the average for the entire measurement period.

> Waveform and VAD parameters diagram. During this measurement severe distortions were detected

click to [ enlarge picture ]

> PESQ Summary Results

The screen below shows important results of the PESQ algorithm It chiefly shows the MOS score, level measurement results and some additional information on the delay variation. The PESQ MOS as defined by the ITU recommendation P.862 ranges from 1.0 (worst) up to 4.5 (best). Below the PESQ MOS the same value is given mapped to:

•	The full P.800 scale (1..5)
•	To the PESQ-LQ scale

The resulting R factor according the e model is also shown here. Other values given are the minimum, average and maximum delay in milliseconds, as well as the delay jitter in milliseconds.

> PESQ Result Diagram

click to [ enlarge picture ]

> Delay Jitter

Using PESQ on OPERA, you can even analyze the behavior of adaptive jitter buffers. Of course, a PESQ measurement can not look into the gateways, but the result of the jitter buffer adaptation can be observed as a delay jitter of the audio signal. The length of the jitter buffer adds linear to the delay of the speech signal. This means that a delay jump of 100ms is directly related to a jitter buffer adaptation of this amount (assuming that all the other latencies in the network are constant). Also you may observe that adaptations occur during active speech, which results in a worse MOS value. Jitter measurement may give you valuable information on how to optimize your network.
The delay jitter as measured by OPERA is defined as the maximum and minimum deviation of the delay from the average delay in ms as is shown in the final results window (second bar graph from left). For a more detailed analysis of delay variations further diagrams can be derived, reporting the Delay-versus-Time diagram, and the Delay Histogram.

Information on the measured signal levels is also given in the Final results window. In the tables on the right side of the diagram the signal levels in dB_ov as well as the loudness in Sone are found.

The next diagram shows MOS vs. Time. This diagram indicates the perceived voice quality as measured by PESQ on a frame by frame basis. This diagram may be used to analyze sequences that have spurious audible distortions. For instance, you may search the peak in the MOS vs. Time diagram and then analyze the signals around this time stamp using the other diagrams provided.

> PESQ MOS as a function of time

click to [ enlarge picture ]

> Use PESQ's gain measurement to analyze AGC

The behaviour of e.g. AGC devices can best be measured using the Gain Variation diagram as shown in below. The y-axis of this diagram indicates the variable part of the gain in dB, which must be seen relative to the overall attenuation as shown in the Final Results diagram. The x-axis is the time in ms. Please note, that both signals must exceed the threshold in quiet by app. 7dB. A red line will be plotted for the invalid periods. All frames that do not meet this criterion will be set to 0dB.

click to [ enlarge picture ]

> Download more detailed Technical Specifications

->	Download PESQ – Perceptual Evaluation of Speech Quality [as PDF / 156 kb ]

> Further Details...

For more details on PESQ measurements, we recommend to refer to our Literature section, or take a look at OPERA under our Products section.

[top]