Ogg Vorbis Audio Compression
The present paper is study of the Ogg Vorbis audio compression, presented as coursework for the module CIM241, Multimedia Communications. The coursework comprises of two parts. The first part describes the principles behind audio compression in general and some of the encoding and decoding principles which are in use in the Vorbis codec. The second part performs an evaluation of the Vorbis compression, using the Matlab environment. It compares pieces of music in various genres and bitrates, in .ogg and .mp3 format, comparing to the uncompressed form of .wav.
The subject of the present coursework is a tutorial study of the Ogg Vorbis audio compression, as described by the non profit organization Xiph.Org Foundation, founded at 1994 by the project leader Christopher Montgomery . The term Ogg refers to a project, which is still in the development phase, of a completely free and open standard container format. The ogg is a file format which multiplexes an unlimited number of tracks encoded by a set of open and patent free codecs for speech, audio, video, text and metadata. Similar to Ogg container formats are:
* The Matroska (mkv), which is an open free standard developed by Matroska.org.
* The well known Audio Video Interleave (avi) developed by from Microsoft in 1992.
* The new Advanced systems Format (asf), from Microsoft as well.
* The MPEG-4 part 14 (mp4) part of the MPEG-4 specification.
Some of the compression codecs supported by ogg, which they are now developed by the Xiph.Org Foundation are:
* The Vorbis (ogg or oga), which is a lossy audio compression, started by C. Montgomery at 1993 and became stable at 2000.
* The Speex (spx), which is a lossy audio compression designed for speech. It can be used in VoIP applications or in podcasts.
* The Theora (ogv), which is a lossy video compression. It is based on DCT like other video codecs and has variable bitrate.
* The Free Lossless Audio Codec (flac), which is a lossless audio compression targeting on the audiophile public.
One of the major reasons, which have encouraged the development of the Vorbis compression, was the heavy patent protected mp3 format. At 1998 the Fraunhofer Institute has announced to charge licensing fees to mp3, something which has accelerated the development of Vorbis. The Vorbis codec is completely patent free and accompaniment by an open source license (BSD).
The dominant mp3 format has also some serious drawbacks, as its psychoacoustic compression model is somehow old. New advances to audio research have shown that serious improvements can be made, something that Vorbis still continues to improve through consecutive tests by developers from the public community, such as Aoyumi's Tuned Vorbis (aoTuV). The mp3 supports only two channels (left and right) in contrast to ogg, which supports up to 255 distinct channels. The latest versions of Vorbis encoders are extremely flexible, able to produce bitrates from 32 up to 500 kbps (aoTuV), either variable (VBR) or constant bitrate (CBR), comparing to the maximum bitrate of 320 kbps that the mp3 offers (LAME encoder). Sampling rate can vary from 1 up to 200 KHz, comparing to 8 to 48 KHz of mp3.
II. Technical Details
The reason behind lossy compression codecs as Vorbis is the storage needs for the online available music and the transportation of digital music via portable music players with limited capacity. An audio compact disk (CD) can store up to 650 MB or 74 minutes of digitized music in PCM format according to the Red Book specification. The audio CD quality refers to analog music sampled at 44100 Hz, in two channels stereo and 16 bit data format. So one second of music holds:
When digitally extracted it can be stored in the form of uncompressed audio with 1411.2 kbps bitrate or Waveform Audio File Format (wav) supported by Microsoft. With average song duration of 3.5 min, we get a wav file of 35.5 MB. The typical mp3 compression of 192 kbps gives 1.4 MB per min of music, so for 3.5 min, we have 4.9 MB per song. So, the compression ratio that the mp3 offers at 192 kbps is:
The typical data CD can store up to 130 mp3 songs at 192 kbps or 455 min or 7.6 hours of music. This is a considerable achievement. Contemporary lossless audio codecs like FLAC or others (Musepack, WavPack) offer compression ratio up to 2, value which is also impressive. Improved lossy codecs, such as Ogg Vorbis or the MPEG-4 Advance Audio Coding (AAC), offer better audio quality for the same file length or the same bitrate as the mp3.
There are two types of encoding, the constant bit rate (CBR) and the variable bit rate (VBR). The term constant refers to a fixed bit rate for the entire length of the selection therefore the size of the compressed audio will be a linear function of the original length and the sampling rate. The term variable is not constant, instead the encoder change it according to the complexity of the original data. During sections of audio where are silent or poor in spectrum content the bit depth is reduced to a minimum value (e.g. 128 kbps). The bit depth is restored to its nominal value (e.g. 192 kbps) to reach maximum values (e.g. 256 kbps) for the sections where are load with multiple instruments and voices.
Modern encodes support both modes, the CBR and the VBR, although the VBR mode is now used widely, due to the increased audio quality keeping the file length small. Many well documented blind listening tests have been performed and the results are available on the internet. Briefly, the LC-AAC codec is now the dominant among high quality codecs for the online music and the Vorbis comes second, but the last continues to improve as several independent developers perform continuous optimizations. The mp3 have reached its limits, although very good encoders are still in use (LAME).
The Vorbis specification describes only a generic decoder frame. The encoder is left free for implementation by the public community. Vorbis is based on the same principles as mp3 although it adds new functions and more flexible features.
The block diagram depicted in 1 describes an abstract Vorbis encoder. The encoding begins with the allocation of the digitized audio information into frames. The Vorbis encoder is able to handle frames with different sizes, so it can adjust dynamically the frequency resolution during signal processing if it is necessary. On the mp3 encoder it is possible to enforce a concrete sampling frequency and bitrate, in contrary of the Vorbis, where the variable frequency and bitrate are considered as its normal state of function.
Vorbis, like the rest of the lossy audio codecs, performs transformations on the uncompressed data. The psychoacoustic analysis is performed on the early stage of the transformation in order to reduce the data volume while it maintains the audio quality below a predefined level. The psychoacoustic model is not defined strictly by the specification and depends on the developer of the encoder, as well.
The Modified Discrete Cosine Transform (MDCT) and the Inverse MDCT are used to transform the input signal from time to frequency domain, where the energy concentration is critical. The spectrum of a frame is then divided into two parts: the floor, which is a raw approximation, and the residue, which is the remainder. The Vorbis encoder is open to use different interpretations for representing and encoding the floor and the residue parts. A distinct encoding mode is assigned to each interpretation. When the user chooses a specific mode the encoder uses the respective part of the code.
Thereafter, the data which are included on the floor and residue parts, are compressed in the entropy encoding block. Vorbis uses the Huffman algorithm but more effectively than the mp3 encoding method. A dynamic probability model is chosen instead of the static probability model which is in use by the mp3 specification. The encoder uses vector quantization and creates custom codebooks for any data stream and they can be different for the floor and residue parts and the discrete frames.
The final part of the Vorbis encoder is where the encoded frame is bit-packed into a logical frame. A header frame always precedes a series of audio packets and includes all the necessary info for the decoder. The header includes the complete set of the used codebooks, the precise methods which the encoder have used to encode the floor and the residue parts, and the modes and mapping for the multichannel info. It carries also metadata including the bitrate, the sampling rate, the names of the album, song and artists, etc.
As already mentioned Vorbis is encapsulated in the form of logic packets to a specific container format, called the Ogg, and forming the transport stream. The transporting format provides the necessary methods for multimedia streaming over the internet, such as framing, positioning, synchronization and error correction.
The block diagram in the following 2 describes an abstract Vorbis decoder.
The process begins when the header decode block receives the first ogg packets via a multimedia stream or from an ogg container file. The Vorbis packets could be one of the following types:
* Identification packets. They identify the streaming data as valid Vorbis packets and specify values as the encoder version, audio features, sampling rate and channel number. Sampling rate can incorporate the maximum, minimum and nominal bit rate.
* Comment packets. They contain most of the meta-data as artist, song title and album info.
* Setup packets. They contain the configuration data for the decoder, such as the codebooks for the inverse vector quantization and Huffman decoding.
* Audio packets. The carry the audio information.
There is a specific order by which Vorbis packets should be received. Every identification packet should always be followed by a comment packet or another identification packet. A comment packet should always be followed by a setup packet or an identification packet. A setup packet should be followed by an audio packet or an identification packet. The last when appears after the next three packets, it resets the configuration or starts a new stream of data.
After packet identification the decoder selects the proper decoding format by interpreting info as the frame size, transform type, window type and mapping number. The mapping number specifies the floor and residue restore functions. The floor curve restoration is implemented in two stages. Initially the curve amplitude and the filter coefficient are entropy decoded from the bit stream and thereafter the floor curve is constructed from the frequency response of the Line Spectral Pair (LSP) filter.
The residues are decoded in the residue unpacking block by using the extracted vector quantization codebooks. The last are extracted together with the floor codebooks and some of them are in common. The residues are then added to the restored floor curve and finally the spectral curve is reconstructed completely. The IMDC block converts the delivered audio spectrum to the temporal domain. The final block of windowing is restoring the transformation of the signal which has been added on the encoder before MCT, in order to reduce the effects of the block artifacts.
The decoder delivers decoded PCM samples which can be converted to an uncompressed audio format like Microsoft's wav or just simple reproduced on an audio system like the sound card of a PC, through its D/A converter.
III. Matlab Simulation
In the second part of the coursework, I analyze experimentally the Ogg Vorbis compression in Academic Matlab 2008b and I compare it with the widely known mp3 compression. The major part of the code is influenced by the work of A.H. Poonawalla . I have used the methods of plotting the audio spectrums through FFT functions, although I have done some major improvements, which I will explain further down. I have use also two groups of function developed by the Matlab community. The first group, made by D. Ellis , includes the mp3read() function, which converts an mp3 file to raw PCM audio vector in the Matlab environment and the mp3write(), which converts a raw PCM audio vector to mp3 file.
The second group, made by A. Fernandez, includes the oggread() function, which converts a Vorbis file to raw PCM audio vector in the Matlab environment, and the oggwrite(), which converts a raw PCM audio vector to a Vorbis file. Matlab also includes two build in functions, the wavread() and wavwrite(), which interwork with the uncompressed wav files, like the previous pairs respectively. The main subject to prove is that the Ogg Vorbis audio compression is better than mp3, on the issue of the audio quality and this formality stands for various bitrates and various music styles, as well. Another subject is the investigation on the spectrum of the decoded audio for both, the ogg and the mp3.
Another program which I have used is Cool Edit Pro 2.1 (now known as Adobe Audition), which is available as trail. I have use the program to extract 30 seconds of digital audio from three audio CDs (Classic, Rock and Jazz music) and to store it on the hard disk as uncompressed wav format (44.1 KHz stereo, bitrate 1411 kbps, size 5.169 KB), mp3 format (44.1 KHz stereo, bitrates 128, 192, 256 kbps, sizes 470, 705 940 KB) and ogg format (44.1 KHz stereo, bitrates 128, 192, 256 kbps, sizes 480. 716, 953 KB). Something important to mention is that the ogg plug-in of Cool Edit has the option to use CBR for encoding, although Vorbis uses natively VBR.
It was important to use constant bitrate for both codecs, because the variable bit rate induces a variable sampling frequency, which will shift the information content on each frequency channel in relation to the input, by non integer values. Thus, the direct subtraction of the spectra will be erroneous, as the Matlab read functions will extract raw audio of different length. It was a problem that A.H. Poonawalla couldn't solve on his original work.
A. Code explanation
The code starts by loading the three pieces of music from the Music directory. As I have mentioned the build-in and two custom functions are used. Fs is the extracted sampling frequency, which is 44100 for all formats:
[wav,Fs] = wavread('Music\Rock-1411.wav');
mp3 = mp3read('Music\Rock-192.mp3');
ogg = oggread('Music\Rock-192.ogg');
The variables wav and ogg are matrices with two rows and 1323000 columns. The number results when multiplying the 30 sec of music by the 44100 samples per second. The mp3 file is little longer, at 1323695x2 elements. The mp3 format includes 1800 additional elements, called padding, which are near zero value, low noise samples. The first 1105 elements are place in the beginning of the file and the last 695 elements at the end. The programmer of mp3read removes the starting padding but is unable to remove the ending because the length of the file is unknown, unless in the case that it is known (1323000). In this case, I use the following code to remove the ending padding, so the resulting vector contains 1323000 samples of pure audio, having one to one correspondence with the original vector.
mp3 = mp3(1:length(wav),:);
The following code calculates the SNR and the PSNR of the mp3 encoded audio sample comparing to the uncompressed wav sample. I have used the following formulas :
Where is the average square value of the original data sequence, is the mean square error (MSE) and the peak square value. The error is calculated by subtracting the elements of the mp3 vector from the elements of the wav vector.
peak = max(wav(:));
diff = wav - mp3;
disp(['SNR of mp3 is ',num2str(10*log10(sum(wav(:).^2)/sum(diff(:).^2))),' dB']);
disp(['PSNR of mp3 is ',num2str(10*log10(length(wav)*peak^2/sum(diff(:).^2))),' dB']);
Similar code calculates below the SNR and PSNR for the ogg Vorbis encoding. The following code performs fast Fourier transform (FFT) on the sample matrices:
wav_fft = fftshift(abs(fft(wav(:,2),Fs)));
The function fft() returns a complex vector of length Fs, containing the Fs-point discrete Fourier transform of the right channel audio data of vector wav. The function abs() returns a vector containing the complex modulus (or magnitude) of the complex vector. The fftshift() function shifts dc-component to the array and the result is stored in vector wav_fft. I use the same method for mp3 and ogg vectors.
wav_left_phase = unwrap(angle(fftshift(fft(wav(:,1),Fs))));
The function angle() returns a vector of length Fs, containing the phase angles (in radians) of the complex vector of shifted fft of the left audio channel. The unwrap() function corrects phase angles to produce smoother phase plots and the result is stored in wav_left_phase vector. Similarly it is done for wav_left_vector for the right audio channel, and for the mp3 and ogg left and right audio channels. In general, if z is a complex vector, the magnitude R and the phase angle theta are given by:
Because I have performed an Fs=44100 point FFT of an audio, sampled at 44100 HZ, it is convenient to depict the centered spectrum on X axis (frequency) ranging from -22050 Hz to +22050 Hz. This is done by creating the vector freq:
freq = (0:Fs-1)-(Fs/2);
Now it is time to plot the right audio spectrum. The vertical axis of amplitude is in logarithmic scale. On the frequency axis there is the typical negative frequency redundancy of the Fourier transformation.
As long as the audio vectors have the same length, there is no scaling on the horizontal axis and the information matches perfectly. The subtraction of their logarithmic spectra gives a very good approximation of the spectral differences among the original and the ogg encoded audio. Since I am interested in differences and not absolute values, scaling on the vertical axis is irrelevant.
Similarly for the difference between the original audio and the mp3 encoded audio:
Finally I plot a double phase diagram which depicts the phase shift that the mp3 and the ogg encoding induces on the original phase, for the left and right audio channels:
On every vector I have assign a different color, so the phase shift which every encoder induces will be clearly observable. I have done the same for the right audio channel.
B. Analysis Results
For a complete study of compression codec it is necessary to perform blind listening test, well organized and documented, with various samples of sounds, including songs, speech and music from several genres. The encoding quality has to cover a wide range starting from a low rate of 64 kbps, middle rates of 128 and 192 kbps, up to high rates of 320 kbps, for both encoding types, the CBR and the VBR. The tests have to be performed on a specific audio system, preferably a home theater pc (HTPC) connected on a Hi-Fi stereo amplifier and loudspeakers.
The simulation has to perform full analysis on each sample and the results have to be included on the study, in order to be plainly objective. For the present coursework this is unnecessary, and only some samples are included, designative of three major music genres. Classic music encoded at 128 kbps, Rock music at 192 kbps and Jazz music at 256 kbps. The following matrix includes the signal to noise ratio (SNR) and the peak SNR for the six samples.
It is clear that, in every test, Ogg Vorbis compresses audio samples with less noise than the mp3 encoder. Although the tests are few, it is deduced that higher encoding rates gives better SNR for both encoders. The reader can have a private listening experience on every sample by pressing Ctrl plus Left Click on the underlined titles. Winamp is a recommended audio player, especially for the ogg format. Samples with 192 kbps encoding are considered as transparent for the ordinary listener, but a VBR of 256 or more are considered satisfactory, especially for the mp3. I am going to prove why.
The range of the original sample, extracted from uncompressed wav file, extends up to 22 KHz. It is clear that the mp3 encoder filters out frequencies above 16 KHz, something that it is observable in all the samples and it is also know from other experimental studies . The Ogg Vorbis preserves more of the original dynamic range in the upper bands, exceeding 20 KHz.
The following diagram depicts the spectral differences between the original spectrum and the encoded spectrum, by mp3 and ogg encoder respectively. Something, which seems to be in common, is that both encoders preserve better the low frequencies, starting from dc coefficients up to 8 KHz. The mp3 encoder divergences more than ogg, something expected from SNR and PSNR measurements. In general both codecs focus on low frequencies, where the bulk of the information is gathered. The spectrum difference of mp3 above 16 KHz is false, instead of ogg, which it keeps an almost fixed spectral error above 12 KHz. I have similar results from the other samples, as well.
The phase shift diagram in 6 reveals many of the drawbacks of mp3 encoder comparing to Ogg Vorbis. On each audio channel, the mp3 encoder induces larger shift from the phase of original sample, than it does the ogg encoder. Apart of the phase distortion it causes different phase shift on each channel, as well. The last is a major distortion factor of the stereo image of the music which is reproduced on a Hi-Fi Stereo music system. Another disturbing finding is that the mp3 induces more phase distortion as the bitrate increases, something that is missing from OGG and AAC encoders. Therefore it vitiates any claims for hi fidelity demands from the mp3 format generally.
The Ogg Vorbis is a comparably new lossy audio compression codec. It is based on the mp3 technology but it has brought-in many improvements, which the public community has suggested after at least a decade of using and analyzing the mp3 compression. The Ogg Vorbis offers improved dynamic range in contrast of the mp3 and higher signal to noise ratio. The flexible Vorbis encoder is able to produce from streaming audio to accompaniment streaming video via the container format Ogg, up to high quality compressed audio files, targeting the audiophile listening audience. Although, the most important feature that Vorbis and the rest of Xiph's codecs have brought, is that they are patent free and open standard, something considerably practical for the independent developers and the open community.
% Matlab code by Ioannis Lianakis for the coursework of module CIM241
% The script begins by loading the three pieces of music. The user should
% change manually filenames for Rock, Jazz and Classic music, respectively.
[wav,Fs] = wavread('Music\Rock-1411.wav'); % wav read function
mp3 = mp3read('Music\Rock-192.mp3'); % mp3 read function
ogg = oggread('Music\Rock-192.ogg'); % Vorbis read function
% Calculation of the SNR and PSNR for the mp3 and the ogg coding
mp3 = mp3(1:length(wav),:); % It truncates the end padding
peak = max(wav(:)); %The maximum uncompressed value
diff = wav - mp3; % matrix containing the one to one subtraction
disp(['SNR of mp3 is ',num2str(10*log10(sum(wav(:).^2)/sum(diff(:).^2))),' dB']); % The SNR formula
disp(['PSNR of mp3 is ',num2str(10*log10(length(wav)*peak^2/sum(diff(:).^2))),' dB']); % The PSNR formula
disp(' '); % Prints a new line
diff = wav - ogg; % Similar as above
disp(['SNR of ogg is ',num2str(10*log10(sum(wav(:).^2)/sum(diff(:).^2))),' dB']);
disp(['PSNR of ogg is ',num2str(10*log10(length(wav)*peak^2/sum(diff(:).^2))),' dB']);
% Performs the Fast Fourier Transformation
freq = (0:Fs-1)-(Fs/2); % creates the frequency scaling for the X axis
wav_fft = fftshift(abs(fft(wav(:,2),Fs))); % Extracts the complex modulus for Right audio channel
mp3_fft = fftshift(abs(fft(mp3(:,2),Fs))); % and shifts the zero frequency component
ogg_fft = fftshift(abs(fft(ogg(:,2),Fs))); % to the center of the array
wav_left_phase = unwrap(angle(fftshift(fft(wav(:,1),Fs)))); % extracts the phase angle (in radians)
mp3_left_phase = unwrap(angle(fftshift(fft(mp3(:,1),Fs)))); % from Left audio channel complex array
ogg_left_phase = unwrap(angle(fftshift(fft(ogg(:,1),Fs)))); % and corrects phases for smoother plots
wav_right_phase = unwrap(angle(fftshift(fft(wav(:,2),Fs)))); % Similar as above for Right audio channel
mp3_right_phase = unwrap(angle(fftshift(fft(mp3(:,2),Fs))));
ogg_right_phase = unwrap(angle(fftshift(fft(ogg(:,2),Fs))));
(1); % Creates the Right Channel spectrum plots
clf(); % clears the contents of the last 1
subplot(3,1,1); % Creates 3 vertical plots and draws the first
plot(freq,log(wav_fft(:))); % Draws the FFT(freq) on a logarithmic scale
ylabel('Original wav'); % of the uncompressed right channel
axis([ -2.5e4 2.5e4 -8 8 ]);
subplot(3,1,2); % Draws the second of the plots
plot(freq,log(ogg_fft(:))); % of the ogg coded right channel
ylabel('Wav from ogg');
axis([ -2.5e4 2.5e4 -8 8 ]);
subplot(3,1,3); % Draws the third of the plots
plot(freq,log(mp3_fft(:))); % of the mp3 coded right channel
ylabel('Wav from mp3');
axis([ -2.5e4 2.5e4 -8 8 ]);
(2); % Creates the spectrum difference plots
subplot(2,1,1); % Creates 2 vertical plots and draw the first
plot(freq,log(wav_fft(:))-log(ogg_fft(:))); % Differences between the
ylabel('Diff. bt ogg-wav'); % original and the ogg spectrum
axis([ -2.5e4 2.5e4 -5 5 ]);
plot(freq,log(wav_fft(:))-log(mp3_fft(:))); % Differences between the
ylabel('Diff. bt mp3-wav'); % original and the mp3 spectrum
axis([ -2.5e4 2.5e4 -5 5 ]);
(3); % Creates the phase shift plots
subplot(2,1,1); % Creates the first of the two vertical plots,
hold; % for the left audio channel
plot(freq,wav_left_phase(:),'b'); % Three phase shift plots are overlay drawn
plot(freq,ogg_left_phase(:),'r'); % For the original, the ogg and the mp3 specrtum
title('Left Phase Shift');
subplot(2,1,2); % Creates the second of the two plots,
hold; % for the right audio channel
plot(freq,wav_right_phase(:),'b'); % similarly as above
title('Right Phase Shift');