- 307.9 MB
- 60.5 GB
Joint Stereo vs Stereo - Which is better?
Joint Stereo is a neat mathematical technique which is used to enhance the quality of compressed digital audio. It generally tends to be associated with the most popular format of audio compression - mp3, but it has also been incorporated into several other formats, for example: in the other "layers" of MPEG audio (mp3 = MPEG1 layer III) and in Advanced Audio Coding (AAC). The Ogg developers have (wisely, perhaps) avoided using the expression "Joint Stereo", but their "channel interleaving" and "lossless stereo image coupling" is essentially a kind of Joint Stereo Plus GT. JS is also used by some lossless audio compression techniques, for example: Monkey's Audio (ape) and Lossless Predictive Audio Compression (LPAC). (The mere fact that Joint Stereo is used in lossless compression ought to be enough to destroy - in one stroke - the myth that JS "destroys stereo separation"). My particular interest is in the mp3 format, but the basic principles behind Joint Stereo - what it is, and how it works - are universally true, regardless of the particular application. For a good introduction to the basic principles of MPEG Audio compression, I recommend the MPEG Audio FAQ at the Hannover University site.
The vast majority of mp3 enthusiasts - the millions of people who transfer mp3 files around the Internet via Napster-style file sharing software, or via IRC, or Usenet, or whatever - are probably blissfully unaware of all the arguments which take place concerning lame v Fraunhofer, 128k v 192k, CBR v VBR, stereo v joint stereo - and good luck to them! However, once people start to acquire a basic level of knowledge concerning the mechanics of audio compression, they almost inevitably gravitate towards the opinion that "Joint Stereo Is BAD". Here is a selection of some my "favourite" quotes about JS which I've seen posted onto Usenet during recent years:
[JS] is evil IMHO. It's a waste of good music, d/l time, and upload time too. Sure the files are smaller but what's the use if it's not true stereo!
Why on earth would you want to alter what the engineer had in mind just to save a few dozen KB in file size? It just doesn't make sense to me.
FUCK ALL YOU ASSHOLES WHO RIP IN JOINT STEREO, YOU'RE JUST DESTROYING GOOD MUSIC
I have always thought that joint stereo was the worst option, and only used to save space or by those that did not know any better.
if its so much better quality why dont record companies use it or something similar on commercialy released cd's your ears are broke fag....go see a fucking doctor
uhm ... yes, well, thanks for all the well-reasoned expert advice, folks ... and if you were to seek a more learned opinion by searching the Web, well, some of the seemingly "authoritative" advice out there is also pretty garbled, to say the least:
Probably the worst "explanation" I've ever seen is at cdrinfo.com :
"I'll try to explain Joint Stereo very briefly. Basically it looks for signals that are identical in the left and right channel and if it finds any they are encoded as mono. This means that 50% bits are saved for the mono encoded signals and these bits are used to improve the encoding accuracy (very simplified explanation)"
Very SIMPLIFIED explanation? Very INCORRECT explanation, more like! And it's somewhat ironic that one of the best explanations of JS that I've come across is in the documentation for the Xing encoder. ( Xing Guide in .pdf format) Ironic? Well, should you be unaware, the Xing mp3 encoder is widely considered to be the worst available, but clearly, they have a good understanding of what they're trying to do!
"Don't confuse Joint Stereo (Stereo mode 1) with the Joint Stereo coding used for MPEG layer 2 encoding - it is not the same. Joint Stereo (Stereo mode 1) encoding for MPEG layer3 allows the XingMP3 Encoder to use additional methods of encoding, specifically - MS Stereo (Middle/Side Stereo), and for lower bitrates only, Intensity Stereo, in addition to the Independent Channel coding used for Stereo mode 0. MS Stereo uses one channel to encode information that is identical on the left and right channels and the other channel to encode the differences between the two channels. Intensity Stereo encodes only bits that are perceived to be important to the stereophonic image. The XingMP3 Encoder uses Intensity Stereo only in low bitrate files, (96kbps or less) where file size is critical to the user. In Joint Stereo (Stereo mode 1), the encoder dynamically (frame by frame) chooses the method of encoding that produces the best quality for each individual frame. Dynamic encoding improves compression efficiency which results in a higher quality file using less bits. Stereo mode 0 encodes the left and right channels independently. The total bitrate remains constant, but the split between the channels can vary. The XingMP3 Encoder uses this flexibility to improve quality by allocating more bits to the channel with the more dynamic signal. For MPEG layer 3 encoding, Stereo mode 0 limits the encoder to only one method of encoding - Independent Channels. Because Stereo mode 0 is limited to one method of encoding, Joint Stereo (Stereo mode 1) in most cases produces higher quality. In the exceptions, the Stereo mode 0 quality will be essentially equivalent to Joint Stereo (Stereo mode 1)."
Even the Fraunhofer Gesellschaft (FhG) - the people who originally devised mp3 compression - provide a definition of Joint Stereo which, I think, does more to confuse than explain:
"Joint stereo coding takes advantage of the fact that both channels of a stereo channel pair contain far the same information. These stereophonic irrelevancies and redundancies are exploited to reduce the total bitrate. Joint stereo is used in cases where only low bitrates are available but stereo signals are desired."
Now, whilst I wouldn't like to imply in any way that Fraunhofer don't know what they're talking about(!), I think that their emphasis on "reducing the bitrate" does tend to confuse people. Of course, FhG are looking at ... to use a cliché I loathe ... the bigger picture. The aim of all audio compression techniques is to reduce bitrates whilst minimising any deterioration in sound quality. The use of Joint Stereo enhances sound quality such that smaller bitrates are more likely to be considered acceptable. But, the idea that JS simply discards stereo information in an attempt to trim a few kBytes off the file size (as our chums above seem to think) is completely erroneous, and in this case, it is one myth which we ought to (easily) destroy before we go any further: Whenever you encode a .wav file at a specified Constant Bit Rate, this bitrate defines the size of the compressed mp3, regardless of what stereo mode you choose, and in fact, regardless of which encoder you select. For example, if you encode at 128k (that is, 128 kilo bits per second), then each second of music will be represented by 128 000 bits. And for the benefit of anyone unfamiliar with bits, Bytes, MegaBytes, etc.:
1 second = 128 000 bits
= 16 000 Bytes (8 bits = 1 Byte)
= 15.625 kBytes (1024 Bytes = 1 kByte)
1 minute = 0.9155 MBytes (1024 kBytes = 1 MByte)
If you are in any doubt whatsoever about this, please do a simple test for yourself, with any .wav file, any encoder, and any bitrate. Joint Stereo does not "discard" a single bit; it simply makes better use of the number of bits which you have specified. Before I go any further in describing how it does this, let's remind ourselves of the basic principles of mp3 compression.
To begin with the obvious: the whole point of compressed audio is ... to compress audio! ie, to reduce the size of a raw audio wav file as much as possible. Repeat after me: Smaller is Better ... Smaller is Better ... Smaller is Better ...
The "best" size for an mp3 file is the smallest size which does not noticeably degrade the quality of the audio signal. Obviously, different listeners will have different opinions as to what level of degradation can be tolerated, but the basic principle is ... Smaller is Better.
mp3 began life as one of several "layers" of audio compression developed by the Moving Pictures Experts Group (MPEG). mp3 was the third of the 3 layers developed - the most complex, and the most compressive. The original aim of the MPEG committee was to reduce the size of audio files to around 10% of their uncompressed size. And in fact, since their target bit rate was 128k (as opposed to 1411k for a raw signal), this actually works out as barely 9% of the original. So, how is it possible to discard 91% of your original data without causing any noticeable deterioration in the audio quality?
1. Perceptual Coding
Perceptual Coding is by far the most important (and the most technically demanding) element of mp3 compression, but it is a concept which a lot of people have difficulty with when they come across terms like "Minimal Audition Threshold", "Psychoacoustic Modelling" or "Auditory Masking".
The basic principle behind Perceptual Coding is that the compressed audio can safely ignore all those sounds which the human ear does not perceive - either because we are physically incapable of hearing them, or because they are masked by other sounds.
One example which might help to explain this concept is to think in terms of vinyl recordings. During quieter passages of music, we are often very aware of all the deficiencies of vinyl - clicks, crackle and pops. However, during the louder sections we no longer hear them. They haven't gone away - it's just that they become "masked" by the stronger waveforms.
Alternatively, think of a walk in the hills. All is at peace with the world - we hear the larks singing, a nearby stream bubbling over the rocks, the wind blowing through the gullies, a distant sheep bleating - then, a couple of RAF jets on manoeuvres suddenly appear from nowhere (yes, I'm thinking the English Lake District here!). All of a sudden you hear nothing but the deafening roar of jet engines. As before, the other sounds haven't disappeared, they have just been masked.
Now, if you had an audio recording of this scenario and you wanted to compress the signal losslessly, you would need to include all of these masked sounds. In "lossy" compression techniques, we ignore them. The great "art" of mp3 compression (although also backed by a lot of scientific study) is in identifying and eliminating those elements of the sound waves which can be safely ignored.
2. Huffman Coding
In complete contrast to Perceptual Coding, which is the main "lossy" element of mp3 compression, Huffman Coding is a lossless technique, similar to that used in standard zip compression and the like. The basic principle is to search for repetitions in the bitstream and to replace repeated bit patterns with shorter code words. A figure of 20% is regularly quoted as being the typical level of compression which can be achieved by Huffman Coding alone.
3. The Bit Reservoir
You regularly come across people in mp3 circles who are terrified by the idea of Variable Bit Rates: the possibility that some section of their music might be encoded at a rate below what they perceive to be the minimum acceptable, regardless of whether there's actually any significant audio content, or not. In fact, all mp3s are - to some extent - encoded at Variable Bit Rate, it's just that the variability is considerably lower with Constant Bit Rate encodings.
One of the techniques which mp3 employs to maximise quality at low bitrates is the use of a "bit reservoir". Basically, spare bits are stored in a kind of "savings bank" during simpler passages of music, such that they may be subsequently used during the more complex passages, thereby reducing the deterioration in audio quality which would otherwise occur. The bit reservoir has to be built up from zero before it can be used - the system would soon collapse if bits were borrowed in advance, with some kind of vague promise to pay them back before the encoder got to the end of the piece of music. But, since many pieces start with a short section of silence, or very quiet audio, this is a good opportunity to build up a bit of credit in the bit reservoir.
And now we come to the final technique used to enhance mp3 compression .... .... Joint Stereo.
Whoever was responsible for incorporating Joint Stereo into mp3 made two big mistakes (IMHO):
The first mistake was in naming the technique "Joint Stereo" - the very name suggests to people that the Left and Right stereo channels are "joined" together.
The second mistake was to give this name to two completely different compression enhancement techniques: Intensity Stereo and Mid/Side Stereo. The two techniques have nothing in common other than the general principle of (slightly paraphrasing the FhG quote) "exploiting stereophonic irrelevancies and redundancies".
The name "Joint Stereo" could reasonably be applied to "Intensity Stereo", which does, in fact, combine Left and Right audio channels at higher frequencies. The reasoning behind this is that the human ear is very insensitive to the direction of sounds as we approach the upper limit of our hearing range. Upon decoding the compressed signal, the stereo image is partially restored by applying different scaling factors to the Left and Right channels. However, this technique is far from transparent.
The main point to appreciate about Intensity Stereo is that it was only ever intended to be used in situations where - for whatever reason - the bitrate needs to be restricted to 96k or less, but where some form of stereo signal is desired. In these circumstances, it is generally preferable to accept some loss of stereo definition than to attempt to encode in full stereo with insufficient bits. Most of the music which circulates around the Internet these days tends to be encoded at bitrates of 128k or higher. Above 96k, all the current generation of encoders will default to Mid/Side Joint Stereo, and hence, most music fans will rarely encounter Intensity Stereo.
The only exception to this is the old Xing encoder, which used Intensity Stereo at higher bitrates. This is one of the reasons why Xing has gained such a bad reputation.
And one final point: to a lesser extent, the human ear is also relatively insensitive to direction at very low frequencies. Anyone who uses a speaker system with a centralised sub-woofer is utilising a form of Intensity Stereo, and demonstrating the fact that full stereo is not really necessary at certain frequencies.
The form of "Joint Stereo" which you will invariably encounter in any recent audio encodings is Mid/Side Stereo. As mentioned above, M/S Stereo has been poorly served by being labelled as "Joint" Stereo; the Left and Right channels are not "joined" in any meaningful sense of the word. The numerical data which defines the left and right audio signals is simply rearranged mathematically and stored in a matrix form - but, in such a way that the original channels can be easily and losslessly reconstituted. Matrix Stereo would be a more accurate description. The basic theory behind Mid/Side Stereo is quite simple to understand - if you are not afraid of a little schoolboy algebra!
Mid/Side Stereo looks at the same 2-channel data from a different perspective. Instead of storing the audio data in Left and Right channels, we can just as well store the same series of numbers in terms of the Average (of Left and Right) and the Difference (between Left and Right). This is the basic principle of Mid/Side (Joint) Stereo: the Average signal is normally referred to as "Middle", and the Difference as "Side".
Now, if you can handle a bit of schoolboy algebra, let's demonstrate that these two alternative formats of storing stereo data are completely interchangeable. To begin with, the simplest way of storing 2-channel audio data is in terms of the Left (L) and Right (R) values at any particular time:
Left = L Right = R
In Mid/Side format, we store the Average and Difference values instead:
Middle = (L+R)/2 Side = (L-R)/2
Note that the sign of the Side value is very important. The usual convention is that positive Side indicates that the Left signal is bigger than the Right; a negative Side indicates that the Right is bigger.
Should we wish to return from Mid/Side format back to Left/Right, then we can recreate the Left channel by summing the Average plus the (Left-Right) Difference, and the Right channel by taking the Average minus the Difference, as follows:
Left = Middle + Side Right = Middle - Side
Substituting the earlier expressions for Middle and Side:
Left = (L+R)/2 + (L-R)/2 Right = (L+R)/2 - (L-R)/2
Or, in other words:
Left = L Right = R
Back where we started from!
One myth which regularly crops up in discussions about Joint Stereo is:
If Joint Stereo was any good, then surely everybody would record CDs in Joint Stereo.
This comment is difficult to deal with simply because the person making it clearly has zero understanding of what JS is all about. The technique of manipulating stereo audio signals in Mid/Side format was devised solely for the purpose of enhancing audio compression. As far as uncompressed audio is concerned, it offers no benefits whatsoever. On the other hand, neither does it have any drawbacks - other than the fact that it is slightly more complex to handle than simple stereo.
If you were to write a modified CD burning program which stored the audio data in Middle and Side channels, and if you were to add an extra circuit in your CD player to decode Middle and Side back to Left and Right, then (as we have shown above) you would be back where you started from. The decoded Left and Right signals would be no better and no worse than the original Left and Right signals. They would be exactly the same. CDs are recorded in Simple (Left/Right) Stereo because it is simpler - obviously!
Well, the M/S Joint Stereo technique takes advantage of the fact that for most recorded music, there is comparatively little difference between the audio signals for the Left and Right channels. Re-arranging the data into Middle and Side channels will usually result in a situation where the Middle channel is much bigger than the Side channel. In which case, the smaller Side channel can then be accurately encoded using fewer bits - freeing up resources which can then be employed more usefully on the larger Middle channel. When the Middle and Side channels are subsequently reformatted back to Left and Right, the net result will be a more accurate representation of the original Left and Right input channels.
A simplified demonstration:
Although this simple exercise is in no way intended to illustrate the actual mechanics of mp3 encoding, I think it is useful to be able demonstrate the basic principle that focusing more resources on the larger signal and less on the smaller can yield overall benefits. Let's take a very small segment of a fictitious audio wave, just 6 samples, as follows:
LEFT: 20 000 21 000 22 000 23 000 22 000 22 000
RIGHT: 18 000 19 000 20 000 19 000 18 000 19 000
Then, suppose we do a lossy audio compression, and when we re-expand the compressed signal, we get an error of +/- 2%:
LEFT: 20 400 20 580 22 440 22 540 22 440 21 560
RIGHT: 18 360 18 620 20 400 18 620 18 360 18 620
Now, for the alternative Joint Stereo approach, we first need to transform the (original) Left and Right channels into Middle and Side, as follows:
MIDDLE: 19 000 20 000 21 000 21 000 20 000 20 500
SIDE: +1 000 +1 000 +1 000 +2 000 +2 000 +1 500
Because of the relatively small size of the Side channel, we are now able to encode this using less resources. In percentage terms, the accuracy of the encoding may decrease to say, +/- 4%, but in absolute terms, the error will be smaller. However, because we can now divert more resources to the Middle channel, we can expect the accuracy of the encoding to increase to say, +/- 1%. So, using these figures, the expanded Mid/Side compression would look like this:
MIDDLE: 19 190 19 800 21 210 20 790 20 200 20 295
SIDE: +1 040 + 960 +1 040 +1 920 +2 080 +1 440
Now, transforming the Middle and Side channels back to Left and Right, the decompressed waveform looks like this:
LEFT: 20 230 20 760 22 250 22 710 22 280 21 735
RIGHT: 18 150 18 840 20 170 18 870 18 120 18 855
Note that in every case, the Mid/Side analysis gives a better match to the original signal than the equivalent Left/Right results. But, before we go on, two points:
Yes, I "fiddled" the results by selecting a sample in which the Left and Right channels were pretty similar. Had I selected a sample in which the Left and Right channels were completely out of phase, then the Mid/Side compression would in fact, turn out to be worse than Left/Right. But, it is a fact that a large proportion of music recorded in stereo has relatively little content which is distinctly different between the two channels. Anyone who has carried out analysis of audio waveforms will be well aware of this, but I will provide a few illustrative samples in the next section.
Also, I think I should acknowledge the fact that lossy audio compression is judged entirely on the basis of its acoustic match with the original data, rather than its numerical accuracy. Whilst it's obviously no bad thing to be capable of achieving a better numeric match with the original waves, it cannot necessarily be assumed that this will automatically result in a better quality encoding.
However, the aim of every encoder which uses Joint Stereo is to correctly identify those sections of music which will benefit from being compressed in Mid/Side format and those which will not. If the differences between the Left and Right channels are such that Mid/Side compression would actually give worse results than Left/Right, then the encoder will switch to Simple Stereo on these frames. In theory, Mid/Side Joint Stereo will always give results which are at least as good as a Left/Right encoding. It cannot be worse, since Mid/Side will only be used on those sections of music which will benefit. Please note my use of the words in theory. The theory assumes that Joint Stereo will be implemented faultlessly, and it has to be said that many implementations of JS have been considerably less than faultless!
Now, before I go on to conclude this article, I'd like to support some of the points raised by means of a few audio samples and screenshots.
Up until this point, I have tried to keep this article entirely factual. But, as to how the myths about Joint Stereo originated, why they have persisted for so long, and why they are so widely believed, I can only guess.
My first guess is simply that the name "Joint Stereo" is so misleading. It clearly implies that the two stereo channels are "joined" together in some way. As I have demonstrated, nothing is joined together in the sense that information is lost: MS stereo temporarily stores the numerical data representing the Left and Right channels in a slightly different format. Matrix Stereo would be a better definition, or simply Mid/Side Stereo.
Another aspect of JS which certainly causes a lot of confusion is the fact that there are two different types of Joint Stereo - "good" JS (Mid/Side Stereo) and "bad" JS (Intensity Stereo). Perhaps it's a little unfair to brand Intensity Stereo as "bad": It was devised for a specific purpose - to provide a minimum of stereo information, whilst working with as low a bitrate as possible. Nevertheless, it is a lossy manipulation, it can cause easily detectable corruption of the stereo signal, and it can be hard work trying to convince people that Mid/Side is a completely different type of Joint Stereo which is not lossy.
However, perhaps the biggest public relations problem which JS has is the fact that whilst it always works perfectly in theory, the application of that theory has, at times, been very imperfect. When it goes wrong, it can go spectacularly wrong. I have first hand experience of this, since I used to do all my mp3 encoding using the popular Fraunhofer 1.263 encoder (widely circulating around the Internet in the form of a Radium hacked version). Despite encoding many hundreds of files in Joint Stereo without any obvious defects, one day I produced an mp3 which sounded badly corrupted, despite the fact that there seemed to be nothing in the least unusual about the source material. I have saved a 30 second sound clip of this song for future reference: it's from an FM radio recording of J Mascis & The Fog performing "Same Day". Here's the clip encoded with the Fh 1.263 at 160k (simple) Stereo; and here it is at 160k Joint Stereo. I had "discovered" for myself the well-known bug in the Fh 1.263 encoder! (details can be found on ff123's site). And just to prove that a well-implemented Joint Stereo encoder can deal OK with this unusual waveform, here it is again encoded with LAME v3.91 160k JS.
Last edited by a moderator: