Difference between revisions of "ReplayGain 1.0 specification"

From Hydrogenaudio Knowledgebase
Jump to: navigation, search
(the nut of it is now populated)
(add control panel image)
Line 1: Line 1:
==The Problem==
 
 
Not all CDs sound equally loud. The perceived loudness of mp3s is even more variable. Whilst different musical moods require that some tracks should sound louder than others, the loudness of a given CD has more to do with the year of issue or the whim of the producer than the intended emotional effect. If we add to this chaos the inconsistent quality of mp3 encoding, it's no wonder that a random play through your music collection can have you leaping for the volume control every other track.
 
Not all CDs sound equally loud. The perceived loudness of mp3s is even more variable. Whilst different musical moods require that some tracks should sound louder than others, the loudness of a given CD has more to do with the year of issue or the whim of the producer than the intended emotional effect. If we add to this chaos the inconsistent quality of mp3 encoding, it's no wonder that a random play through your music collection can have you leaping for the volume control every other track.
  
==The solution==
 
 
There is a remarkably simple solution to this annoyance, and that is to store the required replay gain for each track within the track. This concept is called "MetaData" – data about data. It's already possible to store the title, artist, and CD track number within an mp3 file using the ID3 standard. The later ID3v2 standard also incorporates the ability to store a track relative volume adjustment, which can be used to "fix" quiet or loud sounding mp3s.
 
There is a remarkably simple solution to this annoyance, and that is to store the required replay gain for each track within the track. This concept is called "MetaData" – data about data. It's already possible to store the title, artist, and CD track number within an mp3 file using the ID3 standard. The later ID3v2 standard also incorporates the ability to store a track relative volume adjustment, which can be used to "fix" quiet or loud sounding mp3s.
  
Line 234: Line 232:
  
 
==Player requirements==
 
==Player requirements==
 +
[[File:RG_Player_control.gif‎|frame|Figure 8: Possible Replay Gain control panel]]
 +
 
In practice, scalaing and pre-amp can be carried out in a single step, where each sample is multiplied by a fixed amount. The clipping prevention need only be carried out if, after the first two adjustments, the peak signal amplitude is above digital full scale.
 
In practice, scalaing and pre-amp can be carried out in a single step, where each sample is multiplied by a fixed amount. The clipping prevention need only be carried out if, after the first two adjustments, the peak signal amplitude is above digital full scale.
  

Revision as of 22:29, 11 December 2010

Not all CDs sound equally loud. The perceived loudness of mp3s is even more variable. Whilst different musical moods require that some tracks should sound louder than others, the loudness of a given CD has more to do with the year of issue or the whim of the producer than the intended emotional effect. If we add to this chaos the inconsistent quality of mp3 encoding, it's no wonder that a random play through your music collection can have you leaping for the volume control every other track.

There is a remarkably simple solution to this annoyance, and that is to store the required replay gain for each track within the track. This concept is called "MetaData" – data about data. It's already possible to store the title, artist, and CD track number within an mp3 file using the ID3 standard. The later ID3v2 standard also incorporates the ability to store a track relative volume adjustment, which can be used to "fix" quiet or loud sounding mp3s.

However, there is no consistent standard by which to define the appropriate replay gain which mp3 encoders and players agree on, and no automatic way to set the volume adjustment for each track – until now.

The Replay Gain proposal sets out a simple way of calculating and representing the ideal replay gain for every track and album.

Calculation

Equal Loudness Filter

The human ear does not perceive sounds of all frequencies as having equal loudness. For example, a full scale sine wave at 1kHz sounds much louder than a full scale sine wave at 10kHz, even though the two have identical energy. To account for this, the signal is filtered by an inverted approximation to the equal loudness curves (sometimes referred to as Fletcher-Munson curves).

Equal loudness curves

Figure 1: Equal loudness contours

Figure 1 shows the Equal Loudness Contours, as measured by Robinson and Dadson, 1956. The original measurements were carried out by Fletcher and Munson in 1933, and the curve often carries their name.

The lines represent the sound pressure required for a test tone of any frequency to sound as loud as a test tone of 1 kHz. Take the line marked "60" - at 1 kHz ("1" on the x axis), the line marked "60" is at 60dB (on the y axis). If you follow the "60" line down to 0.5 kHz (500 Hz), and look across to the y axis, the value is about 55 dB. What this means is that a 500 Hz tone at 55 dB SPL sounds as loud to a human listener as a 1 kHz tone at 60 dB SPL.

If every frequency sounded equally loud, then this graph would just be a series of horizontal lines. As it isn't, a filter is required to simulate this characteristic.


Required equal loudness filter

Figure 2: Loudness contours inverse response

Where the lines curve upwards, this means that we are less sensitive to sounds of that frequency. Hence, the filter must attenuate (reduce) sounds of that frequency. The ideal filter will be the inverse of the above graphs. As we don't know the replay level yet, and don't want to use a different filter for sounds of differing loudness, a representative average of the above curves will is chosen as the target filter:


Design of the equal loudness filter

Figure 3: Target response (blue) and "yulewalk" filter response (magenta)
Figure 4: Target response (blue), high-pass response (green) and composite response (red)

MATLAB offers several functions to design FIR and IIR filters to match arbitrary amplitude responses. Feeding the target response into yulewalk.m, and requesting a 2x10 coefficient IIR filter gives the following response:

At higher frequencies, this filter is an excellent approximation to our target. However, it lower frequencies, it doesn't even come close. Increasing the number of coefficients does not cause the yulewalk function to perform significantly better.

One solution is to cascade the yulewalk filter with a 2nd order Butterworth high pass filter, with a high pass frequency of 150 Hz. The resulting combined response (Figure 4) is close to our target response, and is used by Replay Level.


RMS Energy Calculation

Next, the energy during each moment of the signal is determined by calculating the Root Mean Square of the waveform every 50ms.

It's easy to calculate the RMS energy over an entire audio file. For example, Cool Edit Pro (from Syntrillium) does this in its Analise:statistics box. Unfortunately, this value doesn't give a good indication of the perceived loudness of a signal. It's closer than that given by the peak amplitude, but it's still not good enough. For this reason, we have to calculate the RMS energy on a moment by moment basis (as described on this page), then do something useful with all that data.

General concept

The signal is chopped into 50ms long blocks. Then, for each block:

  1. Every sample value is squared (multiplied by itself).
  2. The mean average is taken.
  3. The square root of the average is calculated.

If you read those steps backwards, it's obvious why it's called Root Mean Square (RMS) averaging. Basically, that's all we have to do.

Averaging time

The block length of 50ms was chosen after studying the effect of values between 25ms and 1s. 25ms was too short to accurately reflect the perceived loudness of some sounds. Beyond 50ms there was little change (after statistical processing). For this reason, 50ms was chosen.

Stereo files

The only difficulty lies in what to do with stereo files. We could sum them to mono before calculating the RMS energy, but then any out-of-phase components (having the opposite signal on each channel) would cancel out to zero (i.e. silence). That's not how we perceive them, so it's not a good solution.

The alternative is to calculate two RMS values (once for each channel) and then add them. Unfortunately a Linear addition still doesn't give the same effect as our ears. To demonstrate this, consider a mono (single channel) audio track. We replay it over 1 loudspeaker, and remember how loud it sounds. If we now replay it over 2 loudspeakers, how large should the signal to each speaker be such that, overall, the sound is still as loud as before? You'd think the answer would be half as large (since we have two speakers - that's what a linear addition would suggest) but if you try it, you'll find that the answer is about 3/4.

We get the right answer if we add the means of the channel-signals before calculating the square root. In mixing pan-pot terms, we're using "equal power" rather than "equal voltage". If we also assume that any mono (single channel) signal will always be replayed over two loudspeakers, we can treat a mono signal as a pair of identical stereo signals. Hence a mono signal gives (a+a)/2 (i.e. a), while a stereo signal gives (a+b)/2, where a and b are the mean squared values for each channel. After this, we carry out the square root and conversion to dB.

Statistical Processing

Figure 5: Histogram of classical music
Figure 6: Histogram of classical music
Figure 7: Histogram of classical music

Where the average energy level of a signal varies with time, the louder moments contribute most to our perception of overall loudness. For example, in human speech, over half the time is silence, but this does not affect the perceived loudness of the talker at all! For this reason, the RMS values are sorted into numerical order, and the value 5% down the list is chosen to represent the overall perceived loudness of the signal.

Having calculated RMS signal levels every 50ms through the file, a single value must be calculated to represent the perceived loudness of the entire file. The above histograms show how many times each RMS value occurred in each file.

The most common RMS value in the speech track was -45dB (background noise) - so the most common RMS value is clearly NOT a good indicator of perceived loudness! The average RMS value is similarly misleading with the speech sample, and also with classical music.

A good method to determine the overall perceived loudness is to sort the RMS energy values into numerical order, and then pick a value near the top of the list.

Choosing one represetative value

How far down the sorted list should we look for a representative value? I tried values from 70% to 95%. For highly compressed pop music (e.g. the middle graph above, where there are many values near the top), the choice makes little difference. For speech and classical music, the choice makes a huge difference. The value which most accurately matches human perception of perceived loudness is around 95%, so this value is used by Replay Level.


Calibration with reference level

A suitable average replay level is 83dB SPL. A calibration relating the energy of a digital signal to the real world replay level has been defined by the SMPTE. Using this calibration, we subtract the current signal from the desired (calibrated) level to give the difference. We store this difference in the audio file.

Finding a standard

Having calculated a representative RMS energy value for the audio file, we now need to reference this to a real world sound pressure level. The audio industry doesn't have any standard for listening level, but the movie industry has worked to an 83dB standard for years.[1]

What the standard actually states is that a single channel pink noise signal, with an RMS energy level of -20 dB relative to a full scale sinusoid should be reproduced at 83 dB SPL (measured using a C-weighted, slow averaging SPL meter). In simple terms, this means that everyone can set their volume control to the same (known, calibrated) gain.

An ideal world...

NOW (are you still with me?) if the mastering engineer set the levels on a CD using that calibrated volume control setting, that CD will sound best at that volume. If all CDs were mastered in such a way, they'd all sound best at that volume. If you (as a listener) didn't want to listen at that particular volume setting, you could always turn it down, but all CDs would still sound equalling "turned down" at your preferred setting. You wouldn't have to change the volume setting between discs.

Reality check! We know CDs aren't made like this. There is NO audio standard replay level. So, here's the clever bit - here's the whole point of this website...

Fixing a non-ideal world

We know the level should average around 83dB SPL, and we know a -20dB pink noise signal will give 83dB SPL in a calibrated system. So, we send the pink noise signal through the ReplayGain program, and store the result (let's call it ref_Vrms). For every CD we process, the difference between the calculated value for that CD and ref_Vrms tells you how much you need to scale the signal in order to make it average 83dB.

The actual process is quicker to do than to say!

One complication

The system calibration uses a single channel of pink noise (reproduced through a single loudspeaker). You then play music through both loudspeakers. So, though we use 1 channel of pink noise to calibrate the system gain, the ideal level of the music is actually the loudness when both speakers are in use. So, in ReplayGain, we calibrate to 2 channels of pink noise, because that's how loud we'd like the music to sound. In reality, we just have a monophonic pink noise wavefile, and ReplayGain automatically assumes you're playing it through both speakers, as it would any monophonic file.

Storing the Replay Gain

Replay gain data format

The calibration level of 83dB can be added to the difference from the previous calculation, to yield the actual Replay Gain. NOTE: we store the differential, NOT the actual Replay Gain.

What to store

Three values must be stored.

  1. Peak signal amplitude
  2. "Radio" = Replay Gain adjustment required to make all tracks equal loudness
  3. "Audiophile" = Replay Gain adjustment required to give ideal listening loudness

If calculated on a track-by-track basis, ReplayGain yields (2). If calculated on a disc-by-disc basis, ReplayGain will usually yield (3), though this value may be more accurately determined by a human listener if required.

To allow for future expansion: If more than three values are stored, players should ignore those they do not recognise, but process those that they do. If additional Replay Gain adjustments other than "Radio" and "Audiophile" are stored, they should come after "Radio" and "Audiophile". The Peak Amplitude must always occupy the first 4 bytes of the Replay Gain header frame. The three values listed above (or at least fields to hold the three values, should the values themselves be unknown) are required in all Replay Gain headers.

Range

The replay gain adjustment must be between -51.0dB and +51.0dB. Values outside this range must be limited to be within the range, though they are certainly in error, and should probably be re-calculated, or stored as "not set". For example, trying to cause a silent 24-bit file to play at 83dB will yield a replay gain adjustment of +57dB.

In practice, adjustment values from -23dB to +17dB are the likely extremes, and values from -18dB to +2dB are more usual.

Bit format

Each Replay Gain value should be stored in a Replay Gain Adjustment field consisting of two bytes (16 bits). Here are two example Replay Gain Adjustment fields:

Radio gain adjustment

0 0 1 0 1 1 1 0 0 1 1 1 1 1 0 1
\___/ \___/ | \_______________/
  |     |   |         |        
name    |  sign       |        
code    |  bit        |        
        |             |        
   originator         |        
      code            |        
                 Replay Gain   
                  Adjustment   

Audiophile gain adjustment

0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0
\___/ \___/ | \_______________/
  |     |   |         |
name    |  sign       |
code    |  bit        |
        |             |
   originator         |
      code            |
                 Replay Gain
                  Adjustment

In the above example, the Radio Gain Adjustment is -12.5dB, and was calculated automatically. The Audiophile Gain Adjustment is +2.0dB, and was set by the user.

Name code
000 = not set
001 = Radio Gain Adjustment
010 = Audiophile Gain Adjustment
other = reserved for future use

If space has been reserved for the Replay Gain in the file header, but no replay gain calculation has been carried out, then all bits (including the Name code) may be zero.

For each Replay Gain Adjustment field, if the name code = 000 (not set), then players should ignore the rest of that individual field.

For each Replay Gain Adjustment field, if the name code is an unrecognised value (i.e. not 001-Radio or 010-Audiophile), then players should ignore the rest of that individual field.

If no valid Replay Gain Adjustment fields are found (i.e. all name codes are either 000 or unknown), then the player should proceed as if the file contained no Replay Gain Adjustment information (see player requirements).

Originator code
000 = Replay Gain unspecified
001 = Replay Gain pre-set by artist/producer/mastering engineer
010 = Replay Gain set by user
011 = Replay Gain determined automatically, as described on this site
other = reserved for future use

For each Replay Gain Adjustment field, if the name code is valid, but the Originator code is 000 (Replay Gain unspecified), then the player should ignore that Replay Gain adjustment field.

For each Replay Gain Adjustment field, if the name code is valid, but the Originator code is unknown, then the player should still use the information within that Replay Gain Adjustment field. This is because, even if we are unsure as to how the adjustment was determined, any valid Replay Gain adjustment is more useful than none at all.

If no valid Replay Gain Adjustment fields are found (i.e. all originator codes are 000), then the player should proceed as if the file contained no Replay Gain Adjustment information (see player requirements).

Sign bit
0 = +
1 = -
Replay Gain Adjustment

The value, multiplied by ten, stripped of its sign (since the + or - is stored in the "sign" bit), is represented in 9 bits. e.g. -3.1dB becomes 31 = 000011111.

Default Value

$00 $00 (0000000000000000) should be used where no Replay Gain has been calculated or set. This value will be interpreted by players in the same manner as a file without a Replay Gain field in the header (see player requirements).

The values of xxxyyy0000000000 (where xxx is any name code, and yyy is any originator code) are all valid, but indicate that the Replay Gain is to be left at 83dB (0dB Replay Gain Adjustment). These are not default values, and should only be used where appropriate (e.g. where the user, producer, or Replay Gain calculation has indicated that the correct Replay Gain is 83dB).

Illegal Values

The values xxxyyy1000000000 are all illegal. You cannot have negative zero! These values may be used to convey other information in the future. They must not be used at present. If enountered, players should treat them in the same manner as $00 $00 (the default value).

The value $xx $ff is not illegal, but it would give a false synch value within an mp3 file. The problems this may cause should be investigated, and a solution (e.g. unsychronisation) sought. Maybe this is a use for negative zero?

Peak amplitude data format

Scanning the file for the peak amplitude can be a time-consuming process. Therefore, it's helpful if this single value is stored within the file header. This can be used to check if the required replay gain adjustment will cause the file to clip.

Data Format

The maximum peak amplitude (a single value) should be stored as a 32-bit floating point number, where 1=digital full scale.

Uncompressed Files

Simply store the maximum absolute sample value held in the file (on any channel). The single sample value should be converted to a 32-bit float, such that digital full scale is equivalent to a value of 1.

Compressed files

Compressed audio does not exist as a waveform until it is decoded. Unfortunately, psychoacoustic coding of a heavily limited file can lead to sample values larger than digital full scale upon decoding. However, it is likely that such values will be brought back within range after scaling by the replay level. Even so, it is necessary to store the peak value of a compressed file as a 32-bit floating-point representation, where +/-1 represent digital full scale, and values outside this range would usually clip.

Implementation

For uncompressed files, the maximum values must be found and stored. For compressed files, the files must be decoded using a fully compliant decoder that allows peak overflows (i.e. has headroom), and the maximum value stored.

Replay Gain File Format

Three values must be stored.

  1. Peak signal amplitude
  2. "Radio" = Replay Gain adjustment required to make all tracks equal loudness
  3. "Audiophile" = Replay Gain adjustment required to give ideal listening loudness

Each audio file format represents a unique situation. All audio files would benefit from the inclusion of Replay Gain information. In the following list, the links take you to a suggested format for storing the 3 values within the file. Where there is no link, I'm awaiting suggestions!

  • .ape
  • .mp3 - ID3v2, LAME VBR proposed tag specification
  • .mpc
  • .ogg
  • .wav

Player requirements

Figure 8: Possible Replay Gain control panel

In practice, scalaing and pre-amp can be carried out in a single step, where each sample is multiplied by a fixed amount. The clipping prevention need only be carried out if, after the first two adjustments, the peak signal amplitude is above digital full scale.

The three steps are appropriate to software players operating on the digital signal in order to scale it. However, it is possible to send the digital signal to the DAC without level correction, and to place an attenuator in the analogue signal path. The attenuator can then be driven by the Replay Gain value. Thus maximum signal to noise ratio is maintained in the digital signal and DAC process.

Scale audio to match Replay Gain

The Player reads the Replay Gain value, and scales the audio data as appropriate.

Reading the Replay Gain

First, the player needs to determine if the user requires "Radio" style level equalisation (all tracks same loudness), or "Audiophile" style level equalisation (all tracks "ideal" loudness). This option should be selectable in the Replay Gain control panel, and should default to "Radio".

Then the player reads the appropriate Replay Gain adjustment value from the file header, and converts it back to its original dB value. See the Replay Gain Data Format for more details. Please remember to divide it by ten!

The player also needs to read (or calculate) the Peak amplitude. This is required for Clipping prevention.

Scaling by the Replay Gain adjustment

Changing the level of an audio signal simply means multiplying each sample value by a constant value. This constant is given by:

scale=10.^(replay_gain/20);

Or, in words: ten to the power of (the replay gain divided by 20).

After any such operation, it's a good idea to dither the result. If this calculation and the pre-amp are implemented separately, then dither should only be added to the final result, just before the result is truncated back to 16 bits, or 24, or 8, as limited by the soundcard - not the file (i.e. after Replay Gain adjustment, an 8-bit file should be sent to a 16-bit soundcard at 16-bits).

If the Replay Gain information is absent...

Simply disabling Replay Gain control for tracks without Replay Gain information would cause these tracks to be louder than the others, so bringing back the original problem!

If neither ("Radio" or "Audiophile") Gain adjustment is set, or if the track does not contain Replay Gain information, then the player should use an average of the previous 10(?) Replay Gains. This represents the typical loudness of tracks in the users music collection, and is a much better estimate of the likely Replay Gain than 0dB, or no adjustment at all.

If the file only contains one of the Replay Gain adjustments (e.g. Audiophile) but the user has requested the other (Radio), then the player should use the one that is available (in this case, Audiophile).

Pre-amp

Most users who only play pop music will find that the level has been reduced too far for them. An optional boost of 6dB-12dB should be included by default, otherwise users will think the player sucks! Knowledgeable users, or those playing classical music, will disable this. Some may even choose to decrease the level. For user friendliness, this part should be referred to as the "pre-amp".

Whilst the SMPTE calibration level we're using suggests that the average level of an audio track should be 20dB below full scale (to leave room for peaks - where the emotion of the music lives), some pop music is dynamically compressed to peak at 0dB and average around -3dB. This means that, when the Replay Gain is correctly set, the level of such tracks will be reduced by 17dB! If users are listening to a mixture of highly compressed and not compressed tracks, then Replay Gain will make the listening experience more pleasurable, by bringing the level of the compressed tracks down into line with that of the others. However, if users are only listening to highly compressed music, then they are likely to complain that all their files are now too quiet.

To solve this problem, a Pre-amp should be incorporated into the player. This is basically just an adjustment to the scale factor we calculated on the previous page. It should default to a +6dB boost (though some manufacturers may choose +9, +12 or +15dB). This means that casual users will find little change to the loudness of their compressed pop music (except that the occasional "problem" quiet track will now be as loud as the rest), while power users and audiophiles can reduce the Pre-amp gain to enjoy all their music.

If the Pre-amp gain is left high for classical music (or nicely produced pop music), this means that the peaks will be compressed (see Avoiding Clipping). However, this is exactly what radio stations do all the time, and many listeners like this sound.

Implementation

If enabled, simply read the user selected pre-amp gain, and scale the audio signal by the appropriate amount. For example, a +6dB gain requires a scale of 10.^(6/20), which is approximately 2. The Replay Gain and Pre-amp scale factors can be multiplied together for simplicity and ease of processing.

Clipping Prevention

The player should, by default, apply hard limiting (NOT CLIPPING) to any signal peaks that would go over full scale after the above two operations. This should be user defeatable, so that audiophile users can choose to decrease the overall level to avoid clipping, rather than limiting the signal.

Why might the signal clip?

There are 3 reasons:

  1. In coded audio (e.g. mp3 files) a file that was hard-limited to digital full scale before encoding will often be pushed over the limit by the psychoacoustic compression. A decoder with headroom can recover the over full scale signal by reducing the gain. MAD does this. Typical decoders just clip.
  2. Replay Gain will make loud dynamically compressed tracks quieter, and quiet dynamically uncompressed tracks louder. The average levels will then be similar, but the quiet tracks will actually have louder peaks. If the user pushes the pre-amp gain to maximum (which would take highly compressed pop music back to its original level), then the peaks of the (originally) quieter tracks will be pushed well over full scale.
  3. If a track has a very wide dynamic range, then even without turning up the pre-amp, the replay gain itself may instruct the player to turn the track up such that it would clip, simply because the average energy is so low, but the peak amplitude is very high. If anyone does find a recording which causes this with the pre-amp gain set at 0, please let me know!

What can we do about it?

The simple option is to let it clip! However, this isn't a good idea, as it'll sound awful. There are two solutions:

In situation 2 above, the user clearly wants all the music to sound very loud. To give them their wish, any signal which would peak above digital full scale should be hard limited at just below digital full scale. This is also useful at lower pre-amp gains, where it allows the average level of classical music to be raised to that of pop music, without distorting. This could be useful for making tapes for the car. The exact type of limiting/compression is up to the player, but something like the Hard Limiter found in Cool Edit Pro (Syntrillium) would be appropriate (for pop music at least).

The audiophile user will not want any compression or limiting on the signal. In this case the only option is to reduce the pre-amp gain (so that the scaling of the digital signal is lower than that suggested by the replay level). In order to maintain the consistency of level between tracks, the pre-amp gain should remain at this reduced level for subsequent tracks.

Implementation

If the Peak Level is stored in the header of the file, it is trivial to calculate if (following the Replay Gain adjustment and Pre-amp gain) the signal will clip at some point. If it won't, then no further action is necessary. If it will, then either the hard limiter should be enabled, or the pre-amp gain should be reduced accordingly before playing the track.

Notes

  1. This number (83dB SPL) wasn't picked at random. It represents a comfortable average listening level, determined by professionals from years of listening. That reference level of -20dB pink noise isn't random either. It causes the calibrated average level to be 20dB less than the peak level. In other words, it leaves 20dB of headroom for louder than average signals. So, if CDs were mastered this way, the average level would be around -20dB FS, leaving lots of room for the dramatic peaks which make music exciting.