ABX is a method for determining by listening whether two wav files are audibly different from each other. The method is most useful for listening to potential differences near the threshold of audibility. A key feature of this method is that the tests are performed "blind", or without the listener's knowledge of what the file-under-test is. Another key feature is that the influence of chance on the results can be reduced by performing multiple tests (trials).

The tester assigns one file to the "a" button, the other file to the "b" button, and then the abx program randomly assigns either the first or the second file to the "x" button. The listener can listen to a, b, or x, in any order, as many times as they wish, then decides whether x is the same as a or the same as b. That's one trial.

An ABX session consists of multiple trials to reduce the probability that a particular result is the result of chance rather than the listener actually hearing a difference. For example, if a listener correctly identifies x after only one trial, the probability of that occurring by chance is 50%. However, if the listener were to correctly identify x five times out of five trials, the probability of that occurring by chance is 0.5⁵ = 0.03 (3%).

ff123 has a ~~ff123 page~~ which calculates (by simulation) the type 1 error, or the probability of concluding that a perceptible difference exists when one does not:

A difference is concluded to be heard when 13 correct identifications out of 16 trials is achieved. 0.01 is used as the critical type I error.

In sensory evaluation techniques, 3rd ed. [1] by Meilgaard, Civille, and Carr, this type of test is referred to as a "duo-trio," but the procedure is the inverse of the abx, and multiple people are used instead of multiple trials.

"Present to each subject an identified reference sample, followed by two coded samples, one of which matches the reference sample. Ask subjects to indicate which coded sample matches the reference. Count the number of correct replies and refer to table t10 for interpretation."

Meilgaard, et al, go on to state that

"As a general rule, the minimum is 16 subjects, but for less than 28, the beta error is high. Discrimination is much improved if 32, 40, or a larger number can be employed."

Note: The beta error is the type II error, or the probability of concluding that no perceptible difference exists when one does.

Since ABX tests typically involve only a single listener performing multiple trials, the number of trials should be kept reasonably small to prevent listener fatigue from affecting the results. 16 trials is generally agreed to provide a good balance, and corresponds with Meilgaard's recommendation for multiple listeners in a duo-trio test, at least as far as minimizing type I errors is concerned. Typically, 0.01 is used as the critical type I error. So that would mean that a difference is concluded to be heard when 13 correct identifications out of 16 is achieved.

Programs for Performing Blind Listening Tests

Lacinato ABX/Shootouter for Windows/Mac/Linux
ABX Comparator (component for foobar2000)
ABX-comparator for Linux/GNOME (no longer maintained)
python-abx (abx-comparator clone)
~~WinABX/WinABA~~
~~Blind Audio Comparison and Rating~~
~~PC ABX and ABX info~~

Notes

1 ^ Sensory Evaluation Techniques. pg. 69-71.

References

Meilgaard, Morten, Gail Vance Civille, and B. Thomas Carr. Sensory Evaluation Techniques. CRC. June 1999. Edition: 3rd. ISBN: 9780849302763
ff123. ABX, Probability of Experiment Being the Same as Random Guesses. Discussion Of Audio Compression. July 23rd, 2002. (Accessed June 5, 2008)