Introduction
In some fields of scientific investigation and product
R&D, it is important to measure the sensitivity of human perception
to a change in stimulus. ABX methods are a powerful and
scientific way of carrying this out. We believe these methods
have many advantages over the commonly used subjective scoring studies
because:

Panel scoring tests are often carried out
without applying a strictly correct statistical design or analysis

Judgement of the size of differences is
subjective and relative

Practical relevance of classical confidence
intervals, significance levels and the power of tests are fraught with
confusion, especially for nonstatisticians
In ABX tests, by contrast:
 No decision as to the relative quality of the stimulus is
made consciously
 The statistical analysis is far more straightforward
(though see below)
 The measure of detection is on a fixed objective scale
 Different results can be combined or compared, within
sensible limits
 Subjects are arguably focusing more of their cognitive
abilities on detection
Most of our work has been in the audio field testing a variety
of electronic and electrical components and digital processing
algorithms. The aim here is to characterise their audibility by the
statistical distribution of h, the parameter
that specifies 'detection probability'. We will therefore use listening
tests for examples, but h is a scientific and
generalised measure that can be fairly compared between completely
different types of change in the stimulus, and the methods can be used
wherever a stimulus can be changed quickly by pressing a button.
In the limit of an infinite number of ABX runs, ‘h=0’
means that the difference cannot be detected and ‘h=1’
means that it will always be detected, with values in between
indicating how obvious the difference is. As the correct analysis
of ABX data is based on the Binomial sampling distribution, a point
estimate for h will not tell the whole story
as sampling error will produce uncertainty about the actual
value. Some may be surprised by the amount of uncertainty
involved when relatively small ABX samples are used. See the examples below.
Many users of ABX methods are unfortunately not aware of
the correct statistical analysis of ABX data that takes into account
the inevitable presence of random guessing . This makes
a large difference to the distribution of h
when sample sizes are modest and detection is difficult  often the
most interesting cases. See equations (3, 4) and references [1, 2] for more detail.
ABX methods allow the building of a
scientific database of human perception. One obvious example is in
audio coding technologies where we can e.g. calibrate different
psychoacoustic compression methods according to their h
results. See figure 1 below for an example showing point
estimates and 75% HDRs for h with three codecs
and five different amounts of compression.
Using h as a metric allows you to:

Compare different types of artefact in terms
of their detectability (e.g. how much THD is equally audible to how
much compression)

Build a database of the perceived importance
of different measured problems and create a coherent structure for
correlating measured and subjective changes. A multidimensional
psychoacoustic model that identifies interactions between artefacts
then becomes feasible

Judge the validity of a given change in
detection produced by a technical change and assess accurately the
risks of various decisions

Calibrate listeners/transducers/programme
material/etc. for their sensitivity
By extending the statistical theory, we can also reason about
the distribution of ∂h (1 ≤ ∂h
≤ 1). This is the 'change in detection probability' between two
different sets of results.
The ABX test method
The ABX test allows a truly doubleblind comparison of two
signals which are to be compared. An audio test will be described
here but the stimulus could just as easily be e.g. visual.
The listener has a box with three buttons; A, B and X. For each
programme item tested, the two different signals are guaranteed to
appear in the A and B channels, whilst the X channel is randomly
assigned to be identical to either the A or B signal for the length of
each item. The listener can switch between A, B and X as often as
s/he likes and is asked to decide (by the end of the item) whether the
X channel contains the A or B signal. This decision is then
automatically classified as correct or incorrect depending upon whether
the X channel is correctly identified. A succession of such
listening runs constitute the data that are analysed. The double
blind nature of the test ensures that the results are truly independent
and unbiased, within the frame of reference of the test conditions.
The analysis of audibility from the ABX test is widely applicable and
will work with any comparison of two signals. Psychoacousticians
(and psychologists in general) who aim at quantifying the human
perception of differences between similar signals should find this
approach valuable as a scientific benchmark to compare any differences
that may be of interest. The Bayesian approach also makes it easy
to combine information from different sources.
In psychoacoustics, the analysis can be used to build a database of
detection for signal perturbations of different types. This
should be of particular interest to researchers and manufacturers who
seek to understand, measure and compare the audibility of digital
signal processing and datacompression methods where so many tuning
parameters and tradeoffs are possible. See figure 1 above.
Statistical basis of the analysis
The following is a brief description of some of the
statistical aspects of analysing ABX data. The reader is referred
to [1, 2] for a more thorough discussion of the underlying probability
theory. For ease of reference, the terminology used is that
defined in [1].
A sequence of independent trials with an outcome of either ‘success’ or
‘failure’ are sometimes called Bernoulli trials. Assuming that
there is a fixed probability of success, p,
and there are T trials with C
successes, then C is distributed as a Binomial
variate thus:
so that, for example, the probability of C
correct answers given T trials and a success
probability of p is:
In the ABX case, p is the probability of correctly identifying the X
channel.
An important point about p is made by
Srednicki in [1]; because of the ‘multiple choice’ nature of the ABX
box, guesses or decisions that are made randomly with no knowledge of
the difference in signals will make p converge
on the value of 0.5 i.e. 50% correct answers. The probability
that we are interested in is h, the
probability of actually hearing the difference. The
probability that any one answer will be correct p
is therefore h plus one half of the fraction
of guesses:
Eq(4) allows us to transform any inference about p into inference about h.
At this point, Classical and Bayesian statisticians might part company
and we will follow a Bayesian approach. This is due to the
availability of real prior information about the problem which will
produce more precise results, as well as some other more philosophical
arguments about the 'meaning' of probability. With larger sample
sizes and/or large h the results are
effectively identical for both approaches. See [2] for a
description of the Classical approach.
From the above argument, we know that p will
lie between 0.5 and 1, but within this range we have no prior reason to
believe that any value of p is more likely than any other. We
will describe this prior information as the prior distribution for p, a uniform distribution with probability c between
0.5 and 1, and probability 0 outside that range.
By Eq(4) it can be seen that this is equivalent to giving h an uninformative constant prior distribution over
the range of interest, 0 to 1.
Once we have some data we can generate the posterior
distribution of p via Bayes’
Theorem.
This combines our prior knowledge with the data from the
experiment.
The more data there is, the more information is contained in
P( p  T,
C ) and the more precise
the estimation of p. We now have
everything that we need to reason about p and,
by the use of Eq(4), h. The quantitative
results produced by the analysis are given below.

P( h  T, C )  the posterior distribution
for h, from which all the inference is taken

E( h )  the expected
value or mean of P( h
 T, C ). This is the most widely
accepted single number estimate of h; it is
the ‘centre of gravity’ of the posterior distribution

Ranges over which we can predict h to lie (Highest Density Regions or HDRs) at chosen
levels of certainty
As a guide to the probability statements which are attached to
the HDRs, the odds are 3:1 (or equivalently there is a 75% probability)
that h lies within the 75% HDR.
Intuitively, many are satisfied with this level of certainty, but when
safety critical decisions are made, the interested parties usually
require much higher certainty (>99.9%). There are no fixed
rules to using these summary statistics  one accepts a higher risk of
being incorrect to gain a more precise limit on the parameter.
From a purely personal point of view, we usually use the 75% HDR region
for the initial assessment of a parameter.
Examples
Figure 2 contains the posterior
distribution for h from three different 60 run
listening tests with differing numbers of correct identifications.
Summary statistics from these results are given in table
1. For example, from result B (38/60) our best single number
estimate for h is 0.264 and we know that h lies between 0.13 and 0.395 with a probability of
75% or alternatively at odds of 3:1. If we want to be more
certain e.g. 95% or odds of 19:1, then we need to accept that the range
over which it lies is wider  from 0.04 to 0.475.
Result 
E( h ) 
75% HDR 
95% HDR 
A 30/60 
0.099 
( 0.00, 0.14 ) 
( 0.00, 0.24 ) 
B 38/60 
0.264 
( 0.13, 0.39 ) 
( 0.04, 0.48 ) 
C 46/60 
0.516 
( 0.40, 0.65 ) 
( 0.30, 0.72 ) 
Table 1: Summary
statistics from ABX example
Imagine you have made a change to a device being tested and
now get 38/60 instead of the original 30/60 (as in results A and B from
figure 2). The distribution of ∂h from
this change is given in figure 3 below.
Summary statistics for this change are given in table 2.
Result 
E( ∂h ) 
P( ∂h >= 0 ) 
P( ∂h < 0 ) 
38/60 vs 30/60 
0.165 
91.3% 
8.7% 
Table 2: Summary
statistics from ABX difference example
This shows an expected increase in h
of 0.16. It also gives the important information that h has increased with a probability of 91%  or
alternatively with odds of about 10:1.
References
[1] Srednicki (1988) "A Bayesian Analysis of AB Listening
Tests" , J. AES. Vol. 36 No.
3 pp.143145
[2] Burstein (1989) "Transformed Binomial Confidence Limits
for Listening Tests", J. AES.
Vol. 37 No. 5 pp.363367
