Measuring human perception with ABX methods




Introduction

In some fields of scientific investigation and product R&D, it is important to measure the sensitivity of human perception to a change in stimulus.  ABX methods are a powerful and scientific way of carrying this out.  We believe these methods have many advantages over the commonly used subjective scoring studies because:

  • Panel scoring tests are often carried out without applying a strictly correct statistical design or analysis

  • Judgement of the size of differences is subjective and relative

  • Practical relevance of classical confidence intervals, significance levels and the power of tests are fraught with confusion, especially for non-statisticians

In ABX tests, by contrast:

  • No decision as to the relative quality of the stimulus is made consciously
  • The statistical analysis is far more straightforward (though see below)
  • The measure of detection is on a fixed objective scale
  • Different results can be combined or compared, within sensible limits
  • Subjects are arguably focusing more of their cognitive abilities on detection

Most of our work has been in the audio field testing a variety of electronic and electrical components and digital processing algorithms. The aim here is to characterise their audibility by the statistical distribution of h, the parameter that specifies 'detection probability'. We will therefore use listening tests for examples, but h is a scientific and generalised measure that can be fairly compared between completely different types of change in the stimulus, and the methods can be used wherever a stimulus can be changed quickly by pressing a button.

In the limit of an infinite number of ABX runs, ‘h=0’ means that the difference cannot be detected and ‘h=1’ means that it will always be detected, with values in between indicating how obvious the difference is.  As the correct analysis of ABX data is based on the Binomial sampling distribution, a point estimate for h will not tell the whole story as sampling error will produce uncertainty about the actual value.  Some may be surprised by the amount of uncertainty involved when relatively small ABX samples are used.  See the examples below.

Many users of ABX methods are unfortunately not aware of the correct statistical analysis of ABX data that takes into account the inevitable presence of random guessing .  This makes a large difference to the distribution of h when sample sizes are modest and detection is difficult - often the most interesting cases.  See equations (3, 4) and references [1, 2] for more detail.

ABX methods allow the building of a scientific database of human perception. One obvious example is in audio coding technologies where we can e.g. calibrate different psycho-acoustic compression methods according to their h results.  See figure 1 below for an example showing point estimates and 75% HDRs for h with three codecs and five different amounts of compression.



Using h as a metric allows you to:

  • Compare different types of artefact in terms of their detectability (e.g. how much THD is equally audible to how much compression)

  • Build a database of the perceived importance of different measured problems and create a coherent structure for correlating measured and subjective changes.  A multi-dimensional psychoacoustic model that identifies interactions between artefacts then becomes feasible

  • Judge the validity of a given change in detection produced by a technical change and assess accurately the risks of various decisions

  • Calibrate listeners/transducers/programme material/etc. for their sensitivity

By extending the statistical theory, we can also reason about the distribution of ∂h (-1 ≤ ∂h ≤ 1).  This is the 'change in detection probability' between two different sets of results.

The ABX test method

The ABX test allows a truly double-blind comparison of two signals which are to be compared.  An audio test will be described here but the stimulus could just as easily be e.g. visual.

The listener has a box with three buttons; A, B and X.  For each programme item tested, the two different signals are guaranteed to appear in the A and B channels, whilst the X channel is randomly assigned to be identical to either the A or B signal for the length of each item.  The listener can switch between A, B and X as often as s/he likes and is asked to decide (by the end of the item) whether the X channel contains the A or B signal.  This decision is then automatically classified as correct or incorrect depending upon whether the X channel is correctly identified.  A succession of such listening runs constitute the data that are analysed.  The double blind nature of the test ensures that the results are truly independent and unbiased, within the frame of reference of the test conditions.

The analysis of audibility from the ABX test is widely applicable and will work with any comparison of two signals.  Psychoacousticians (and psychologists in general) who aim at quantifying the human perception of differences between similar signals should find this approach valuable as a scientific benchmark to compare any differences that may be of interest.  The Bayesian approach also makes it easy to combine information from different sources.

In psychoacoustics, the analysis can be used to build a database of detection for signal perturbations of different types.  This should be of particular interest to researchers and manufacturers who seek to understand, measure and compare the audibility of digital signal processing and data-compression methods where so many tuning parameters and trade-offs are possible.  See figure 1 above.

Statistical basis of the analysis

The following is a brief description of some of the statistical aspects of analysing ABX data.  The reader is referred to [1, 2] for a more thorough discussion of the underlying probability theory.  For ease of reference, the terminology used is that defined in [1].

A sequence of independent trials with an outcome of either ‘success’ or ‘failure’ are sometimes called Bernoulli trials.  Assuming that there is a fixed probability of success, p, and there are T trials with C successes, then C is distributed as a Binomial variate thus:

             ABX eq1

so that, for example, the probability of C correct answers given T trials and a success probability of p is:

   ABX eq2

In the ABX case, p is the probability of correctly identifying the X channel.

An important point about p is made by Srednicki in [1]; because of the ‘multiple choice’ nature of the ABX box, guesses or decisions that are made randomly with no knowledge of the difference in signals will make p converge on the value of 0.5 i.e. 50% correct answers.  The probability that we are interested in is h, the probability of actually hearing the difference. The probability that any one answer will be correct p is therefore h plus one half of the fraction of guesses:

ABX eq3 and 4

Eq(4) allows us to transform any inference about p into inference about h

At this point, Classical and Bayesian statisticians might part company and we will follow a Bayesian approach.  This is due to the availability of real prior information about the problem which will produce more precise results, as well as some other more philosophical arguments about the 'meaning' of probability.  With larger sample sizes and/or large h the results are effectively identical for both approaches.  See [2] for a description of the Classical approach.

From the above argument, we know that p will lie between 0.5 and 1, but within this range we have no prior reason to believe that any value of p is more likely than any other.  We will describe this prior information as the prior distribution for p, a uniform distribution with probability c between 0.5 and 1, and probability 0 outside that range.

    ABX eq5

By Eq(4) it can be seen that this is equivalent to giving h an uninformative constant prior distribution over the range of interest, 0 to 1. 

Once we have some data we can generate the posterior distribution of p via Bayes’ Theorem.

  ABX eq6

This combines our prior knowledge with the data from the experiment. 

The more data there is, the more information is contained in P( p | T, C ) and the more precise the estimation of p.  We now have everything that we need to reason about p and, by the use of Eq(4), h.  The quantitative results produced by the analysis are given below.

  • P( h | T, C ) - the posterior distribution for h, from which all the inference is taken

  • E( h ) - the expected value or mean of P( h | T, C ).  This is the most widely accepted single number estimate of h; it is the ‘centre of gravity’ of the posterior distribution

  • Ranges over which we can predict h to lie (Highest Density Regions or HDRs) at chosen levels of certainty

As a guide to the probability statements which are attached to the HDRs, the odds are 3:1 (or equivalently there is a 75% probability) that h lies within the 75% HDR.  Intuitively, many are satisfied with this level of certainty, but when safety critical decisions are made, the interested parties usually require much higher certainty (>99.9%).  There are no fixed rules to using these summary statistics - one accepts a higher risk of being incorrect to gain a more precise limit on the parameter.  From a purely personal point of view, we usually use the 75% HDR region for the initial assessment of a parameter.

Examples

Figure 2 contains the posterior distribution for h from three different 60 run listening tests with differing numbers of correct identifications.



Summary statistics from these results are given in table 1.  For example, from result B (38/60) our best single number estimate for h is 0.264 and we know that h lies between 0.13 and 0.395 with a probability of 75% or alternatively at odds of 3:1.  If we want to be more certain e.g. 95% or odds of 19:1, then we need to accept that the range over which it lies is wider - from 0.04 to 0.475.


Result E( h ) 75% HDR 95% HDR
A 30/60 0.099  ( 0.00, 0.14 )   ( 0.00, 0.24 )  
B 38/60 0.264  ( 0.13, 0.39 )   ( 0.04, 0.48 )  
C 46/60 0.516  ( 0.40, 0.65 )   ( 0.30, 0.72 )  

Table 1: Summary statistics from ABX example


Imagine you have made a change to a device being tested and now get 38/60 instead of the original 30/60 (as in results A and B from figure 2).  The distribution of ∂h from this change is given in figure 3 below.


Summary statistics for this change are given in table 2.

Result E( ∂h ) P( ∂h >= 0 ) P( ∂h < 0 )
38/60 vs 30/60 0.165   91.3%      8.7%     

Table 2: Summary statistics from ABX difference example


This shows an expected increase in h of 0.16.  It also gives the important information that h has increased with a probability of 91% - or alternatively with odds of about 10:1.

References

[1] Srednicki (1988) "A Bayesian Analysis of A-B Listening Tests" , J. AES. Vol. 36 No. 3 pp.143-145

[2] Burstein (1989) "Transformed Binomial Confidence Limits for Listening Tests",  J. AES. Vol. 37 No. 5 pp.363-367



Go back to the top Hopkins Research home
© Hopkins Research Ltd 2011   All rights reserved