ERP Boot Camp Tip: Why mean amplitude is usually superior to peak amplitude

Traditionally, ERP amplitudes were quantified (scored) by finding the maximum voltage (or minimum voltage for a negative component) within some time period.  Why? Mainly because this was easy to do with a ruler and a pencil when your EEG system did not include a general-purpose computer and just gave you a printout of the waveform. When computers became available, and could easily quantify components in more sophisticated ways, many researchers continued to use peaks.

However, other researchers began scoring component amplitudes using the mean voltage within a particular time range. This is still far from perfect, but over time it has become clear that this mean amplitude approach had many advantages over peak amplitude. And there has been a clear shift toward the use of mean amplitude instead of peak amplitude.  However, peak amplitude is still used more than it should be.  The goal of this blog post is to describe some of the reasons why mean amplitude is usually preferable to peak amplitude so that researchers will make an informed choice and not just follow a tradition. A more detailed discussion is provided in Chapter 9 of An Introduction to the Event-Related Potential Technique, 2nd Edition (MIT Press). That chapter also discusses why peak latency is a poor measure of timing and describes some better alternatives.

Reason 1: Peaks and components are not the same thing.  Generally speaking, there's nothing special about the time at which the voltage reaches a maximum amplitude.  Given that multiple components are almost always overlapping at any given moment in time, the time and amplitude of the peak voltage will often not be the same as the time and amplitude of the peak of the component of interest.  Moreover, computational models of cognitive and neural processes rarely have much to say about when a process "peaks."  Instead, they focus on when a process begins, ends, etc.  So, peaks aren't particularly meaningful theoretically, and they can encourage an overly simplistic view of the relationship between the underlying components and the observed waveform.

Reason 2: Peak amplitude is typically less reliable than mean amplitude.  Peak amplitude is easily influenced by noise, whereas mean amplitude essentially filters out noise at high and intermediate frequencies. Here's a nice study showing that mean amplitude provides more robust results than peak amplitude: Clayson, P. E., Baldwin, S. A., & Larson, M. J. (2013). How does noise affect amplitude and latency measurement of event-related potentials (ERPs)? A methodological critique and simulation study. Psychophysiology, 50, 174-186.

Reason 3: The peak occurs at different times at different electrode sites.  An ERP component in the brain will have the same timing at every electrode site, but the timing of the peak voltage may differ considerably from site to site (because of other overlapping components). Consequently, when you measure the peak at multiple electrode sites, you're measuring the underlying component at different time points at each site, which is just a weird thing to do. More formally, it's not legitimate to look at the scalp distribution of a peak amplitude measurement (unless you find the peak at one electrode site and then measure all electrode sites at that time point).

Reason 4: Peak amplitude is biased by the noise level and number of trials, but mean amplitude is not.  The noisier the data, the bigger the peak (all else being equal). As a result, it's not legitimate to compare peak amplitudes from groups or conditions that differ in noise level (usually as a result of differences in the number of trials). However, mean amplitude is unbiased (i.e., the variance will increase as the noise level increases, but the score is not pushed to a consistently higher value). If you are measuring mean amplitude, it's perfectly legitimate to compare groups or conditions with different noise levels or different numbers of trials. For more details, see this previous blog post.

Reason 5: Peak is a nonlinear measures whereas mean is linear. Linear operations have many advantages. One is that the order of operations does not matter for linear operations [e.g., (A + B) + C = A + (B + C)]. If you measure the mean amplitude on each individual trial and then average these values together, you get the exactly same thing as if you average the single-trial waveforms together and then measure mean amplitude. Similarly, if you measure the mean amplitude from each subject's averaged ERP waveform and then average these values together, the result will be identical to measuring the mean amplitude from the grand average.  By contrast, you may get a very different value if you measure the peak on the single trials (or single subjects) rather than on the averaged ERP waveform (or the grand average).

Reason 6: Peak amplitude is strongly impacted by trial-to-trial latency variability, but mean amplitude is completely insensitive.  If the single-trial amplitude of a component is the same in two groups or conditions, but there is more latency variability in one group/condition than in the other, the peak amplitude in the averaged ERP waveform will be lower in the group/condition with greater latency variability. For example, it may appear that a patient group has a lower amplitude than a control group if the patient group has more variability in the timing of their brain activity.  However, mean amplitude is completely unaffected by latency variability (assuming the measurement window is wide enough), so a difference in latency variability cannot artificially produce a difference in mean amplitude. If you see a difference between a patient group and a control group in mean amplitude (with a sufficiently broad measurement window), you know it reflects a bona fide difference in the single-trial amplitudes.

A caveat: Choosing the time window.  The biggest challenge in using mean amplitude is deciding on the measurement window. If you use the observed data to choose the time window, you can dramatically increase the possibility that noise in the data leads to a statistically significant (but completely bogus and unreplicable) effect.  We will discuss solutions to this problem in a future blog post. In the meantime, see this article: Luck, S. J., & Gaspelin, N. (2017). How to Get Statistically Significant Effects in Any ERP Experiment (and Why You Shouldn’t). Psychophysiology, 54, 146-157.