ERP Boot Camp Tip: Why mean amplitude is usually superior to peak amplitude

Traditionally, ERP amplitudes were quantified (scored) by finding the maximum voltage (or minimum voltage for a negative component) within some time period.  Why? Mainly because this was easy to do with a ruler and a pencil when your EEG system did not include a general-purpose computer and just gave you a printout of the waveform. When computers became available, and could easily quantify components in more sophisticated ways, many researchers continued to use peaks.

However, other researchers began scoring component amplitudes using the mean voltage within a particular time range. This is still far from perfect, but over time it has become clear that this mean amplitude approach had many advantages over peak amplitude. And there has been a clear shift toward the use of mean amplitude instead of peak amplitude.  However, peak amplitude is still used more than it should be.  The goal of this blog post is to describe some of the reasons why mean amplitude is usually preferable to peak amplitude so that researchers will make an informed choice and not just follow a tradition. A more detailed discussion is provided in Chapter 9 of An Introduction to the Event-Related Potential Technique, 2nd Edition (MIT Press). That chapter also discusses why peak latency is a poor measure of timing and describes some better alternatives.

Reason 1: Peaks and components are not the same thing.  Generally speaking, there's nothing special about the time at which the voltage reaches a maximum amplitude.  Given that multiple components are almost always overlapping at any given moment in time, the time and amplitude of the peak voltage will often not be the same as the time and amplitude of the peak of the component of interest.  Moreover, computational models of cognitive and neural processes rarely have much to say about when a process "peaks."  Instead, they focus on when a process begins, ends, etc.  So, peaks aren't particularly meaningful theoretically, and they can encourage an overly simplistic view of the relationship between the underlying components and the observed waveform.

Reason 2: Peak amplitude is typically less reliable than mean amplitude.  Peak amplitude is easily influenced by noise, whereas mean amplitude essentially filters out noise at high and intermediate frequencies. Here's a nice study showing that mean amplitude provides more robust results than peak amplitude: Clayson, P. E., Baldwin, S. A., & Larson, M. J. (2013). How does noise affect amplitude and latency measurement of event-related potentials (ERPs)? A methodological critique and simulation study. Psychophysiology, 50, 174-186.

Reason 3: The peak occurs at different times at different electrode sites.  An ERP component in the brain will have the same timing at every electrode site, but the timing of the peak voltage may differ considerably from site to site (because of other overlapping components). Consequently, when you measure the peak at multiple electrode sites, you're measuring the underlying component at different time points at each site, which is just a weird thing to do. More formally, it's not legitimate to look at the scalp distribution of a peak amplitude measurement (unless you find the peak at one electrode site and then measure all electrode sites at that time point).

Reason 4: Peak amplitude is biased by the noise level and number of trials, but mean amplitude is not.  The noisier the data, the bigger the peak (all else being equal). As a result, it's not legitimate to compare peak amplitudes from groups or conditions that differ in noise level (usually as a result of differences in the number of trials). However, mean amplitude is unbiased (i.e., the variance will increase as the noise level increases, but the score is not pushed to a consistently higher value). If you are measuring mean amplitude, it's perfectly legitimate to compare groups or conditions with different noise levels or different numbers of trials. For more details, see this previous blog post.

Reason 5: Peak is a nonlinear measures whereas mean is linear. Linear operations have many advantages. One is that the order of operations does not matter for linear operations [e.g., (A + B) + C = A + (B + C)]. If you measure the mean amplitude on each individual trial and then average these values together, you get the exactly same thing as if you average the single-trial waveforms together and then measure mean amplitude. Similarly, if you measure the mean amplitude from each subject's averaged ERP waveform and then average these values together, the result will be identical to measuring the mean amplitude from the grand average.  By contrast, you may get a very different value if you measure the peak on the single trials (or single subjects) rather than on the averaged ERP waveform (or the grand average).

Reason 6: Peak amplitude is strongly impacted by trial-to-trial latency variability, but mean amplitude is completely insensitive.  If the single-trial amplitude of a component is the same in two groups or conditions, but there is more latency variability in one group/condition than in the other, the peak amplitude in the averaged ERP waveform will be lower in the group/condition with greater latency variability. For example, it may appear that a patient group has a lower amplitude than a control group if the patient group has more variability in the timing of their brain activity.  However, mean amplitude is completely unaffected by latency variability (assuming the measurement window is wide enough), so a difference in latency variability cannot artificially produce a difference in mean amplitude. If you see a difference between a patient group and a control group in mean amplitude (with a sufficiently broad measurement window), you know it reflects a bona fide difference in the single-trial amplitudes.

A caveat: Choosing the time window.  The biggest challenge in using mean amplitude is deciding on the measurement window. If you use the observed data to choose the time window, you can dramatically increase the possibility that noise in the data leads to a statistically significant (but completely bogus and unreplicable) effect.  We will discuss solutions to this problem in a future blog post. In the meantime, see this article: Luck, S. J., & Gaspelin, N. (2017). How to Get Statistically Significant Effects in Any ERP Experiment (and Why You Shouldn’t). Psychophysiology, 54, 146-157.

ERP Boot Camp Tip: Comparing conditions with different numbers of trials

A common question in ERP research is whether it is legitimate to compare conditions in which different numbers of trials were averaged together (e.g., error trials versus correct trials in an ERN study; oddballs versus standards in an oddball or MMN study).  It turns out that the answer depends on how you're measuring the ERP components.  In a nutshell: if you're measuring mean amplitude, then it's not a problem to compare conditions with different numbers of trials; if you are measuring peak amplitude, then it is a problem.

An extended discussion of this issue can be found in this document. Here, we provide a brief summary.

The figure below shows a clean ERP waveform and the same ERP waveform with noise added. Note that the peak amplitude is higher in the noisy waveform.  This exemplifies a general principle: All else being equal, the peak voltage will be greater in a noisier waveform than in a cleaner waveform.  This is why it is not legitimate to compare waveforms with different numbers of trials (and therefore different noise levels) when using peak amplitude.  The usual solution to this problem is to create an averaged ERP waveform using a subsample of trials from the condition with more trials, equating the number of trials in the averages.  However, it is almost always better to stop using peak amplitude and instead use mean amplitude to quantify the amplitude of the component (see Chapter 9 in An Introduction to the Event-Related Potential Technique for a list of reasons why mean amplitude is almost always superior to peak amplitude).

different numbers of trials.jpg

Mean amplitude (e.g., the average voltage between 300 and 500 ms) is not biased by the noise level.  That is, the mean amplitude will be more variable if the data are noisier, but it is not consistently pushed toward a larger value.  So, you might have more subject-to-subject variability in a condition with fewer trials, but most statistical techniques are robust to modest differences in variance, and this variability will not induce an artificial difference in means between your groups.  There is no need to subsample from the condition with more trials when you are using mean amplitude.  You are just throwing away statistical power if you do this.

Bottom line: In almost every case, the best way to deal with the "problem" of different numbers of trials per condition is to do nothing at all, except make sure you're using mean amplitude to quantify the amplitude.

Hints for ICA-based artifact correction

ICA is a great tool for correcting artifacts, especially eye blinks. Here we provide some practical hints for ICA-based artifact correction.

One of the most difficult parts for new users is knowing which independent components (ICs) reflect artifacts and should be removed.  There's a really great overview here, which describes how to use the combination of scalp distribution, time course, and frequency content to distinguish among artifacts.

A fundamental assumption of ICA is that the artifact has a perfectly consistent scalp distribution.  This is true for some artifacts, such as blinks and EKG, but it may not be true for others (e.g., EMG, eye movements).  ICA works reasonably well with EMG and eye movements under some conditions (especially if all the eye movements are along a single plane), but we recommend caution with these artifacts.

A key aspect of ICA is that the number of ICs must necessarily be equal to the number of channels. You obviously aren't changing the number of underlying brain components by changing the number of electrodes, so the fact that the number of ICs changes as you vary the number of electrodes should make it clear that ICA is an imperfect approach that will not work for every kind of artifact.  Also, no matter how many electrodes you have, the number of brain components is likely to be greater than the number of ICs.  So, ICA will inevitably blend multiple brain components into a single IC, and a single brain component may be split across multiple ICs. As a result, ICA cannot be expected to figure out all the true underlying components.  In practice, ICA works best for components that are relatively large and relatively common.  For this reason, our labs mainly use ICA for blinks, which are both very large (in all participants) and fairly common (in most participants). 

Given that the number of ICs is fixed, you don't want to "waste" ICs on huge but infrequent artifacts (especially if you have a relatively small number of channels).  For example, your participants may have periods of "crazy" EEG during breaks (as a result of stretching, movement, etc.), and these periods may eat up a lot of ICs.  You may therefore want to delete sections of "crazy data" before performing ICA.  ERPLAB Toolbox has two routines that are designed to help with this.  One can automatically delete periods of data between trial blocks.  Another can detect and delete periods of crazy data.  But don't use this approach to delete ordinary artifacts -- this should be for periods in which you are seeing huge and irregular voltage deflections.

Similarly, it's important to eliminate slow drifts prior to ICA.  We generally recommend a high-pass filter with a half-amplitude cutoff of 0.1 Hz and a slope of 12 dB/octave.  You can also apply a low-pass filter if your data have a lot of high-frequency noise.  This filtering should be done on the continuous data (prior to epoching and prior to deleting segments of crazy data) and should be consistent for all participants (e.g., don't low-pass filter for some participants but not others, or use different cutoffs for different participants).

Another important bit of practical advice is that ICA involves training a neural network, and you need enough "trials" (time points) to train the network. A general heuristic is that the # of time points must be greater than 20 x (# channels)^2.  The key is that the number of channels is squared.  So, with 64 channels, you would need 81,920 points (which would be about 5.5 minutes of data with a 250 Hz sampling rate).  However, with 128 channels, you would need 4 times as many points, and with 256 channels, you would need 16 times as many points.

ERP Boot Camp Tip: General Hints for Processing Data

EEG/ERP data are noisy and complicated, and it's easy to make mistakes or miss problems. Here are some hints for avoiding common problems that arise in EEG/ERP data collection and processing.

Start by running one subject and then doing a fairly complete analysis of that subject's data.  You will likely find some kind of problem (e.g., a problem with the event codes) that you need to fix before you run any more subjects.  Make sure you check the number of trials in each condition to make sure that it exactly matches what you expect.  Also, make sure you check the behavioral data, and not just the ERPs.  If you collect data from multiple subjects before doing a complete analysis, there's about a 50% chance that you will find a problem that requires that you throw out all of the data that you've collected, which will make you very sad. Do not skip this step! 

Once you verify that everything in your task, data collection procedures, and analysis scripts is working correctly, you can start collecting data from multiple additional subjects.  However, you should do a preliminary analysis of each subject's data within 48 hours of collecting the data (i.e., up to and including the point of plotting the averaged ERP waveforms).  This allows you to detect a problem (e.g., a malfunctioning electrode) before you collect data from a bunch of subjects with the same problem. This is especially important if you are not the one collecting the data and are therefore not present to notice problems during the actual recording session. 

The first time you process the data from a given subject, don't do it with a script!  Instead, process the data "by hand" (using a GUI) so that you can make sure that everything is OK with the subject's data.  There are many things that can go wrong, and this is your chance to find problems.  The most important things to look at are: the raw EEG before any processing, the EEG data after artifact detection, the time course and scalp distribution of any ICA components being excluded, the number of trials rejected in each condition, and the averaged ERP waveforms.  We recommend that you set artifact rejection parameters individually for each subject, because different people can have very different artifacts.  One size does not fit all.  (In a between-subjects design, the person setting the parameters should be blind to group membership to avoid biasing the results.)  These parameters can then be saved in an Excel file for future use and for reporting in journal articles.

If you need to re-analyze your data (e.g., with a different epoch length), it's much faster to do this with a script.  Your script can read in any subject-specific parameters from the Excel file.  Also, it's easy to make a mistake when you do the initial analysis "by hand," so re-analyzing everyone with a script prior to statistical analysis is a good idea. However, it is easy to make mistakes in scripting as well, so it's important to check the results of every step of processing in your script for accuracy.  It can also be helpful, especially if you are new to scripting, to have another researcher look through your data processing procedures to check for accuracy. 

Bottom line: Scripts are extremely useful for reanalyzing data, but they should not be used for the initial analysis.  Also, don't just borrow someone else's script and apply it to your data.  If you don't fully understand every step of a script (including the rationale for the parameters), don't use the script.

Hints for Processing Data.jpg

ERP Boot Camp Tip: What does the polarity of an ERP component mean?

We are often asked whether it means something whether a component is positive (e.g., P2 and P3) or negative (e.g., N1, N400, error-related negativity).  The answer, for the most part, is "no".

First, every ERP component will be positive on one side of the head and negative on the other side.  We often don't "see" the other side of a component (e.g., the negative side of the P3) because (a) the opposite-polarity side is in a place without any electrodes (e.g., the bottom of the skull), (b) the opposite-polarity side is obscured by other components, or (c) the opposite-polarity side is spatially diffuse (low amplitude and broadly distributed).  But it's there!

As the figure below shows, there are 4 factors that determine the polarity of an ERP component.  If we knew 3, we could in principle determine the 4th by knowing the polarity.  In practice, we never know 3 so we cannot determine the 4th.  In particular, although the polarity depends on whether the ERP arises from excitatory or inhibitory neurotransmission, we cannot ordinarily determine whether a component represents excitation or inhibition from its polarity.


Mainly, polarity is used to help identify a given component.  For example, if our active electrodes are near Pz and our reference electrodes are near the mastoids, we can be sure that the P3 will be a positive voltage.  Beyond that, polarity doesn't tell us much.

ERP Boot Camp Tip: Stimulus Duration

Here's a really simple tip regarding the duration of visual stimuli in ERP experiments: In most cases, the duration of a visual stimulus should be either (a) between 100 and 200 ms or (b) longer than the time period that you will be showing in your ERP waveforms.  


Here's the rationale: If your stimulus is shorter than ~100 ms, it is effectively the same as a 100 ms stimulus with lower contrast (look into Bloch's Law if you're interested in the reason for this).  For example, if you present a stimulus for 1 ms, the visual system will see this as being essentially identical to a very dim 100-ms stimulus.  As a result, there is usually no point is presenting a visual stimulus for less than 100 ms (unless you are using masking).

Once a stimulus duration exceeds ~100 ms, it produces an offset response as well as an onset response.  For example, the image shown here illustrates what happens with a 500-ms stimulus duration: There is a positive bump at 600 ms that is the P1 elicited by the offset of the stimulus.  This isn't necessarily a problem, but it makes your waveforms look weird.  You don't want to waste words explaining this in a paper.  So, if you need a long duration, make it long enough that the offset response is after the end of the time period you'll be showing in your waveforms.

The offset response gets gradually larger as the duration exceeds 100 ms.  With a 200-ms duration, the offset response is negligible.  So, if you want to give your participants a little extra time for perceiving the stimulus (but you don't want a very long duration), 200 ms is fine.  We usually use 200 ms in our schizophrenia studies.

Also, if the stimulus is <=200 ms, there is not much opportunity for eye movements (unless the stimulus is lateralized).  If you present a complex stimulus for >200 ms, you will likely get eye movements.  This may or may not be a significant problem, depending on the nature of your study.