We need to talk about significance tests

Posted on Do 24 Oktober 2019 in misc

At ACL 2019, We Need to Talk about Standard Splits by Kyle Gorman and Steven Bedrick was gilded as one of five outstanding papers. The authors perform a replication study on PoS taggers to evaluate whether the reported accuracies can be reproduced and whether those accuracies hinge on using the standard split versus other splits or slightly differently annotated data. To perform the evaluation between taggers, the authors propose to perform random splits of the test- and training data and evaluate a natural language processor on these random splits instead of the standardized one. In this blog post, I will describe why I find that approach to be of limited merit and in some cases even problematic.

As the rest of this blog post mainly criticizes the statistics aspects of the G&B paper, I want to clearly state that in my opinion the replication performed in the paper is important, the random splits and re-evaluation both on the same and on the OntoNotes data as well as the error analysis provide a real benefit. My gripe is only with the statistical evaluation of the results.

Let us first get some statistical theory out of the way, please bear with me:

What are statistical tests anyway?

Most statistical tests observed in publications are frequentist tests resulting in p-values. p-values have been widely criticized for being hard to understand and for misrepresenting results. Generally, p-values smaller than 0.05 are seen as significant and significant results are seen as proof for an effect, whereas high p-values are seen as a sign of no effect. This is a misinterpretation of p-values!

Most tests work with two complementary hypotheses, usually called \(h_0\) and \(h_1\). If we (as in the G&B paper) want to compare two different NLP systems A and B, \(h_0\) would be that the accuracy of A and the accuracy of B are the same, i.e. the underlying accuracy distribution of both systems is the same. \(h_1\) is the alternative hypothesis, i.e. that the accuracies of A and B need to be modeled with two different distributions. Note that \(h_0\) is not that the accuracies are similar (maybe modulo some small ε) but actually identical. If we can reject \(h_0\) (both have the same accuracy) then we have proved that the alternative to \(h_0\) needs to hold true (\(h_1\), either A or B is a better system than the other one). And already the the problems start: We cannot prove that \(h_0\) is wrong, we can at most find that it is less likely that \(h_0\) is true given the observation that we made than before the observation (but more on that problem in “Problems with p-values” below).

I said “probability distribution” above, but did not specify where the distribution comes from. In general, we have to make a hypothesis about the shape of the underlying distribution and model the experiments in a way that it can be seen as repeated draw from that distribution. G&B do this by modeling PoS tagging as a Bernoulli distribution B(X), which simply states that each PoS tagger gets each tag right with a probability of X. Tagging a single word can then be seen as drawing from B(X) to see whether the tag was correct or not.

With the Bernoulli distribution above, \(h_0\) is the hypothesis that the observations of tags being correct or incorrect are drawn from the same Bernoulli distribution. The p-value obtained by the significance test is the probability that the observed difference between the taggers (or a more extreme difference between the taggers) is generated by a single Bernoulli distribution as assumed by \(h_0\). This is not the probability of \(h_0\) being true (see “problems with p-values” below).

significant ≠ big

If someone reports highly significant results, it sounds like they made a big leap forwards. However, a p-value can also become smaller by simply testing more often. Let us rummage in my wallet and take two unfair coins: one lands on “heads” with a probability of only 0.2, the other one with a probability of 0.45. In this scenario, \(h_0\) would be that the coins are fair, i.e. the probability of heads is assumed to be 0.5 under \(h_0\). Surely the first one will yield more significant results?

Let us flip the first coin ten times. We observe “heads” twice (what a nice coincidence!). However, this does not tell us “significantly” that the coin is not fair even though we have observed “heads” only 20% of the time (a binominal test gives a p-value of 0.1).

However, flipping the second coin a thousand times and observing “heads” 450 times (what a coincidence again!) yields a p-value of 0.002, a “highly significant” result even though the relative difference between heads and tails is quite small.

Correcting for multiple tests: The Bonferroni correction

Let’s say instead of making one experiment, we perform 25. What is the probability that we observed at least one “significant” result even though for all experiments \(h_0\) is true? If we use the standard cutoff for significance of 0.05, each experiment will yield a result we view as significant even though the \(h_0\) to that experiment is true with a probability of 0.05. The probability of one experiment yielding no significant result is therefore 0.95. Obtaining no significant result in all 25 experiments is then \(0.95^{25}\) and the probability of obtaining at least one significant result is is \(1-(1-0.05)^{25} = 0.7\), i.e. the probability that we incorrectly report at least one significant result is pretty high. Bonferroni correction to the rescue: Simply divide the significance level by the number of experiments and you get the old probability of 0.05 for incorrectly reporting at least one of the experiments as significant back: \(1-(1-0.05/25)^{25} = 0.049\). This limits the probability of incorrectly reporting a significant result. Bonferroni correction does so, however, at the expense of interpretability: Try to understand what two bonferroni-corrected significant result out of a set of ten experiments means without resorting to hand waving: I can’t. This problem of interpretability is even worse in the G&B setting as they test the same \(h_0\) in each experiment, but the Bonferroni correction is usually performed under the assumption that each test evaluates a different \(h_0\), see this xkcd for an example.

See also “What is Type I error?” for a (short) discussion of problems when correcting for multiple hypothesis testing.

Digging into the G&B paper

With the statistical basics covered, I will now try to explain my criticism on the paper.

Repeated tests on the same dataset limit the variance in data G&B argue that we should use repeated random splits to evaluate a system. This means that due to the cost of evaluation, less datasets will be used, at least in all the cases where computation time and cost can be a limiting factor. If one has the computational resources to perform tests on twenty datasets, I would rather use twenty different datasets than twenty random splits of a single dataset: Using different datasets means we know that a system works well in a variety of settings and languages, performing repeated splits results in us being super sure how the system works on the WSJ part of the PTB (in the case of this paper).

Yes, the approach can be applied to other languages and other corpora as well (in those cases we would again only learn about the performance on that corpus) and there is a good reason G&B chose the WSJ corpus: previous evaluations for the taggers they inspect are mainly available for this corpus and some taggers use features geared towards that corpus; G&B make this reasoning explicit and did not choose English as an implicit default. Nonetheless (and this is not a criticism of the G&B paper) I fear that a reduction in variety leads to English being the language picked as it is easy to pick as a default language; that this happens can be seen e.g. from the Bender rule mentions on twitter (the Bender rules states that you should name the languages you work on; it is often violated and in those cases, the language being used is mostly English).

The underlying assumptions for significance testing are not made explicit The testing assumes a Bernoulli distribution for making errors, i.e. the probability of making an error is assumed to be the same for every token. This obviously does not hold true for PoS tagging. The word “to”, for example, has its own PoS tag and it is very unlikely that a tagger will make an error on these words. In contrast, the number of errors for OOV tokens is much higher. A mismatch between the assumed and the real distribution of errors results in incorrect computations for p-values.

Repeated test with Bonferroni corrections is not what it seems G&B propose to perform repeated significance testing on random splits of test/train data, perform Bonferroni correction on each of the tests and then count the number of significant results (i.e. number of tests that had p<0.05). One of the main findings of the paper is that with this method they find that two PoS taggers found to be significantly different on the standard split are not actually significantly different. Let us dig into the math behind that.

The McNemar test used by G&B counts 1) how many times the PoS tagger X was correct and Y was incorrect (called B in the McNemar test) and 2) how many times Y was correct but X was incorrect (called C in the McNemar test). Significance is computed by a binomial test on B against B+C with an assumed probability of 0.5. Using the coin example above, we flip a coin B+C times and it lands on “heads” B times. Obviously, the p-value obtained only depends on B and C.

Now we are comparing two PoS taggers with similar accuracies, say X has a (real) accuracy of 97% and Y has an accuracy of 96.9%. Will we be able to detect this difference? If we are lucky and the observed accuracy is the real underlying accuracy, X makes 1500 errors on a test set of 50k words and Y makes 1550 errors. We don't know how many errors are shared between both, let us just assume that the number is quite high with 1300 tokens incorrectly labeled by both X and Y. Using our significance test, we obtain a p-value of 0.02; a significant result! If you want to try this yourself, here is the R code:

X <- 1500
Y <- 1550
both <- 1300
onlyX <- X-both
onlyY <- Y-both

But maybe we are not yet convinced. So let us use the method proposed by G&B: we re-run the experiment 20 times to be extra sure. And, what a luck, the PoS taggers behave exactly as they should, perfectly reproducing their true underlying accuracies each time. That is, in each and every out of the 20 tests, X performs better than Y. But following the approach by G&B, we don’t have a significant difference between X and Y anymore! How could that happen? It is due to the Bonferroni correction!

Remember, we did 20 tests and each time we obtained a p-value of 0.02. But because G&B perform Bonferroni correction, the critical value to call a p-value significant is not 0.05 but 0.05/20 = 0.0025. And since 0.02 > 0.0025, none of the twenty results are significant anymore, even though we can see that X consistently outperforms Y.

The takeaway is that the repeated measure, Bonferroni corrections and counting the number of significant outcomes basically removes the possibility to detect small differences in accuracy. It does not increase but actually decreases the sensitivity of the measurement.

The main claim of “we need to talk about standard splits” In the Discussion, G&B note (emphasis mine):

First, we find that a system judged to be significantly better than another on the basis of performance on the standard split, *does not outperform that system* on re-annotated data or randomly generated splits, suggesting that it is “overfit to the standard split” and *does not represent a genuine improvement in performance*.

With the system judged better being the Stanford tagger and the tagger compared to being the LAPOS tagger. This is the “absence of evidence is not evidence of absence” trap: The paper did not report evidence that LAPOS is in fact not a better-performing tagger than the Stanford one, the experiments merely failed to produce any evidence at all, maybe due to the Bonferroni correction. It is not possible to draw any conclusion from the data presented in the paper other than that the difference in accuracy between both parsers is not extremely big.

Problems with p-values

Up to now, computing p-values was just treated as a given. While I am at it, I want to give some pointers to why I think that their use is problematic.

p<0.05 is a weak signal A p-value is (an approximation of) the probability of the distribution assumed by \(h_0\) generating either the observed data point or a more extreme one. For the experiments in the G&B paper, this means that a p-value is the probability of two taggers A and B showing at least the difference in accuracy observed if both A and B adhere to the same underlying distribution. In other words, a p-value is the probability of the data given our hypothesis \(h_0\) is true (\(\sum_{d'\geq d} P(d' |h_0)\)). What we are interested is, however, the probability of the distribution being the correct distribution given the data we observed \(P(h_0| d)\). Trick question: What is the probability of \(h_0\) being true if we obtained a p-value of 0.05 (i.e. a significant result)?

We don't know because we need Bayes' rule to obtain that probability (\(p(h_0| d )\)) and for this, we need the a priori probabilities of \(h_0\), \(h_1\), and \(d\)! Assuming an a priori probability for \(h_0\) and \(h_1\) of 0.5 each, \(P(h_1| d)\) is not – as one might think – 0.95, but can at most be about 0.7 (see What does p<0.05 mean anyway?).

I would argue that the a priori probability of two systems sharing neither code nor computational approach actually behave the same is near zero and most certainly not 50%. In this case, a small p-value gives nearly no additional evidence.

significance cut-offs lead to p-hacking People dislike SOTA hunting, should we now also start with significance hunting? I cannot go into details, but I have seen the negative impact of being incentivized to obtain significant results, i.e. p<0.05 because it is seen as the barrier to publication. Also, using a significance cut-off to dichotomize the continuous p-value and then reasoning about the resulting (non-)significance is problematic, as e.g. described in The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant.

p-values are often misunderstood look around and you will often find phrases such as “as p<0.05, we can reject \(h_0\)” or “because p>0.05, there is no significant difference“ (→”absence of evidence is not evidence of absence“). See also this post on how to correctly report p-value results; it is not easy to do it correctly! p-values are (incorrectly, see “p<0.05 is a weak signal” above) described in the G&B paper as follows (emphasis mine):

While the distribution of [the difference in test accuracy] is not obvious, the probability that there is no population-level difference in system performance (i.e., δ=0) can be computed indirectly using McNemar’s test […] *the (one-sided) probability of the null hypothesis is the probability of sampling $n_{1>2}$ from this distribution.*

If p-values are misunderstood and misrepresented that often, it might just not be a suitable tool to communicate results.

Conclusions and alternatives

I am not opposed to random splits and multiple testing if the computational resources are available, in fact I performed those for evaluating PoS taggers in my Bachelor’s thesis. However, performing significance tests with a (historically randomly chosen) cutoff of p<0.05, counting the number of significant p-values makes the results hard to interpret and (as I have hopefully shown) easy to mis-interpret.

Also, the approach basically discards the size of the effect. However, effect size is more important than significance. If a new system is slower than the current one, I would switch to it if the benefit is big and maybe not if the increase in accuracy is small. I would therefore not base the decision on p-values.

When comparing two NLP processors, instead of computing p-values over multiple datasets, a statistic for the distribution of accuracy differences would be more helpful. Take the accuracy differences and report min, max, and percentiles (or plot a histogram). This way, the reader can understand the size of the differences and see how much it varies by split.

When using different datasets instead of multiple splits of the same data (which I would recommend due to the larger coverage), you can generate a “true” difference estimation by performing bootstrapping on the accuracy differences. This will give you confidence estimates to report instead of simple point estimates. Most importantly, make the underlying evaluation data available as a resource so that others can compute different statistics based on their need.


  • statistical significance is less important than effect size.
  • p-values are problematic.
  • if you perform multiple testing, do it on different datasets, not different splits of the same one.
  • Bonferroni correction reduces the sensitivity and is hard to interpret.
  • counting the ratio of significant results is not a measure of significance, it relies on the arbitrary 0.05 p-value cutoff which has no theoretical grounding.
  • most importantly, we should never ever make certain p-values such as 0.05 a hurdle for publication.

Please comment this blog post either here or via mail or twitter; I will update this post to improve it.

Thanks to Kyle Gorman and Steven Bedrick for comments and helpful discussions on a draft version of this post. Thanks to Antoine Venant for asking the right questions to (hopefully) improve this post. Thanks to Christine Köhn for thorough feedback and comments. I would marry you if we were not already married.