We need to talk about significance tests

Posted on Thu 24 October 2019 in misc • Tagged with nlp

At ACL 2019, We Need to Talk about Standard Splits by Kyle Gorman and Steven Bedrick was gilded as one of five outstanding papers. The authors perform a replication study on PoS taggers to evaluate whether the reported accuracies can be reproduced and whether those accuracies hinge on using the standard split versus other splits or slightly differently annotated data. To perform the evaluation between taggers, the authors propose to perform random splits of the test- and training data and evaluate a natural language processor on these random splits instead of the standardized one. In this blog post, I will describe why I find that approach to be of limited merit and in some cases even problematic.

As the rest of this blog post mainly criticizes the statistics aspects of the G&B paper, I want to clearly state that in my opinion the replication performed in the paper is important, the random splits and re-evaluation both on the same and on the OntoNotes data as well as the error analysis provide a real benefit. My gripe is only with the statistical evaluation of the results.

Let us first get some statistical theory out of the way, please bear with me:

What are statistical tests anyway?

Most statistical tests observed in publications are frequentist tests resulting in p-values. p-values have been widely criticized for being hard to understand and for misrepresenting results. Generally, p-values smaller than 0.05 are seen as significant and significant results are seen as proof for an effect, whereas high p-values are seen as a sign of no effect. This is a misinterpretation of p-values!

Most tests work with two complementary hypotheses, usually called \(h_0\) and \(h_1\). If we (as in the G&B paper) want to compare two different NLP systems A and B, \(h_0\) would be that the accuracy of A and the accuracy of B are the same, i.e. the underlying accuracy distribution of both systems is the same. \(h_1\) is the alternative hypothesis, i.e. that the accuracies of A and B need to be modeled with two different distributions. Note that \(h_0\) is not that the accuracies are similar (maybe modulo some small ε) but actually identical. If we can reject \(h_0\) (both have the same accuracy) then we have proved that the alternative to \(h_0\) needs to hold true (\(h_1\), either A or B is a better system than the other one). And already the the problems start: We cannot prove that \(h_0\) is wrong, we can at most find that it is less likely that \(h_0\) is true given the observation that we made than before the observation (but more on that problem in “Problems with p-values” below).

I said “probability distribution” above, but did not specify where the distribution comes from. In general, we have to make a hypothesis about the shape of the underlying distribution and model the experiments in a way that it can be seen as repeated draw from that distribution. G&B do this by modeling PoS tagging as a Bernoulli distribution B(X), which simply states that each PoS tagger gets each tag right with a probability of X. Tagging a single word can then be seen as drawing from B(X) to see whether the tag was correct or not.

With the Bernoulli distribution above, \(h_0\) is the hypothesis that the observations of tags being correct or incorrect are drawn from the same Bernoulli distribution. The p-value obtained by the significance test is the probability that the observed difference between the taggers (or a more extreme difference between the taggers) is generated by a single Bernoulli distribution as assumed by \(h_0\). This is not the probability of \(h_0\) being true (see “problems with p-values” below).

significant ≠ big

If someone reports highly significant results, it sounds like they made a big leap forwards. However, a p-value can also become smaller by simply testing more often. Let us rummage in my wallet and take two unfair coins: one lands on “heads” with a probability of only 0.2, the other one with a probability of 0.45. In this scenario, \(h_0\) would be that the coins are fair, i.e. the probability of heads is assumed to be 0.5 under \(h_0\). Surely the first one will yield more significant results?

Let us flip the first coin ten times. We observe “heads” twice (what a nice coincidence!). However, this does not tell us “significantly” that the coin is not fair even though we have observed “heads” only 20% of the time (a binominal test gives a p-value of 0.1).

However, flipping the second coin a thousand times and observing “heads” 450 times (what a coincidence again!) yields a p-value of 0.002, a “highly significant” result even though the relative difference between heads and tails is quite small.

Correcting for multiple tests: The Bonferroni correction

Let’s say instead of making one experiment, we perform 25. What is the probability that we observed at least one “significant” result even though for all experiments \(h_0\) is true? If we use the standard cutoff for significance of 0.05, each experiment will yield a result we view as significant even though the \(h_0\) to that experiment is true with a probability of 0.05. The probability of one experiment yielding no significant result is therefore 0.95. Obtaining no significant result in all 25 experiments is then \(0.95^{25}\) and the probability of obtaining at least one significant result is is \(1-(1-0.05)^{25} = 0.7\), i.e. the probability that we incorrectly report at least one significant result is pretty high. Bonferroni correction to the rescue: Simply divide the significance level by the number of experiments and you get the old probability of 0.05 for incorrectly reporting at least one of the experiments as significant back: \(1-(1-0.05/25)^{25} = 0.049\). This limits the probability of incorrectly reporting a significant result. Bonferroni correction does so, however, at the expense of interpretability: Try to understand what two bonferroni-corrected significant result out of a set of ten experiments means without resorting to hand waving: I can’t. This problem of interpretability is even worse in the G&B setting as they test the same \(h_0\) in each experiment, but the Bonferroni correction is usually performed under the assumption that each test evaluates a different \(h_0\), see this xkcd for an example.

See also “What is Type I error?” for a (short) discussion of problems when correcting for multiple hypothesis testing.

Digging into the G&B paper

With the statistical basics covered, I will now try to explain my criticism on the paper.

Repeated tests on the same dataset limit the variance in data G&B argue that we should use repeated random splits to evaluate a system. This means that due to the cost of evaluation, less datasets will be used, at least in all the cases where computation time and cost can be a limiting factor. If one has the computational resources to perform tests on twenty datasets, I would rather use twenty different datasets than twenty random splits of a single dataset: Using different datasets means we know that a system works well in a variety of settings and languages, performing repeated splits results in us being super sure how the system works on the WSJ part of the PTB (in the case of this paper).

Yes, the approach can be applied to other languages and other corpora as well (in those cases we would again only learn about the performance on that corpus) and there is a good reason G&B chose the WSJ corpus: previous evaluations for the taggers they inspect are mainly available for this corpus and some taggers use features geared towards that corpus; G&B make this reasoning explicit and did not choose English as an implicit default. Nonetheless (and this is not a criticism of the G&B paper) I fear that a reduction in variety leads to English being the language picked as it is easy to pick as a default language; that this happens can be seen e.g. from the Bender rule mentions on twitter (the Bender rules states that you should name the languages you work on; it is often violated and in those cases, the language being used is mostly English).

The underlying assumptions for significance testing are not made explicit The testing assumes a Bernoulli distribution for making errors, i.e. the probability of making an error is assumed to be the same for every token. This obviously does not hold true for PoS tagging. The word “to”, for example, has its own PoS tag and it is very unlikely that a tagger will make an error on these words. In contrast, the number of errors for OOV tokens is much higher. A mismatch between the assumed and the real distribution of errors results in incorrect computations for p-values.

Repeated test with Bonferroni corrections is not what it seems G&B propose to perform repeated significance testing on random splits of test/train data, perform Bonferroni correction on each of the tests and then count the number of significant results (i.e. number of tests that had p<0.05). One of the main findings of the paper is that with this method they find that two PoS taggers found to be significantly different on the standard split are not actually significantly different. Let us dig into the math behind that.

The McNemar test used by G&B counts 1) how many times the PoS tagger X was correct and Y was incorrect (called B in the McNemar test) and 2) how many times Y was correct but X was incorrect (called C in the McNemar test). Significance is computed by a binomial test on B against B+C with an assumed probability of 0.5. Using the coin example above, we flip a coin B+C times and it lands on “heads” B times. Obviously, the p-value obtained only depends on B and C.

Now we are comparing two PoS taggers with similar accuracies, say X has a (real) accuracy of 97% and Y has an accuracy of 96.9%. Will we be able to detect this difference? If we are lucky and the observed accuracy is the real underlying accuracy, X makes 1500 errors on a test set of 50k words and Y makes 1550 errors. We don't know how many errors are shared between both, let us just assume that the number is quite high with 1300 tokens incorrectly labeled by both X and Y. Using our significance test, we obtain a p-value of 0.02; a significant result! If you want to try this yourself, here is the R code:

X <- 1500
Y <- 1550
both <- 1300
onlyX <- X-both
onlyY <- Y-both

But maybe we are not yet convinced. So let us use the method proposed by G&B: we re-run the experiment 20 times to be extra sure. And, what a luck, the PoS taggers behave exactly as they should, perfectly reproducing their true underlying accuracies each time. That is, in each and every out of the 20 tests, X performs better than Y. But following the approach by G&B, we don’t have a significant difference between X and Y anymore! How could that happen? It is due to the Bonferroni correction!

Remember, we did 20 tests and each time we obtained a p-value of 0.02. But because G&B perform Bonferroni correction, the critical value to call a p-value significant is not 0.05 but 0.05/20 = 0.0025. And since 0.02 > 0.0025, none of the twenty results are significant anymore, even though we can see that X consistently outperforms Y.

The takeaway is that the repeated measure, Bonferroni corrections and counting the number of significant outcomes basically removes the possibility to detect small differences in accuracy. It does not increase but actually decreases the sensitivity of the measurement.

The main claim of “we need to talk about standard splits” In the Discussion, G&B note (emphasis mine):

First, we find that a system judged to be significantly better than another on the basis of performance on the standard split, does not outperform that system on re-annotated data or randomly generated splits, suggesting that it is “overfit to the standard split” and does not represent a genuine improvement in performance.

With the system judged better being the Stanford tagger and the tagger compared to being the LAPOS tagger. This is the “absence of evidence is not evidence of absence” trap: The paper did not report evidence that LAPOS is in fact not a better-performing tagger than the Stanford one, the experiments merely failed to produce any evidence at all, maybe due to the Bonferroni correction. It is not possible to draw any conclusion from the data presented in the paper other than that the difference in accuracy between both parsers is not extremely big.

Problems with p-values

Up to now, computing p-values was just treated as a given. While I am at it, I want to give some pointers to why I think that their use is problematic.

p<0.05 is a weak signal A p-value is (an approximation of) the probability of the distribution assumed by \(h_0\) generating either the observed data point or a more extreme one. For the experiments in the G&B paper, this means that a p-value is the probability of two taggers A and B showing at least the difference in accuracy observed if both A and B adhere to the same underlying distribution. In other words, a p-value is the probability of the data given our hypothesis \(h_0\) is true (\(\sum_{d'\geq d} P(d' |h_0)\)). What we are interested is, however, the probability of the distribution being the correct distribution given the data we observed \(P(h_0| d)\). Trick question: What is the probability of \(h_0\) being true if we obtained a p-value of 0.05 (i.e. a significant result)?

We don't know because we need Bayes' rule to obtain that probability (\(p(h_0| d )\)) and for this, we need the a priori probabilities of \(h_0\), \(h_1\), and \(d\)! Assuming an a priori probability for \(h_0\) and \(h_1\) of 0.5 each, \(P(h_1| d)\) is not – as one might think – 0.95, but can at most be about 0.7 (see What does p<0.05 mean anyway?).

I would argue that the a priori probability of two systems sharing neither code nor computational approach actually behave the same is near zero and most certainly not 50%. In this case, a small p-value gives nearly no additional evidence.

significance cut-offs lead to p-hacking People dislike SOTA hunting, should we now also start with significance hunting? I cannot go into details, but I have seen the negative impact of being incentivized to obtain significant results, i.e. p<0.05 because it is seen as the barrier to publication. Also, using a significance cut-off to dichotomize the continuous p-value and then reasoning about the resulting (non-)significance is problematic, as e.g. described in The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant.

p-values are often misunderstood look around and you will often find phrases such as “as p<0.05, we can reject \(h_0\)” or “because p>0.05, there is no significant difference“ (→”absence of evidence is not evidence of absence“). See also this post on how to correctly report p-value results; it is not easy to do it correctly! p-values are (incorrectly, see “p<0.05 is a weak signal” above) described in the G&B paper as follows (emphasis mine):

While the distribution of [the difference in test accuracy] is not obvious, the probability that there is no population-level difference in system performance (i.e., δ=0) can be computed indirectly using McNemar’s test […] the (one-sided) probability of the null hypothesis is the probability of sampling \(n_{1>2}\) from this distribution.

If p-values are misunderstood and misrepresented that often, it might just not be a suitable tool to communicate results.

Conclusions and alternatives

I am not opposed to random splits and multiple testing if the computational resources are available, in fact I performed those for evaluating PoS taggers in my Bachelor’s thesis. However, performing significance tests with a (historically randomly chosen) cutoff of p<0.05, counting the number of significant p-values makes the results hard to interpret and (as I have hopefully shown) easy to mis-interpret.

Also, the approach basically discards the size of the effect. However, effect size is more important than significance. If a new system is slower than the current one, I would switch to it if the benefit is big and maybe not if the increase in accuracy is small. I would therefore not base the decision on p-values.

When comparing two NLP processors, instead of computing p-values over multiple datasets, a statistic for the distribution of accuracy differences would be more helpful. Take the accuracy differences and report min, max, and percentiles (or plot a histogram). This way, the reader can understand the size of the differences and see how much it varies by split.

When using different datasets instead of multiple splits of the same data (which I would recommend due to the larger coverage), you can generate a “true” difference estimation by performing bootstrapping on the accuracy differences. This will give you confidence estimates to report instead of simple point estimates. Most importantly, make the underlying evaluation data available as a resource so that others can compute different statistics based on their need.


  • statistical significance is less important than effect size.
  • p-values are problematic.
  • if you perform multiple testing, do it on different datasets, not different splits of the same one.
  • Bonferroni correction reduces the sensitivity and is hard to interpret.
  • counting the ratio of significant results is not a measure of significance, it relies on the arbitrary 0.05 p-value cutoff which has no theoretical grounding.
  • most importantly, we should never ever make certain p-values such as 0.05 a hurdle for publication.

Please comment this blog post either here or via mail or twitter; I will update this post to improve it.

Thanks to Kyle Gorman and Steven Bedrick for comments and helpful discussions on a draft version of this post. Thanks to Antoine Venant for asking the right questions to (hopefully) improve this post. Thanks to Christine Köhn for thorough feedback and comments. I would marry you if we were not already married.


Some tips on writing software for research

Posted on Wed 17 July 2019 in misc • Tagged with programming, nlp

These are my notes for a presentation in our group at Saarland University. The presentation was mainly about software written as part of experiments in NLP, but most of the tips do not focus on NLP but rather on writing code for reproducible experiments that involve processing data sets. This kind of software is often only used by a small group of people (down to groups of one). I neither claim that this is the ultimate guide nor that I actually follow all my own advice (but I try!).

Why you should care

Our success is not measured by how nice our software looks, the important aspects are scientific insights and publications. So why should you spend additional time on your code, which is just a means to an end? There are several good reasons:

Others might want to build upon your results. This often means running your software. If that is not easy, they might not do it. I once spent a week trying to get a previously published system to run and failed. That wasted my time and reduced the impact of that publication as it is impossible to extend and cite that research.

You might want or need to extend your previous research. Maybe you did some experiments at the start of your PhD but need to change some parameters to make it compatible with other experiments. Do not assume that future-you will still know all intricacies you currently know!

Good software increases your confidence in results. You will publish a paper with your name on it. The more hacky the process of obtained results, the less sure you can be that the results are actually correct. Manual steps are a source of error, minimize them!

Less work in the long run. Made an error somewhere? Just start over! If your experiments run with minimal manual work, restart is easy and CPU time is cheaper than your time. Once you run your experiment pipeline several times, additional work into automation pays off.

You have an ethical obligation to document your experiments. A paper is usually not enough to understand all details! The code used to run the experiments can be seen as the definition of your experiments. Automation is documentation!

Good software enables teamwork. Automated experiments mean that all your collaborators will also be able to do the experiments. You don't want to be the one person knowing how to run the experiments (you will have to do all the work) and don't want to be the one not knowing it (unable to do the work). Worst case: You need knowledge from several people to run the experiment, noone can reproduce the results on their own.

The motto of this blog post
Do it twice or more
Write a script that works for you
Enjoy the summer

Some basics

Hopefully, I have convinced you by now. So let's start with some basics. No need to be perfect, small steps help.

Keep source data, in progress data, and results separate. Having different directories enables you to just rm -rf in_progress results and start afresh. If you save intermediate results for long-running computations: make sure you can always distinguish between finished and in-progress results. E.g., you might have a power outage and otherwise not know which computations are complete and which have to be started again.

Add a shebang to your scripts and make them executable. Bash: #! /bin/bash, Python 3: #! /usr/bin/env python3. This shows everyone which files are intended to be run and which are not. Also, the shebang makes sure that the right interpreter is used. You do not want a bash script run by /bin/sh and get weird results.

Add a comment header with author, license, documentation. Others know who to contact, know how they are allowed to use and distribute your code and what it does. An example from my code:

#!/usr/bin/env python3
# author: Arne Köhn <arne@chark.eu>
# License: Apache 2.0

# reads an swc file given from the command line and prints the tokens
# one on each line, and with empty lines for sentence boundaries.

This gives a potential user at least a first idea of what to expect. Add a README to your project to give a high-level overview.

Use logging frameworks. Do not simply print to stdout or stderr, use a logging framework to distinguish between debug and serious log messages. Do not debug with print statements, use a logger for that! That way, you can simply keep the logging code but disable the specific sub-logger. Java: use log4j2, Python: logging.log. Log to meaningfully named log files (e.g. Y-M-D-experimentname.log) so you can find the logs if you find problems with an experiment later on.

Version control

Use a version control system! Commit early, commit often. Git commits: first line 50 chars, second line empty, detailed log message afterwards. Example:

Add script that handles installer building

That script will automatically fetch the correct versions of files
from the build.gradle file and run install4j if it is in the path.

Also don't use master-SNAMPSHOT but a fixed version of each
dependency.  Otherwise we end up with non-reproducible installer

This way, tools will be able to display your commit message correctly.

Rewrite history. Committing often means you might want to commit incomplete code or you find errors later on and fix those. Merge those commits into one and reword the commit history to make it more understandable by others. This also lets you commit with nonsense commit messages locally without embarrassment. Simply perform a git rebase -i to merge the commits later on. Note: don't rewrite history you already pushed and other might rely on.

Use rebase when pulling.git pull --rebase will move your local commits on top of the ones already in the upstream repository. A sequential history without merge commits is much easier to read.

Tag important commits.git tag -a [tagname] creates a new tag to find this specific revision. Ran experiments with a specific version? git tag -a acl2019 and people (including you!) will be able to find it.

Use GitHub issues and wiki. You can subscribe to a repository and will get mails for all issues, essentially using it a a mailing list with automatic archival. The wiki can be cloned with git and used as centralized offline documentation. Also works with other hosting systems such as gitlab.


Use a build system. Eclipse projects do not count! Gradle is easy to understand and use. It might get more complicated if your project gets complicated, but it will definitely save you time.

A gradle file for a standard project looks like this, not much work for the time you save:

Plugins {id 'pl.allegro.tech.build.axion-release' version '1.10.1' }
repositories { maven {url 'https://jitpack.io'} }
dependencies { api 'com.github.coli-saar:basics:2909d4b20d9a9cb47ef'

group = 'com.github.coli-saar'
version = scmVersion.version
description = 'alto'

The plugin even takes the correct version number from your git tag when creating artifacts.

Make libraries available as a maven dependency. This is trivial with jitpack: It will package libraries for you to use on demand from git repositories. This is already shown in the example above: once jitpack is added as a repository, you can add dependencies from any git repository by simply depending on com.github.USER:REPO:VERSION with VERSION being a git revision or a git tag. jitpack will take care of the rest.

With the plugin described above, a release can be done by simply tagging a version. Only with this ease of use will you actually create releases more than once in a blue moon. In old projects, it sometimes took me half a day to do a proper release because I couldn’t remember all the steps involved. If it is hard to do a release, it will just be dropped because it is not a priority for your research.


Don't write top-level code. Use functions to subdivide your code, run your main function using this at the end:

if __name__ == "__main__":

This way, you can load your code into an interactive python shell (e.g. using ipython) and play with it. It also becomes possible to load the code as a library into another program.

Parse your arguments with argparse. This makes your program more robust and provides automatic documentation for people wanting to run your code. Alternative: docopt

Declare the packages your program requires. write them in requirements.txt so people can run pip3 install -r requirements.txt to obtain all libraries.

Use proper indentation and comments. Indent with 2 or 4 space, NO TABS! (This is not just my opinion, it’s in the official guidelines) Use docstrings for comments:

def frobnicate:
    """Computes the inner square sum of an infinite derived integral."""
    return 0

Your co-authors and all tools will thank you.

Instruct others how to use your libraries. Similar to jitpack for Java, you can use git URLS with pip: pip3 install git+https://example.com/usr/foo.git@TAG Write this in your README so others can use your code (and use it to include foreign code).


Perform proper error checking. exit on error and force all variables to be initialized. Add these two lines to the top of your bash scripts:

# exit on error
set -e
# exit on use of an uninitialized variable
set -u

Otherwise, the script will continue even when a command fails and interpret uninitialized variables as simply being empty. Not good if you have a typo in a variable name!

Avoid absolute directories. Want to run a program from your bash script? Either change the working directory to the directory your bash script is in or construct the path to the other program by hand:

# obtain the directory the bash script is stored in
DIR=$(cd $(dirname $0); pwd)

# Now either run the command by constructing the path:
# or change the directory if that is more convenient:
cd $DIR

Create output directories if they don't exist. Running mkdir foo will fail if foo already exists, creating the directories by hand is burdensome. Use mkdir -p foo to create the directory, it will not fail if foo exists. Additional bonus: You can create several directories at once: mkdir -p foo/bar/baz will create all nested directories at once.

Use [[ instead of [ for if statements.[[ is an internal bash mechanism, [ is an external program, which can behave weird: if [ $foo == "" ] can never become true because an empty variable $foo would lead to a syntax error(!) – simply use [[.

data formats

Use a common data format. This saves you from writing your own parser for your data format. CSV is supported everywhere. XML can be read and generated by every programming language. json is well defined. “Excel” is not a good data format.

Keep data as rich as (reasonably) possible. If you throw away information you don’t need at the moment, you (or someone else) might regret that because this additional information would have enabled interesting research. For example, the Spoken Wikipedia Corpora contain information about sub-word alignments, normalization, section structure (headings, nesting) etc. None of the research based on the SWC need all of it and it was a lot of work for us, but it enabled research we did not think of1 when creating the resource. What does this have to do with data formats? Your data format has to handle all your annotations. Plain-text with timestamps just was not powerful enough for what we wanted to annotate, even though it is standard for speech recognition, for example.

XML is actually pretty good. Young people these days2 like to dislike XML due to its verbosity. Yes, it is verbose. But it has redeeming qualities:

Continuous integration

Continuous integration automatically builds and tests your software whenever you push new commits or create a pull request. It makes sure that the software actually still works in a clean environment and not just your laptop. Integration with GitHub or Gitlab is pretty easy, I will show an example for GitHub with TravisCI, the de-facto default for CI with GitHub.

All you have to do is to add a .travis.yml file into the root of your project which contains the necessary information and enable CI by logging into travis using your GitHub credentials and selecting the repositories you want to enable CI for.

The configuration is quite simple: If you use a standard build system (see above), you only need to state the language. For example, language: java is enough configuration to get going with a gradle project.

However, you can do more: Travis can create releases based on tags for you:

language: java
  provider: releases
    Secure: [omitted]
  skip_cleanup: true
  file_glob: true
  file: build/libs/alto-*-all.jar
    tags: true

will create a new release for each tag (last lines) by building the project as always, performing no cleanup (important!) and then pushing a specific jar into a new GitHub release (provider: releases means GitHub release). We build alto that way and all its releases are automatically created.

The only slightly non-trivial part is the api key. This is created by the travis command line tool.

Automate experiment evaluation

Automate table creation. Use python, sed, awk, bash (or whatever else) to automatically gather your results. Every manual step is a possibility for copy-paste errors.

Make your output compatible to a plotting toolkit. Same as above: If your output can be read by gnuplot (or matplotlib or ggplot), you can automate your pipeline. Export to png and tikz so you can have a look at your data (png) but also have high quality vector graphics in your paper (tikz). tikz is also great because it enables manual post processing if you e.g. want to do something not obtainable through your plotting toolkit such as forcing numeral font for the y-axis.

When you are done

Clean up your pipeline. Often, an experiments grows and changes over time. Once it is done, make it automatically download needed data, automate leftover manual steps. Remember: automation is documentation – people can read the code to see what you did.

Make your data and code available. Remember: Bad code is better than no code – do not wait because you want to clean up your code some time in the future.

That’s it

Thanks for reading! I hope some points were of interest to you; if there is something not described clearly enough (or plain wrong) or I missed something important: Do let me know! I will update this post if new points come up.


  • 2019-07-18: based on my best critic Christine.
  1. Unfortunately, also an example of papers not citing the resources they use
  2. We have students who were born after Without me was released!

Why we chose XML for the SWC annotations

Posted on Wed 29 November 2017 in misc • Tagged with corpora

I was asked why we use XML instead of json for the Spoken Wikipedia Corpora:

As mentioned, we actually started with json. The first version of the SWC was actually annotated using json and I converted that to XML.

The original json more or less looked like this:

{ "sentences_starts": [0,10,46,72],
  "words": [
      {"token" : "hello", "start": 50, "end": 370},
      ["more tokens here"]

To obtain the second sentence, you needed to get sentence_starts[1] and sentence_starts[2], then obtain the sub-list of words defined by those bounds. You can notice the downside of data normalization.

The XML looked like this:

  <token start="50" end="370">hello</token>
  [more tokens here]

You can see that it is much more succinct. To obtain the second sentence, just do an xpath query: sentence[1] (more about using xpath at the bottom of this post).

But now we have much more structure, as you can see in our RelaxNG definition (have a look, it's easy to read!). We have

  • sections which can be nested
  • parts which were ignored during the alignment
  • sentences containing tokens, containting normalizations, containing phonemes

All in all, the annotation is a fairly elaborate typed tree. json is actually less succinct if you want to represent such data because there are no types. Try to represent <s><t>foo</t> <t>bar</t></s> in json:

{ "type": "s"
  "elems": [{"type": "t", "elems": ["foo"]},
           " ",
           {"type": "t", "elems": ["bar"]}

The distinction between data and annotation is not clear in json: in XML, everything that is not an XML tag is character data. To get the original text, just strip all XML tags. In json you would somehow have to externally define what the original data is and what the annotation is. This is important because we keep character-by-character correspondence to the original data. This is a very cool feature because you can cross-reference our annotations with the html markup to e.g. look at the impact of <b> tags on pronunciation.

validating your annotations

Last but not least, XML is much easier to validate (and given the complexity of our annotation, that was necessary!). The RelaxNG definition is human readable (so people can learn about the schema) and used for validation at the same time. Having a schema definition helped quite a bit because it was a central document where we could collaborate about the annotation schema. The automatic validation helped to catch malformed output – which happened more than once and was usually based on some edge cases. Without the validation, we wouldn’t have caught (and corrected) them. To my knowledge, there are no good json validators that check the structure and not just whether it is valid json.Update:

I had a look at json-schema and will give you a short comparison. In our annotation, a section has a title and content. The title is a list of tokens, some of which might be ignored. The content of a section can contain sentences, paragraphs, subsections or ignored elements.

This i s how the RelaxNG definition for that part looks like:

## A section contains a title and content. Sections are nested,
## e.g. h3 sections are stored in the content of the parent h2
## section.
Section = element section {
    attribute level {xsd:positiveInteger},
    element sectiontitle { MAUSINFO?, (T | element ignored {(T)*})* },
    element sectioncontent { (S|P|Section|Ignored)* }

I think it is fairly easy to read if you are acquainted with standard EBNF notation – | is an or, * denotes repetition and so on.

Compare my attempt at using json-schema:

{ "section": 
  { "type": "object",
    "required": ["elname", "elems"]
      { "elname": {"type": "string",
                   "pattern": "^section$"}
          "elems": {"type" : "array"
                     ["and all the interesting parts are still missing"]

That part only defines that I want to have a dictionary with elname=section and it needs to have an array for the subelements. I just gave up after a few minutes :-)

Working with XML annotations

Say you want to work with an XML annotated corpus. The easiest way to do that is XPath.

You don't care about our fancy structure annotation and just want to work on the sentences in SWC? Use this XPath selector: //s. // means descendant-or-self and s is just the element type you are interested in, i.e. you select all sentence structures that are somewhere under the root node. To give you an example in python:

import lxml.etree as ET
root = ET.parse("aligned.swc")
sentences = root.xpath("//s")

You can attach predicates in square brackets. count(t)>10 only selects sentences that have more than ten tokens:

sentences_longer_ten = root.xpath("//s[count(t)>10]")

You are only interested in long sections? Let's get the sections with more than 1k tokens! Note the .//, the leading dot means “start descending from the current node”, with just an //, you would count from the root node and not from each section.

long_sections = len(root.xpath("//section[count(.//t)>1000]"))

You want to get the number of words (i.e. tokens that have a normalization) which were not aligned? It’s easy: select all tokens with an n element as child but without an n element that has a start tag:

number_unaligned_words = root.xpath('count(//t[n][not(n[@start])])')

Note that we used count() to get a number instead of a list of elements. The aligned words have n subnodes but no n without a start attribute (there is no universal quantifier in xpath, you have to the equivalent not-exist):

aligned_words = root.xpath('//t[n][not(n[not(@start)])]')

You want to know the difference between start times for phoneme-based and word-based alignments? Here you are!

phon-diffs = [n.xpath("sum(./ph[1]/@start)")
              - int(n.attrib["start"]) 
              for n in root.xpath("//n[ph and @start]")]

We first obtain the normalizations that have word- and phoneme-based alignments (//n[ph and @start]) and then use list comprehension to compute the differences between the word-based alignments (n.attrib["start"]) and the start of the first phoneme (n.xpath("sum(./ph[1]/@start)")) – the sum() is just a hack to obtain an int instead of a string…

And that’s it! In my opinion, it’s easier than working with deeply nested json data structures. Questions, comments? send me a mail.


GamersGlobal Comment Corpus released

Posted on Sat 18 November 2017 in nlp • Tagged with corpus

Today I'm releasing the GamersGlobal comment corpus. GamersGlobal is a German computer gaming site (and my favorite one!) with a fairly active comment section below each article. This corpus contains all comments by the 20 most active users up to November 2016.

I use this corpus for teaching, mainly author attribution using bayes classifiers and language modeling. It's just more fun to use interesting comments than some news text from years ago. This is also the reason for the lack of additional meta data such as threading information: It was easier to obtain this way and I'm not doing research on it.

GamersGlobal has all user-generated content licensed under a Creative Commons share-alike license, making it ideal for corpus creation.

The corpus archive contains:

  • the original csv table with timestamps and author information
  • comments sorted by author (untokenized)
  • comments sorted by author (tokenized)
  • a script to create a train / test set with the author names in the test set hidden (this is what I hand out to my students)

You can download it here: ggcc-1.0.tar.xz (40mb, md5sum: b4adb108bc5385ee9a2caefdf8db018e).

Some statistics: - 202,561 comments - 10,376,599 characters - more statistics are left as an exercise to the reader :-)

If you are interested in corpora, be sure to also check out the Hamburg Dependency Treebank and the Spoken Wikipedia Corpora!


abgaben.el: assignment correction with emacs

Posted on Mon 13 November 2017 in software • Tagged with emacs, teaching

Part of my job at the university is teaching and that entails correcting assignments. In the old days, I would receive the assignments by email, print them, write comments in the margins, give points for the assignments and hand them back. This approach has two downsides:

  • assignments are done by groups of 2-3 students but only one would have my commented version
  • I wouldn't have my own comments afterwards.

Therefore I switched to digital comments on the pdf. I would then send the annotated pdf to the students. Because it took a lot of time (~30min every week) to find the correct email, send the emails etc, I wrote a small package to help with that: abgaben.el

I assume that you use mu4e for your emails. I ususally have several classes every semester – this semester I have one on monday (“montag”) and one on wednesday (“mittwoch”).

My workflow is as follows:

When I get an email, I save the assignment using the attachment action provided by abgaben.el. It asks for the group (montag/mittwoch in my case) and the week (01 in this example). Both questions remember your answer and will use it as a default for the next invocation. It then saves the attachment to the correct directory (abgaben-root-folder/montag/01/) and will create a new entry in your org mode file (abgaben-org-file, which needs to have a generic heading as well as your group headings in place), linking the assignment and the email:

You get the attachment action by adding something like this:

(add-to-list 'mu4e-view-attachment-actions
    '("gsave assignment" . abgaben-capture-submission) t)

The first character of the string is the shortcut. In this case, you need to press A gA for mu4e attachment actions and then g to invoke abgaben-save-abgabe.

Then you can annotate the assignment with pdf-tools or whatever program you like. You could also sync the files to your tablet and annotate them there. Afterwards, call abgaben-export-pdf-annot-to-org to export your annotations into the org file. That command will also check for points and create a new subheading listing all points as well as a sum. (Because I batch process the assignments, I usually only have to press M-x M-p <RET>…)

You can then send the annotated pdf to your students by calling abgaben-prepare-reply. The function will store a reply with the exported annotations, the points overview and the annotated pdf as attachment in your kill ring and open the original email by your students. Press R to reply, C-y to insert your reply, modify if needed, and send the email. You are done!

(For some reason, I re-exported the annotations in this video, but it is a really cool feature worth to be seen twice!)

Now you have an org file with all your annotations exported (and ready to reuse if several groups make the same mistake…), the points neatly summarized and all relevant data linked.

You can customize the relevant aspects of abgaben.el by M-x customize-group <RET> abgaben <RET>.

The package might soon be available via melpa, but for now you'll have to download it and install it via package-install-file. is availble via melpa.

If you end up using this package or parts of it, drop me an email!

Update Dec 2017: The package now also supports archives, e.g. for source code submission. These are extracted into a new folder and that folder is linked in the org file.


ESSLLI Course on Incremental NLP

Posted on Mon 03 October 2016 in nlp

Timo and I held a course on incremental processing at ESSLLI 2016. If you have a look at (most of) our publications, you will see that Timo works on incremental speech processing and I on incremental Text processing. The course was about incremental NLP in general and I hope we were successful in generating interest in incremental processing.

The slides are online (scroll to the bottom), but may not be sufficient to understand everything we actually said (this is as slides should be, in my opinion).

I stayed in Terlan which is fifteen minutes by train from Bozen and is quite lovely. Much quieter than Bolzano and one can go on hikes directly. Terlan has lots of great wine. Nearly all Terlan wine is produced by a co-operative founded in 1893.


GPS track visualization for videos

Posted on Mon 03 October 2016 in misc

We recently went for a ride at the very nice Alsterquellgebiet just north of Hamburg. We had a camera mounted and from time to time, I shot a short video.

Back home I wanted to visualize where we were for each video to make a short clip using kdenlive. The result is a small python program which will create images like these:

Track visualization on OSM

Given a gpx file and a set of other files, it downloads an OSM map for the region, draws the track, and for every file determines where it was shot (based on the time stamp as my files sadly have no usable meta-data). It then produces an image as above for each file.

You can download the script here: trackviz.py

Make sure to properly attribute OpenStreetMap if you distribute these images! Since they are downloaded directly from osm.org, they are licensed under a Creative Commons Attribution-ShareAlike 2.0 license.


Evaluating Embeddings using Syntax-based Classification Tasks as a Proxy for Parser Performance

Posted on Sun 19 June 2016 in Publications

My paper about the correlation between syneval and parsing performance has been accepted at RepEval 2016. You can find code, data etc. here. Looking forward to Berlin (which is a 1:30h train ride from Hamburg).


Mining the Spoken Wikipedia for Speech Data and Beyond

Posted on Mon 30 May 2016 in Publications • Tagged with corpus

Our paper Mining the Spoken Wikipedia for Speech Data and Beyond has been accepted at LREC. Timo presented it and the reception seemed to be rather good. You can find our paper about hours and hours of time-aligned speech data generated from the Spoken Wikipedia at the Spoken Wikipedia Corpora website. There is about 200 hours of aligned data for German alone!

Of course, all data is available under CC BY-SA-3.0. :-)


What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation

Posted on Tue 15 September 2015 in Publications

I'm presenting my paper What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation at EMNLP. You can have a look at the data, code, and examples.

Hopefully, the EMNLP video recordings will be online at some point. As of now (2016-04), they are not.