Why we chose XML for the SWC annotations

Posted on Wed 29 November 2017 in misc • Tagged with corpora

I was asked why we use XML instead of json for the Spoken Wikipedia Corpora:

As mentioned, we actually started with json. The first version of the SWC was actually annotated using json and I converted that to XML.

The original json more or less looked like this:

{ "sentences_starts": [0,10,46,72],
  "words": [
      {"token" : "hello", "start": 50, "end": 370},
      ["more tokens here"]

To obtain the second sentence, you needed to get sentence_starts[1] and sentence_starts[2], then obtain the sub-list of words defined by those bounds. You can notice the downside of data normalization.

The XML looked like this:

  <token start="50" end="370">hello</token>
  [more tokens here]

You can see that it is much more succinct. To obtain the second sentence, just do an xpath query: sentence[1] (more about using xpath at the bottom of this post).

But now we have much more structure, as you can see in our RelaxNG definition (have a look, it's easy to read!). We have

  • sections which can be nested
  • parts which were ignored during the alignment
  • sentences containing tokens, containting normalizations, containing phonemes

All in all, the annotation is a fairly elaborate typed tree. json is actually less succinct if you want to represent such data because there are no types. Try to represent <s><t>foo</t> <t>bar</t></s> in json:

{ "type": "s"
  "elems": [{"type": "t", "elems": ["foo"]},
           " ",
           {"type": "t", "elems": ["bar"]}

The distinction between data and annotation is not clear in json: in XML, everything that is not an XML tag is character data. To get the original text, just strip all XML tags. In json you would somehow have to externally define what the original data is and what the annotation is. This is important because we keep character-by-character correspondence to the original data. This is a very cool feature because you can cross-reference our annotations with the html markup to e.g. look at the impact of <b> tags on pronunciation.

validating your annotations

Last but not least, XML is much easier to validate (and given the complexity of our annotation, that was necessary!). The RelaxNG definition is human readable (so people can learn about the schema) and used for validation at the same time. Having a schema definition helped quite a bit because it was a central document where we could collaborate about the annotation schema. The automatic validation helped to catch malformed output – which happened more than once and was usually based on some edge cases. Without the validation, we wouldn’t have caught (and corrected) them. To my knowledge, there are no good json validators that check the structure and not just whether it is valid json. Update:

I had a look at json-schema and will give you a short comparison. In our annotation, a section has a title and content. The title is a list of tokens, some of which might be ignored. The content of a section can contain sentences, paragraphs, subsections or ignored elements.

This i s how the RelaxNG definition for that part looks like:

## A section contains a title and content. Sections are nested,
## e.g. h3 sections are stored in the content of the parent h2
## section.
Section = element section {
    attribute level {xsd:positiveInteger},
    element sectiontitle { MAUSINFO?, (T | element ignored {(T)*})* },
    element sectioncontent { (S|P|Section|Ignored)* }

I think it is fairly easy to read if you are acquainted with standard EBNF notation – | is an or, * denotes repetition and so on.

Compare my attempt at using json-schema:

{ "section": 
  { "type": "object",
    "required": ["elname", "elems"]
      { "elname": {"type": "string",
                   "pattern": "^section$"}
          "elems": {"type" : "array"
                     ["and all the interesting parts are still missing"]

That part only defines that I want to have a dictionary with elname=section and it needs to have an array for the subelements. I just gave up after a few minutes :-)

Working with XML annotations

Say you want to work with an XML annotated corpus. The easiest way to do that is XPath.

You don't care about our fancy structure annotation and just want to work on the sentences in SWC? Use this XPath selector: //s. // means descendant-or-self and s is just the element type you are interested in, i.e. you select all sentence structures that are somewhere under the root node. To give you an example in python:

import lxml.etree as ET
root = ET.parse("aligned.swc")
sentences = root.xpath("//s")

You can attach predicates in square brackets. count(t)>10 only selects sentences that have more than ten tokens:

sentences_longer_ten = root.xpath("//s[count(t)>10]")

You are only interested in long sections? Let's get the sections with more than 1k tokens! Note the .//, the leading dot means “start descending from the current node”, with just an //, you would count from the root node and not from each section.

long_sections = len(root.xpath("//section[count(.//t)>1000]"))

You want to get the number of words (i.e. tokens that have a normalization) which were not aligned? It’s easy: select all tokens with an n element as child but without an n element that has a start tag:

number_unaligned_words = root.xpath('count(//t[n][not(n[@start])])')

Note that we used count() to get a number instead of a list of elements. The aligned words have n subnodes but no n without a start attribute (there is no universal quantifier in xpath, you have to the equivalent not-exist):

aligned_words = root.xpath('//t[n][not(n[not(@start)])]')

You want to know the difference between start times for phoneme-based and word-based alignments? Here you are!

phon-diffs = [n.xpath("sum(./ph[1]/@start)")
              - int(n.attrib["start"]) 
              for n in root.xpath("//n[ph and @start]")]

We first obtain the normalizations that have word- and phoneme-based alignments (//n[ph and @start]) and then use list comprehension to compute the differences between the word-based alignments (n.attrib["start"]) and the start of the first phoneme (n.xpath("sum(./ph[1]/@start)")) – the sum() is just a hack to obtain an int instead of a string…

And that’s it! In my opinion, it’s easier than working with deeply nested json data structures. Questions, comments? send me a mail.

GamersGlobal Comment Corpus released

Posted on Sat 18 November 2017 in nlp • Tagged with corpus

Today I'm releasing the GamersGlobal comment corpus. GamersGlobal is a German computer gaming site (and my favorite one!) with a fairly active comment section below each article. This corpus contains all comments by the 20 most active users up to November 2016.

I use this corpus for teaching, mainly author attribution using bayes classifiers and language modeling. It's just more fun to use interesting comments than some news text from years ago. This is also the reason for the lack of additional meta data such as threading information: It was easier to obtain this way and I'm not doing research on it.

GamersGlobal has all user-generated content licensed under a Creative Commons share-alike license, making it ideal for corpus creation.

The corpus archive contains:

  • the original csv table with timestamps and author information
  • comments sorted by author (untokenized)
  • comments sorted by author (tokenized)
  • a script to create a train / test set with the author names in the test set hidden (this is what I hand out to my students)

You can download it here: ggcc-1.0.tar.xz (40mb, md5sum: b4adb108bc5385ee9a2caefdf8db018e).

Some statistics: - 202,561 comments - 10,376,599 characters - more statistics are left as an exercise to the reader :-)

If you are interested in corpora, be sure to also check out the Hamburg Dependency Treebank and the Spoken Wikipedia Corpora!

abgaben.el: assignment correction with emacs

Posted on Mon 13 November 2017 in software • Tagged with emacs, teaching

Part of my job at the university is teaching and that entails correcting assignments. In the old days, I would receive the assignments by email, print them, write comments in the margins, give points for the assignments and hand them back. This approach has two downsides:

  • assignments are done by groups of 2-3 students but only one would have my commented version
  • I wouldn't have my own comments afterwards.

Therefore I switched to digital comments on the pdf. I would then send the annotated pdf to the students. Because it took a lot of time (~30min every week) to find the correct email, send the emails etc, I wrote a small package to help with that: abgaben.el

I assume that you use mu4e for your emails. I ususally have several classes every semester – this semester I have one on monday (“montag”) and one on wednesday (“mittwoch”).

My workflow is as follows:

When I get an email, I save the assignment using the attachment action provided by abgaben.el. It asks for the group (montag/mittwoch in my case) and the week (01 in this example). Both questions remember your answer and will use it as a default for the next invocation. It then saves the attachment to the correct directory (abgaben-root-folder/montag/01/) and will create a new entry in your org mode file (abgaben-org-file, which needs to have a generic heading as well as your group headings in place), linking the assignment and the email:

You get the attachment action by adding something like this:

(add-to-list 'mu4e-view-attachment-actions
    '("gsave assignment" . abgaben-capture-submission) t)

The first character of the string is the shortcut. In this case, you need to press A gA for mu4e attachment actions and then g to invoke abgaben-save-abgabe.

Then you can annotate the assignment with pdf-tools or whatever program you like. You could also sync the files to your tablet and annotate them there. Afterwards, call abgaben-export-pdf-annot-to-org to export your annotations into the org file. That command will also check for points and create a new subheading listing all points as well as a sum. (Because I batch process the assignments, I usually only have to press M-x M-p <RET>…)

You can then send the annotated pdf to your students by calling abgaben-prepare-reply. The function will store a reply with the exported annotations, the points overview and the annotated pdf as attachment in your kill ring and open the original email by your students. Press R to reply, C-y to insert your reply, modify if needed, and send the email. You are done!

(For some reason, I re-exported the annotations in this video, but it is a really cool feature worth to be seen twice!)

Now you have an org file with all your annotations exported (and ready to reuse if several groups make the same mistake…), the points neatly summarized and all relevant data linked.

You can customize the relevant aspects of abgaben.el by M-x customize-group <RET> abgaben <RET>.

The package might soon be available via melpa, but for now you'll have to download it and install it via package-install-file. is availble via melpa.

If you end up using this package or parts of it, drop me an email!

Update Dec 2017: The package now also supports archives, e.g. for source code submission. These are extracted into a new folder and that folder is linked in the org file.

ESSLLI Course on Incremental NLP

Posted on Mon 03 October 2016 in nlp

Timo and I held a course on incremental processing at ESSLLI 2016. If you have a look at (most of) our publications, you will see that Timo works on incremental speech processing and I on incremental Text processing. The course was about incremental NLP in general and I hope we were successful in generating interest in incremental processing.

The slides are online (scroll to the bottom), but may not be sufficient to understand everything we actually said (this is as slides should be, in my opinion).

I stayed in Terlan which is fifteen minutes by train from Bozen and is quite lovely. Much quieter than Bolzano and one can go on hikes directly. Terlan has lots of great wine. Nearly all Terlan wine is produced by a co-operative founded in 1893.

GPS track visualization for videos

Posted on Mon 03 October 2016 in misc

We recently went for a ride at the very nice Alsterquellgebiet just north of Hamburg. We had a camera mounted and from time to time, I shot a short video.

Back home I wanted to visualize where we were for each video to make a short clip using kdenlive. The result is a small python program which will create images like these:

Track visualization on OSM

Given a gpx file and a set of other files, it downloads an OSM map for the region, draws the track, and for every file determines where it was shot (based on the time stamp as my files sadly have no usable meta-data). It then produces an image as above for each file.

You can download the script here: trackviz.py

Make sure to properly attribute OpenStreetMap if you distribute these images! Since they are downloaded directly from osm.org, they are licensed under a Creative Commons Attribution-ShareAlike 2.0 license.

Evaluating Embeddings using Syntax-based Classification Tasks as a Proxy for Parser Performance

Posted on Sun 19 June 2016 in Publications

My paper about the correlation between syneval and parsing performance has been accepted at RepEval 2016. You can find code, data etc. here. Looking forward to Berlin (which is a 1:30h train ride from Hamburg).

Mining the Spoken Wikipedia for Speech Data and Beyond

Posted on Mon 30 May 2016 in Publications • Tagged with corpus

Our paper Mining the Spoken Wikipedia for Speech Data and Beyond has been accepted at LREC. Timo presented it and the reception seemed to be rather good. You can find our paper about hours and hours of time-aligned speech data generated from the Spoken Wikipedia at the Spoken Wikipedia Corpora website. There is about 200 hours of aligned data for German alone!

Of course, all data is available under CC BY-SA-3.0. :-)

What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation

Posted on Tue 15 September 2015 in Publications

I'm presenting my paper What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation at EMNLP. You can have a look at the data, code, and examples.

Hopefully, the EMNLP video recordings will be online at some point. As of now (2016-04), they are not.

My Bachelor thesis

Posted on Thu 31 December 2009 in Publications

My bachelor thesis is in German, you can find it at the open access repository of our department: Inkrementelle Part-of-Speech Tagger

I also made an overview in English with the relevant results.