GamersGlobal Comment Corpus released

Posted on Sat 18 November 2017 in nlp • Tagged with corpus

Today I'm releasing the GamersGlobal comment corpus. GamersGlobal is a German computer gaming site (and my favorite one!) with a fairly active comment section below each article. This corpus contains all comments by the 20 most active users up to November 2016.

I use this corpus for teaching, mainly author attribution using bayes classifiers and language modeling. It's just more fun to use interesting comments than some news text from years ago. This is also the reason for the lack of additional meta data such as threading information: It was easier to obtain this way and I'm not doing research on it.

GamersGlobal has all user-generated content licensed under a Creative Commons share-alike license, making it ideal for corpus creation.

The corpus archive contains:

  • the original csv table with timestamps and author information
  • comments sorted by author (untokenized)
  • comments sorted by author (tokenized)
  • a script to create a train / test set with the author names in the test set hidden (this is what I hand out to my students)

You can download it here: ggcc-1.0.tar.xz (40mb, md5sum: b4adb108bc5385ee9a2caefdf8db018e).

Some statistics: - 202,561 comments - 10,376,599 characters - more statistics are left as an exercise to the reader :-)

If you are interested in corpora, be sure to also check out the Hamburg Dependency Treebank and the Spoken Wikipedia Corpora!

Continue reading

ESSLLI Course on Incremental NLP

Posted on Mon 03 October 2016 in nlp

Timo and I held a course on incremental processing at ESSLLI 2016. If you have a look at (most of) our publications, you will see that Timo works on incremental speech processing and I on incremental Text processing. The course was about incremental NLP in general and I hope we were successful in generating interest in incremental processing.

The slides are online (scroll to the bottom), but may not be sufficient to understand everything we actually said (this is as slides should be, in my opinion).

I stayed in Terlan which is fifteen minutes by train from Bozen and is quite lovely. Much quieter than Bolzano and one can go on hikes directly. Terlan has lots of great wine. Nearly all Terlan wine is produced by a co-operative founded in 1893.

Continue reading