GamersGlobal Comment Corpus released

Posted on Sat 18 November 2017 in nlp • Tagged with corpus

Today I'm releasing the GamersGlobal comment corpus. GamersGlobal is a German computer gaming site (and my favorite one!) with a fairly active comment section below each article. This corpus contains all comments by the 20 most active users up to November 2016.

I use this corpus for teaching, mainly author attribution using bayes classifiers and language modeling. It's just more fun to use interesting comments than some news text from years ago. This is also the reason for the lack of additional meta data such as threading information: It was easier to obtain this way and I'm not doing research on it.

GamersGlobal has all user-generated content licensed under a Creative Commons share-alike license, making it ideal for corpus creation.

The corpus archive contains:

  • the original csv table with timestamps and author information
  • comments sorted by author (untokenized)
  • comments sorted by author (tokenized)
  • a script to create a train / test set with the author names in the test set hidden (this is what I hand out to my students)

You can download it here: ggcc-1.0.tar.xz (40mb, md5sum: b4adb108bc5385ee9a2caefdf8db018e).

Some statistics: - 202,561 comments - 10,376,599 characters - more statistics are left as an exercise to the reader :-)

If you are interested in corpora, be sure to also check out the Hamburg Dependency Treebank and the Spoken Wikipedia Corpora!

Continue reading

Mining the Spoken Wikipedia for Speech Data and Beyond

Posted on Mon 30 May 2016 in Publications • Tagged with corpus

Our paper Mining the Spoken Wikipedia for Speech Data and Beyond has been accepted at LREC. Timo presented it and the reception seemed to be rather good. You can find our paper about hours and hours of time-aligned speech data generated from the Spoken Wikipedia at the Spoken Wikipedia Corpora website. There is about 200 hours of aligned data for German alone!

Of course, all data is available under CC BY-SA-3.0. :-)

Continue reading