A Corpus of German Reddit Exchanges (GeRedE)

Andreas Blombach, Natalie Dykes, Philipp Heinrich, Besim Kabashi, Thomas Proisl


Abstract
GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. Starting from a large, freely available data set, the paper describes our approach to filter out German data and further pre-processing steps, as well as which metadata and annotation layers have been included so far. We explore the Reddit sphere, what makes the German data linguistically peculiar, and how some of the communities within Reddit differ from one another. The CWB-indexed version of our final corpus is available via CQPweb, and all our processing scripts as well as all manual annotation and automatic language classification can be downloaded from GitHub.
Anthology ID:
2020.lrec-1.774
Volume:
Proceedings of The 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6310–6316
URL:
https://www.aclweb.org/anthology/2020.lrec-1.774
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://www.aclweb.org/anthology/2020.lrec-1.774.pdf

You can write comments here (and agree to place them under CC-by). They are not guaranteed to stay and there is no e-mail functionality.