Özlem Çetinoğlu
2020
Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus
Mohamed Balabel
|
Injy Hamed
|
Slim Abdennadher
|
Ngoc Thang Vu
|
Özlem Çetinoğlu
Proceedings of The 12th Language Resources and Evaluation Conference
Code-switching has become a prevalent phenomenon across many communities. It poses a challenge to NLP researchers, mainly due to the lack of available data needed for training and testing applications. In this paper, we introduce a new resource: a corpus of Egyptian- Arabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags. Beside the corpus itself, we provide annotation guidelines to address the unique challenges of annotating code-switch data. Another challenge that we address is the fact that Egyptian Arabic orthography and grammar are not standardized.
Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies
Manuela Sanguinetti
|
Cristina Bosco
|
Lauren Cassidy
|
Özlem Çetinoğlu
|
Alessandra Teresa Cignarella
|
Teresa Lynn
|
Ines Rehbein
|
Josef Ruppenhofer
|
Djamé Seddah
|
Amir Zeldes
Proceedings of The 12th Language Resources and Evaluation Conference
The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.