2020
pdf
bib
abs
Shallow Discourse Parsing for Under-Resourced Languages: Combining Machine Translation and Annotation Projection
Henny Sluyter-Gäthje
|
Peter Bourgonje
|
Manfred Stede
Proceedings of The 12th Language Resources and Evaluation Conference
Shallow Discourse Parsing (SDP), the identification of coherence relations between text spans, relies on large amounts of training data, which so far exists only for English - any other language is in this respect an under-resourced one. For those languages where machine translation from English is available with reasonable quality, MT in conjunction with annotation projection can be an option for producing an SDP resource. In our study, we translate the English Penn Discourse TreeBank into German and experiment with various methods of annotation projection to arrive at the German counterpart of the PDTB. We describe the key characteristics of the corpus as well as some typical sources of errors encountered during its creation. Then we evaluate the GermanPDTB by training components for selected sub-tasks of discourse parsing on this silver data and compare performance to the same components when trained on the gold, original PDTB corpus.
pdf
bib
abs
The Potsdam Commentary Corpus 2.2: Extending Annotations for Shallow Discourse Parsing
Peter Bourgonje
|
Manfred Stede
Proceedings of The 12th Language Resources and Evaluation Conference
We present the Potsdam Commentary Corpus 2.2, a German corpus of news editorials annotated on several different levels. New in the 2.2 version of the corpus are two additional annotation layers for coherence relations following the Penn Discourse TreeBank framework. Specifically, we add relation senses to an already existing layer of discourse connectives and their arguments, and we introduce a new layer with additional coherence relation types, resulting in a German corpus that mirrors the PDTB. The aim of this is to increase usability of the corpus for the task of shallow discourse parsing. In this paper, we provide inter-annotator agreement figures for the new annotations and compare corpus statistics based on the new annotations to the equivalent statistics extracted from the PDTB.
pdf
bib
abs
DiMLex-Bangla: A Lexicon of Bangla Discourse Connectives
Debopam Das
|
Manfred Stede
|
Soumya Sankar Ghosh
|
Lahari Chatterjee
Proceedings of The 12th Language Resources and Evaluation Conference
We present DiMLex-Bangla, a newly developed lexicon of discourse connectives in Bangla. The lexicon, upon completion of its first version, contains 123 Bangla connective entries, which are primarily compiled from the linguistic literature and translation of English discourse connectives. The lexicon compilation is later augmented by adding more connectives from a currently developed corpus, called the Bangla RST Discourse Treebank (Das and Stede, 2018). DiMLex-Bangla provides information on syntactic categories of Bangla connectives, their discourse semantics and non-connective uses (if any). It uses the format of the German connective lexicon DiMLex (Stede and Umbach, 1998), which provides a cross-linguistically applicable XML schema. The resource is the first of its kind in Bangla, and is freely available for use in studies on discourse structure and computational applications.
pdf
bib
abs
Semi-Supervised Tri-Training for Explicit Discourse Argument Expansion
René Knaebel
|
Manfred Stede
Proceedings of The 12th Language Resources and Evaluation Conference
This paper describes a novel application of semi-supervision for shallow discourse parsing. We use a neural approach for sequence tagging and focus on the extraction of explicit discourse arguments. First, additional unlabeled data is prepared for semi-supervised learning. From this data, weak annotations are generated in a first setting and later used in another setting to study performance differences. In our studies, we show an increase in the performance of our models that ranges between 2-10% F1 score. Further, we give some insights to the generated discourse annotations and compare the developed additional relations with the training relations. We release this new dataset of explicit discourse arguments to enable the training of large statistical models.