Bartłomiej Nitoń
2020
The MARCELL Legislative Corpus
Tamás Váradi
|
Svetla Koeva
|
Martin Yamalov
|
Marko Tadić
|
Bálint Sass
|
Bartłomiej Nitoń
|
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Verginica Barbu Mititelu
|
Radu Ion
|
Elena Irimia
|
Maria Mitrofan
|
Vasile Păiș
|
Dan Tufiș
|
Radovan Garabík
|
Simon Krek
|
Andraz Repar
|
Matjaž Rihtar
|
Janez Brank
Proceedings of The 12th Language Resources and Evaluation Conference
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
New Developments in the Polish Parliamentary Corpus
Maciej Ogrodniczuk
|
Bartłomiej Nitoń
Proceedings of the Second ParlaCLARIN Workshop
This short paper presents the current (as of February 2020) state of preparation of the Polish Parliamentary Corpus (PPC)—an extensive collection of transcripts of Polish parliamentary proceedings dating from 1919 to present. The most evident developments as compared to the 2018 version is harmonization of metadata, standardization of document identifiers, uploading contents of all documents and metadata to the database (to enable easier modification, maintenance and future development of the corpus), linking utterances to the political ontology, linking corpus texts to source data and processing historical documents.
Search
Co-authors
- Maciej Ogrodniczuk 2
- Tamás Váradi 1
- Svetla Koeva 1
- Martin Yamalov 1
- Marko Tadić 1
- show all...