Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing
Doaa Samy, Jerónimo Arenas-García, David Pérez-Fernández
Abstract
Legal-ES is an open source resource kit for legal Spanish. It consists of a large scale Spanish corpus of open legal texts and different kinds of language models including word embeddings and topic models. The corpus includes over 1000 million words covering a collection of legislative and administrative open access documents in Spanish from different sources representing international, national and regional entities. The corpus is pre-processed and tokenized using Spacy. For the word embeddings, gensim was used on the collection of tokens, producing a representation space that is especially suited to reflect the inherent characteristics of the legal domain. We calculate also topic models to obtain a convenient tool to understand the main topics in the corpus and to navigate through the documents exploiting the semantic similarity among documents. We will analyse the time structure of a dynamic topic model to infer changes in the legal production of Spanish jurisdiction that have occurred over the analysed time framework.- Anthology ID:
- 2020.lt4gov-1.6
- Volume:
- Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Venues:
- LREC | LT4Gov | WS
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 32–36
- URL:
- https://www.aclweb.org/anthology/2020.lt4gov-1.6
- DOI:
- PDF:
- https://www.aclweb.org/anthology/2020.lt4gov-1.6.pdf
You can write comments here (and agree to place them under CC-by). They are not guaranteed to stay and there is no e-mail functionality.