Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing

Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing Doaa Samy author Jerónimo Arenas-García author David Pérez-Fernández author 2020-may text English eng Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov) European Language Resources Association Marseille, France conference publication 979-10-95546-62-7 Legal-ES is an open source resource kit for legal Spanish. It consists of a large scale Spanish corpus of open legal texts and different kinds of language models including word embeddings and topic models. The corpus includes over 1000 million words covering a collection of legislative and administrative open access documents in Spanish from different sources representing international, national and regional entities. The corpus is pre-processed and tokenized using Spacy. For the word embeddings, gensim was used on the collection of tokens, producing a representation space that is especially suited to reflect the inherent characteristics of the legal domain. We calculate also topic models to obtain a convenient tool to understand the main topics in the corpus and to navigate through the documents exploiting the semantic similarity among documents. We will analyse the time structure of a dynamic topic model to infer changes in the legal production of Spanish jurisdiction that have occurred over the analysed time framework. samy-etal-2020-legal https://www.aclweb.org/anthology/2020.lt4gov-1.6 2020-may 32 36