2020
pdf
bib
abs
Developing a Twi (Asante) Dictionary from Akan Interlinear Glossed Texts
Dorothee Beermann
|
Lars Hellan
|
Pavel Mihaylov
|
Anna Struck
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Traditionally, a lexicographer identifies the lexical items to be added to a dictionary. Here we present a corpus-based approach to dictionary compilation and describe a procedure that derives a Twi dictionary from a TypeCraft corpus of Interlinear Glossed Texts. We first extracted a list of unique words. We excluded words belonging to different dialects of Akan (mostly Fante and Abron). We corrected misspellings and distinguished English loan words to be integrated in our dictionary from instances of code switching. Next to the dictionary itself, one other resource arising from our work is a lexicographical model for Akan which represents the lexical resource itself, and the extended morphological and word class inventories that provide information to be aggregated. We also represent external resources such as the corpus that serves as the source and word level audio files. The Twi dictionary consists at present of 1367 words; it will be available online and from an open mobile app.
pdf
bib
abs
Typical Sentences as a Resource for Valence
Uwe Quasthoff
|
Lars Hellan
|
Erik Körner
|
Thomas Eckart
|
Dirk Goldhahn
|
Dorothee Beermann
Proceedings of The 12th Language Resources and Evaluation Conference
Verb valence information can be derived from corpora by using subcorpora of typical sentences that are constructed in a language independent manner based on frequent POS structures. The inspection of typical sentences with a fixed verb in a certain position can show the valence information directly. Using verb fingerprints, consisting of the most typical sentence patterns the verb appears in, we are able to identify standard valence patterns and compare them against a language’s valence profile. With a very limited number of training data per language, valence information for other verbs can be derived as well. Based on the Norwegian valence patterns we are able to find comparative patterns in German where typical sentences are able to express the same situation in an equivalent way and can so construct verb valence pairs for a bilingual PolyVal dictionary. This contribution discusses this application with a focus on the Norwegian valence dictionary NorVal.
pdf
bib
abs
A Computational Grammar of Ga
Lars Hellan
Proceedings of the first workshop on Resources for African Indigenous Languages
The paper describes aspects of an HPSG style computational grammar of the West African language Ga (a Kwa language spoken in the Accra area of Ghana). As a Volta Basin Kwa language, Ga features many types of multiverb expressions and other particular constructional patterns in the verbal and nominal domain. The paper highlights theoretical and formal features of the grammar motivated by these phenomena, some of them possibly innovative to the formal framework. As a so-called deep grammar of the language, it hosts a rich lexical structure, and we describe ways in which the grammar builds on previously available lexical resources. We outline an environment of current resources in which the grammar is part, and lines of research and development in which it and its environment can be used.
pdf
bib
abs
Interoperable Semantic Annotation
Lars Hellan
16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation PROCEEDINGS
The paper presents an annotation schema with the following characteristics: it is formally compact; it systematically and compositionally expands into fullfledged analytic representations, exploiting simple algorithms of typed feature structures; its representation of various dimensions of semantic content is systematically integrated with morpho-syntactic and lexical representation; it is integrated with a ‘deep’ parsing grammar. Its compactness allows for efficient handling of large amounts of structures and data, and it is interoperable in covering multiple aspects of grammar and meaning. The code and its analytic expansions represent a cross-linguistically wide range of phenomena of languages and language structures. This paper presents its syntactic-semantic interoperability first from a theoretical point of view and then as applied in linguistic description.