2020
pdf
bib
abs
Orthographic Codes and the Neighborhood Effect: Lessons from Information Theory
Stéphan Tulkens
|
Dominiek Sandra
|
Walter Daelemans
Proceedings of The 12th Language Resources and Evaluation Conference
We consider the orthographic neighborhood effect: the effect that words with more orthographic similarity to other words are read faster. The neighborhood effect serves as an important control variable in psycholinguistic studies of word reading, and explains variance in addition to word length and word frequency. Following previous work, we model the neighborhood effect as the average distance to neighbors in feature space for three feature sets: slots, character ngrams and skipgrams. We optimize each of these feature sets and find evidence for language-independent optima, across five megastudy corpora from five alphabetic languages. Additionally, we show that weighting features using the inverse of mutual information (MI) improves the neighborhood effect significantly for all languages. We analyze the inverse feature weighting, and show that, across languages, grammatical morphemes get the lowest weights. Finally, we perform the same experiments on Korean Hangul, a non-alphabetic writing system, where we find the opposite results: slower responses as a function of denser neighborhoods, and a negative effect of inverse feature weighting. This raises the question of whether this is a cognitive effect, or an effect of the way we represent Hangul orthography, and indicates more research is needed.
pdf
bib
abs
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm
|
Katrin Marheinecke
|
Stefanie Hegele
|
Stelios Piperidis
|
Kalina Bontcheva
|
Jan Hajič
|
Khalid Choukri
|
Andrejs Vasiļjevs
|
Gerhard Backfried
|
Christoph Prinz
|
José Manuel Gómez-Pérez
|
Luc Meertens
|
Paul Lukowicz
|
Josef van Genabith
|
Andrea Lösch
|
Philipp Slusallek
|
Morten Irgens
|
Patrick Gatellier
|
Joachim Köhler
|
Laure Le Bars
|
Dimitra Anastasiou
|
Albina Auksoriūtė
|
Núria Bel
|
António Branco
|
Gerhard Budin
|
Walter Daelemans
|
Koenraad De Smedt
|
Radovan Garabík
|
Maria Gavriilidou
|
Dagmar Gromann
|
Svetla Koeva
|
Simon Krek
|
Cvetana Krstev
|
Krister Lindén
|
Bernardo Magnini
|
Jan Odijk
|
Maciej Ogrodniczuk
|
Eiríkur Rögnvaldsson
|
Mike Rosner
|
Bolette Pedersen
|
Inguna Skadiņa
|
Marko Tadić
|
Dan Tufiș
|
Tamás Váradi
|
Kadri Vider
|
Andy Way
|
François Yvon
Proceedings of The 12th Language Resources and Evaluation Conference
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.
pdf
bib
abs
Streaming Language-Specific Twitter Data with Optimal Keywords
Tim Kreutz
|
Walter Daelemans
Proceedings of the 12th Web as Corpus Workshop
The Twitter Streaming API has been used to create language-specific corpora with varying degrees of success. Selecting a filter of frequent yet distinct keywords for German resulted in a near-complete collection of German tweets. This method is promising as it keeps within Twitter endpoint limitations and could be applied to other languages besides German. But so far no research has compared methods for selecting optimal keywords for this task. This paper proposes a method for finding optimal key phrases based on a greedy solution to the maximum coverage problem. We generate candidate key phrases for the 50 most frequent languages on Twitter. Candidates are then iteratively selected based on a variety of scoring functions applied to their coverage of target tweets. Selecting candidates based on the scoring function that exponentiates the precision of a key phrase and weighs it by recall achieved the best results overall. Some target languages yield lower results than what could be expected from their prevalence on Twitter. Upon analyzing the errors, we find that these are languages that are very close to more prevalent languages. In these cases, key phrases that limit finding the competitive language are selected, and overall recall on the target language also decreases. We publish the resulting optimized lists for each language as a resource. The code to generate lists for other research objectives is also supplied.
pdf
bib
abs
Sarcasm Detection Using an Ensemble Approach
Jens Lemmens
|
Ben Burtenshaw
|
Ehsan Lotfi
|
Ilia Markov
|
Walter Daelemans
Proceedings of the Second Workshop on Figurative Language Processing
We present an ensemble approach for the detection of sarcasm in Reddit and Twitter responses in the context of The Second Workshop on Figurative Language Processing held in conjunction with ACL 2020. The ensemble is trained on the predicted sarcasm probabilities of four component models and on additional features, such as the sentiment of the comment, its length, and source (Reddit or Twitter) in order to learn which of the component models is the most reliable for which input. The component models consist of an LSTM with hashtag and emoji representations; a CNN-LSTM with casing, stop word, punctuation, and sentiment representations; an MLP based on Infersent embeddings; and an SVM trained on stylometric and emotion-based features. All component models use the two conversational turns preceding the response as context, except for the SVM, which only uses features extracted from the response. The ensemble itself consists of an adaboost classifier with the decision tree algorithm as base estimator and yields F1-scores of 67% and 74% on the Reddit and Twitter test data, respectively.