Malvina Nissim
2020
MAGPIE: A Large Corpus of Potentially Idiomatic Expressions
Hessel Haagsma
|
Johan Bos
|
Malvina Nissim
Proceedings of The 12th Language Resources and Evaluation Conference
Given the limited size of existing idiom corpora, we aim to enable progress in automatic idiom processing and linguistic analysis by creating the largest-to-date corpus of idioms for English. Using a fixed idiom list, automatic pre-extraction, and a strictly controlled crowdsourced annotation procedure, we show that it is feasible to build a high-quality corpus comprising more than 50K instances, an order of a magnitude larger than previous resources. Crucial ingredients of crowdsourcing were the selection of crowdworkers, clear and comprehensive instructions, and an interface that breaks down the task in small, manageable steps. Analysis of the resulting corpus revealed strong effects of genre on idiom distribution, providing new evidence for existing theories on what influences idiom usage. The corpus also contains rich metadata, and is made publicly available.
Invisible to People but not to Machines: Evaluation of Style-aware HeadlineGeneration in Absence of Reliable Human Judgment
Lorenzo De Mattei
|
Michele Cafagna
|
Felice Dell’Orletta
|
Malvina Nissim
Proceedings of The 12th Language Resources and Evaluation Conference
We automatically generate headlines that are expected to comply with the specific styles of two different Italian newspapers. Through a data alignment strategy and different training/testing settings, we aim at decoupling content from style and preserve the latter in generation. In order to evaluate the generated headlines’ quality in terms of their specific newspaper-compliance, we devise a fine-grained evaluation strategy based on automatic classification. We observe that our models do indeed learn newspaper-specific style. Importantly, we also observe that humans aren’t reliable judges for this task, since although familiar with the newspapers, they are not able to discern their specific styles even in the original human-written headlines. The utility of automatic evaluation goes therefore beyond saving the costs and hurdles of manual annotation, and deserves particular care in its design.
Lower Bias, Higher Density Abusive Language Datasets: A Recipe
Juliet van Rosendaal
|
Tommaso Caselli
|
Malvina Nissim
Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language
Datasets to train models for abusive language detection are at the same time necessary and still scarce. One the reasons for their limited availability is the cost of their creation. It is not only that manual annotation is expensive, it is also the case that the phenomenon is sparse, causing human annotators having to go through a large number of irrelevant examples in order to obtain some significant data. Strategies used until now to increase density of abusive language and obtain more meaningful data overall, include data filtering on the basis of pre-selected keywords and hate-rich sources of data. We suggest a recipe that at the same time can provide meaningful data with possibly higher density of abusive language and also reduce top-down biases imposed by corpus creators in the selection of the data to annotate. More specifically, we exploit the controversy channel on Reddit to obtain keywords that are used to filter a Twitter dataset. While the method needs further validation and refinement, our preliminary experiments show a higher density of abusive tweets in the filtered vs unfiltered dataset, and a more meaningful topic distribution after filtering.
Search
Co-authors
- Hessel Haagsma 1
- Johan Bos 1
- Lorenzo De Mattei 1
- Michele Cafagna 1
- Felice Dell’Orletta 1
- show all...