2020
pdf
bib
abs
Predicting Degrees of Technicality in Automatic Terminology Extraction
Anna Hätty
|
Dominik Schlechtweg
|
Michael Dorna
|
Sabine Schulte im Walde
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
While automatic term extraction is a well-researched area, computational approaches to distinguish between degrees of technicality are still understudied. We semi-automatically create a German gold standard of technicality across four domains, and illustrate the impact of a web-crawled general-language corpus on technicality prediction. When defining a classification approach that combines general-language and domain-specific word embeddings, we go beyond previous work and align vector spaces to gain comparative embeddings. We suggest two novel models to exploit general- vs. domain-specific comparisons: a simple neural network model with pre-computed comparative-embedding information as input, and a multi-channel model computing the comparison internally. Both models outperform previous approaches, with the multi-channel model performing best.
pdf
bib
abs
A Domain-Specific Dataset of Difficulty Ratings for German Noun Compounds in the Domains DIY, Cooking and Automotive
Julia Bettinger
|
Anna Hätty
|
Michael Dorna
|
Sabine Schulte im Walde
Proceedings of The 12th Language Resources and Evaluation Conference
We present a dataset with difficulty ratings for 1,030 German closed noun compounds extracted from domain-specific texts for do-it-ourself (DIY), cooking and automotive. The dataset includes two-part compounds for cooking and DIY, and two- to four-part compounds for automotive. The compounds were identified in text using the Simple Compound Splitter (Weller-Di Marco, 2017); a subset was filtered and balanced for frequency and productivity criteria as basis for manual annotation and fine-grained interpretation. This study presents the creation, the final dataset with ratings from 20 annotators and statistics over the dataset, to provide insight into the perception of domain-specific term difficulty. It is particularly striking that annotators agree on a coarse, binary distinction between easy vs. difficult domain-specific compounds but that a more fine grained distinction of difficulty is not meaningful. We finally discuss the challenges of an annotation for difficulty, which includes both the task description as well as the selection of the data basis.
pdf
bib
abs
Variants of Vector Space Reductions for Predicting the Compositionality of English Noun Compounds
Pegah Alipoor
|
Sabine Schulte im Walde
Proceedings of The 12th Language Resources and Evaluation Conference
Predicting the degree of compositionality of noun compounds such as “snowball” and “butterfly” is a crucial ingredient for lexicography and Natural Language Processing applications, to know whether the compound should be treated as a whole, or through its constituents, and what it means. Computational approaches for an automatic prediction typically represent and compare compounds and their constituents within a vector space and use distributional similarity as a proxy to predict the semantic relatedness between the compounds and their constituents as the compound’s degree of compositionality. This paper provides a systematic evaluation of vector-space reduction variants across kinds, exploring reductions based on part-of-speech next to and also in combination with Principal Components Analysis using Singular Value and word2vec embeddings. We show that word2vec and nouns only dimensionality reductions are the most successful and stable vector space variants for our task.
pdf
bib
abs
Varying Vector Representations and Integrating Meaning Shifts into a PageRank Model for Automatic Term Extraction
Anurag Nigam
|
Anna Hätty
|
Sabine Schulte im Walde
Proceedings of The 12th Language Resources and Evaluation Conference
We perform a comparative study for automatic term extraction from domain-specific language using a PageRank model with different edge-weighting methods. We vary vector space representations within the PageRank graph algorithm, and we go beyond standard co-occurrence and investigate the influence of measures of association strength and first- vs. second-order co-occurrence. In addition, we incorporate meaning shifts from general to domain-specific language as personalized vectors, in order to distinguish between termhood strengths of ambiguous words across word senses. Our study is performed for two domain-specific English corpora: ACL and do-it-yourself (DIY); and a domain-specific German corpus: cooking. The models are assessed by applying average precision and the roc score as evaluation metrices.
pdf
bib
abs
CCOHA: Clean Corpus of Historical American English
Reem Alatrash
|
Dominik Schlechtweg
|
Jonas Kuhn
|
Sabine Schulte im Walde
Proceedings of The 12th Language Resources and Evaluation Conference
Modelling language change is an increasingly important area of interest within the fields of sociolinguistics and historical linguistics. In recent years, there has been a growing number of publications whose main concern is studying changes that have occurred within the past centuries. The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. This paper describes methods applied to the downloadable version of the COHA corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties. The resulting corpus CCOHA contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.
bib
abs
Variants of Vector Space Reductions for Predicting the Compositionality of English Noun Compounds
Pegah Alipoormolabashi
|
Sabine Schulte im Walde
Proceedings of the The Fourth Widening Natural Language Processing Workshop
Predicting the degree of compositionality of noun compounds is a crucial ingredient for lexicography and NLP applications, to know whether the compound should be treated as a whole, or through its constituents. Computational approaches for an automatic prediction typically represent compounds and their constituents within a vector space to have a numeric relatedness measure for the words. This paper provides a systematic evaluation of using different vector-space reduction variants for the prediction. We demonstrate that Word2vec and nouns-only dimensionality reductions are the most successful and stable vector space reduction variants for our task.