Rebecca Sharp
2020
MathAlign: Linking Formula Identifiers to their Contextual Natural Language Descriptions
Maria Alexeeva
|
Rebecca Sharp
|
Marco A. Valenzuela-Escárcega
|
Jennifer Kadowaki
|
Adarsh Pyarelal
|
Clayton Morrison
Proceedings of The 12th Language Resources and Evaluation Conference
Extending machine reading approaches to extract mathematical concepts and their descriptions is useful for a variety of tasks, ranging from mathematical information retrieval to increasing accessibility of scientific documents for the visually impaired. This entails segmenting mathematical formulae into identifiers and linking them to their natural language descriptions. We propose a rule-based approach for this task, which extracts LaTeX representations of formula identifiers and links them to their in-text descriptions, given only the original PDF and the location of the formula of interest. We also present a novel evaluation dataset for this task, as well as the tool used to create it.
Towards the Necessity for Debiasing Natural Language Inference Datasets
Mithun Paul Panenghat
|
Sandeep Suntwal
|
Faiz Rafique
|
Rebecca Sharp
|
Mihai Surdeanu
Proceedings of The 12th Language Resources and Evaluation Conference
Modeling natural language inference is a challenging task. With large annotated data sets available it has now become feasible to train complex neural network based inference methods which achieve state of the art performance. However, it has been shown that these models also learn from the subtle biases inherent in these datasets (CITATION). In this work we explore two techniques for delexicalization that modify the datasets in such a way that we can control the importance that neural-network based methods place on lexical entities. We demonstrate that the proposed methods not only maintain the performance in-domain but also improve performance in some out-of-domain settings. For example, when using the delexicalized version of the FEVER dataset, the in-domain performance of a state of the art neural network method dropped only by 1.12% while its out-of-domain performance on the FNC dataset improved by 4.63%. We release the delexicalized versions of three common datasets used in natural language inference. These datasets are delexicalized using two methods: one which replaces the lexical entities in an overlap-aware manner, and a second, which additionally incorporates semantic lifting of nouns and verbs to their WordNet hypernym synsets