Maxine Eskenazi
2020
USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
Shikib Mehri
|
Maxine Eskenazi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.
“None of the Above”: Measure Uncertainty in Dialog Response Retrieval
Yulan Feng
|
Shikib Mehri
|
Maxine Eskenazi
|
Tiancheng Zhao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
This paper discusses the importance of uncovering uncertainty in end-to-end dialog tasks and presents our experimental results on uncertainty classification on the processed Ubuntu Dialog Corpus. We show that instead of retraining models for this specific purpose, we can capture the original retrieval model’s underlying confidence concerning the best prediction using trivial additional computation.
Unsupervised Evaluation of Interactive Dialog with DialoGPT
Shikib Mehri
|
Maxine Eskenazi
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.
Search