Unsupervised Evaluation of Interactive Dialog with DialoGPT
Abstract
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.- Anthology ID:
- 2020.sigdial-1.28
- Volume:
- Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue
- Month:
- July
- Year:
- 2020
- Address:
- 1st virtual meeting
- Venue:
- SIGDIAL
- SIG:
- SIGDIAL
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 225–235
- URL:
- https://www.aclweb.org/anthology/2020.sigdial-1.28
- DOI:
- PDF:
- https://www.aclweb.org/anthology/2020.sigdial-1.28.pdf
You can write comments here (and agree to place them under CC-by). They are not guaranteed to stay and there is no e-mail functionality.