Yugo Murawaki


2020

pdf bib
Building a Japanese Typo Dataset from Wikipedia’s Revision History
Yu Tanaka | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

pdf bib
Adapting BERT to Implicit Discourse Relation Classification with a Focus on Discourse Connectives
Yudai Kishimoto | Yugo Murawaki | Sadao Kurohashi
Proceedings of The 12th Language Resources and Evaluation Conference

BERT, a neural network-based language model pre-trained on large corpora, is a breakthrough in natural language processing, significantly outperforming previous state-of-the-art models in numerous tasks. However, there have been few reports on its application to implicit discourse relation classification, and it is not clear how BERT is best adapted to the task. In this paper, we test three methods of adaptation. (1) We perform additional pre-training on text tailored to discourse classification. (2) In expectation of knowledge transfer from explicit discourse relations to implicit discourse relations, we add a task named explicit connective prediction at the additional pre-training step. (3) To exploit implicit connectives given by treebank annotators, we add a task named implicit connective prediction at the fine-tuning step. We demonstrate that these three techniques can be combined straightforwardly in a single training pipeline. Through comprehensive experiments, we found that the first and second techniques provide additional gain while the last one did not.