Jonathan Wright
2020
A Progress Report on Activities at the Linguistic Data Consortium Benefitting the LREC Community
Christopher Cieri
|
James Fiumara
|
Stephanie Strassel
|
Jonathan Wright
|
Denise DiPersio
|
Mark Liberman
Proceedings of The 12th Language Resources and Evaluation Conference
This latest in a series of Linguistic Data Consortium (LDC) progress reports to the LREC community does not describe any single language resource, evaluation campaign or technology but sketches the activities, since the last report, of a data center devoted to supporting the work of LREC attendees among other research communities. Specifically, we describe 96 new corpora released in 2018-2020 to date, a new technology evaluation campaign, ongoing activities to support multiple common task human language technology programs, and innovations to advance the methodology of language data collection and annotation.
Call My Net 2: A New Resource for Speaker Recognition
Karen Jones
|
Stephanie Strassel
|
Kevin Walker
|
Jonathan Wright
Proceedings of The 12th Language Resources and Evaluation Conference
We introduce the Call My Net 2 (CMN2) Corpus, a new resource for speaker recognition featuring Tunisian Arabic conversations between friends and family, incorporating both traditional telephony and VoIP data. The corpus contains data from over 400 Tunisian Arabic speakers collected via a custom-built platform deployed in Tunis, with each speaker making 10 or more calls each lasting up to 10 minutes. Calls include speech in various realistic and natural acoustic settings, both noisy and non-noisy. Speakers used a variety of handsets, including landline and mobile devices, and made VoIP calls from tablets or computers. All calls were subject to a series of manual and automatic quality checks, including speech duration, audio quality, language identity and speaker identity. The CMN2 corpus has been used in two NIST Speaker Recognition Evaluations (SRE18 and SRE19), and the SRE test sets as well as the full CMN2 corpus will be published in the Linguistic Data Consortium Catalog. We describe CMN2 corpus requirements, the telephone collection platform, and procedures for call collection. We review properties of the CMN2 dataset and discuss features of the corpus that distinguish it from prior SRE collection efforts, including some of the technical challenges encountered with collecting VoIP data.
LanguageARC: Developing Language Resources Through Citizen Linguistics
James Fiumara
|
Christopher Cieri
|
Jonathan Wright
|
Mark Liberman
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"
This paper introduces the citizen science platform, LanguageARC, developed within the NIEUW (Novel Incentives and Workflows) project supported by the National Science Foundation under Grant No. 1730377. LanguageARC is a community-oriented online platform bringing together researchers and “citizen linguists” with the shared goal of contributing to linguistic research and language technology development. Like other Citizen Science platforms and projects, LanguageARC harnesses the power and efforts of volunteers who are motivated by the incentives of contributing to science, learning and discovery, and belonging to a community dedicated to social improvement. Citizen linguists contribute language data and judgments by participating in research tasks such as classifying regional accents from audio clips, recording audio of picture descriptions and answering personality questionnaires to create baseline data for NLP research into autism and neurodegenerative conditions. Researchers can create projects on Language ARC without any coding or HTML required using our Project Builder Toolkit.
Search
Co-authors
- Christopher Cieri 2
- James Fiumara 2
- Stephanie Strassel 2
- Mark Liberman 2
- Denise DiPersio 1
- show all...