A Framework for Shared Agreement of Language Tags beyond ISO 639

Frances Gillis-Webber; Sabine Tittel

A Framework for Shared Agreement of Language Tags beyond ISO 639

Abstract

The identification and annotation of languages in an unambiguous and standardized way is essential for the description of linguistic data. It is the prerequisite for machine-based interpretation, aggregation, and re-use of the data with respect to different languages. This makes it a key aspect especially for Linked Data and the multilingual Semantic Web. The standard for language tags is defined by IETF’s BCP 47 and ISO 639 provides the language codes that are the tags’ main constituents. However, for the identification of lesser-known languages, endangered languages, regional varieties or historical stages of a language, the ISO 639 codes are insufficient. Also, the optional language sub-tags compliant with BCP 47 do not offer a possibility fine-grained enough to represent linguistic variation. We propose a versatile pattern that extends the BCP 47 sub-tag ‘privateuse’ and is, thus, able to overcome the limits of BCP 47 and ISO 639. Sufficient coverage of the pattern is demonstrated with the use case of linguistic Linked Data of the endangered Gascon language. We show how to use a URI shortcode for the extended sub-tag, making the length compliant with BCP 47. We achieve this with a web application and API developed to encode and decode the language tag.

Anthology ID:: 2020.lrec-1.408
Volume:: Proceedings of The 12th Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 3333–3339
URL:: https://www.aclweb.org/anthology/2020.lrec-1.408
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://www.aclweb.org/anthology/2020.lrec-1.408.pdf

You can write comments here (and agree to place them under CC-by). They are not guaranteed to stay and there is no e-mail functionality.

PDF BibTeX Search