Why we chose XML for the SWC annotations
Posted on Wed 29 November 2017 in misc • Tagged with corpora
I was asked why we use XML instead of json for the Spoken Wikipedia Corpora:
Hey, could you elaborate more about xml != json ? What's missing? xml ids (like for crossing branches in TBs) ?
— Djamé (@zehavoc) 29. November 2017
As mentioned, we actually started with json. The first version of the SWC was actually annotated using json and I converted that to XML.
The original json more or less looked like this:
{ "sentences_starts": [0,10,46,72],
"words": [
{"token" : "hello", "start": 50, "end": 370},
["more tokens here"]
]
}
To obtain the second sentence, you needed to get sentence_starts[1]
and sentence_starts[2]
,
then obtain the sub-list of words defined by those bounds. You can notice the downside of data normalization.
The XML looked like this:
<sentence>
<token start="50" end="370">hello</token>
[more tokens here]
</sentencene>
You can see that it is much more succinct. To obtain the second sentence, just do an xpath query: sentence[1]
(more about using xpath at the bottom of this post).
But now we have much more structure, as you can see in our RelaxNG definition (have a look, it's easy to read!). We have
- sections which can be nested
- parts which were ignored during the alignment
- sentences containing tokens, containting normalizations, containing phonemes
All in all, the annotation is a fairly elaborate typed tree.
json is actually less succinct if you want to represent such data because there are no types.
Try to represent <s><t>foo</t> <t>bar</t></s>
in json:
{ "type": "s"
"elems": [{"type": "t", "elems": ["foo"]},
" ",
{"type": "t", "elems": ["bar"]}
]
}
The distinction between data and annotation is not clear in json: in
XML, everything that is not an XML tag is character data. To get the
original text, just strip all XML tags. In json you would somehow
have to externally define what the original data is and what the
annotation is. This is important because we keep
character-by-character correspondence to the original data. This is a
very cool feature because you can cross-reference our annotations with
the html markup to e.g. look at the impact of <b>
tags on
pronunciation.
validating your annotations
Last but not least, XML is much easier to validate (and given the
complexity of our annotation, that was necessary!). The RelaxNG
definition is human readable (so
people can learn about the schema) and used for validation at the same
time. Having a schema definition helped quite a bit because it was a
central document where we could collaborate about the annotation
schema. The automatic validation helped to catch malformed output –
which happened more than once and was usually based on some edge
cases. Without the validation, we wouldn’t have caught (and
corrected) them. To my knowledge, there are no good json validators
that check the structure and not just whether it is valid json.
Update:
Nice post! By the way some JSON structure validators exist, such as https://t.co/yIUeqHGdy7
— Raphaël (@rbournho) 30. November 2017
I had a look at json-schema and will give you a short comparison. In our annotation, a section has a title and content. The title is a list of tokens, some of which might be ignored. The content of a section can contain sentences, paragraphs, subsections or ignored elements.
This i s how the RelaxNG definition for that part looks like:
## A section contains a title and content. Sections are nested,
## e.g. h3 sections are stored in the content of the parent h2
## section.
Section = element section {
attribute level {xsd:positiveInteger},
element sectiontitle { MAUSINFO?, (T | element ignored {(T)*})* },
element sectioncontent { (S|P|Section|Ignored)* }
}
I think it is fairly easy to read if you are acquainted with standard EBNF notation – | is an or, * denotes repetition and so on.
Compare my attempt at using json-schema:
{ "section":
{ "type": "object",
"required": ["elname", "elems"]
"properties":
{ "elname": {"type": "string",
"pattern": "^section$"}
"elems": {"type" : "array"
["and all the interesting parts are still missing"]
}
}
}
}
That part only defines that I want to have a dictionary with elname=section and it needs to have an array for the subelements. I just gave up after a few minutes :-)
Working with XML annotations
Say you want to work with an XML annotated corpus. The easiest way to do that is XPath.
You don't care about our fancy structure annotation and just want to
work on the sentences in SWC? Use this XPath selector: //s
. //
means descendant-or-self and s
is just the element type you are
interested in, i.e. you select all sentence structures that are
somewhere under the root node. To give you an example in python:
import lxml.etree as ET
root = ET.parse("aligned.swc")
sentences = root.xpath("//s")
You can attach predicates in square brackets. count(t)>10
only
selects sentences that have more than ten tokens:
sentences_longer_ten = root.xpath("//s[count(t)>10]")
You are only interested in long sections? Let's get the sections with
more than 1k tokens! Note the .//
, the leading dot means “start
descending from the current node”, with just an //
, you would count
from the root node and not from each section.
long_sections = len(root.xpath("//section[count(.//t)>1000]"))
You want to get the number of words (i.e. tokens that have a
normalization) which were not aligned? It’s easy: select all tokens
with an n
element as child but without an n
element that has a start tag:
number_unaligned_words = root.xpath('count(//t[n][not(n[@start])])')
Note that we used count()
to get a number instead of a list of elements.
The aligned words have n
subnodes but no n
without a start
attribute
(there is no universal quantifier in xpath, you have to the equivalent not-exist):
aligned_words = root.xpath('//t[n][not(n[not(@start)])]')
You want to know the difference between start times for phoneme-based and word-based alignments? Here you are!
phon-diffs = [n.xpath("sum(./ph[1]/@start)")
- int(n.attrib["start"])
for n in root.xpath("//n[ph and @start]")]
We first obtain the normalizations that have word- and phoneme-based
alignments (//n[ph and @start]
) and then use list comprehension to
compute the differences between the word-based alignments
(n.attrib["start"]
) and the start of the first phoneme
(n.xpath("sum(./ph[1]/@start)")
) – the sum()
is just a hack to
obtain an int instead of a string…
And that’s it! In my opinion, it’s easier than working with deeply nested json data structures. Questions, comments? send me a mail.