No source: created in electronic format.
The need for text encoding standards for language resources (LRs) is widely
acknowledged: within the International Standards Organization (ISO) Technical
Committee 37 / Subcommittee 4 (TC 37 / SC 4), work in this area has been going on
since the early 2000s, and working groups devoted to this issue have been set up in
two current pan-European projects, CLARIN (
One such project is the National Corpus of Polish (Pol. Narodowy
Korpus Języka Polskiego; NKJP; et al. 2008,
2009) involving 4 Polish institutions and carried out in 2008–2010. The project aims
at the creation of a 1-billion-word automatically annotated corpus of Polish, with a
1-million-word subcorpus annotated manually. The following levels of linguistic
annotation are distinguished in the project: 1) segmentation into sentences, 2)
segmentation into fine-grained word-level tokens, 3) morphosyntactic analysis, 4)
coarse-grained syntactic words (e.g., analytical forms, constructions involving
bound words, etc.), 5) named entities, 6) syntactic groups, 7) word senses (for a
limited number of ambiguous lexemes).
Any standards adopted for these levels should allow for stand-off annotation, as is now common practice and as is virtually indispensable in the case of many levels of annotation, possibly involving conflicting hierarchies.
Two additional, non-linguistic levels of annotation required for each document are text structure (e.g., division into chapters, sections and paragraphs, appropriate marking of front matter, etc.) and metadata. The standard adopted for these levels should be sufficiently flexible to allow for representing diverse types of texts, including books, articles, blogs and transcripts of spoken data.
NKJP is committed to following current standards and best practices in corpus
development and text encoding. However, because of the current proliferation of
official, de facto and purported standards, it is far from
clear what standards a new corpus project should adopt. The aim of this paper is to
attempt to answer this question.
The three text encoding standards and best practices listed in a recent CLARIN
short guide (CLARIN:STE, 2009)et al.
2009.et al.
2000). Apart from these, there are other de facto
standards and best practices, e.g., TIGER-XML (Mengel and Lezius, 2000) for the
encoding of syntactic information, or the more general PAULA (Dipper, 2005)
encoding schema used in various projects in Germany.
The original version of XCES inherits from TEI an exhaustive approach to metadata representation. It makes specific recommendations for the representation of morphosyntactic information and for the alignment of parallel corpora. In early the 2000s, it was probably the most popular corpus encoding standard.
Currently, the claim of XCES to being such a standard is much weaker. A new —
more abstract — version of XCES was introduced around 2003, where concrete
morphosyntactic schema was replaced by a general feature structure
mechanism, different from the ISO Feature Structure Representation (FSR)
standard (ISO 24610-1). In our view, this is a step back, as adopting a more
abstract representation requires more work on the part of corpus developers.
Moreover, XCES has no specific recommendations for other levels of
linguistic knowledge, and no mechanisms for representing discontinuity and
alternatives, all of which need to be represented in NKJP. Taking also into
account the lack of documentation and the potential confusion concerning its
versioning,
There is a family of ISO standards developed by ISO TC 37 SC 4 for modelling
and representing different types of linguistic information. The two
published standards concern the representation of feature structures (ISO
24610-1) and the encoding of dictionaries (ISO 24613). Other proposed
standards are at varying levels of maturity and abstractness. While
eventually these standards may reach stability and specificity required by
practical applications, this is currently not the case.
TIGER-XML and a schema which may be consider as its generalisation, PAULA, are specific, relatively well-documented and widely employed best practices for describing linguistic objects occurring in texts (so-called "markables") and relations between them (in the case of TIGER-XML, the constituency relation). They do not contain specifications for metadata or structural annotation.
For metadata and structural annotation levels there is no real alternative to TEI. Moreover, TEI P5 implements the FSR standard ISO 24610-1, which can be used for the representation of any linguistic content, along the lines of XCES (although the feature structure representations used in XCES do not comply with this standard), PAULA and the proposed ISO standard, Linguistic Annotation Framework (ISO 24612). TEI P5 is stable, has rich documentation and an active user base, and for these reasons alone it should be preferred to XCES and (the current versions of) the ISO standards. Moreover, any TIGER-XML and PAULA annotation may be expressed in TEI in an isomorphic way, thanks to the linking mechanisms of TEI P5.
However, TEI is a very rich toolbox, proposing multitudinous mechanisms for
representing multifarious aspects of text encoding, and this richness, as well
as the sheer size of TEI P5 documentation (1350–1400 pages), are often perceived
by corpus developers as prohibitive. For this reason, within NKJP, a specific
set of recommendations for particular levels of annotation has been developed,
aiming at achieving a maximal compatibility (understood as the easiness to
translate between formats) with other proposed and de
facto standards.
For example, TEI P5 offers, among others, the following ways to represent
syntactic constituency: et al., 2009);
While the first of these representations is the most direct, and the second most general, it is the third representation that directly mirrors TIGER-XML, PAULA and SynAF, and for this reason, it has been adopted in NKJP.