See Abstract in PDF, XML, or in the Programme
Bański, Piotr
University of Warsaw
pkbanski@uw.edu.pl
Adam Przepiórkowski
Institute of Computer Science Polish Academy of
Sciences
adamp@ipipan.waw.pl
The need for text encoding standards for language resources (LRs) is widely acknowledged: within the International Standards Organization (ISO) Technical Committee 37 / Subcommittee 4 (TC 37 / SC 4), work in this area has been going on since the early 2000s, and working groups devoted to this issue have been set up in two current pan-European projects, CLARIN (http://www.clarin.eu/) and FLaReNet (http://www.flarenet.eu/). It is obvious that standards are necessary for the interoperability of tools and for the facilitation of data exchange between projects, but they are also needed within projects, especially where multiple partners and multiple levels of linguistic data are involved.
One such project is the National Corpus of Polish (Pol. Narodowy Korpus Języka Polskiego; NKJP; http://nkjp.pl/; Przepiórkowski et al. 2008, 2009) involving 4 Polish institutions and carried out in 2008–2010. The project aims at the creation of a 1-billion-word automatically annotated corpus of Polish, with a 1-million-word subcorpus annotated manually. The following levels of linguistic annotation are distinguished in the project: 1) segmentation into sentences, 2) segmentation into fine-grained word-level tokens, 3) morphosyntactic analysis, 4) coarse-grained syntactic words (e.g., analytical forms, constructions involving bound words, etc.), 5) named entities, 6) syntactic groups, 7) word senses (for a limited number of ambiguous lexemes).
Any standards adopted for these levels should allow for stand-off annotation, as is now common practice and as is virtually indispensable in the case of many levels of annotation, possibly involving conflicting hierarchies.
Two additional, non-linguistic levels of annotation required for each document are text structure (e.g., division into chapters, sections and paragraphs, appropriate marking of front matter, etc.) and metadata. The standard adopted for these levels should be sufficiently flexible to allow for representing diverse types of texts, including books, articles, blogs and transcripts of spoken data.
NKJP is committed to following current standards and best practices in corpus development and text encoding. However, because of the current proliferation of official, de facto and purported standards, it is far from clear what standards a new corpus project should adopt. The aim of this paper is to attempt to answer this question.
The three text encoding standards and best practices listed in a recent CLARIN short guide (CLARIN:STE, 2009)1 are: standards developed within ISO TC 37 SC 4, the Text Encoding Initiative (TEI; Burnard and Bauman 2008) guidelines and the XML version of the Corpus Encoding Standard (XCES; Ide et al. 2000). Apart from these, there are other de facto standards and best practices, e.g., TIGER-XML (Mengel and Lezius, 2000) for the encoding of syntactic information, or the more general PAULA (Dipper, 2005) encoding schema used in various projects in Germany.
The original version of XCES inherits from TEI an exhaustive approach to metadata representation. It makes specific recommendations for the representation of morphosyntactic information and for the alignment of parallel corpora. In early the 2000s, it was probably the most popular corpus encoding standard.
Currently, the claim of XCES to being such a standard is much weaker. A new — more abstract — version of XCES was introduced around 2003, where concrete morphosyntactic schema was replaced by a general feature structure mechanism, different from the ISO Feature Structure Representation (FSR) standard (ISO 24610-1). In our view, this is a step back, as adopting a more abstract representation requires more work on the part of corpus developers. Moreover, XCES has no specific recommendations for other levels of linguistic knowledge, and no mechanisms for representing discontinuity and alternatives, all of which need to be represented in NKJP. Taking also into account the lack of documentation and the potential confusion concerning its versioning,2 XCES turns out to be unsuitable for the purposes of NKJP.
There is a family of ISO standards developed by ISO TC 37 SC 4 for modelling and representing different types of linguistic information. The two published standards concern the representation of feature structures (ISO 24610-1) and the encoding of dictionaries (ISO 24613). Other proposed standards are at varying levels of maturity and abstractness. While eventually these standards may reach stability and specificity required by practical applications, this is currently not the case.3
TIGER-XML and a schema which may be consider as its generalisation, PAULA, are specific, relatively well-documented and widely employed best practices for describing linguistic objects occurring in texts (so-called "markables") and relations between them (in the case of TIGER-XML, the constituency relation). They do not contain specifications for metadata or structural annotation.
For metadata and structural annotation levels there is no real alternative to TEI. Moreover, TEI P5 implements the FSR standard ISO 24610-1, which can be used for the representation of any linguistic content, along the lines of XCES (although the feature structure representations used in XCES do not comply with this standard), PAULA and the proposed ISO standard, Linguistic Annotation Framework (ISO 24612). TEI P5 is stable, has rich documentation and an active user base, and for these reasons alone it should be preferred to XCES and (the current versions of) the ISO standards. Moreover, any TIGER-XML and PAULA annotation may be expressed in TEI in an isomorphic way, thanks to the linking mechanisms of TEI P5.
However, TEI is a very rich toolbox, proposing multitudinous mechanisms for representing multifarious aspects of text encoding, and this richness, as well as the sheer size of TEI P5 documentation (1350–1400 pages), are often perceived by corpus developers as prohibitive. For this reason, within NKJP, a specific set of recommendations for particular levels of annotation has been developed, aiming at achieving a maximal compatibility (understood as the easiness to translate between formats) with other proposed and de facto standards.
For example, TEI P5 offers, among others, the following ways to represent syntactic constituency:
While the first of these representations is the most direct, and the second most general, it is the third representation that directly mirrors TIGER-XML, PAULA and SynAF, and for this reason, it has been adopted in NKJP.
© 2010 Centre for Computing in the Humanities
Last Updated: 30-06-2010