No source: created in electronic format.
Knowledge in textual form is always presented as visually and hierarchically
structured units of text, which is particularly true in the case of academic
texts. One research hypothesis of the ongoing project Knowledge ordering in texts—text structure and structure visualisations as
sources of natural ontologiesLOEWE-Schwerpunkt Kulturtechniken und ihre
Medialisierung, cf. tocs), for
example, represents a compromise between requirements of the text type and the
methodological and conceptual structure of its subject-matter. The aim of the
project is to examine how visual-hierarchical structuring systems are
constructed, how knowledge structures are encoded in them, and how they can be
exploited to automatically derive ontological knowledge for navigation,
archiving, or search tasks. The idea to extract domain concepts and semantic
relations mainly from the structural and linguistic information gathered from
tables of contents represents a novel approach to ontology learning.
In the present phase, we examine German academic text books, in later phases,
dissertations, research articles and historical scientific texts will also be
taken into account. A corpus of digital versions of 32 text books from 12
different academic disciplines has been compiled,Facultas, Haupt, Narr/Francke/Attempto, Springer, UTB, Vandenhoek
& Ruprecht, and Wissenschaftliche
Buchgesellschaft for kindly making available digital versions of
textbooks for us.
Presently, all available annotation layers are stored in an eXist native XML
database
The corpus infrastructure is used to explore the document applying the method of toc fragment analysis as described in the following section, and to implement functions for concept extraction and semantic relation analysis.
Our method of analysing toc fragments consists of the following steps:
Consider the section of the generated table of contents of the text book Einführung Pädagogik by Raithel et al. (2007) shown
in Figure 1. By choosing the heading 5. Ausgewählte
Subdisziplinen und Fachbereiche and its immediately superordinated
heading (in this case the title of the book) as well as its immediately
subordinated headings, we arrive at the toc fragment (or “window”) shown in
Figure 2. In the toc fragment, four terms from the domain are contained, Pädagogik, Erlebnispädagogik,
Erwachsenenbildung, and Gesundheitspädagogik.Einführung, Subdisziplin and Fachrichtung)
and from terms denoting text-type structural categories of academic texts
such as Literatur).
We employ the semantic network approach Multilayered
Extended Semantic Networks (acronym: MultiNets) by Helbig (2006) to
represent the domain concepts and semantic relations between them
expressed in a toc fragment. The MultiNet approach is a fully-fledged
semantic theory and provides a rich and consistent inventory of semantic
entity types, features, relations and functions, and has been previously
employed in the syntactic-semantic analysis components of QA systems
(Hartrumpf 2005). Using the graphical MWR editor for designing
MultiNets,
In the semantic network in Figure 3, the concepts
On account of this analysis, the following hypothesis is formed:
Given a potential structuring schema, consisting of an initial expression N,
and an expression N-1 related to N by a heading_of
relation on the document structure level, and an expression N+1 to which N
is related by the heading_of relation on the document
structure level (cf. Figure 4), if
The Hypothesis is verified by formulating the potential structuring schema
as a query to the corpus using the XQuery query language. The query result
document then contains a set of toc fragments that can now be inspected as
to whether their semantics conform to the hypothesis or not, leading to a
small statistic about the validity of the hypothesis. Sometimes the
inspection may also lead to a modification of the original query. In the
first result fragment in Figure 6, for instance, the superordinate concept
Wahrnehmung is not contained in N-1, but as the
compound modifier of Bereich (a synonym of Subdisziplin).
In this example it becomes clear that analyses on the morphological and lexical-semantic level interact with the analyses of the structuring information in that both levels provide conditions or constraints when building the semantic analysis of a toc fragment. Our corpus infrastructure is designed such that information from multiple linguistic and structural levels can be taken into account.
We presently inventorise sets of complex conditions connecting a structuring
schema with a MultiNet Schema as constructions in the
sense of Construction Grammar (CxG). Construction Grammar (Kay 1995, Östman
& Fried 2004) is a theory of grammar which is not based on phrase
structure rules operating on lexical elements, but as combinations of
constructions in which form schemata are associated with meaning schemata
and is therefore appropriate for the description task at hand. The inventory
of constructions will then be employed in ontology learning, particularly
for the task of automatically extracting domain concepts and semantic
relations between them. Constructions describing document structuring
schemata as described above play a role similar to the lexico-syntactic
“Hearst Patterns” described in Hearst (1992), which have been employed for
extracting semantic relations from running text.