Digital Humanities


King's College London, 3rd - 6th July 2010

[Image: KCL Photo Collage]
[Image: London Photo Collage (Somerset House; Globe Theatre; Millennium Bridge; Tate Modern)]

Text Mining in the Digital Humanities

See Abstract in PDF, XML, or in the Programme

Heyer, Gerhard
eAQUA Project, Natural Language Processing Group, Institute of Mathematics and Computer Science, University of Leipzig, Germany

Büchler, Marco
eAQUA Project, Natural Language Processing Group, Institute of Mathematics and Computer Science, University of Leipzig, Germany

Eckart, Thomas
eAQUA Project, Natural Language Processing Group, Institute of Mathematics and Computer Science, University of Leipzig, Germany

Schubert, Charlotte
eAQUA Project, Ancient History Group, Department of History, Faculty for History, Art, and Oriental Studies, University of Leipzig, Germany

Thinking about text mining and its scope in the Digital Humanities, a comparison between the theory based work of the Humanities and the model driven approaches of Computer Science can highlight the decisive differences. Whilst Classicists primary rely on manual work e. g. using a search engine which just finds what is requested and skips non-requested but nonetheless interesting results, an objective model can be applied to the whole text and is closer to completeness. Even if the implication of the result doesn't depend on what the researcher does, the quality itself is typically worse than manual work. That's why the workshop combines both the quality of manual work and the objectivity of a model.

The workshop contains four sessions of 90 minutes as well as one hour for lunch (not provided) and two half-hour breaks (all in all 8 hours). Every session is segmented into three parts:

  1. Theoretical background (30 minutes): Within this section the necessary background is given to bring workshop relevant knowledge to the participants. This includes a soft brainstorming of the algorithms working behind the user interfaces.
  2. Introduction of the user interface (15 to 30 minutes): To avoid reading a manual a short introduction to the user interface is given. The short introduction of the presenter can be followed locally by every participant. When a problem occurs, the non active presenters will help the respective participant.
  3. Hands-on section (30-45 minutes): After receiving the text mining background and a short introduction to the user interface, the participants have up to half of a session for working on their own laptops. All presenters can be asked for detailed questions.

Based on the works within the eAQUA project of the last years, the modules Explorative Search, Text Completion, Difference Analysis as well as Citation Detection are chosen to highlight the benefits of computer based models. In detail that means:

  • Explorative Search: By using Google in daily life almost everything can be found. The basic idea is: if a web page doesn't contain the information sought, any other will do. The differences in searching humanities texts can be grouped to two main clusters: a) The text corpus is closed and relatively small compared to the Internet; b) In relation to daily life queries on Google complete requests in the humanities are quite uncommon since the set of words are often unknown. For this reason a graph based approach is used to find (starting with a single word like a city or a person name) interesting associated words you would typically not have directly in your mind. At the end of this session, it will be discussed briefly how such an approach can be integrated into teaching since especially for students a search like this can be useful to explore and learn a domain.
  • Text Completion: Because of the degree of fragmentation of papyri and inscriptions, a dedicated session for completing texts is set on the agenda. In this session well established approaches of spell checking will be combined with dedicated techniques addressing ancient text properties.
  • Difference Analysis: In this session a web based tool is introduced to compare word lists of e.g. two authors, works or literary classifications. The result is divided into five categories: two categories containing words only used in one of the two text sets, two categories representing words which are significantly more often used in one of the two sets and finally a class of words with similar frequency. Based on these separations, differences can be identified faster than by manual reading.
  • Citation Detection: The session of detecting citations contains three different aspects: a) How can citations be detected? b) How can found citations be accessed as efficiently as possible by Ancient Greek philologists (micro view on citations)?; c) How can more global associations be found like dependencies between centuries and dedicated passages of works (macro view on citations)? The main focus of this session is not set on the algorithms to find citations but on both user interfaces for different research groups.

Full day workshop: Monday, 5 July.

© 2010 Centre for Computing in the Humanities

Last Updated: 30-06-2010