Text Mining in the Digital Humanities
**Registration for this workshop is now closed**
Gerhard Heyer, Marco Büchler, Thomas Eckart (University of Leipzig), Charlotte Schubert (University of Leipzig)
Full day workshop: Monday, 5 July
Thinking about text mining and its scope in the Digital Humanities, a comparison between the theory based work of the Humanities and the model driven approaches of Computer Science can highlight the decisive differences. Whilst, classicists primary rely on manual work e. g. using a search engine which just finds what is requested and skips non-requested but nonetheless interesting results, an objective model can be applied to the whole text and is closer to completeness. Even if the implication of the result doesn't depend on what the researcher does, the quality itself is typically worse than manual work. That's why the workshop combines both the quality of manual work and the objectivity of a model.
The workshop contains four sessions of 90 minutes as well as one hour for lunch (not provided) and two half-hour breaks (all in all 8 hours). Every session is segmented into three parts:
- Theoretical background (30 minutes): Within this section the necessary background is given to bring basic knowledge to the participants. This includes a soft brainstorming of the algorithms running at background of the user interfaces.
- Introduction of the user interface (15 to 30 minutes): To avoid reading a manual a short introduction to the user interface is given. The short introduction of the presenter can be followed locally by every participant. Once a problem occurs, the both non-active presenters will help the respective participants.
- Hands-on section (30-45 minutes): After receiving the text mining background and a short introduction to the user interface, the participants get up to half of a session for working on their own laptops. All presenters can be asked detailed questions.
Based on the works within the eAQUA project of the last years, the modules Explorative Search, Text Completion, Difference Analysis as well as Citation Detection are chosen to highlight the benefits of computer based models. In detail that means
- Explorative Search: Using in daily life Google, almost everything can be found. The basic idea is: If one web page doesn't contain the seeked information any other will do it yet. The difference to search on Humanities texts can be grouped to two main clusters: a) The text corpus is closed and relatively small to the Internet. b) In relation to daily life queries on Google like a set of words, complete requests in researches as the Humanities are quite unclear since the set of words are unknown. That's why in this section, a graph based approach is used to find starting with a single word like a city or a person interesting associated words you would typically not have directly in your mind. In the end of this session, it will be shortly discussed how such an approach can be integrated into teaching since especially for students a search like this can be useful to explore and learn a domain.
- Text Completion: Because of the fragmentary degree of papyri and inscriptions, a dedicated session for completing texts is set on the agenda. In this session well established approaches of spell checking will be combined with dedicated techniques addressing Ancient text properties.
- Difference Analysis: A well known methodology in research is to compare different entities. That's why in this session a web based tool is introduced to compare word lists of e. g. two authors, works or literary classifications. The result is divided into five categories: two categories containing words only used in one of the two sets. Two categories represent the words which are significantly more often used than in one of the two text sets. Finally, a class of words being similar frequent is shown. Based on this separations, differences can be faster identified than by manual reading.
- Citation Detection: The session of detecting citations contains three different aspects: a) How can citations be detected? b) How can found citations be accessed as efficiently as possible by Ancient Greek Philologists? (micro view on citations) c) How can more global associations be found like dependencies between centuries and dedicated passages of works? (macro view on citations) The main focus of this session is not set on the algorithms to find citations but on both mentioned user interfaces for different research groups.