In text linguistics, among other things we study the structuring of texts and the question if certain text types are structured in a certain, typical way. However, even though TEI encoded texts have been around for quite some time, now, in historical German linguistics works on exploiting TEI encoding in the context of text type analysis are rare.

In my dissertation thesis (Haaf: appearing), I studied the presence of typical language patterns in certain devotional text types, namely protestant funeral sermons and devotional prose texts of the 17th century. Comparing corpora, I extracted statistically relevant textual patterns of the named text types and asked for their respective functions for the text. My work also included an exploitation of TEI encoding – alone and in combination with potential textual patterns, which I plan to report on in my talk.

The thesis combined qualitative and quantitative methods: potentially significant features were gathered from previous qualitative research and were extracted from corpora using (semi-)automatic methods. The results, however, did not just allow for insights on relevant textual features but for conclusions based on these features on the specifics of devotional 17th century culture.

Though devotional literature was in its time highly relevant for people of all social ranks, it has only rarely been considered by linguistic, literary, or theological research (exceptions: esp. Pfefferkorn 2005; Kemper 2015). Moreover, despite vast qualitative research on the characteristics of text types in Germanic linguistics alone (overviews e.g. Heinemann 2000; Schuster 2017; bibliography: Adamzik 1995), the approach of exploiting large corpora in this context is relatively new.

Corpus linguistics considering text types focused on contemporary texts and the differentiation of widely disparate text types (e.g. Biber 1988), or on corpus driven methods (e.g. Scharloth 2018), one exception being Bubenhofer and Spieß 2012. Digital literary studies typically base text type differentiation on stylometry or topic modelling methods rather than linguistic featues (e.g. Schöch 2017; Hettinger et al. 2015; cf. Viehhauser 2017). All such studies almost never exploit TEI encoding in the corpora used.

Method & Corpora

The current study was based on three TEI corpora (table 1), taken from the DTA ( 2007–2023) collections. Those texts are encoded according to the DTA Base Format (DTABf), a TEI P5 dialect which is meant to allow for homogeneous annotation and interoperable outcome of historical texts (DTABf since 2011; Haaf, Geyken, Wiegand 2014). Thus, information on text structures were available for data analysis, but also on linguistic features of tokens. The latter are gained by automatic procedures within the digitization and publication workflow of the DTA (Jurish 2012), and one output format includes tei:w elements with these information on token level based on the TEI att.linguistic class ( Bański, Haaf, Mueller 2017). Feature extraction in the current study was based on this resulting format (DTABf + att.linguistic)1 and was done using XSLT and Python technology.

Table 1

The study included nineteen features (see examples in table 2) from different textual layers (word, phrase, sentence, text), estimating significance by computing and evaluating measures of descriptive and analytical statistics (on frequency, distribution, and variance).

Table 2

Results to present

The results show, that essential information on patterns of text types can be conveyed by TEI text structuring. This concerns layout specifics as well as typical combinations of textual structures and ways of phrasing.

Thus, results from earlier qualitative analyses could now be specified by factoring in TEI encoding. For example, the finding that citations are essential to devotional literature could now be supplemented with information about how and where bibliographic citations are usually realized in a text (Img. 1).

Img. 1: Margins in funeral sermons carrying (typographic characteristics of) bibliographic citations, freq. per 1 million token

Furthermore, characteristic places of lexical repetition (Img. 2) could be specified along with the relevance of location for its emotionalizing effect ( repetitions in lists vs. paragraphs).

Img. 2: Repetitions of trigrams in paragraphs in the devotional prose (left box), funeral sermons (middle box) and the reference corpus (right box), freq. per 1 million token

These are only examples of a range of results obtained by considering TEI encoding, which I would like to present. However, the talk shall also address limitations of the approach (i.e. limited annotation depth or lacks of interoperability as briefly listed in table 2), ways to stretch these limits, and requirements for markup depth in linguistic research data.

Finally, a model gained from the study’s results will be presented that shows intended effects of devotional literature and its significant textual and structural patterns, and thus allows for insights on German devotional culture of the 17th century.


Susanne Haaf holds a degree in German philology and Computational Linguistics (M. A.) from the University of Heidelberg. Currently, she works as a research associate at Berlin-Brandenburg Academy of Sciences and Humanities, where she has been engaged in the projects DTA, CLARIN-D, t.evo and (till present) ZDL, all of which involved the preparation and maintenance of TEI corpora. She finished and defended her PhD thesis at the University of Paderborn in 2022, which contains work on the computational analysis of patterns which differenciate historical devotional text types.


