Managing Different (Encoding) Cultures when Building a TEI Corpus of Historical German Newspapers

Authors: Haaf, Susanne

Date: Thursday, 7 September 2023, 4:15pm to 5:45pm

Location: Main Campus, L 1 <campus:stage>



Long before the 20th century, reading the newspaper was already a very popular activity and the newspaper was established as a mass medium, leading to its huge impact in various areas, as e.g. language usage and change. Notwithstanding, the Deutsches Textarchiv project (DTA, since 2007) did not have newspapers on the primary list when creating a TEI corpus to document the development of the New High German language between 1600 and 1900. While the project’s focus was on printed books from various domains (Geyken, Haaf 2018) the ambitious schedule did not allow for newspapers to be considered for the time being. These hence became important when the DTA started to include digital data from other sources. Then, with the support of many partners a handsome historical TEI newspaper corpus was assembled, which shall be presented as a poster here.

Aggregated Corpora

The DTA by now comprises 2.039 issues of different historical newspapers dating back to a time period of more than years, from 1609 to 1929, the most voluminous corpora being the Mannheim Corpus of Historical Newspapers (Haaf, Schulz 2014; Fiechter et al. 2019), the Neue Rheinische Zeitung (New Rhenish Newspaper)1, the Hamburgischer Correspondent (Hamburg Correspondent; Schuster, Wille 2017)2, the Aviso of 16093, and a corpus of issues of the Allgemeine Zeitung from two different sources4. In addition, several journal corpora have been aggregated by the DTA.

Coding Cultures

The motto of Coding Cultures suits the historical newspaper corpora of the DTA in different ways:

  1. Similar to all data curation for the DTA (Geyken et al. 2018), one challenge was to convert from diverse source formats (following different coding cultures) into one common TEI format, the DTA Base Format (DTABf, since 2011; Haaf, Geyken, Wiegand 2014).
  2. Where newspapers were newly digitized (Georgi, Haaf: appearing; Haaf, Schulz 2014; Schuster, Wille 2017), annotations had to suit individual project rules and interests while preserving homogeneity of the corpus.
  3. Over the centuries, newspapers had developed increasingly complex layout specifics, following different traditions and resulting in complexity for annotation.

Hence, the challenge was nothing less than to ensure homogeneous TEI annotation in a newspaper corpus that spans several centuries and gathers material from various sources. This was only possible because of a community effort (including willingness to share, help, compromise, and follow standards) in order to create a research resource for everyone. The poster will present the corpus and the challenges of its creation, also raising detailed questions about TEI newspaper markup itself.


DTA (since 2007): Deutsches Textarchiv. Grundlage für ein Referenzkorpus der neuhochdeutschen Sprache. Prepared by Matthias Boenig, Alexander Geyken, Susanne Haaf, Bryan Jurish, Christian Thomas, and Frank Wiegand. Ed. by Berlin Brandenburg Academy of Sciences and Humanities. Berlin. Online: (accessed 2023-07-31).

DTABf (since 2011): Deutsches Textarchiv – Basisformat, ed. by Deutsches Textarchiv (DTA) and DTABf Steering Committee (Susanne Haaf, Matthias Boenig, Alexander Geyken, Christian Thomas, Frank Wiegand, Daniel Burkhardt, Stefan Dumont & Martina Gödel). Berlin. Online: (accessed 2023-06-22).

Benjamin Fiechter, Susanne Haaf, Amelie Meister, and Oliver Pfefferkorn: Presseschau um die Jahrhundertwende: Neue historische Zeitungen im DTA. In Im Zentrum Sprache. Untersuchungen zur deutschen Sprache, 6 February 2019. Online (accessed 2023-08-01).

Alexander Geyken, Matthias Boenig, Susanne Haaf, Bryan Jurish, Christian Thomas, and Frank Wiegand (2018): Das Deutsche Textarchiv als Forschungsplattform für historische Daten in CLARIN. In Digitale Infrastrukturen für die germanistische Forschung, ed. by Henning Lobin, Roman Schneider, and Andreas Witt (Germanistische Sprachwissenschaft um 2020 6). Berlin/Boston, pp. 219–248. DOI: 10.1515/9783110538663-011.

Geyken, Alexander and Susanne Haaf (2018): Integration heterogener historischer Textkorpora in das Deutsche Textarchiv. Strategien der Anlagerung und Perspektiven der Nachnutzung. In Korpuslinguistik. Ed. by Joachim Gessinger, Angelika Redder and Ulrich Schmitz, with support of Wilfried Stölting (Osnabrücker Beiträge zur Sprachtheorie 92), Duisburg, pp. 175–192.

Georgi, Christopher and Susanne Haaf (appearirng): Die Volltextdigitalisierung der „Allgemeinen Zeitung“ (1830–1929). Historischer Hintergrund, Workflow und Forschungsperspektiven. In Historische Textmuster im Wandel. Neue Wege zu ihrer Erschließung. Ed. by Susanne Haaf and Britt-Marie Schuster, with support of Frauke Thielert (RGL). Berlin/Boston.

Haaf, Susanne, Alexander Geyken, and Frank Wiegand (2014): The DTA “Base Format”. A TEI Subset for the Compilation of a Large Reference Corpus of Printed Text from Multiple Sources. Journal of the Text Encoding Initiative (jTEI) 8. DOI: 10.4000/jtei.1114.

Susanne Haaf, Matthias Schulz: Historical Newspapers & Journals for the DTA. In: Language Resources and Technologies for Processing and Linking Historical Documents and Archives – Deploying Linked Open Data in Cultural Heritage – LRT4HDA. Proceedings of the workshop, held at the Ninth International Conference on Language Resources and Evaluation (LREC’14), May 26–31, 2014, Reykjavik (Iceland), pp. 50–54.

Schuster, Britt-Marie and Manuel Wille (2017): Die Volltextdigitalisierung der „Staats- und Gelehrten Zeitung des Hamburgischen Unpartheyischen Correspondenten“ und ihrer Vorgänger (1712–1848) und ihr Nutzen. Befunde zur Genese und zum Wandel von Textmustern. In Die Zeitung als das Medium der neueren Sprachgeschichte? Korpora, Analyse und Wirkung, ed. by Oliver Pfefferkorn, Jörg Riecke, and Britt-Marie Schuster. Berlin/Boston, pp. 99–119.

About the author

Susanne Haaf holds a degree in German philology and Computational Linguistics (M. A.) from the University of Heidelberg. Currently, she works as a research associate at Berlin-Brandenburg Academy of Sciences and Humanities, where she has been engaged in the projects DTA, CLARIN-D, t.evo and (till present) ZDL, all of which involved the preparation and maintenance of TEI corpora. She finished and defended her PhD thesis at the University of Paderborn in 2022, which contains work on the computational analysis of patterns which differenciate historical devotional text types.


Contribution Type