Managing Different (Encoding) Cultures when Building a TEI Corpus of Historical German Newspapers

Authors: Haaf, Susanne

Date: Thursday, 7 September 2023, 4:15pm to 5:45pm

Location: Main Campus, L 1



Long before the 20th century, reading the newspaper was already a very popular activity and the newspaper was established as a mass medium, leading to its huge impact in various areas, as e.g. language usage and change. Notwithstanding, the Deutsches Textarchiv project (DTA, since 2007) did not have newspapers on the primary list when creating a TEI corpus to document the development of the New High German language between 1600 and 1900. While the project’s focus was on printed books from various domains (Geyken, Haaf 2018) the ambitious schedule did not allow for newspapers to be considered for the time being. These hence became important when the DTA started to include digital data from other sources. Then, with the support of many partners a handsome historical TEI newspaper corpus was assembled, which shall be presented as a poster here.

Aggregated Corpora

The DTA by now comprises 2.039 issues of different historical newspapers dating back to a time period of more than years, from 1609 to 1929, the most voluminous corpora being the Mannheim Corpus of Historical Newspapers (Haaf, Schulz 2014; Fiechter et al. 2019), the Neue Rheinische Zeitung (New Rhenish Newspaper)1, the Hamburgischer Correspondent (Hamburg Correspondent; Schuster, Wille 2017)2, the Aviso of 16093, and a corpus of issues of the Allgemeine Zeitung from two different sources4. In addition, several journal corpora have been aggregated by the DTA.

Coding Cultures

The motto of Coding Cultures suits the historical newspaper corpora of the DTA in different ways:

  1. Similar to all data curation for the DTA (Geyken et al. 2018), one challenge was to convert from diverse source formats (following different coding cultures) into one common TEI format, the DTA Base Format (DTABf, since 2011; Haaf, Geyken, Wiegand 2014).
  2. Where newspapers were newly digitized (Georgi, Haaf: appearing; Haaf, Schulz 2014; Schuster, Wille 2017), annotations had to suit individual project rules and interests while preserving homogeneity of the corpus.
  3. Over the centuries, newspapers had developed increasingly complex layout specifics, following different traditions and resulting in complexity for annotation.

Hence, the challenge was nothing less than to ensure homogeneous TEI annotation in a newspaper corpus that spans several centuries and gathers material from various sources. This was only possible because of a community effort (including willingness to share, help, compromise, and follow standards) in order to create a research resource for everyone. The poster will present the corpus and the challenges of its creation, also raising detailed questions about TEI newspaper markup itself.


About the author

Susanne Haaf holds a degree in German philology and Computational Linguistics (M. A.) from the University of Heidelberg. Currently, she works as a research associate at Berlin-Brandenburg Academy of Sciences and Humanities, where she has been engaged in the projects DTA, CLARIN-D, t.evo and (till present) ZDL, all of which involved the preparation and maintenance of TEI corpora. She finished and defended her PhD thesis at the University of Paderborn in 2022, which contains work on the computational analysis of patterns which differenciate historical devotional text types.


