FAIR Derived Data in TEI for Copyrighted Texts

Authors: Calvo Tello, José / Du, Keli / Göbel, Mathias / Rißler-Pipka, Nanette

Date: Wednesday, 6 September 2023, 9:15am to 10:45am

Location: Main Campus, L 1.202 <campus:note>


In this proposal, we present several options for encoding data derived from copyrighted texts in TEI. We argue for the benefits that TEI can bring to the FAIR status of the different types of data (text, textual structure, metadata, annotation), and present each option in five corpora which cover different languages, genres and periods.


Text and data mining using copyrighted texts faces restrictions on publication and re-use. One solution to this problem is to transform the original files to derived data (see e.g. Sánchez Sánchez and Domínguez Cintas 2007, Lin et al. 2012, Bhattacharyya et al. 2015, Jett et al. 2020, Schöch et al. 2020). By removing the primarily copyright-relevant features from the original documents, it is possible to publish this derived data.

Until now, the derived data was expressed using formats and vocabularies based on each project’s preferences. The transformation to derived data has considered mainly the textual information, giving less attention to other data (textual structure, metadata, or annotation). Many formats struggle modelling these kinds of data, which deteriorates their FAIR status (Wilkinson et al. 2016).

The German consortium Text+ is currently exploring derived data. We argue for the use of TEI as a format for derived data from texts, including different kinds of data into a single file. In contrast to other formats which are used only in specific disciplines, TEI is known in many communities. While the conversion from TEI to these formats is possible, the inverse workflow is in many cases impossible. TEI allows fulfilling the FAIR criteria better than other formats, because it:

  • offers elements to identify (F1) and describe (F2) the document,
  • is an open format (A1.1), data and metadata are in a single file (A2),
  • offers a rich vocabulary (I1) which follows FAIR criteria (I2), for the documentation of changes (I3),
  • decisions (R1), licenses (R1.1), and origin of the data (R1.2).

To show our implementation, we use documents in five languages from already existing corpora:

  • Gutenberg.de, the German clone of Project Gutenberg
  • American drama on CD-ROM, a corpus of English plays (18th-20th centuries)
  • The Spanish CoNSSA corpus (Calvo Tello 2021)
  • A French corpus with 320 copyright protected novels
  • A Chinese corpus with 158 texts by Lu Xun

In the talk, we will show several options of how researchers could model in TEI files transformed data from the original copyrighted texts. First, bag-of-words models can be expressed as measure elements and these can relate to different textual levels, such as volumes, chapters, paragraphs or even sentences. Second, the TEI elements within the body element could be understood as other kind of transformed data that can be used by researchers. The third option combines frequency of tokens and textual structure expressed through TEI elements by sorting the tokens randomly within a certain TEI element, such as paragraphs. However, this randomization spoils almost any calculation of n-grams or collocations, which are at the core of many current distributional NLP methods. Thus, our final solution is to create n-grams maintaining the original order of the tokens, and place them into a container (seg) which are randomly shuffled. These TEI files could be used with standard tools and still contain text (unreadable for humans), metadata, annotation, textual structure, and partly, the original distribution of the tokens.


Bhattacharyya, Sayan, Peter Organisciak, and J. Stephen Downie. 2015. ‘A Fragmentizing Interface to a Large Corpus of Digitized Text: (Post)Humanism and Non-Consumptive Reading via Features’. Interdisciplinary Science Reviews 40 (1): 61–77. https://doi.org/10.1179/0308018814Z.000000000105.

Calvo Tello, José. 2021. The Novel in the Spanish Silver Age: A Digital Analysis of Genre Using Machine Learning. Digital Humanities Research 4. Bielefeld: transcript. https://www.transcript-verlag.de/978-3-8376-5925-2/the-novel-in-the-spanish-silver-age/?c=331025282.

Lin, Yuri, Jean-Baptiste Michel, Erez Aiden Lieberman, Jon Orwant, Will Brockman, and Slav Petrov. 2012. ‘Syntactic Annotations for the Google Books NGram Corpus’. In Proceedings of the ACL 2012 System Demonstrations, 169–74. Jeju Island, Korea: Association for Computational Linguistics. https://aclanthology.org/P12-3029.

Sánchez Sánchez, Mercedes, and Carlos Domínguez Cintas. 2007. ‘El banco de datos de la RAE: CREA y CORDE’. Per Abbat: boletín filológico de actualización académica y didáctica, no. 2: 137–48.

Schöch, Christof, Frédéric Döhl, Achim Rettinger, Evelyn Gius, Peer Trilcke, Peter Leinen, Fotis Jannidis, Maria Hinzmann, and Jörg Röpke. 2020. ‘Abgeleitete Textformate: Text und Data Mining mit urheberrechtlich geschützten Textbeständen’. Html,PDF,Xml. Zeitschrift für digitale Geisteswissenschaften. https://doi.org/10.17175/2020_006.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. ‘The FAIR Guiding Principles for Scientific Data Management and Stewardship’. Scientific Data 3 (March). https://doi.org/10.1038/sdata.2016.18.

About the authors

José Calvo Tello works as a researcher and subject librarian at the Göttingen State and University Library. His research focuses on the application and development of computational and statistical methods to Romance literature and library records.

Keli Du is a researcher on the Zeta and Company and Text+ projects at the Trier Centre for Digital Humanities. His work focuses on computational literary studies. He is particularly interested in modelling and operationalising humanities research questions in data analysis problems.

Mathias Göbel is a Data Analyst at Göttingen State and University Library, one of the largest but definitely the greatest research library in Germany. Mathias prepared the technical part of several TEI-based editions utilizing larger infrastructure provided by DARIAH-DE and TextGrid.

Nanette Rißler-Pipka is a digital humanist, literary scholar, specialist in French and Spanish literature. She is National Coordinator of Germany for DARIAH-ERIC. She works at the central office of the Max Weber Foundation-German Humanities Institutes Abroad.

Contribution Type