Thinking about collation

Authors: Beshero-Bondar, Elisa Eileen / Viglianti, Raffaele / Cayless, Hugh / Roeder, Torsten

Date: Tuesday, 5 September 2023, 2:15pm to 5:45pm

Location: HNI, Room F0.231 <hni:note>


Collation projects, or projects that attempt to compare and model versions, pose distinct challenges for the scholarly conceptualization of a text. The critical apparatus encoding tagsets available in the TEI and the MEI offer structures to help model variation in multiple ways, but the explanation of these could be improved and those engaged in collation projects may wish to improve the examples and possibilities for modeling available in our community guidelines. The leaders of this workshop have each written about, modeled, processed, and otherwise experimented with critical apparatus and collation methods. Each brings a distinct perspective on problems and possibilities for computational modeling of alignment and variation, as well as strategies to process collations for analysis and scholarly publication.

While the workshop leaders are primarily knowledgeable of the TEI, the workshop is directed toward members of the TEI and MEC communities and, depending on the participants, may explore differences in the collation of music notation vs. text-based documents. We are best prepared to discuss textual collation, but we invite music notation experts to participate since both communities share related data models for the critical apparatus and as we work with multimodal documents combining music notation and text. The workshop leaders will share a diverse array of experiences with collation problems, with emphasis on finding solutions in conceptual modeling of texts and variation. Participants will be expected to bring their own laptops, and we will need a projector in the room.

We hope that this workshop will help to formulate new ideas about how to model variations. While tools and methods for collation are by necessity distinct for text and music notation, the results share notable commonalities, as demonstrated by the similarity of critical apparatus tags in TEI and MEI. Moreover, the workshop will aim to identify which areas can be improved in our Guidelines (both TEI and MEI) on this subject. For example, in the TEI Guidelines Chapter 12, the discussion of location-referenced double-endpoint-attachment, and parallel segmentation methods may unnecessarily complicate the question of whether they can be encoded in-line or externally, and whether variations can be represented as overlapping or not. Further, the TEI’s examples prioritize the highlighting of differences among similar witnesses. Perhaps we should consider an alternative, to model similarities in closely related but heavily variant texts.

Of course, the TEI Guidelines merely suggest without explicating how computational processes can work with the critical apparatus to reconstruct witnesses from the encoding and why that may be a desirable goal. Our workshop will try to explore, identify, and articulate the gaps between the data model and the presumptions of the tooling we have applied.

Whhile the word “collation” does not appear in the current version of Chapter 12 of the TEI Guidelines, “collation” is now more frequently paired with “machine-assisted” or applied in the context of automation. Given the importance of computational tooling to the workflow of collation, particularly when it comes to text or music processing, we seek to explore the special challenges involved in trying to generate good output: what fine-tuning methods can we apply, what strategies are brittle or problematic? The MusicDiff tool developed for Beethoven’s Werkstatt aligns and visualizes music scores marked in a simple form of MEI, and as discussed at the 2020 MEI conference, its developers anticipate that it could be generalized to a wider range of encoding.1 For text collation, software such as collateX is often used to help prepare the foundation of a TEI critical apparatus, but the application of such software and how we prepare its inputs and outputs requires careful consideration. The TEI Guidelines do not recommend a serialization for automated collation, but perhaps they ought to. The Gothenburg Model, first formulated in 2009 for text collation, helps to establish a method for thinking about the various procedures involved: tokenization, normalization, alignment, analysis/feedback, and visualization.2 Significantly, these do not proceed in a linear way. Visualization exposes problems and we have to revisit our paradigms for tokenization and normalization based on analysis of the results, and especially challenging collation projects may seem to require endless adjustments. Workshop participants new to collation will gain familiarity with the Gothenburg model, and consider how effectively its notion of alignment accords with our notions of segmentation.

Outline of the workshop

  1. Collation in theory: an introduction

  2. Survey of established methods:

    a. Machine-assisted collation and the Gothenburg model for textual data

    b. Critical apparatus modeling

    c. TEI vs. MEI approaches

  3. What can we do with collation data that is not well represented in our current community Guidelines in TEI and MEI?

  4. Problems we experience with collation: Workshop leaders and participants share representative examples of conceptual and computational challenges for collation for identifying alignments and variations in textual data.

  5. Experimental/Open Discussion: Can AI-training support collation?

  6. Sharing and workshopping collation use-cases and plans.

About the authors

Elisa Beshero-Bondar is Professor of Digital Humanities and Program Chair of Digital Media, Arts, and Technology at Penn State Erie, The Behrend College. Her work on the Frankenstein Variorum project has led her into some interesting challenges with machine-assisted collation and the TEI critical apparatus. She is chair of the TEI Technical Council, on which she has served as an elected member since 2016.

Raffaele (Raff) Viglianti is a Senior Research Software Developer at the Maryland Institute for Technology in the Humanities, University of Maryland. His research is grounded in digital humanities and textual scholarship, where “text” includes musical notation. He researches new and efficient practices to model and publish textual sources as innovative and sustainable digital scholarly resources. He is currently an elected member of the Text Encoding Initiative technical council and the Technical Editor of the Scholarly Editing journal.

Hugh Cayless is a Senior Digital Humanities Developer at Duke University Libraries. His focus in the digital critical edition space has been on improving the TEI Guidelines’ treatment of textual variation issues and on developing interactive visualizations of critical apparatus. His work concentrates mainly on ancient texts, including papyri and inscriptions. He has re-edited chapter 12 (Critical Apparatus) of the Guidelines extensively, but still finds problems with it every time he looks at it.

Torsten Roeder is a Senior Digital Humanities project manager at the Centre for Philology and Digitality at University of Würzburg. His main area are digital scholarly editions and he worked for various projects that focus on textual genetics, variance and comparison. His team is developing a generic interface framework for digital resources, while his own research project deals with early born-digital heritage and semantics of digitally represented text.


  1. Kristin Herold, Johannes Kepper, Ran Mo, and Agnes Seipelt, “MusicDiff – A Diff Tool for MEI,” Music Encoding Conference Proceedings, eds. De Luca, E. & Flanders, J. 2020 59-66. Humanities Commons. 

  2. For a detailed explanation of the Gothenburg Model, see Interedition Development Group, The Gothenburg Model, 2010-2019: On the summit and workshop of collation software developers in 2009 that formulated the Gothenburg model, see Ronald Haentjens Dekker, Dirk van Hulle, Gregor Middell, Vincent Neyt, and Joris van Zundert, “Computer-supported collation of modern manuscripts: CollateX and the Beckett Digital Manuscript Project,” Digital Scholarship in the Humanities 30:3 (December 2014) pp. 3-4. DOI:

Contribution Type