TEI Processing and Computational Analysis on Encoding Judgment Text

Authors: Shao, Hsuan-Lei / Huang, Sieh-Chien

Date: Thursday, 7 September 2023, 2:15pm to 3:45pm

Location: Main Campus, L 2.202 <campus:measure>


Research Background

Legal documents and judgments serve as crucial sources of legal knowledge and social background information. Analyzing these texts using natural language processing (NLP) techniques holds significant value. However, the complex structure and unique formats of legal documents pose challenges to such analysis. In light of this, our research proposes a specialized form and process for the Text Encoding Initiative (TEI) applied to legal judgments. This process leverages domain-specific legal knowledge to add specific markup for legal concepts and terminology as metadata.

In this article, we incorporate TEI as an integral part of the research design for legal analytics. Our study utilizes a set of homicide judgments as research materials and devises a process to convert raw text into a computable dataframe. This process involves employing specialized data structures for text encoding, methodologies for encoding via XML platforms, structural editing, meta description, and computational analysis

Literature Review

In the beginning, the TEI process was primarily used in the research fields of literature, history, and philosophy [1,2]. Its application was primarily focused on text data. However, over time, the scope of TEI has expanded significantly. The TEI process is now being applied to various other types of data, such as religious classics [3,4], parliamentary records [5], and even “Voice Data” [6], which is another topic of discussion at this conference. Despite this broadening range of TEI applications, there is still a noticeable dearth of TEI implementation and annotation within the legal text domain. It is this void that our research team seeks to address and contribute to.

With TEI technology advances, scholars are increasingly reevaluating the ontologies of their respective research fields [7]. They are striving to redefine metadata and (re)build ontologies to suit the digital methods and tools available. In the realm of legal studies, which has a long-standing tradition of matured methodologies, our contribution lies in creating a new ontology specifically tailored to legal research that effectively integrates digital methods and practices. It is also to enhance TEI Engineering by developing/rethinking an “digital ontology” of legal domain.

Research Design

In this case, our primary research objective is to predict the “sentence term of a homicide case.” To achieve this, we have designed a comprehensive “sentencing factors” ontology derived from the criminal law code. Through the application of our TEI processing and computational analysis approach, we effectively extract valuable information from legal judgments and employ it to predict the potential sentencing outcomes.

Our research showcases the efficacy of our TEI method in legal text analysis and TEI processing. The adoption of specialized data structures, encoding methodologies, and meta descriptions ensures accurate computational analysis, leading to meaningful results in the context of sentencing predictions. The utilization of TEI in the legal domain showcases the potential of this approach as a valuable tool for enhancing legal text analysis and supporting decision-making processes.

Furthermore, our research extends beyond the legal domain. By offering insights into specific corpora, formal ontologies, and best practices in TEI, our approach has far-reaching implications for researchers in diverse fields. During the conference, our presentation will not only include a comprehensive description of our TEI processing approach, but we will also provide a demonstration of TEI in action within the realm of legal studies. Overall, our research contributes significantly to the broader exploration of TEI’s potential in advancing legal studies and computational analysis.


[1] Vertan, C., & Reimers, S. (2012). A TEI-based Application for Editing Manuscript Descriptions. Journal of the Text Encoding Initiative, (2).

[2] Soualah, M. O., & Hassoun, M. (2012). A TEI P5 Manuscript Description Adaptation for cataloguing digitized Arabic manuscripts. Journal of the Text Encoding Initiative, (2).

[3] McAllister, P. (2020). Quotes, Paraphrases, and Allusions: Text Reuse in Sanskrit Commentaries and How to Encode It. Journal of the Text Encoding Initiative, (13).

[4] Wittern, C. (2020). Digital Texts in Practice. Journal of the Text Encoding Initiative, (13).

[5] Wissik, T. (2021). Encoding Interruptions in Parliamentary Data: From Applause to Interjections and Laughter. Journal of the Text Encoding Initiative, (14).

[6] Emsley, I., & Roure, D. D. (2016). “It will discourse most eloquent music”: Sonifying Variants of Hamlet. Journal of the Text Encoding Initiative, (10).

[7] Bowers, J., & Romary, L. (2016). Deep encoding of etymological information in TEI. Journal of the Text Encoding Initiative, (10).

About the authors

Hsuan-Lei Shao (National Taiwan Normal University), https://orcid.org/0000-0002-7101-5272

Sieh-Chien Huang (National Taiwan University), schhuang@ntu.edu.tw

Contribution Type