TEI Lex-0: Recent Developments and New Directions

Authors: Tasovac, Toma / Bański, Piotr / Herold, Axel / Lehečka, Boris / Romary, Laurent / Salgado, Ana

Date: Wednesday, 6 September 2023, 2:15pm to 3:45pm

Location: Main Campus, L 1 <campus:stage>

Abstract

TEI Lex-0 is a strict customization of the TEI P5 Guidelines for marking up dictionaries which establishes a baseline encoding and a target format to facilitate the interoperability of heterogeneously encoded lexical resources (Tasovac et al. 2018). TEI Lex-0 has received strong community support and uptake: for instance, the format was chosen, together with Ontolex-Lemon (McCrae et al. 2017) as one of the default data models by the European Lexicographic Infrastructure (Tiberius et al., 2022); while the DARIAH Working Group on Lexical Resources, which maintains the TEI Lex-0 Guidelines, received the 2020 Rahtz Prize for TEI Ingenuity.¹

This panel will report on the recent developments with TEI Lex-0 and discuss possible new directions. We will do this by delivering three papers: one on dictionary metadata, one on our user-friendly tools to help both the editing and publication of TEI Lex-0dictionaries; and one on the challenges of encoding descriptive grammars. We plan to leave sufficient time for community feedback and interaction with the audience. The panel will be of interest not only to TEI experts in lexicography and linguistics, but also to metadata specialists, tool developers and the wider audience of TEI enthusiasts who are interested in the social and technical aspects of the application of TEI in various domains.

Encoding Metadata for Dictionaries in TEI Lex-0

This paper highlights the need for domain-specific metadata standards and demonstrates the usefulness of strict TEI Lex-0 Guidelines for encoding dictionary metadata. While the FAIR principles of findability, accessibility, interoperability and reusability (see Wilkinson et al. 2016) have received wide political support and are well-established in science, technology and innovation domains (see Tóth-Czifra 2020), their generic and discipline-agnostic nature leaves significant room for improvement. Lexicographers and researchers need specific information that goes beyond generic metadata formats and focuses on the specific questions of linguistic scope, structural model, markup granularity etc. FAIR Data on its own is not sufficient for such queries.

The paper will review the current state of metadata in lexicographic resources, with references to both LexBib and LexMeta (see Lindemann et al. 2018; Kosem and Lindemann 2021; and Lindemann et al. 2022), explain the rationale for the strict encoding choices made in TEI Lex-0, and contextualize the work on standardizing TEI Lex-0 metadata in the teiHeader within the broader standardization landscape including: a) the ongoing work on ISO 24612 (Lexical Markup Framework) (see Romary et al. 2019); and b) the Lexicographic Data Seal of Compliance, a community-based certification system for lexicographic resources that adhere to best scholarly practices (Tasovac et al. 2021).

Building an Infrastructure for TEI Lex-0

This paper describes two user-friendly tools for working on and with TEI Lex-0 encoded dictionaries: a customized TEI Lex-0 framework for oXygen XML Editor², and a customization of the TEI Publisher³ for lexicographic resources.

The TEI Lex-0 framework, available as an add-on for oXygen XML Editor, offers access to frequently used pieces of XML code (e.g. namespaces or basic parts of the entry structure) with associated keyboard shortcuts. The Author Mode offers intuitive interactive elements for editing metadata as well as dictionary content. Rules defined using Schematron with a Quick Fix extension are used for validation, advanced checking and editing. For data analysis (e.g. extracting headword lists, distributions of morphological categories etc.), XSLT transformations are bundled and used in combination with XQuery functions.

TEI Lex-0 Publisher extends the main functionalities of TEI Publisher using XQuery, Webcomponents and JavaScript to offer basic and advanced features for working with monolingual (and, in a future iteration multilingual) dictionaries, including, for instance: browsing, simple and advanced search (combination of multiple parameters), facets, facsimile display, the definition of a REST API for working with dictionary data etc.

Encoding Descriptive Grammars in TEI Lex-0

Descriptive grammars are one of the cornerstones of the study of language: while dictionaries, broadly speaking, describe the meaning of words, grammar books describe how the words are constructed and how they are put together to form meaningful sentences. And just like dictionaries, grammars created in the past are still of interest to humanities scholars: they document not only the past epochs of particular languages, but also the evolving thought about language as such and its central role in society.

In this paper, we’ll analyze the differences between encoding grammars and dictionaries while paying special attention to the interplay of structured elements (ranging from individual morphosyntactically tagged forms to full-blown declensions and conjugations) within the largely narrative text of grammar books. We will argue that the encoding of grammar books can be made semantically more precise and infrastructurally more interoperable through the reuse of structural patterns or “crystals” (Romary and Wegstein, 2012) recommended by TEI Lex-0.

The paper will conclude with suggestions on how the encoding of grammatical information can be improved across the several parts of the TEI Guidelines.

Bibliography

Kosem, I. and D. Lindemann (2021): ‘New developments in Elexifinder, a discovery portal for lexicographic literature’. In: Gavriilidou, Z., L. Mitits and S. Kiosses (eds.): Lexicography for Inclusion: Proceedings of the 19th EURALEX International Congress, 7–11 September 2021, Alexandroupolis, Vol. 2. Alexandroupolis, pp. 759–766. https://euralex2020.gr/proceedings-volume-2.

Lindemann, D., F. Kliche and U. Heid. (2018): ‘LexBib: a corpus and bibliography of metalexicographcal publications’. In: Proceedings of EURALEX 2018. Ljubljana, pp. 699–712. http://euralex.org/publications/lexbib-a-corpus-and-bibliography-of-metalexicographical-publications/

Lindemann, D., P. Labropoulou and C. Klaes. (2022). ‘Introducing LexMeta: A Metadata Model for Lexical Resources’. In: XX EURALEX Conference, Mannheim, Germany. https://doi.org/10.5281/zenodo.6897062

McCrae, J. P., Bosque-Gil, J., Gracia, J., Buitelaar, P. and Cimiano, P. (2017). ‘The Ontolex-Lemon model: development and applications’. In: Proceedings of eLex 2017 conference (pp. 19-21). http://john.mccr.ae/papers/mccrae2017ontolex.pdf

Romary L. and W. Wegstein (2012). ‘Consistent modelling of heterogeneous lexical structures’. In: Journal of the Text Encoding Initiative, Issue 3 | November 2012. URL: http://jtei.revues.org/540; DOI: 10.4000/jtei.540

Romary L, M. Khemakhem, F. Khan, J. Bowers, N, Calzolari, et al.. LMF Reloaded. AsiaLex 2019: Past, Present and Future, Jun 2019, Istanbul, Turkey. hal-02118319

Tasovac, T., Romary, L. et al. (2018). TEI Lex-0: A baseline encoding for lexicographic data. Version 0.9.1. DARIAH Working Group on Lexical Resources. https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html.

Tasovac, T., Romary, L., Tóth-Czifra, E., Marinski, I. (2021). Lexicographic Data Seal of Compliance. Research Report. ELEXIS; DARIAH. ⟨hal-03344267⟩.

TEI Consortium. (2022). TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 4.6.0. Last updated on 4th April 2023, revision f18deffba. TEI Consortium. https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html.

Tiberius, C., S. Krek, M. Mechura, J. McCrae and Toma Tasovac (2022). D1.5 Best practices for lexicography - final report. European Lexicographic Infrastructure (ELEXIS). https://elex.is/wp-content/uploads/ELEXIS_D1_5_Best_practices_for_lexicography.pdf

Tóth-Czifra. E. (2020). ‘The Risk of Losing the Thick Description: Data Management Challenges Faced by the Arts and Humanities in the Evolving FAIR Data Ecosystem’. In: J. Edmond (ed.), Digital Technology and the Practices of Humanities Research. Cambridge, UK: Open Book Publishers. https://doi.org/10.11647/OBP.0192

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. (2016). ‘The FAIR Guiding Principles for scientific data management and stewardship’. in: Sci Data 3, 160018. DOI: 10.1038/sdata.2016.18

Notes

Contribution Type

Keywords