Introducing standards for automating manuscript transcription
During the Middle Ages most European languages - with the exception of Latin - were still in development. Standard spelling didn’t exist, certain letters were only just emerging and writing introduced all manner of notes and symbols, including the ampersand (&).
The result? “When it came to transcribing medieval manuscripts, individual specialists went about it in their own way”, explains Thibault Clérice, a researcher in computational humanities with the ALMANACH project team at the Inria Paris Centre. “But automating manuscript transcription requires machine learning, and for this you need standards.” In 2022 [Ariane Pinche…
Introducing standards for automating manuscript transcription
During the Middle Ages most European languages - with the exception of Latin - were still in development. Standard spelling didn’t exist, certain letters were only just emerging and writing introduced all manner of notes and symbols, including the ampersand (&).
The result? “When it came to transcribing medieval manuscripts, individual specialists went about it in their own way”, explains Thibault Clérice, a researcher in computational humanities with the ALMANACH project team at the Inria Paris Centre. “But automating manuscript transcription requires machine learning, and for this you need standards.” In 2022 Ariane Pinche, a CNRS research fellow in medieval studies and digital humanities, launched** a project called CATMus to tackle this challenge.**
Training generative AI through the use of a standardised corpus
Recruits to Ariane Pinche and Thibault Clérice’s team included Alix Chagué, then a PhD student at the Inria Paris Centre; Malamatenia Vlachou-Efsthatiou, a PhD student in Latin palaeography at the École Nationale des Ponts et Chaussées; and Simon Gabay, a researcher in digital humanities at the University of Geneva in Switzerland. Their initial objective was to create a massive, uniform database. For this the researchers gathered together 300 medieval manuscripts that had already been either fully or partly transcribed (200,000 lines in total), with well-established standards respecting spelling and abbreviations.
“The documents in question ranged from the 8th to the 16th centuries and were written in a dozen or so different languages - mostly in Old French and Latin, but also in Spanish languages, Italian, Venetian, Dutch, and so on”, explains Thibault Clérice.
This process of standardisation then meant this corpus could be used to train a model based on artificial intelligence. For this they employed transcription tools developed at the Ecole Pratique des Hautes Etudes (EPHE) - PSL by researchers including Benjamin Kiessling (now at Inria): eScriptorium and Kraken. Not only is this energy-efficient, but it focuses more on image recognition than on understanding language, helping to avoid excessive extrapolation.
More than 32,000 manuscripts transcribed in the space of a few months
This brought an end to CATMuS - but there was a logical next step. “After spending more than two years collecting and transcribing manuscripts and then training the model, all we wanted to do was to put it to use ourselves!”, recalls Thibault Clérice. And so in 2024 the team expanded to launch a second project: CoMMA
Thibault Clérice would still be in charge of modelling and processing, only now he had assistance from Benoît Sagot, head of the Almanach project team, while Hassen Aguili, an engineer also from Almanach, was brought in for his expertise on interfacing.
But before they could get the model up and running, they needed documents to transcribe. For this they turned to the EquipEx+ Biblissima+, which boasts a catalogue containing links to the digital versions of more than 260,000 manuscripts stored by institutions such as the National Library of France, as well as the associated metadata (dates, language, name, etc.).
“We received a total of 32,763 manuscripts, mostly in Old French and Latin, which we transcribed in four months”, explains Thibault Clérice - infinitely quicker than it would have taken to complete such a task manually.
Interdisciplinarity - the key to success
The model used is centred around two algorithms: one responsible for recognising the different elements on the page (main text, notes, illustrations, etc.) and the other, developed as part of CATMuS, which is used to transcribe text. “Ariane Pinche and Malamatenia Vlachou-Efsthatiou manually checked three consecutive lines in 670 manuscripts and the error rate with our model was found to be only 9.7%, which is really low.”
Some errors were linked to the fact that the manuscripts were older than those used to train the model, while others resulted from difficulties recognising the text, particularly when it was written in cursive.
A paper outlining the process and its limitations is on the way and the team hasn’t ruled out further reducing the model’s error rate. “As long as it’s worthwhile”, says Thibault Clérice. “Doubling the processing time just to reduce the error rate by 1% wouldn’t really be worth it.”
For Clérice, what this result underlines - aside from how good their model is - is the power of interdisciplinarity:
With purely digital expertise we would have been unable to understand the manuscripts we were handling and the processes that had to be applied to them.
A range of possible applications
The interdisciplinary aspect also relates to all of the applications possible using this unique corpus, which is now free to access. Simon Gabay has already explored a few of these, studying the evolution in formatting and abbreviations over time in the manuscripts that have been transcribed. But CoMMA is a massive, one-of-a-kind corpus, and a whole host of other applications are possible.
“*Prior to now, the biggest corpus of manuscripts written in Old French contained 11 million pseudowords - groups of characters - whereas CoMMA has 516 million” says Thibault Clérice. “Meanwhile, for Latin, we have gone from 226 million words to 2.7 billion.” *
Elena Pierazzo, professor of digital humanities at the University of Tours, is also excited about the possibilities on offer: “*This corpus will change how we process textual data: **having such a vast quantity of data that respects original spelling and abbreviations opens up all sorts of avenues for studying writing habits.*CoMMA will help us to understand the evolution of languages, including dialects, through the use of statistical data. This corpus also shines a spotlight on texts previously overlooked by researchers which are now easy to access by searching either by period or by theme.”
A cross-disciplinary tool for the humanities
From a digital perspective, the corpus can now be used to train AI customised for the analysis of ancient texts, something which was previously impossible owing to insufficient data. Plus, as Elena Pierazzo is keen to emphasise: *“CoMMA will reshape the borders between disciplines within the humanities. Specialists in the history of art, medicine or philosophy whose paths would previously never have crossed will now be able to work together using this cross-disciplinary tool, which covers practically all of the knowledge there is available on the Middle Ages in Old French and Latin.” *
And the researchers are keen to press forwards. Plans are in place to open the corpus up to other languages by obtaining new texts from Biblissima+. “There is no reason why Spanish or Italian languages, and the researchers studying them, can’t take advantage of transcriptions from our model”, concludes Thibault Clérice. Clearly there is much still to be explored.