The Cultural Heritage AI Cookbook (2025 Edition)
Gethin Rees King’s College London
Arno Bosse KNAW HuC
Rossana Damiano University of Turin
Leif Isaksen University of Exeter
Tariq Yousef University of Southern Denmark
Elton Barker The Open University
Khalid Al Khatib Rijksuniversiteit Groningen
Anne Chen Bard College
Enrico Daga The Open University
Stephen Gadd University of Pittsburgh
William Mattingly Yale
Diana Maynard University of Sheffield
Chiara Palladino Durham University
Sebastiaan Peeters University of Twente
Nina Claudia Rastinger Austrian Academy of Sciences
Mia Ridge British Library
*…
The Cultural Heritage AI Cookbook (2025 Edition)
Gethin Rees King’s College London
Arno Bosse KNAW HuC
Rossana Damiano University of Turin
Leif Isaksen University of Exeter
Tariq Yousef University of Southern Denmark
Elton Barker The Open University
Khalid Al Khatib Rijksuniversiteit Groningen
Anne Chen Bard College
Enrico Daga The Open University
Stephen Gadd University of Pittsburgh
William Mattingly Yale
Diana Maynard University of Sheffield
Chiara Palladino Durham University
Sebastiaan Peeters University of Twente
Nina Claudia Rastinger Austrian Academy of Sciences
Mia Ridge British Library
Matteo Romanello Odoma
Robert Sanderson Yale
Marco Antonio Stranisci University of Turin
William Thorne University of Sheffield
Erik Tjong Kim Sang Netherlands eScience Center
Leon van Wissen University of Amsterdam
Mónica Marrero Europeana
Margherita Fantoli Catholic University of Leuven
What, Who, How
The depth and diversity of Cultural Heritage collections are recognised as invaluable for enriching lives, fostering social and cultural cohesion, and acting as a valuable economic resource. Yet making full use of those collections and the individual records within them remains hampered by a series of interrelated problems: 1. digital catalogue metadata tend to exist for only a small proportion of CH collections; 2. where it exists, it is often sparse, unstructured and contains varying forms of bias; 3. where structured, it is often not aligned with external authorities.
This means that it is currently difficult to discover individual items and almost impossible to link them to other records within the same collection, let alone between different resources.
To address these issues, guidelines have been produced to improve the Findability, Accessibility, Interoperability and Reusability of digital assets through machine-actionable methods. Based on FAIR principles, Linked Open Data (LOD) has proven an effective mechanism for identifying, disambiguating and linking key entities, such as place, people, objects and events, but implementing LOD tends to require massive investment in time, resource and expertise. More recently, transformer-based AI Large Language Models (LLMs) have demonstrated a remarkable capacity to interpret and contextualise natural language. However, while LLMs are far more intuitive to use, their probabilistic and variable outputs make data enrichment unstable and unpredictable: they can return simply too many errors to make their use worthwhile for data curation. The particular scenario set out here uses a combination of LOD and LLM technologies to enable digital assets to be enriched through the processes of Named Entity Recognition, Named Entity Disambiguation, and Relationship Extraction.
The following cookbook provides different recipes, derived from LOD and LLM technologies, for enabling CH institutions to enrich their metadata at scale. We envisage two user profiles of the cookbook. One user will be a collections manager who is interested in making use of digital technologies for enriching their objects, but won’t necessarily have the technical expertise to do this for themselves. The second user, who has more technical proficiency, will be able to use our recipes as an inspiration or basis for their own work.
The cookbook has the following structure. It has notebooks for:
Data preparation and processes — in which we set out: (i) how to get the data in a format that can be used in these processes; and (ii) the different ways of identifying named entities and then disambiguating them. 1.
Evaluation - in which we set out how to assess the results of the data processing according to standard metrics. 1.
Applications - in which we set out examples use cases for what you’ll be able to do with the processed data. There is also a Glossary of concepts and an About page.
A final note: this work is very much of the moment: September 2025. Given the rapid pace of technological change, particularly in LLMs, we anticipate that the specific tools and methods that we outline here will not be so cutting edge in a year. In other words, the recipes should not be considered a maintained service or best practice that is future-proofed, nor, indeed, a ready to go implementation. That said, we believe that these simple-to-follow recipes can be easily adapted to different scenarios, updated by new technologies, and extended for greater coverage. If you have any comments or suggestions, please do raise a GitHub ticket on this repo or email officers@pelagios.org.