AI-powered metadata workflows at the ICAEW Digital Archive

**Jonathan Bushell is Head of Curation and Library Collection at the **Institute of Chartered Accountants in England and Wales (ICAEW)

When we launched the ICAEW Digital Archive in early 2020, we did what many small teams do with a new system and limited staff: we prioritised preservation over description. We pushed thousands of digital objects into Preservica (our chosen digital preservation system) with only minimal metadata – and sometimes none at all. Our assumption was that most users would arrive via our traditional library catalogue, and would simply follow links through to the digital objects.

A few years on, it was clear that this approach had reached its limits, and so we began exploring how we could use AI as an assistive tool in metad…

**Jonathan Bushell is Head of Curation and Library Collection at the **Institute of Chartered Accountants in England and Wales (ICAEW)

A few years on, it was clear that this approach had reached its limits, and so we began exploring how we could use AI as an assistive tool in metadata creation – both for tackling our metadata backlog, and for future ingests.

“Good enough” metadata wasn’t good enough

As our collections grew, our initial “minimal metadata” approach threw up a number of issues, including:

Discovery friction: Expecting users to navigate the repository through a separate library catalogue created unnecessary barriers to discovery.

Uncatalogued formats: We had ingested assets not covered by the library catalogue (such as audio-visual recordings and web archives) that had no descriptive records elsewhere.

Insufficient detail: Our assets needed asset-level descriptions beyond what the library catalogue provided (which was often only collection-level metadata).

In response, we tried to apply more detailed metadata manually, but unfortunately a lack of strict guidelines meant that different people described things in different ways and our practices changed over time. We also faced capacity constraints. Some assets – especially long webinars – would have required hours of watching, pausing, summarising and typing to describe properly. We just didn’t have that kind of time.

The result was a two-fold discovery problem: many assets had minimal or missing metadata, and others had metadata that was inconsistently or incorrectly applied. In both cases, items were effectively undiscoverable.

Clearly, the repository’s metadata needed to be reviewed and corrected, but a completely manual audit was impractical due to the volume of assets. This pushed us to explore how AI – specifically large language models (LLMs) – could help.

Engineering an effective prompt

We knew what needed to be built: a system that would provide individual assets (or their most informative portions) to an LLM, along with highly detailed instructions on how to extract metadata in a specific style, and return a structured response.

As we set about developing such a system, our primary focus was on engineering an effective prompt – essentially a cataloguing manual turned into instructions for the model. Here, we began with simple experiments using OpenAI’s GPT models, before progressively expanding the prompt with more fields and rules while testing on real assets. We used Cursor IDE during development, which significantly sped up our prototyping and allowed us to evolve the workflow organically based on what worked in practice.

Ultimately, we produced a detailed prompt which incorporated the following key components:

Explicit task definitions: “Your task is to analyse uploaded assets and extract structured metadata following ICAEW-specific conventions…”.

Role establishment: “You are a metadata archivist for the ICAEW Digital Archive…”.

Dublin Core schema: listing every metadata field we expect, along with guidelines for each.

ICAEW-specific requirements.

Our topic taxonomy: providing the model with our full list of authorised subject terms, and instructing the AI to use these terms when assigning the Subject field.

Output format specification: explicitly requiring JSON output with specific keys, and providing detailed examples.

Few-shot examples: concrete examples to demonstrate the expected output format and style.

**Validation rules: **for example, "If you are unsure or the information is not present, leave the field blank or say ‘N/A’".

For more information on our prompt, see the full prompt configuration in our GitHub repository.

Ensuring consistency and managing costs

To streamline the project, we made some pragmatic design choices – most notably:

PDF as a standard input format. We decided to normalise all assets to PDF format before processing. This simplified our pipeline and ensured the AI model consistently received both text and layout information.

Cost management via page limits. Using OpenAI’s API for hundreds of pages could get expensive, so we implemented a simple but effective strategy: only sending the most relevant portions of each asset (for longer documents, often the first 5-6 pages and the last 5-6 pages) to the model. The front matter and back matter usually contain the key metadata we need: titles, authors, publication dates, abstracts, etc.

Implementing an AI-assisted workflow

With our prompt in place and pragmatic decisions having been made, we implemented a new AI-assisted workflow and began processing our asset backlog.

The workflow consists of the following steps:

**Format check and normalisation **(asset converted to PDF if not already in that format). 1.

Page count check, and** creation of a temporary PDF **containing the first and last 5-6 pages, if the asset is over 10-12 pages in length. 1.

**AI metadata extraction **via LLM, facilitated by our prompt. 1.

**Resulting JSON written to a CSV file **keyed to the asset’s Preservica ID. 1.

CSV reviewed by archivist, with corrections being made if necessary. 1.

Metadata imported into Preservica in bulk.

This has transformed metadata creation from a time-intensive manual process to an efficient background task, enabling us to clear backlogs and make collections accessible faster.

Importantly, the AI solution serves as an assistant, not as a replacement for professional judgement. However, AI is now doing the heavy lifting – proposing detailed descriptions, names, dates and subjects – while humans focus on checking and approving.

Discoverability improved

Having implemented the AI-assisted workflow outlined above, we have seen assets which were previously almost invisible become discoverable to users.

Take a typical webinar recording. Before, a lot of our webinars had a basic title, an ingest date and not much else. You could find them if you already knew the title, but they were unlikely to surface in broader searches. Now, having first used WhisperX to transcribe their content, we are able to use AI to generate rich, structured metadata for these assets, including descriptions and subject terms from our taxonomy.

We still add “(AI generated description)” to the end of the description field to make its provenance explicit, and we often adjust the title for AV material where the AI has had to guess. But the difference in discoverability is huge. A webinar that was previously almost unfindable (except by title) may now appear in topic-based searches.

What we’ve learned

First, given the right conditions, AI can excel in metadata creation at scale, saving many hours of human labour. We have seen it rapidly generate summaries, extract names and dates, and apply keywords consistently across thousands of records.

Second, humans still matter. AI doesn’t understand context or nuance like humans do, it can hallucinate plausible-sounding but incorrect details, and it struggles when a document is poorly scanned or when key information only appears in an image. For that reason, we have deliberately kept a human-in-the-loop model: archivists review AI-generated metadata, correct titles, add context, and occasionally delete fields entirely when the model has gone off-piste.

Third, there are practical challenges. PDF quality matters (AI can struggle with poorly scanned image-heavy documents); processing large collections could become expensive; and relying on external APIs brings questions about pricing, rate limits and long-term sustainability.

For anyone curious about the nuts and bolts, the open-source workflow, complete prompt configuration, and documentation are available at github.com/icaew-digital-archive/metadata-extraction.* *

This blog post is based on a longer, more detailed article illustrated with flowcharts, demos, examples and screenshots which is available on the author’s website.