How to Do Evals on a Bloated RAG Pipeline | Towards Data Science

to Building an Overengineered Retrieval System. That one was about building the entire system. This one is about doing the evals for it.

In the previous article, I went through different parts of a RAG pipeline: chunking the data properly, query optimization, retrieval (semantic, BM25, or hybrid search), re-ranking, expanding chunks to neighbors, building the context, and then generation with an LLM.

One of the questions I got was: does expanding chunks to neighbors actually improve answers, or does it just add noise and make it harder for the model to stay grounded?

So that’s what we’ll test here. We’ll run some basic evaluations and look at metrics like faithfulness, **answer relevanc…

to Building an Overengineered Retrieval System. That one was about building the entire system. This one is about doing the evals for it.

One of the questions I got was: does expanding chunks to neighbors actually improve answers, or does it just add noise and make it harder for the model to stay grounded?

So that’s what we’ll test here. We’ll run some basic evaluations and look at metrics like faithfulness, answer relevancy, context relevance, and hallucination rate, and compare results across different models and datasets.

I’ve collected most of the results here and here already, but we’ll go through them too.

As a note, I’m planning to compare this kind of “advanced” pipeline to a more naive baseline later. But this article is mainly about evaluating the pipeline as it is.

I always go through some intro sections before I dig in, but if you’re new-new, I’d first read up on how to build a basic RAG system, how embeddings work, and an actual intro to evals/metrics. Then you can also read how to build the over-engineered pipeline I introduced above, or at least skim it.

If none of this is new, then skip to the results part.

Why we perform evals

Evals are about making sure to pressure-test the system on a bigger (more targeted) corpus than your favorite 10 questions, and making sure that whatever changes you push don’t change the quality of the system.

Changes in data, prompts, or models can very much affect performance without you seeing it.

You may also need to show your team the general performance of the system you’ve built before being allowed to test it on real users.

But before you do this, you need to decide what to test.

What does a successful system look like to you? If you care about multi-hop, you need questions that actually require multi-hop. If you care about Q&A and proper citations, you test for that. Otherwise, you end up evaluating the wrong thing.

This is a bit like doing investigative work: you test something, you try to understand the results, and then you build better tests.

To do this well, you should try to build a golden set (often from user logs) to test with.

This isn’t always possible, so in situations like this we build synthetic datasets. This may not be the best way to do it, as it will clearly be biased and won’t reflect what your users will actually ask. Nevertheless, you may need to start somewhere.

For this article, I’ve created three different datasets so we can discuss it: one created from the ingested corpus, one that creates messy user questions from the corpus, and one with random questions on RAG that haven’t been generated from the corpus at all.

You’ll be able to see how these datasets give us different results on the metrics, but that they all mean different things.

What to think about

I’m not going to go through everything there is to think about here, because doing evals well is pretty difficult (although also fun if you like statistics and data).

But there are a few things you need to keep in mind: LLM judges are biased, cherry-picking questions is a problem, gold answers are best if you have them, and using a larger dataset with tags helps you break down where and how the system is failing.

If you’ve read the eval metrics article, you’ve already seen the idea of LLM-as-a-judge. It can be useful, but it’s not inherently reliable because it has baked-in preferences and blind spots.

There are things that will make you go mad, like a judge punishing an answer that’s based on the corpus but not explicitly stated in the retrieved chunks (summaries / small inferences), or judging the same answer differently depending on how the question is phrased.

You’ll realize this later when you dig into the questions that are failing to understand why.

Another thing to keep in mind is to make sure not to “cherry-pick” questions, even if you feel the urge to.

You obviously have to start somewhere, but the goal is to get close to what your users are actually asking, find the issues, and to update the dataset continuously based on what the system seems to fail in. It’s easy to get nice numbers if you mostly test “easy” questions, but then the eval becomes less useful.

The best thing is to have not just real user questions but also gold answers.

So even if you can “bypass” having references by using an LLM judge, having the correct answers for those questions is best. That’s when you can use the LLM to judge whether the output matches the gold answer, instead of asking it to judge the answer on its own.

Sample size matters too. Too small and it may not be reliable. Too big and it’s easy to miss smaller problems.

If you have enough data, you can tag questions into topics, different wordings (pessimistic / typical phrasing), and different types (short / long / messy) so you can see what breaks where.

I’ve heard recommendations that start with something like 200–1,000 real queries with gold answers if you want this to be a real evaluation setup.

Since this entire exercise is hypothetical, and the system has ingested documents to demo the idea of expanding to neighbors, the evals will have datasets that have been synthetically generated, and thus less reliable, but there is still learnings we can get from it.

Deciding on metrics & datasets

This section is about two things: which metrics I’m using to evaluate the pipeline, and how I’m using them across datasets to see if neighbor expansion seems to help.

First, if you haven’t read about evals for LLM systems at all, go read this article. It gives you a taxonomy of the different metrics out there (RAG included).

Since I’m lazy for this, I needed reference-free metrics, but this will also limit us to what we can actually test. We can have the judge look at the context, the question, and the generated answer.

A few metrics that can help here are faithfulness (is the answer grounded in the provided context), answer relevancy (does it actually answer the question), context relevancy (how much of the context is just noise), and hallucination (how many claims are actually backed up by the provided context).

Since we want to figure out if seed expansion is useful, and without building two different pipelines, we can do one simple comparison: ask the judge to look at the seed chunks vs. the final expanded context and score how much of the answer comes from each for the faithfulness metric.

If grounding improves when the judge sees the expanded context, that’s at least evidence that the model is using the expanded chunks and it’s not just noise. We would need more testing, though, to say for sure which is the winner.

Finally, the datasets matter as much as the metrics.

If you’ve read the first article, you know that all the docs that have been ingested are scientific articles that mention RAG. So all the questions that we create here need to be about RAG.

I have generated three different datasets with a different RAG flavor.

The first is based on the ingested corpus, going through each scientific article and writing two questions each that it can answer.

The second is doing the same but providing messy questions like, “how does k2 btw rag improve answer fetching compared to naive rag, like what’s the similarity scores in terms of q3?”

This messy user questions dataset could be good to test the query optimizer if you read the first article (but I don’t have those results for you here). Here it will tell us if stating things differently would skew the results.

The third dataset is based on 66 random RAG questions found online. This means that these questions may not have answers in the corpus (the ingested RAG articles are just from September to October, so we don’t know exactly what they contain).

So the first two will evaluate how well the pipeline behaves, whether it can answer questions on the documents it has, and the third one tells us what it is missing and how it behaves on questions that it might not be able to answer.

Though this is a bit simplified, as the first questions may be structured on sections and the random ones may be better answered by seed chunks.

Running the evals

To run the evals, you first need to run the pipeline on every question, for every model, and store the results.

If you don’t store everything you need, you can’t debug later. You want to be able to go from a low score back to the exact answer, the exact retrieved context, and the exact model settings.

I also wanted to compare models, because people assume “bigger model = better answers,” and that’s not always true, especially for easier tasks. So I’m running the same pipeline across GPT-5-mini, GPT-5.1, and** GPT-5.2**, for several datasets.

Once that’s done, I build the eval layer on top of those stored outputs.

I used RAGAS for the standard metrics and DeepEval for the custom ones. You can obviously build it manually, but it’s much easier this way. I love how seamless DeepEval is, though it’s harder to debug if you find issues with the judge later.

A few specifics: the pipeline runs with no context cap, the judge model is gpt-4o-mini, and we use n=3 for RAGAS and n=1 for the custom judges.

Since neighbor expansion is the whole point of this pipeline, remember we also run this check: for faithfulness, we score grounding against the seed chunks and against the full expanded context, to see if there’s a difference.

Eval results of datasets & models

Let’s run the evals for the different datasets, metrics, and models to see how the pipeline is doing and how we can interpret the results. Remember **you can find the full results here and here **(especially if you dislike my childish sketches).

We can start with the results from the dataset generated by the corpus.

Remember that the real tabe you’ll find here

The table above shows the first RAGAS metrics. Faithfulness (does it stay grounded in the context provided) and answer relevancy (does it answer the question) are very high.

This is to be expected, as we’re basically giving it questions that it should be able to answer with the documents. If these showed low numbers, there would be something severely off in the pipeline.

It also gives us back seed faithfulness, where the judge is estimating how grounded the answer is to the seed chunks. This one is overall a lot lower than the full context faithfulness, 12–18 points across the different models.

In fewer words: we can say that the LLM is using some of the full context, not just the seed chunks, when generating its answer.

What we can’t judge though is if the seed-only answer would have been just as good. This will require us to run two pipelines and compare the same metrics and datasets for each.

Now let’s look at these next metrics (for the same dataset).

Remember that the real tabe you’ll find here

I would have estimated that context relevance would decrease here, as it’s looking at the full context that pulls in up to 10 different chunk neighbors for the section.

A reason for this may be that the questions generated are based on sections, which means that added context helps to answer them.

Structure citations (i.e. does it cite its claims correctly) looks alright, but hallucination is high, which is good (1 means no made-up claims in the answer).

Now you’ll see that the different models provide very little difference in terms of performance.

Yes, this is quite an easy Q&A task. But it does demonstrate that the additional size of the model may not be needed for everything, and the added context expansion may be able to act as a buffer for the smaller models.

Now let’s look at the results if we change the dataset to those messy user questions instead.

Remember that the real tabe you’ll find here

We see a few drops in points, but they still stay high, though without isolating the outliers here we can’t say why. But faithfulness looks lower when only judging with the seed chunks for the messy user questions, which is interesting.

Let’s now turn to the third dataset, which will be able to tell us a lot more.

Remember that the real tabe you’ll find here

We see all around worse numbers which is of course expected, the corpus that has been ingested probably can’t answer all of these questions so well. This helps us point to where we have missing information.

Faithfulness stays high though still for the full context runs. Here the difference from the seed-only runs are a lot higher, which means the added expansion is being used more in the answer.

Something that was strange here was how GPT-5.2 consistently did worse for answer relevance across two different runs. This can be a metric thing, or it can be a model thing where it answers more cautiously than before, thus getting a lower score.

This also tells you why it’s so important to test these new models on your own pipelines before adding them in.

Let’s continue with the other metrics for the random dataset.

Remember that the real tabe you’ll find here

Context relevance is very low, so the judge thinks there is a lot of noise in the context that does not directly answer the question for this dataset.

The context relevance metric rewards retrieving a high proportion of directly relevant chunks, but seed+expand intentionally adds neighbor chunks (definitions, adjacent sections, captions) that should improve grounding. Unfortunately, there may be some noise that comes with that.

Both structure citations and hallucination score worse here, probably because it’s harder to stay grounded if you’re not provided any relevant information to use.

Now this was a first build for this pipeline. Based on these results, you can improve it until you drive those numbers up. Maybe change the metrics to be more custom to your pipeline.

Both datasets show different things: the corpus-based one shows the pipeline works, the random one shows there is missing information and that there is a lot more noise to navigate around.

What to examine

So what do you do now that you have all these scores? You can look at regressions, worst offenders, and do some random sampling.

Like I said before, this whole exercise is a bit like investigative work, which you’ll love if you love working with data. If you do have enough data and you’ve tagged it, then by all means break it down and look at the individual numbers and see what you find. But there are three checks you should always do.

First is regressions: check the stuff that used to pass and now fails following changes.

Second is checking the worst offenders (bottom ~5–10 data points) to see what is happening. It could be the metric, the judge, it could be the retrieval, and it could be the model.

Below is one of the worst ones for full faithfulness and answer relevancy.

"question": "how often does entity linking mess up acronyms and lookalikes, and how do i test that",
"answer": "The documents do not report a direct number for "how often" entity linking fails specifically on acronyms or look‑alike names.\n\nWhat we do see is that:\n- Even when structure is correct, LLMs often link to the wrong property or entity (semantic inaccuracies) in SPARQL generation over a scholarly KG [2]. This is the same failure mode you'd see when an acronym (e.g., NMF, ACE) is mapped to the wrong node.\n- In medical KGs, entity linking relies on exact keywords and then semantic similarity of embeddings when no exact match exists [3]. This last step explicitly allows linking to a merely "semantically related" concept rather than the exact one, which is where lookalikes and acronym collisions can occur [3].\n- In conversational entity‑centric systems, the largest single error category is coreference/entity confusion (27% of failures) [4], showing that confusing similar mentions is a major practical problem even outside strict EL.\n\nHow to test this in your setting, using patterns from the papers:\n1) Build a targeted test set\n   - Create a small benchmark that stresses acronyms and visually/lexically similar entities, analogous to how MATE is used to stress cross‑modal entity linking [1].\n   - Include positive pairs (correct acronym→entity) and hard negatives (different entities sharing acronym or near‑identical names).\n\n2) Run your full EL pipeline\n   - Use the same identifier / keyword / embedding‑based linking strategy described in MedKGent [3].\n\n3) Manually and/or LLM‑assisted evaluation\n   - Have humans or an LLM score whether each link is correct vs. incorrect, similar to the triple‑validity rubric used in MedKGent [3] and the SPARQL error audit in FIRESPARQL [2].\n\n4) Analyze error types\n   - Break down errors like the semantic vs. structural analysis in [2] and the error table in [4], but with categories "acronym collision," "lookalike surface form," etc.\n\nThis will give you an empirical error rate for your own stack, focused specifically on acronyms and lookalikes.",
"full_faithfulness": 0.24,
"seed_faithfulness": 0.8260869565000001,
"answer_relevancy": 0.0,
"context_relevance": 0.208549739206933,
"context_relevance_reason": "The context provided does not directly address the user's question about how often entity linking messes up acronyms and lookalikes, nor does it offer methods for testing that. While it discusses entity linking and its evolution, it lacks specific information on the issues related to acronyms and lookalikes, which is the core of the user's inquiry.",
"hallucination_score": 0.6572611409640697,
"hallucination_reason": "The response accurately identifies that the documents do not provide a specific frequency for how often entity linking fails with acronyms or lookalikes, which aligns with the input query. It also discusses relevant issues such as semantic inaccuracies and coreference confusion, which are pertinent to the topic. However, it lacks direct references to specific claims made in the context, such as the limitations of traditional EL methods or the role of exact keywords in medical KGs, which could have strengthened the response further.",
"full_contexts": ["Entity Linking\n\nEntity Linking (EL) has evolved from text-only methods to Multimodal Entity Linking (MEL), and more recently to Cross-Modal Entity Linking (CMEL), which supports crossmodal reasoning. Traditional EL methods associate textual entities with their corresponding entries in a knowledge base, but overlook non-textual information (Shen, Wang, and Han 2015; Shen et al. 2023). MEL extends EL by incorporating visual information as auxiliary attributes to enhance alignment between entities and knowledge base entries (Gan et al. 2021; Liu et al. 2024b; Song et al. 2024).", "However, MEL does not establish cross-modal relations beyond these auxiliary associations, thereby limiting genuine cross-modal interaction.", "CMEL goes further by treating visual content as entities-aligning visual entities with their textual counterparts-to construct MMKGs and facilitate explicit crossmodal inference (Yao et al. 2023). Research on CMEL remains in its early stages, lacking a unified theoretical framework and robust evaluation protocols. The MATE benchmark is introduced to assess CMEL performance, but its synthetic 3D scenes fall short in capturing the complexity and diversity of real-world images (Alonso et al. 2025). To bridge this gap, we construct a CMEL dataset featuring greater real-world complexity and propose a spectral clustering-based method for candidate entity generation to drive further advances in CMEL research.", "3 Error type analysis on generated SPARQL queries\n\nDespite the improvements of LLMs on QA over SKGs, LLMs face limitations when handling KG-specific parsing. The experimental results conducted by Sören Auer et al.[2] showed that only 63 out of 100 handcrafted questions could be answered by ChatGPT, of which only 14 answers were correct. To better understand why LLMs fail to generate the correct SPARQL query to a NLQ, we conduct a pilot experiment on using ChatGPT(GPT-4) with a random one-shot example to generate SPARQL queries for 30 handcrafted questions in the SciQA benchmark datasets.", "Insights from this pilot experiment revealed two major categories of errors LLMs tend to make in this task: semantic inaccuracies and structural inconsistencies. Semantic inaccuracies occur when LLMs fail to link the correct properties and entities in ORKG, despite generating SPARQL queries with correct structure. Our observations reveal that LLMs tend to rely on the example provided in the one-shot learning process to generate the correct structure for a certain type", "of questions, but often struggle with linking the correct properties and entities because LLMs do not learn the content of the underlying KG. Structural inconsistencies arise due to LLMs' lack of ontological schema of the underlying KG, leading to errors in query structure, such as missing or abundant links (triples), despite correctly linking to the mentioned entities or properties.", "Figure 1 shows the example of semantic inaccuracies and structural inconsistencies problem with the generated SPARQL queries in our pilot study. In the example of the semantic inaccuracies problem, ChatGPT failed to link the correct property orkgp:P15687; instead, it linked to a wrong property orkgp:P7101. In the example of the structural inconsistencies problem, the SPARQL query generated by ChatGPT directly links Contribution to Metrics, fails to detect the correct schema of the ORKG where Contribution and Metric are connected via Evaluation.", "Fig. 1: Examples of semantic inaccuracies and structural inconsistencies problem with the generted SPARQL queries\n\nSemantic inaccuracies Problem\n\nFail to link the correct properties and entities in ORKG\n\nWhat is the maximum sample size?\n\nContribution Evaluation Metric P34 P2006 P7046\n\nStructural inconsistencies Problem\n\nMake errors in query structure, such as missing or abundant links (triples)\n\nWhat are the metrics used by paper \"Using NMF-based text summarization\n\nto improve supervised and unsupervised classification?\n\norkgp:P15687 rdfs:label Sample size (n)\n\norkgp:P7101 rdfs:label has elements", "2 Results\n2.1 Method overview\n\nas its confidence score. For instance, if the triple (NPPA, Negative Correlate, Water) appears in 90% of the outputs, its confidence score is 0.9. Low-confidence triples (score < 0.6) are filtered out, and only high-confidence triples are retained for downstream graph construction. Each triple is also annotated with the PubMed ID of the source abstract and a timestamp, ensuring traceability and source attribution. For example, (NPPA, Negative Correlate, Water) would have a PubMed ID of 10494624 and a timestamp of 2000-01-01.", "As shown in Figure 1 c , for each retained triple, such as (NPPA, Negative Correlate, Water), the Constructor Agent checks its presence in the current KG. If absent ( i.e. , either the head or tail entities are missing), it is inserted; if present, its confidence score is updated according to Equation (1). The associated PubMed ID is appended, and the timestamp is updated to reflect the latest publication. For example, if an existing triple (NPPA, Negative Correlate, Water) has a confidence score of 0.7, PubMed ID 10691132, and timestamp 1999-12-31, and a new occurrence with a confidence score of 0.9, PubMed ID 10494624, and timestamp 2000-01-01 is encountered, the updated triple will have a confidence score of 0.97, PubMed IDs [10691132, 10494624], and a timestamp of 2000-01-01. If the head and tail entities are present but the relation differs, such as existing (NPPA, Associate, Water) vs. incoming (NPPA, Negative Correlate, Water), only the most appropriate relation is maintained. The Constructor Agent invokes the LLM to resolve the conflict by selecting the more suitable relation, considering both the existing and incoming triple's confidence scores and timestamps. If the LLM selects the new triple, the existing one is replaced; otherwise, no changes are made. The prompt design for relation conflict resolution is shown in Extended Data Figure 2 c . Together, the two agents extract structured medical facts and integrate them into a dynamic, time-aware KG. See more details in the Section 4.", "2.2 Structural Characterization of the Knowledge Graph\n\nIn this section, we detail the structural characteristics of the medical KG we constructed, with an emphasis on the distribution of node types, relationship types, and the confidence scores of relationship triples. We also present a visualization of a subgraph centered on COVID-19 to illustrate the graph's structure.", "Using the MedKGent framework, we extracted knowledge triples from the abstracts of 10,014,314 medical papers, with 3,472,524 abstracts (34.68%) yielding extractable triples. The relatively low extraction rate can be attributed to several factors: first, some abstracts lacked sufficient structured information for triple extraction; second, only triples with a confidence score exceeding 0.6 were retained, excluding those with lower confidence; and third, some triples extracted by LLMs contained formatting issues, such as extraneous or irrelevant characters, which were discarded. In total, our Extractor Agent identified 8,922,152 valid triples from the abstracts. However, the extracted triples contained a significant number of duplicates and conflicts. To resolve this, our Constructor Agent integrates the triples in chronological order. During this process, duplicates are merged, with the confidence score for each triple increasing in proportion to its frequency, reflecting greater certainty. For conflicting triples, where the same entity pair is associated with multiple relations, the Constructor Agent retains the most appropriate relationship. Following this consolidation, the final KG comprises 2,971,384 distinct triples.", "We conducted a comprehensive statistical analysis of the final constructed KG, which comprises 156,275 nodes. As shown in Figure 2 a , the node distribution is predominantly dominated by Gene and Chemical nodes, with smaller proportions of other entities such as Disease, Variant, Species, and CellLine. The KG includes 2,971,384 relationship triples (edges), representing a range of interactions between entities, as illustrated in Figure 2 b . The most common relationship type is 'Associate', followed by 'Negative Correlate' and 'Positive Correlate', indicating strong associations between medical entities. Less frequent relationships, such as 'Interact', 'Prevent', and 'Drug Interact', provide additional insights into the complexities of medical interactions. The distribution of confidence scores for these relationship triples, shown in Figure 2 c , with confidence values discretized to the nearest smaller 0.05 increment (rounding down to the closest multiple of 0.05), reveals a clear dominance of high-confidence triples. A significant proportion of triples exhibit confidence scores of 0.95, reflecting the cumulative increase in confidence resulting from the repetition of triples during the graph construction process. This high-confidence distribution reinforces the reliability and robustness of the KG.", "We visualized a local subgraph of the constructed KG with COVID-19 as the central node, highlighting five surrounding relationship triples, as shown in Figure 2 d . Each node is characterized by six key attributes: the Identifier, which uniquely references the node and normalizes multiple synonymous mentions to a standardized terminology entry; the Entity Type, which classifies the entity; the Terminology, which maps the entity type to its corresponding standard terminology; the Page Link, providing a reference to the entity in the Terminology; the Exact Keywords, which lists common names and aliases of the entity in lowercase; and the Semantic Embedding, a vector representation of the entity. In practice, these attributes facilitate entity linking within a query by matching entities to their corresponding nodes in the KG. When the Identifier of an entity in the query is available, entity linking can be efficiently performed using this unique reference. In the absence of an Identifier, precise matching", "Figure 2: A comprehensive statistical analysis and visualization of the constructed KG, consisting of 156,275 nodes and 2,971,384 relationship edges. a . Node distribution within the KG, with Gene and Chemical nodes predominating, and smaller proportions of Disease, Variant, Species, and CellLine. b . Relationship type distribution within the KG, highlighting the prevalence of 'Associate' relationships, followed by 'Negative Correlate' and 'Positive Correlate', with less common interactions such as 'Interact', 'Prevent', and 'Drug Interact'. c . The distribution of confidence scores for relationship triples, discretized to the nearest smaller 0.05 increment, ensures values are rounded down to the closest multiple of 0.05. This distribution reveals a clear dominance of high-confidence triples, particularly those with scores of 0.95, underscoring the robustness of the KG. d . Local subgraph visualization centered on COVID-19, displaying five surrounding relationship triples. Each node is characterized by key attributes, including Identifier, Entity Type, Terminology, Page Link, Exact Keywords, and Semantic Embedding, facilitating efficient entity linking through exact or similarity matching. The relationships in the KG are further enriched by attributes such as Confidence, PubMed IDs, and Timestamp, enhancing traceability, accuracy, and temporal relevance.\n\nCOVID -19 ACE2 Pneu- monia Lung Disea -ses MAD00 04J08 tociliz- umab Treat Identifier : MESH:C000718219 Entity Type : Chemical Terminology : NCBI MeSH Page Link", ": meshb.nlm.nih.gov/record/ui?ui=C000718219\n\nExact Keywords : [mad0004j08] Semantic Embedding : [- 0.12, …, 0.10 ] : MESH:D000086382\n\nEntity Type:\n\nDisease\n\n: meshb.nlm.nih.gov/record/ui?ui=D000086382\n\n: [ncp, covid-19]\n0.25, …, 0.09\n\nIdentifier:\n\nMESH:C502936\nChemical\n\nTerminology:\n\nNCBI MeSH\n\n: meshb.nlm.nih.gov/record/ui?ui=C502936\n\n: [mra, tocilizumab] 0.12, …, 0.13 Associate 59272 Gene\n\n:\n\nNCBI Gene\n\n: www.ncbi.nlm.nih.gov/gene/59272\n\n: [ace2, ace2p]\n0.22, …, 0.09]\nMESH:D011014\n\n: meshb.nlm.nih.gov/record/ui?ui=D011014\n\n: [pneumonia]\n0.18, …, 0.01\nMESH:D008171\n\n: meshb.nlm.nih.gov/record/ui?ui=D008171\n\n: [lung diseases,lung damage]\n\n: [ 0.06, …, 0.11 d a b Drug_Interact (0.1%) 0.70 0.65 'Prevent (0.79 0.75 7.89) (7.5%) 0.60 (8.1%) (5.4% (47.7%) 0.80 CellLine Positive (8.9%) (0.5%) Correlate 0.85 (19.9%) (10.3%) Variant (1.49) (5.9%) Cause (1.4% 0.90 (33.6%) Inhibit (1.2% Negative_Correlate Stimulate (0.5%) (13.7%) Species Compare (26.1%) Cotreat (1.0%)", "Figure 3: Comprehensive evaluation of extraction quality for relationship triples generated by the Extractor Agent. Systematic assessment of extraction accuracy using both automated evaluations by LLMs and independent manual expert review. a . Proportion of valid relationship triples (score ≥ 2.0) across relation types, as assessed by GPT4.1 on a randomly selected subset of 34,725 abstracts (83,438 triples). b . Proportion of valid relationship triples across relation types, as assessed by DeepSeek-v3 on the same subset. c . Validity rates from independent manual evaluation by three domain experts on a subset of 400 abstracts (1,060 triples), demonstrating high inter-expert consistency. d-f . Performance of GPT-4.1 and DeepSeek-v3 compared to three expert evaluations on the shared evaluation subset, reporting precision, recall, and F1 score. g . Pairwise inter-rater agreement between experts and LLMs quantified by Cohen's kappa coefficients, demonstrating substantial consistency across all evaluators.\n\nGPT-4.\n\nAutomated Evaluation\n\nDeepSeek-v3 Automated Evaluation\n\nManual Evaluation 0936| 0.0307 0,8875 0,8880 0 8700 0.7160 0.4\n\nExpert1's Evaluation as Reference\n\nExpert2's Evaluation as Reference\n\nExpert3's Evaluation as Reference\n\nPairvise Cohen's 0 9761 09761 0 0602 00760 0.9502 00537 0,9503 0 9440 0.5663 08143 0,8818 0 5446 0.6762 0,8853 0.5446 0.6906 06818 0.6008 0 6560 GPT-4,1 DeepSeek-v3 GPT-4.1 Correlale Corelate Cause Inhon Irhon Cotcat Inlatact Colrcat Kappa ison", "is achieved by checking whether the entity appears in the Exact Keywords list of a specific node. Alternatively, semantic vectors of the query entities can be compared with those in the KG to identify the most similar entities, enabling semantic similarity matching. This approach is particularly beneficial for entities with multiple names, ensuring accurate linking even when not all aliases are captured in the Exact Keywords list.", "The relationships between entities are characterized by three key attributes. Confidence reflects the reliability of the relationship, with higher values indicating greater certainty based on its frequency across multiple sources. The PubMed IDs attribute lists the PubMed identifiers of the papers from which the relationship is derived, enabling easy access to the original publications via the PubMed website 2 . If the relationship appears in multiple papers, all relevant PubMed IDs are included, further increasing the confidence score. Finally, Timestamp denotes the most recent occurrence of the relationship, specifically the publication date of the latest paper. Notably, while Timestamp captures only the latest appearance, the full temporal span of the relationship-including its earliest mention-can be readily retrieved through the associated PubMed IDs via the PubMed website. These attributes collectively enhance the traceability, accuracy, and temporal relevance of the relationships within the KG.", "4 Methods\n4.2.2 Constructor Agent\n\nA chemical/drug treats a disease. The Treat relationship typically occurs between Chemical and Disease.\n\nMeSH (Medical Subject Headings)\n\ndbSNP, otherwise HGNV format\n\nNCBI Taxonomy\nCell Line\nCellosaurus\n\nYour task is to select the most appropriate relation\n\nbetween two medical entities to form more\n\nreasonable knowledge triple.\n\nThere is an and Now, a new between e1 and\n\ne2 is proposed.\n\nPlease decide which relation should be retained\n\nbetween e1 and e2.\n\nIf r1 should be kept, respond with \"Y\".\n\nIf r2 should replace it, respond with \"N\".\n\nYou may consider the following two factors to assist\n\nyour decision:\n\n(1) The\n, and\nthat of\n\n;\n\n(2) The\nfor\n\nIn general, relations with\n\nhigher confidence scores or more recent timestamps are likely\n\nretained.\n\nYour output should contain only \"Y\" or \"N\". Do not\n\nprovide any explanations.\n\nOutput:\n\nc", "Extended Data Figure 2: a . Prompt template for relation extraction. Given a biomedical abstract and its extracted entities, the Extractor Agent prompts the LLM to infer semantic relations between entity pairs using a predefined relation set and textual descriptions. b . Reference terminologies for entity normalization. Each biomedical entity type is mapped to a standard terminology: Gene (NCBI Gene), Disease and Chemical (MeSH), Variant (dbSNP or HGNV), Species (NCBI Taxonomy), and Cell Line (Cellosaurus). c . Prompt design for relation conflict resolution. When conflicting relations exist between the same entity pair, the Constructor Agent prompts the LLM to select the most appropriate one based on confidence scores and timestamps. d . Schema for predefined relation types. The 12 core relation types-seven bidirectional and five unidirectional-are listed alongside their directionality, descriptions, and allowed entity-type combinations.", "4.3 Quality Assessment\n\nWe assessed the quality of relational triples extracted by the Extractor Agent through both automated and manual evaluations, leveraging two state-of-the-art LLMs-GPT-4.1 [74] and DeepSeek-v3 [75]-as well as three PhD students with interdisciplinary expertise in medicine and computer science. For each medical abstract and its corresponding set of extracted triples, individual triples were evaluated using a standardized four-level scoring rubric: 3.0 (Correct), 2.0 (Likely Correct), 1.0 (Likely Incorrect), and 0.0 (Incorrect). The evaluation prompt provided to both LLMs and human annotators is illustrated in Extended Data Figure 3 a .", "A relational triple was defined as valid if it received a score of ≥ 2 . 0 . The validity rate was calculated as:\n\nTo assess the reliability of automatic evaluation, we compared LLM-based assessments with human annotations on a shared evaluation subset, treating human judgments as ground truth. The precision, recall, and F 1 -score of the automatic evaluations were computed as:\n\nwhere TP, FP, and FN represent true positives, false positives, and false negatives, respectively. To further quantify inter-rater agreement, we calculated Cohen's Kappa coefficient [82] for each pair of evaluators, including both LLMs and human annotators, resulting in 10 pairwise comparisons across the five raters. The Kappa coefficient was computed as:\n\nwhere p 0 represents the observed agreement and p e denotes the expected agreement by chance. This analysis provides a quantitative measure of rating consistency across evaluators.", "4.4 Retrieval-Augmented Generation\n\nThe constructed KG serves as a reliable external source for information retrieval and can be integrated into LLMs via a RAG framework. By providing structured biomedical context, the KG enhances LLM performance across a range of medical QA benchmarks.", "Given a user query q , we first extract the set of medical entities present in the question, denoted as E q = { e q 1 , e q 2 , · · · } . When using PubTator3 [80]-the same entity recognition tool employed during KG constructioneach extracted entity is assigned a unique identifier. This allows for efficient entity linking by matching these identifiers to the corresponding nodes N q = { n q 1 , n q 2 , · · · } within the graph. Alternatively, if medical entities are extracted using other methods-such as prompting a LLM-they may lack standardized identifiers. In such cases, the extracted entity mentions are first converted to lowercase and matched against the Exact Keywords attribute of each node in the KG. A successful match enables linkage of the entity to the corresponding graph node. In both approaches, if an entity cannot be linked via its identifier or if its surface form does not appear in any node's Exact Keywords list, we apply a semantic similarity strategy to complete the entity linking process. Specifically, the embedding of the query entity is computed using the same model employed for generating node-level semantic representations ( i.e. , BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext [81]) and is compared against the Semantic Embedding of all nodes in the KG. The entity is then linked to the node with the highest semantic similarity score, which may correspond to either the exact concept or a semantically related medical entity. This entity linking framework-combining identifier-based matching, lexical normalization, and semantic embedding-ensures robust and flexible integration of KG-derived knowledge into downstream QA tasks.", "Following entity linking, we construct evidence subgraphs using a neighbor-based exploration strategy [86] to enhance the reasoning capabilities of LLMs. For each entity-linked node in the query-specific set N q , we retrieve its one-hop neighbors within the KG. Specifically, for each node n q i ∈ N q , all adjacent nodes n q ′ i are identified, and the corresponding triples ( n q i , r, n q ′ i ) are appended to form a localized subgraph G q i . This expansion captures the immediate relational context surrounding the query entities, which is essential for enabling fine-grained medical reasoning. The complete evidence set for a given query is then defined as the union of these localized subgraphs: G q = { G q 1 , G q 2 , · · · } . The resulting subgraph G q may contain a large number of relational triples, including redundant or irrelevant information, which can adversely impact LLM reasoning [87]. To address this, we leverage the LLM's inherent ranking capability to selectively filter high-value knowledge [88]. Given the question q and", "You are tasked with evaluating the validity of the knowledge triples extracted from the abstract of a medical paper.\n\nGiven the abstract (\n\n) of a medical paper and the extracted triples\n\n) from this abstract.\n\nEach triple is represented in the format:\n\n\"Head Entity Name (Alias1, Alias2) | Relationship Name | Tail Entity Name (Alias1, Alias2)\"\n\n,\n\nwith triples separated by ' $ '.", "Some entities may have no aliases or multiple aliases, which are separated by ', ' within the '()'.\n\nYour task is to evaluate the validity of each triple, with a particular focus on the\n\nrelationship\n\nit describes, based on the information provided in the abstract. Consider whether the stated relationship accurately\n\nreflects the connection between the head and tail entities as presented or implied in the text.", "For each triple, evaluate its validity using the following scoring scale and assign a confidence score:\n\n•\n\nCorrect (3.0):\n\nThe relationship logically and accurately describes the relation between the head and tail entities as\n\nexplicitly mentioned or directly and strongly supported\n\nby the abstract. The\n\nrelationship type is\nprecise\n\nand the connection is\n\nundeniable\n\nbased on the text, requiring minimal inference.\n\nLikely Correct (2.0):\n\nThe relationship is\n\ngenerally acceptable and directionally correct\n\n. The core connection between the entities is\n\nvalid and supported by the text (explicitly, implicitly, or via\n\nreasonable inference)\n\n, even if the relationship type has\n\nminor inaccuracies or lacks ideal precision\n\n.\n\nLikely Incorrect (1.0):\n\nsubstantially inaccurate or misleading\n\nsignificantly misrepresenting\n\nthe connection described in the abstract, even if the entities are mentioned together.\n\nIncorrect (0.0):\n\nnot supported by the abstract whatsoever\n\n, is\n\nclearly and undeniably contradicted\n\nby the text, or involves a\n\nfundamental misunderstanding\n\nof the entities or their\n\nconnection as presented.\n\nOutput the evaluation in a fixed format:\n\nFirst line: 'Analysis: ' followed by the analysis of all triples, separated by '; '. Each triple's analysis should explain\n\nwhy\n\nthe specific confidence score (3.0, 2.0, 1.0, or 0.0) was assigned based on the criteria\n\nabove and the abstract's content.", "Second line: Only the numerical confidence scores for all triples, separated by ' $ ', in the same order as the input triples (e.g., 3.0 $ 2.0 $ 1.0 $ 0.0). This line must contain only numbers (formatted to one\n\ndecimal places like 3.0, 2.0, 1.0, 0.0), decimal points, and ' $ ' as separator, with no additional text or English letters.", "5 Results\n5.1 Main Results\n\n| | Model | FR (%) | DC (%) | UCS (/5) |\n|---:|:-------------------|:-----------|:-----------|:-----------|\n| 0 | Stateless LLM | 54.1 (0.4) | 48.3 (0.5) | 2.1 (0.1) |\n| 1 | Vector RAG | 71.6 (0.6) | 66.4 (0.7) | 3.4 (0.1) |\n| 2 | Entity-RAG | 75.9 (0.5) | 72.2 (0.6) | 3.7 (0.1) |\n| 3 | Semantic Anchoring | 83.5 (0.3) | 80.8 (0.4) | 4.3 (0.1) |\n\nTable 1: Overall performance on MultiWOZ-Long. Semantic Anchoring outperforms all baselines across metrics. Improvements in FR and DC are statistically significant at p < 0 . 01 ; UCS gains are significant at p < 0 . 05 . Values are mean ± stdev over three runs.", "Figure 2 analyzes how performance varies with session depth. While all models degrade as dialogue span increases, Semantic Anchoring sustains over 75% recall at 10 sessions, indicating stronger long-range tracking.", "5.2 Per-Dataset Breakdown\n\nTo test generality, we evaluate on DialogRE-L , which emphasizes relation extraction across sessions. Results in Table 2 show consistent improvements, though broader domains are needed to claim robustness.", "Figure 2: Factual Recall by session depth on MultiWOZ-Long. Semantic Anchoring exhibits the slowest degradation, maintaining > 75% recall at 10-session distance. Error bars denote standard deviation across three runs.\n\nFactual Recall vs. Session Depth (MultiWOZ-Long)\n\nStateless LLM Vector RAG Entity-RAG Semantic Anchoring Session Depth\n\n|---:|:-------------------|---------:|---------:|-----------:|\n| 0 | Stateless LLM | 49.8 | 44.1 | 2 |\n| 1 | Vector RAG | 68.7 | 62.5 | 3.2 |\n| 2 | Entity-RAG | 72.1 | 68.3 | 3.6 |\n| 3 | Semantic Anchoring | 81.4 | 77.9 | 4.2 |\n\nTable 2: Performance on DialogRE-L. Semantic Anchoring achieves consistent gains across metrics, suggesting effectiveness in relation extraction tasks that require long-range entity tracking.", "5.3 Ablation Studies\n\nTable 3 examines the role of linguistic components. Removing discourse tagging reduces FR by 4.7 points, while excluding coreference resolution reduces DC by 6.2 points. Eliminating all symbolic features collapses performance to Vector RAG levels. These results align with observed error patterns (§5.6), underscoring the value of symbolic features.", "5.4 Qualitative Examples\n\nIn MultiWOZ-Long, when the user later asks 'Did he confirm the time for the taxi?' , Semantic Anchoring retrieves:\n\n[Entity: John Smith][CorefID: E17] confirmed the taxi is booked for 9 AM.", "By contrast, Vector RAG surfaces unrelated mentions of 'taxi.' Additional examples, including cases where Semantic Anchoring fails, are shown in Appendix C.", "| | Variant | FR (%) | DC (%) | UCS (/5) |\n|---:|:-------------------------|---------:|---------:|-----------:|\n| 0 | Full Model | 83.5 | 80.8 | 4.3 |\n| 1 | - Discourse Tagging | 78.8 | 75.6 | 4 |\n| 2 | - Coreference Resolution | 80.1 | 74.6 | 4.1 |\n| 3 | - Dependency Parsing | 81.2 | 78.5 | 4.1 |\n| 4 | Dense-only (Vector RAG) | 71.6 | 66.4 | 3.4 |\n\nTable 3: Ablation results on MultiWOZ-Long. Removing discourse or coreference modules significantly reduces FR and DC, respectively. Without all symbolic features, performance falls to the dense-only baseline.", "5.5 Human Evaluation\n\nFive trained annotators rated 50 randomly sampled conversations for User Continuity Sati

Why we perform evals

What to think about

Deciding on metrics & datasets

Running the evals

Eval results of datasets & models

What to examine

Similar Posts