17 min readJust now
–
Press enter or click to view image in full size
Image created by the Author using Midjourney V7
Introduction: The Problem Hiding in Plain Sight
Imagine a medical resident studying from a textbook published in 2019. In most fields, five-year-old information would be perfectly adequate — the laws of physics haven’t changed much in the past few years, historical events remain fairly constant, and mathematical theorems are eternal. But medicine is different. Radically different.
In the five years since 2019, thousands of clinical trials have been completed, hundreds of new drugs have been approved, scores of treatment guidelines have been revised, and our understanding of diseases from COVID-19 to cancer has evolved dramatically. That 2019 textbook, while p…
17 min readJust now
–
Press enter or click to view image in full size
Image created by the Author using Midjourney V7
Introduction: The Problem Hiding in Plain Sight
Imagine a medical resident studying from a textbook published in 2019. In most fields, five-year-old information would be perfectly adequate — the laws of physics haven’t changed much in the past few years, historical events remain fairly constant, and mathematical theorems are eternal. But medicine is different. Radically different.
In the five years since 2019, thousands of clinical trials have been completed, hundreds of new drugs have been approved, scores of treatment guidelines have been revised, and our understanding of diseases from COVID-19 to cancer has evolved dramatically. That 2019 textbook, while perfectly accurate when published, now contains outdated treatment recommendations, superseded diagnostic criteria, and clinical “facts” that subsequent research has challenged or refuted.
Now imagine that instead of a single medical resident, millions of AI systems are learning medicine from that 2019 textbook. And imagine that these AI systems are being deployed in hospitals, clinics, and healthcare apps around the world, making recommendations to real patients based on outdated medical knowledge. This isn’t a hypothetical scenario — it’s happening right now.
This is the story of medical AI’s dirty secret: the training data crisis. **It’s the story of how the very datasets used to teach AI systems about medicine are quietly going stale, and how the problem compounds every time one AI system teaches another. **But it’s also the story of how we can fix it — how automated fact-checking against current medical literature can keep AI systems up-to-date at a scale that human experts alone could never achieve.
Part I: How Medical Knowledge Evolves (Faster Than You Think)
The Relentless Pace of Medical Discovery
Every single day, approximately 4,000 new articles are added to PubMed, the primary database of biomedical literature. That’s nearly 1.5 million publications per year — a pace that has been accelerating for decades. If you tried to read every new medical publication as it came out, you would need to read about two articles per minute, twenty-four hours a day, with no breaks for sleep, meals, or anything else. And you’d still fall further behind every day.
This isn’t just volume for the sake of volume. These publications include:
-
Randomized controlled trials that change treatment standards
-
Systematic reviews that synthesize evidence and shift clinical consensus
-
Observational studies that identify new disease patterns or drug side effects
-
Basic science discoveries that revise our understanding of disease mechanisms
-
Clinical case reports that expand our knowledge of rare conditions
The result is that medical “facts” exist in a state of constant flux. What we knew about treating diabetes in 2019 has been refined by dozens of major trials since then. What we understood about COVID-19 in March 2020 was dramatically different from our understanding by December 2020 — and different still today. Even in seemingly stable fields, evidence accumulates that nuances, refines, or sometimes completely overturns previous understanding.
When “Facts” Change: Examples from Recent Medical History
Consider a few real examples of how medical knowledge evolves:
Hormone Replacement Therapy (HRT): For decades, HRT was widely recommended for postmenopausal women based on observational studies suggesting cardiovascular benefits. Then the Women’s Health Initiative trial in 2002 found that HRT actually increased cardiovascular risk in certain populations. Treatment recommendations reversed almost overnight. An AI system trained on pre-2002 data would confidently recommend a therapy that is now considered potentially harmful for many patients.
Aspirin for Primary Prevention: For years, daily low-dose aspirin was recommended for cardiovascular disease prevention in healthy individuals. Recent trials have shown that for many people, the bleeding risks outweigh the benefits. Guidelines have shifted toward much more selective use. Yet countless educational materials and datasets still reflect the old “aspirin for everyone” paradigm.
Vitamin D Supplementation: Early observational studies suggested broad health benefits from vitamin D supplementation. Subsequent randomized trials have been largely disappointing, showing minimal benefit for most outcomes in people without deficiency. The evidence base has shifted from “probably helpful” to “unlikely to help in most cases.”
COVID-19 Treatments: In the span of just 2020–2022, we went from having no proven treatments to having multiple somewhat effective therapies (remdesivir, dexamethasone, monoclonal antibodies, new antivirals), while other initially promising treatments (hydroxychloroquine, ivermectin) were found to be ineffective in rigorous trials. A dataset frozen in mid-2020 would be dangerously wrong by 2021.
These aren’t obscure edge cases — they’re major clinical topics affecting millions of patients. And they illustrate a fundamental truth: in medicine, what’s true today may not be true tomorrow. Or more precisely, what we believe to be true today based on available evidence may be refined, nuanced, or overturned by better evidence tomorrow.
The Problem with Static Datasets
Now here’s where this becomes a crisis for AI. Training datasets are, by nature, static. Someone creates a dataset at a particular point in time, based on the medical knowledge available at that moment. That dataset is then used to train AI models — sometimes for years afterward.
Take PubMedQA, one of the most popular datasets for training and evaluating biomedical AI systems. Created in 2019, it contains questions derived from PubMed articles paired with yes/no/maybe answers. It was brilliantly designed for its original purpose: testing whether AI models can reason correctly about medical information. But here’s what its creators couldn’t have anticipated: researchers would extensively repurpose it for training AI models, not just evaluating them.
When you train an AI model on PubMedQA in 2024, you’re teaching it medical facts from 2019 and earlier. For rapidly evolving fields, that’s like teaching current medical residents from five-year-old textbooks. Some of what they learn will be outdated, some will be incomplete, and some will be wrong based on evidence that emerged after the dataset was created.
The frightening part? Our analysis of PubMedQA using current medical literature found less than 50% concordance. That means more than half of the “facts” in the dataset would be evaluated differently today than they were when the dataset was created. More than half.
Think about that for a moment. If you trained a medical AI system on PubMedQA today, it would be learning from a dataset where the majority of information is outdated, incomplete, or contradicted by current evidence. And if that AI system is deployed in healthcare settings, it could be making recommendations to real patients based on five-year-old medical knowledge that has since been superseded.
Part II: How the Problem Compounds (The Echo Chamber Effect)
Synthetic Data: When AI Teaches AI
The training data staleness problem would be bad enough if it stopped with one generation of models. But it doesn’t. Modern AI development increasingly relies on a process called “synthetic data generation,” where AI models create training examples for other AI models.
Here’s how it works: You train a large AI model on Dataset A (let’s say, PubMedQA from 2019). Then you use that model to generate thousands or millions of new training examples — maybe question-answer pairs about medical topics, or clinical scenarios with recommended treatments. These synthetic examples become Dataset B, which is used to train the next generation of AI models.
See the problem? If Dataset A contained outdated medical information, the AI model trained on it learns that outdated information. When it generates Dataset B, it creates new examples based on that outdated understanding. Models trained on Dataset B don’t just learn the original outdated information — they learn a derivative version of it, potentially with the errors amplified or compounded by the generation process.
This is like that childhood game of telephone, where a message gets subtly distorted each time it’s passed along. Except instead of children’s whispers, we’re talking about medical facts that could influence patient care. And instead of one round of telephone, we’re talking about multiple generations of AI models, each potentially amplifying and entrenching the errors from previous generations.
Knowledge Distillation: The Student Becomes the Teacher
A related problem occurs with “knowledge distillation,” where a smaller AI model is trained to mimic a larger “teacher” model. This is done for practical reasons — smaller models are faster and cheaper to run, making them more practical for deployment. But if the teacher model was trained on outdated data, the student model inherits all those outdated facts.
Again, we see the compounding effect. The original training data is outdated. The teacher model learns those outdated facts. The student model learns to reproduce the teacher’s outputs, including the outdated information. And if that student model is then used as a teacher for an even smaller model, or if it generates synthetic data for training other models, the outdated information propagates through generation after generation of AI systems.
Dataset Pollution: The Ecosystem-Wide Problem
The real nightmare scenario is “dataset pollution” — when errors, outdated information, or biases from one dataset spread throughout the AI research ecosystem. Here’s how it happens:
1. Dataset A (containing some outdated medical facts) is widely used because it’s publicly available and well-known
2. Many models are trained on Dataset A, learning its outdated facts
3. Those models generate synthetic datasets B, C, D, etc., all carrying the same outdated information
4. Researchers create new datasets E, F, G by combining or augmenting existing datasets, mixing in the already-polluted data
5. New models trained on any of these datasets inherit the outdated information
6. The cycle repeats
The frightening part is that this pollution can be invisible. Researchers using Dataset E might have no idea that it contains outdated information ultimately traced back to Dataset A from five years ago. They see a large, comprehensive dataset and assume it’s accurate. They train a model on it, the model performs well on benchmarks (which may themselves be outdated), and it gets deployed.
Only later — perhaps when the AI system makes a dangerous recommendation to a patient — does anyone realize that the model was trained on outdated medical information that has since been contradicted by clinical trials.
This isn’t theoretical. We’re seeing it happen in real-time. Models are being trained on older datasets, generating synthetic data based on outdated knowledge, and teaching new models that same outdated information. Each generation makes the problem worse, and the original source of the error gets further obscured.
Press enter or click to view image in full size
Image created by the author using ChatGPT-4.5
Part III: Why Human Review Doesn’t Scale (The Economic Reality)
The Expert Review Bottleneck
The traditional answer to ensuring data quality is human expert review. Have physicians, researchers, and clinical specialists review every statement in a training dataset to verify it’s accurate. Simple, right?
Not even close. Let’s do the math.
Modern large language models are trained on datasets containing millions of individual facts and claims. Even a relatively small biomedical training dataset might contain 100,000 statements that need verification. Let’s say you want a domain expert to review each statement:
1. Literature search to find relevant current evidence: 30 minutes per statement (optimistic — assuming a skilled researcher and a narrow topic with limited literature)
2. Evidence evaluation to determine what current research actually says: 1+ hours (again optimistic)
3. Judgment and documentation of whether the statement is accurate: 10 minutes after the literature research has concluded
That’s about 100 minutes per statement, and we’re being optimistic. At this rate, reviewing 100,000 statements would take:
- ≈ 166,667 man-hours = 80.1 FTE years. You’d have to pay 80 skilled researchers each for one year full time work.
Even if we do not perform a full literature research and just do a skilled cursory fact check in a domain we a familiar with, even such superficial check is not doable at all in under ten minutes per statement. That would still leave us with having to pay for 8 full time researchers for a full year.
And remember, that’s for a relatively small dataset. Some biomedical training corpora contain millions of statements. Reviewing those at 10 minutes per statement would take centuries of human expert time.
The Economics of Expert Time
Let’s talk cost. A qualified physician or medical researcher capable of evaluating biomedical statements might bill at $200–400 per hour for consulting work. Using the conservative estimate of $200/hour:
- Reviewing 100,000 statements at 10 minutes each = 16,667 hours
At $200/hour, that’s in excess of $3,200,000. For a single dataset. Which will be outdated in a few years if not earlier and need re-validation.
Even if you could find enough qualified experts willing to do this work (you can’t — they have patients to see and research to conduct), the cost is prohibitive for most organizations. And remember, this is just for initial validation. As new evidence emerges, you’d need to periodically re-review statements to catch knowledge drift. That four million dollars becomes a recurring expense.
The Scarcity of Expertise
There’s another problem: medical expertise is a scarce resource. The United States is facing a projected physician shortage of up to 124,000 doctors by 2034. That may eventually be mitigated by AI, but not for the foreseeable future — being a physician or surgeon is more than merely passing a few benchmarks based on reviewing some case studies. Globally, many countries have severe shortages of healthcare workers. These experts are desperately needed for patient care, medical education, and clinical research.
Asking them to spend thousands of hours reviewing AI training datasets is not just expensive — it’s an inefficient use of a scarce and valuable resource. Every hour a cardiologist spends reviewing training data statements about heart disease is an hour they’re not seeing patients, training residents, or conducting research that might advance cardiac care.
This creates an impossible situation: we need expert review to ensure AI training data quality, but we can’t afford the time or cost of expert review at the scale required, and we can’t divert experts from higher-value work even if we could afford it. And we won’t be able to replace those experts with AI (or even assist them meaningfully)… because we haven’t got the training data to train the AI.
The Time Dimension
There’s one more problem with human expert review: time lag. Even if you somehow assembled a team of experts and had unlimited budget, the review process would take months or years for large datasets. By the time review is complete, new evidence has emerged that might change the evaluation of some statements.
This is especially problematic for rapidly evolving fields. During the COVID-19 pandemic, clinical understanding shifted every few months. A review conducted in early 2020 would be outdated by mid-2020. A review started in January might contain contradictory conclusions from the part reviewed in January vs. the part reviewed in June, simply because medical knowledge evolved during the review period.
Human expert review is essential for many tasks, but as the sole mechanism for validating large-scale AI training data, it simply doesn’t work. It’s too slow, too expensive, and requires too much of a scarce resource. We need something better.
Part IV: The Solution — Automated Fact-Checking at Scale
Press enter or click to view image in full size
Image created by the author using ChatGPT-4
What Makes Automated Fact-Checking Possible
For automated fact-checking of biomedical statements to work, several pieces need to be in place:
1. A Comprehensive, Current Knowledge Base
You need access to essentially all biomedical literature, kept continuously up-to-date. Not a snapshot from three years ago, but a living repository that captures new publications as they appear.
This is now achievable. PubMed contains over 37 million article citations and is updated daily. Preprint servers like medRxiv provide early access to research before formal publication. With modern databases and storage, maintaining a comprehensive biomedical literature repository is technically and economically feasible. Advanced semantic search strategies, knowledge graphs, advanced re-ranking strategies and other techniques make reliable literature discovery possible eve at scale.
2. Intelligent Query Processing
A biomedical statement needs to be converted into a database query that finds relevant literature. This is harder than it sounds. The statement “Metformin is first-line therapy for type 2 diabetes” needs to retrieve articles about metformin AND type 2 diabetes AND treatment guidelines AND comparative effectiveness AND alternative treatments — not just articles that happen to mention these terms.
Modern natural language processing, particularly large language models trained on medical text, can now do this reliably. They understand medical terminology, can expand queries to include synonyms and related concepts, and can construct database searches that find genuinely relevant literature rather than just keyword matches.
3. Relevance Assessment
Not every article retrieved by a search is actually relevant to the statement being checked. You need a way to score documents for relevance and focus on the most pertinent ones.
Again, modern AI makes this possible. Language models can read an article abstract and determine whether it actually addresses the specific question at hand. They can score relevance on a scale (1–5, for example) and filter out noise.
4. Evidence Extraction and Synthesis
The hardest part: you need to extract specific passages from relevant articles, determine whether they support or contradict the statement, and synthesize potentially contradictory evidence into an overall judgment.
This requires sophisticated natural language understanding — the kind that, until recently, was thought to require human intelligence. But modern large language models have demonstrated surprising capabilities in this area. They can identify relevant passages, understand their stance toward a claim, and reason about conflicting evidence.
5. Confidence Assessment
Not all fact-checking judgments are equally reliable. A statement supported by ten high-quality randomized trials should be rated more confidently than one supported by a single observational study. The system needs to quantify its own uncertainty.
This is achievable through analysis of evidence quantity, quality, and consistency. If all retrieved studies point in the same direction, confidence is high. If evidence is mixed or limited, confidence is low. These signals can be used to calibrate uncertainty estimates.
The BMLibrarian Fact Checker: A Working Solution
The BMLibrarian Fact Checker demonstrates (as a fully working proof of concept implementation, not a production ready stress tested app) that all these pieces can be integrated into a working system. Here’s how it works:
Step 1: Statement Input
A biomedical statement is submitted for fact-checking. For example: “All cases of childhood ulcerative colitis require colectomy.”
Step 2: Query Generation
A QueryAgent (specialized AI component) analyzes the statement and generates a database query to search the literature. It identifies key concepts (childhood, ulcerative colitis, colectomy, surgical treatment vs. medical management) and constructs a search that retrieves relevant articles.
The system searches a PostgreSQL database containing 37+ million PubMed articles, updated daily, plus medRxiv preprints. This ensures fact-checking against current evidence, not historical snapshots.
Step 3: Document Scoring
Retrieved articles are scored for relevance by a DocumentScoringAgent. Each article gets a score from 1 (not relevant) to 5 (highly relevant). Only high-scoring documents proceed to the next stage, reducing computational overhead and focusing on the best evidence.
Step 4: Citation Extraction
A CitationFinderAgent processes high-scoring documents to extract specific passages that support or contradict the statement. It identifies relevant sentences or paragraphs, labels them as “supports,” “contradicts,” or “neutral,” and records the exact text with citation information (PMID, DOI, etc.).
For the example statement, it might extract citations like:
-
CONTRADICTS: “Most pediatric ulcerative colitis cases achieve remission with medical therapy; colectomy is reserved for medically refractory disease” (PMID: 12345678)
-
CONTRADICTS: “In our cohort of 200 children with UC, only 15% required colectomy within 10 years” (PMID: 87654321)
Step 5: Evidence Synthesis
A FactCheckerAgent synthesizes all extracted citations to render a judgment:
-
Evaluation: “no” (the statement is contradicted by evidence)
-
Reasoning: “Current literature shows that most childhood ulcerative colitis cases respond to medical management. Colectomy is typically reserved for severe or medically refractory disease, not required in all cases.”
-
Confidence: “high” (based on multiple contradicting citations, no supporting evidence, and consistent message across studies)
Step 6: Result Storage and Validation
The complete fact-check result is stored in a database with full metadata: the statement, evaluation, reasoning, confidence level, all citations with their provenance, and processing details. This enables:
-
Verification that all citations reference real documents (no hallucination)
-
Comparison with expected answers (if available) to measure accuracy
-
Human review of uncertain cases
-
Retrospective analysis and system improvement
The Multi-Agent Architecture: Why It Works
The system uses multiple specialized AI agents rather than a single monolithic model. This design mirrors how human experts actually fact-check claims:
1. Human Expert Process:
-
Understand the claim and formulate search strategies
-
Search literature databases
-
Screen results for relevance
-
Read relevant articles and extract pertinent information
-
Synthesize evidence into an overall judgment
-
Assess confidence based on evidence strength
2. BMLibrarian Multi-Agent Process:
-
QueryAgent formulates database searches
-
Database system retrieves articles
-
DocumentScoringAgent screens for relevance
-
CitationFinderAgent extracts pertinent passages
-
FactCheckerAgent synthesizes evidence into judgment
-
System calculates confidence based on evidence metrics
Each agent specializes in one part of the process, similar to how expert fact-checking involves multiple cognitive steps. This division of labor makes the system more reliable and maintainable than a single model trying to do everything at once.
How Fast Is It?
On a single Macbook Pro M3 Max with 128GB RAM, the system processes approximately 25 statements per hour completely autonomously, not requiring any internet connection. That laptop is not dedicated — I use it for writing the software, and even writing this article while it is chugging away in a background process. That means:
-
100,000 statements: ~4,000 hours of processing time
-
fact checking one million biomedical statements/hypotheses: ~166 days… .on a single Macbook Pro laptop, with proof-of-concept prototype software without any optimisation. If we would use the processing power available at AI data centers, it could be done in a few hours at the most — perhaps even in minutes.
Compare this to human expert review (recall: > 16,000 hours for 100,000 statements). This makes large-scale continuous validation actually achievable.
And the cost?
Running the system requires:
- Database server (can use existing infrastructure)
- Compute for AI inference (local models, no API costs)
- Storage for results
- power to run the hardware (my Macbook is rather frugal compared to typical NVidia setups …)
- at some stage a human expert for human-in-the-loop decision support when the system faces uncertainty (not very common, but happens & can be batch processed any time)
The total infrastructure cost is orders of magnitude less than hiring medical experts for months or years of review work.
Actually costs much less than a single expert would cost in a week, for doing the same work over a few months that single expert would not be able to complete over decades. And computing costs scale well — in a data centre with powerful hardware, it could be done in minutes to hours at the cost of less than a few expert woking days.
Does It Actually Work?
This is the crucial question. **An automated fact-checking system that’s fast and cheap but unreliable is worse than useless **— it could give false confidence in bad data. Stay tuned for our preliminary evaluation results …
Catch 22: we too depend on human experts to evaluate our system’s performance. Machine analysis of the 1,000 statement ‘gold standard’ test data set has concluded, but human review is time consuming.
Spoiler alert: after a few hundred data row evaluations with 3 human experts (blinded to the machine evaluation) with stable trend in statistical analysis we can say
- agreement between fact checker and the analysed training data set is low (mid 50 percent ballpark, mainly due to outdated knowledge bases used for training data, trend so far stable)
- agreement between human experts and fact checker is very high (high ninety percent ballpark, trend so far stable)
Stay tuned for when we publish our validation data soon(ish) …..
PS: Join the BMLibrarian Project
Help us to become the #1 biomedical research swiss army knife
We’re actively seeking contributors to help advance BMLibrarian’s fact-checking capabilities. Whether your expertise lies in software development, biomedical research, technical writing, or documentation, there are meaningful ways to contribute.
How You Can Contribute:
- Design and Development: Help us build and refine the codebase (available on GitHub)
- Code Review: Provide feedback and ensure code quality
- Documentation: Write or proofread technical documentation and maintain our Wiki
- Performance Evaluation: Participate in evaluating the fact-checking module’s accuracy
For Biomedical Researchers:
If you have a bachelor’s degree or higher in a biomedical field and some research experience, we can provide a single-click installer for our human evaluation application. This cross-platform tool (Mac, Linux, Windows) requires no special hardware and allows you to:
- Review AI-generated training statements
- Assess their validity based on current evidence (yes/no/uncertain)
- Add optional comments and citations
- Export your evaluations for analysis
Getting Started as a Developer:
We’ve made onboarding straightforward with a detailed programmer’s quickstart manual featuring step-by-step instructions for writing plugins or agents. Our comprehensive wiki serves both end users and developers — you won’t need to navigate 70,000+ lines of code to begin. Simply use one of our templates and customize as needed.
Acknowledgment and Open Science:
Contributors will be credited as evaluators/collaborators in project documentation. While we currently operate as a volunteer effort with modest support (including generous $1,000 in credits from Anthropic for Claude Code use), we’re committed to open science principles: all data, code, and evaluation results are freely available to the research community.
Interested? Visit our GitHub repository to explore the project and get involved.