MetaHarmonizer: robust biomedical metadata harmonization and a contamination control for inflated LLM performance on public benchmarks (opens in new tab)
Public biomedical repositories hold substantial reuse potential, but inconsistent metadata routinely blocks integration across studies. Recent LLM-based harmonization approaches address scale but suffer from non-determinism, hallucinated ontology terms, and, in their highest-accuracy configurations, dependence on proprietary APIs or labeled fine-tuning data. A more fundamental concern is that LLM accuracies on widely-used public benchmarks may substantially inflate transferable capability: un...
Read the original article