doceval — eval harness for LLM document extraction pipelines (opens in new tab)
I kept seeing the same gap: people ship LLM-based document extractors (invoices, receipts, forms) with no systematic way to know how accurate they actually are. So I built doceval — point it at your extractor function + a labeled dataset and get back field-level accuracy, a failure taxonomy (missed_field / hallucination / wrong_format / wrong_value), and optional per-document cost tracking. Works with any extractor (Claude, GPT, regex, rules) and any document schema. One JSON label file per d...
Read the original article