On-Device Summaries: CI Evals Without Fake Confidence (opens in new tab)

A schema-valid NoteSummary only proves the app received a bindable shape. It does not prove the actionItem captured the decision the note actually made. A useful CI eval should keep the claim smaller: run curated note fixtures through the same summary path the screen uses, score only crisp screen-specific failure modes as boolean rules, and label every green run by model version, fixture set, and scoring-rule version. Green does not mean summary quality. It means these known fixtures still pa...

Read the original article