Multicalibration for LLM-based Code Generation
arxiv.org·2h
🏗️Compiler Archaeology
Preview
Report Post

Title:Multicalibration for LLM-based Code Generation

View PDF HTML (experimental)

Abstract:As AI-based code generation becomes widespread, researchers are investigating the calibration of code LLMs - ensuring their confidence scores faithfully represent the true likelihood of code correctness. To do so, we investigate multicalibration, which can capture additional factors about a coding problem, such as complexity, code length, or programming language used. We study four multicalibration approaches on three function synthesis benchmarks, using latest-generation code LLMs (Qwen3 Coder, GPT-OSS, DeepSeek-R1-Distill). Our results demonstrate that multicalibration can yield distinct improvements over both …

Similar Posts

Loading similar posts...