Many teams struggle to take GenAI projects from pilot to production. They are blocked by quality requirements they can neither measure nor meet, but which are necessary for customer satisfaction. Teams that do reach production often struggle to iterate safely, facing regressions and unpredictable changes in output quality.
Databricks enables customers to build systematic evaluation infrastructure through solutions likeJudge Builder, addressing these challenges while creating strategic value that compounds over time. The evaluation methodology we discuss in this post reflects the same research-driven approach that underpins [Agent Bricks](https://www.databricks.com/blog/building-trusted-ai-agents-new-ca…
Many teams struggle to take GenAI projects from pilot to production. They are blocked by quality requirements they can neither measure nor meet, but which are necessary for customer satisfaction. Teams that do reach production often struggle to iterate safely, facing regressions and unpredictable changes in output quality.
Databricks enables customers to build systematic evaluation infrastructure through solutions likeJudge Builder, addressing these challenges while creating strategic value that compounds over time. The evaluation methodology we discuss in this post reflects the same research-driven approach that underpins Agent Bricks — forming the foundation for how we measure, monitor, and continuously improve agent quality across the Databricks AI platform.
Operationally, evaluations enable faster deployment by quantifying performance changes and informing deployment decisions against specific business requirements. Robust, continuous evaluations provide the confidence needed to scale AI applications across the enterprise while consistently meeting performance, safety, and compliance standards.
Strategically, the evaluation data generated—human feedback, model judgments, agent traces—becomes a reusable asset. Customers can leverage this data to train future models, validate evolving agentic workflows, and adapt to whatever emerges next in the rapidly advancing AI landscape. Evaluations codify organizational expertise into persistent competitive advantage.
Building robust AI system evaluations and corresponding reliable AI systems is a significant cross-functional organizational challenge that demands a clear strategic approach across each of the following dimensions:
- Getting Organizations to Design and Prioritize a Judge Portfolio: enabling a variety of stakeholders to agree on what dimensions of quality are worth measuring
- Codifying Expertise Accurately and Reliably: capturing and encoding subject matter expertise from a limited set of experts with minimal effort and noise
- Technical Execution: building tooling that allows for rapid iteration on and deployment of judges, as well as robust measurement, auditability, and governance at scale
In the remainder of this blog post, we’ll explore each of these three dimensions and how we help customers work through them. **Judge Builder **is built on these experiences and streamlines this process, enabling teams to rapidly develop, test, and deploy judges at scale. **If you are interested in working with us to develop custom LLM judges, please contact your account team. **
**Getting Organizations to Design and Prioritize a Judge Portfolio **
**LLM judges do more than just measure. An **LLM judge both serves as a product specification and fundamentally shapes model behavior in practice as teams optimize against it. Thus, a miscalibrated or faulty judge can result in collecting signals that may be irrelevant to the quality of your application and optimizing for the wrong thing entirely.
Therefore, before creating and calibrating a judge, teams must first determine what judge(s) they need. These discussions should involve a variety of stakeholders: some judges may be able to be shared across teams, meaning that investing in high quality judges can pay dividends and accelerate GenAI development across an organization.
We advise our customers to define their judges to focus on precise dimensions of quality; for example, a judge that assesses whether “the response is relevant, factual, and concise” should be decomposed into three separate judges.
Single comprehensive judges obscure root causes. A failing ‘overall quality’ score tells you something’s wrong but not what to fix. Decomposed judges turn evaluation into actionable debugging: when quality drops, you immediately know whether it’s a factuality issue, a relevance issue, or a formatting issue, and can direct improvement efforts accordingly. However, it can also quickly lead to a proliferation of judges; prioritization becomes critical.
Judge selection should combine both top-down and bottom-up considerations:
Top-down: these are quality requirements known to be relevant from the outset.
- Judges derived from these considerations often form the core of continuous monitoring efforts.
- Some examples might be: making sure not to provide prescriptive medical advice, refusing to answer questions unrelated to the business, and adhering to formatting guidelines.
- They may be informed by a variety of inputs including, but not limited to: regulatory requirements, business stakeholder perspectives, corporate style requirements, etc.
Bottom-up: This approach identifies actual failure modes observed in your model’s outputs.
- These judges are crucial for pinpointing specific issues that need resolution and are heavily used in offline evaluations to drive performance improvements.
- Some examples might be: looping outputs, incorrect usage of delimiters, misunderstanding of business jargon, etc.
- Techniques for surfacing these failure modes and judges include techniques like open and axial coding [1] and algorithmic clustering to discover and define distinct categories of failure.
For example, one customer built a top-down judge for correctness—ensuring their AI cited accurate information. Through bottom-up analysis of agent traces, they discovered a pattern: correct responses almost always cited the top two retrieval results. This insight became a new judge that could proxy for correctness in production, where ground-truth labels weren’t available. The combination of both approaches created a more robust evaluation system than either alone and enabled real-time monitoring at scale.
The goal is to find a minimal set of judges that accurately reflect the most high-priority dimensions of quality for your application. The most effective approach combines both top-down and bottom-up insights, ensuring judges are both comprehensive and relevant to your application’s actual performance.
This is an iterative process. As your application’s quality and requirements evolve, so too will your judge portfolio. Treat judges as living artifacts that grow with your system.
**Codifying Expertise Accurately and Reliably **
It is extremely challenging to define what “good” actually means for domain-specific tasks where outputs are subjective, context-dependent, and require specialized expertise to evaluate.
Many enterprise organizations have only a handful of subject matter experts (SMEs) who are sufficiently knowledgeable to assess the quality of the application they are building. These SMEs are few in number, highly valuable, and difficult to scale. Ensuring any judge you build has properly encoded their knowledge enables evaluating applications at unprecedented scales.
This requires careful construction and calibration. Otherwise, judges might flag acceptable outputs, miss real issues, or interpret requirements differently than SMEs would.
Discovery: Focus on What Matters
Collect examples that matter and conduct rigorous error analysis using qualitative methods (e.g. open and axial coding). Focus on:
- Edge cases where pass/fail decisions are borderline
- Clear boundaries that define what good looks like versus common failure modes
- Disagreement points where stakeholders might reasonably differ
Judge Annotation Guidelines: Translate Insights into Clear Criteria
Based on your error analysis findings and any a priori judges that are necessary, create annotation guidelines. This involves:
Identifying relevant dimensions of quality for which you want to create a judge.
Formulating a non-leading, neutral statement that participants can respond to on a categorical scale (often binary, or a 5-point Likert scale). A poorly worded guideline might ask ‘Is the response high quality?’ A well-designed guideline states: ‘The response correctly cites at least one source document (yes/no)’ with explicit clarification that formatting errors should be ignored for this judge.
Providing SMEs with clear and detailed guidelines for how they should provide labels. Be very thorough here. For example: if the judge is formatting related and the format is correct, they should treat that as success even if the answer is wrong overall.
Iterative Annotation: Batch, Check, Align
Let SMEs annotate several examples at a time. We recommend doing annotations in batches because:
- Most organizations are unfamiliar with this type of labeling process and may need ramp up time
- This gives us the opportunity to check inter-rater reliability scores to understand how aligned the SMEs are in their understanding of the task
If there is low agreement, have a discussion to understand whether the judge definition needs refinement or if slight clarifications are sufficient.
Disagreement is normal and expected. In one case, three SMEs showed extreme disagreement on an initial example: ratings of 1, 5, and neutral for the same output. The post-batch discussion revealed the problem—some SMEs were penalizing formatting issues even though the judge was explicitly focused only on factual accuracy, while others were rewarding a non-answer. Once the scope was clarified, all three SMEs converged on a neutral score. Without batched annotation and reliability checks, this confusion would have contaminated the entire dataset, producing a judge that penalized the wrong things.
Fortunately, effective judge creation and calibration doesn’t require thousands of examples. Even 20-30 well-chosen examples can yield a robust judge. The challenge isn’t volume—it’s having a systematic process to capture edge cases and decision boundaries from domain experts within your organization. A three-hour session with SMEs can provide sufficient calibration for many judges.
**From Expert Feedback to Production Judges **
Once you’ve defined your evaluation criteria and collected domain expert feedback, the next challenge is turning these insights into production-ready judges.
There are some additional technical challenges here: translating SME feedback into reliable judge behavior, version control to track judge performance as requirements evolve, deployment infrastructure to run judges at scale, and orchestration systems to manage multiple judges simultaneously.
This means optimizing prompts against your labeled dataset to develop judges that generalize reliably, versioning judges as requirements evolve, deploying them to production, and being able to orchestrate multiple judges across your dimensions of quality.
When it comes to prompt optimization, there are manual and automated approaches. Manual judge tuning gives you full transparency and control, but can be tedious and inefficient. Automatic optimization tools (like DSPy or MLflow’s prompt optimizers) can systematically test different prompts against your examples, and intelligently iterate to quickly reach a strong baseline. The tradeoff is control – some teams prefer manual approaches for this reason, although we generally recommend using an auto-prompt optimizer.
Regardless of approach, the technical solution needs to make iteration fast and governance robust. Teams should be able to update judges in hours, not weeks. When deploying to production, they need confidence that judges are calibrated reliably and will behave consistently.
We’re excited to announce the Judge Builder which streamlines the process of aligning your judges with human feedback by providing an intuitive interface to create judges from your evaluation criteria and continuously align them with human feedback.

See the MLflow product updates blog for more details.
Takeaways
The teams successfully moving from pilot to production recognize that judges are living artifacts. They evolve as models change, as new failure modes emerge, and as business requirements shift. Instead of building judges once during the pilot phase, these teams create lightweight processes for continuously discovering, calibrating, and deploying judges as their systems evolve. This requires three practical commitments:
Focus on high-impact judges first. Identify your most critical regulatory requirement and one observed failure mode. These become your first judges. You can expand your portfolio as new patterns emerge in production.
Create lightweight SME workflows. A few hours with domain experts to review 20-30 edge cases provides sufficient calibration for most judges. Be rigorous with aligning your SMEs with each other to denoise your data, and use automatic prompt optimizers like MLFlow to refine prompts against the examples collected.
Treat judges as living artifacts. Schedule regular judge reviews using production data to surface new failure modes and update your portfolio as requirements shift.
The most common pitfall we observe is treating judge development as purely technical. The teams that succeed are those that build bridges between technical implementation and domain expertise, creating lightweight but systematic processes for capturing and maintaining SME knowledge.
At Databricks, our Judge Builder streamlines this workflow, enabling rapid judge development and deployment at scale. The investment is minimal—often just a few hours of SME time and basic tooling. The return is substantial: systematic evaluation you can trust and a clear path from pilot to production. Whether teams build custom agents or use our Agent Bricks solutions, the path to production requires the same thing: making quality measurable, trackable, and improvable.
Authors: Pallavi Koppol, Alkis Polyzotis, Samraj Moorjani, Wesley Pasfield, Forrest Murray, Vivian Xie, Jonathan Frankle