Benchmarking the Thomson Reuters legal agent

This post was authored by Tyler Alexander, Director of AI Reliability and Heather Nodler, Lead CoCounsel AI Reliability Manager

**Introduction **

This post was authored by Tyler Alexander, Director of AI Reliability and Heather Nodler, Lead CoCounsel AI Reliability Manager

**Introduction **

At Thomson Reuters, we are redefining what it means to deliver professional-grade AI for the legal industry. More than 20,000 law firms, corporations, nonprofits, and government agencies worldwide rely on CoCounsel, our GenAI assistant, which transforms how legal professionals work by automating complex document review, contract analysis, drafting, and other time-intensive tasks with unprecedented speed and accuracy. That trust is earned through a comprehensive evaluation methodology that encompasses dataset rotation, automated testing, expert assessment, continuous monitoring, and strategic partnerships. This post focuses on one critical component of our broader testing framework: how our teams combine attorney expertise with large-scale automated testing through Scorecard, a proprietary evaluation platform originally developed by the engineers behind Waymo’s self-driving car testing infrastructure. While Scorecard represents just one pillar of our multi-layered approach, it exemplifies our commitment to proactive system optimization and continuous improvement.

**Testing and Benchmarking with Scorecard **

Our teams of attorney subject matter experts (SMEs), machine learning experts, and engineers rely on a robust array of testing tools and methodologies, including human legal expertise, specialized testing software, expert prompt engineering, and continuous monitoring of test results. Rather than waiting for performance issues to emerge, we proactively identify and address potential challenges through systematic testing and optimization. When issues arise that may affect CoCounsel’s performance, these teams are equipped to mobilize a collaborative, rapid response effort, locating and remedying performance issues before they affect our customers.

A key tool is Scorecard, a specialized application that quantitatively evaluates CoCounsel responses against ideal responses created by our attorney SMEs. Scorecard is the evaluation infrastructure for AI agents in legaltech, fintech, and compliance, and enables us to supplement our manual testing with large-scale, automated testing against our internal benchmarks. Built by the team behind Waymo’s self-driving evaluation infrastructure, Scorecard runs millions of agent simulations to help teams evaluate, optimize, and ship reliable AI agents faster.

Performance issues typically arise from two distinct factors: (1) the quality of user inputs, such as user prompts or queries, and documents; and, (2) system limitations.

We address the first factor by providing customers with high-quality training, support, and tools—including CoCounsel-created prompts, guided expert workflows, and agentic systems. In contrast, addressing the second factor requires recalibrating the system itself.

Each CoCounsel skill is a precisely engineered legal tool, tailored on the backend to perform a specific legal task. Because we calibrate each skill to reliably extract information by leveraging the unique strengths of its underlying AI model, migrating a skill from one model to another often introduces performance issues that require recalibration. Such migrations may occur, for example, when a third party releases a new AI model with enhanced capabilities. To safeguard our customers, we conduct all migration and recalibration work within testing and staging environments before deploying any changes.

**Case Study: AI Model Migration of Review Documents Skill **

Large-Scale Testing Using Realistic Scenarios & Manual and Automated Review

Jessica, an attorney SME on the CoCounsel AI Reliability Team—also known as the Trust Team—oversees the evaluation of CoCounsel’s Review Documents skill. In just minutes, the Review Documents skill can closely review and analyze large troves of legal information that would ordinarily require an attorney to spend hours or even days of manual review.

Jessica proactively monitors the upcoming migration of the Review Documents skill to a new AI model. This migration promises significant improvements in CoCounsel’s speed and accuracy. Working in a CoCounsel testing environment, Jessica manually reviews and evaluates the skill’s responses on the new model using a carefully curated “testset” of sample “testcases” that reflect real-world legal practice scenarios. Jessica checks CoCounsel’s response to each testcase user query against an ideal or “gold-standard” response that she has personally crafted using knowledge and expertise gained from years of experience as a real-world attorney.

Because each testset can contain several hundred testcases or more, reviewing each result would ordinarily be prohibitively time-consuming. However, Scorecard enables Jessica to supplement and scale the impact of her manual review by providing an extra layer of automated review.

Scorecard works by evaluating each response produced by CoCounsel and the AI model against the corresponding ideal response, then assigning the testcase a passing or failing numerical score using several criteria, such as the model’s ability to recall information, its precision, and its accuracy.

Reviewing the Scorecard results enables Jessica to compare the full testset’s scores on both models for the Review Documents skill. This means she can evaluate CoCounsel’s performance at scale much more efficiently.

Fig 1:* Attorney SME manual review workflow. *

Fig 2:* Scorecard automated review workflow.*

Reviewing the Scorecard data, Jessica quickly observes that on the new model, Scorecard consistently assigns failing scores to a specific testcase, assigning it a 1 out of 5 on all metrics. She identifies underperformance in other testcases, too; however, the other testcases still yield higher scores than the problem testcase. Recognizing the stakes are high, Jessica immediately begins troubleshooting the performance issue.

Troubleshooting

Jessica and her team of SMEs begin to troubleshoot by homing in on the problem testcase that Scorecard identified.

The testcase user query asks:

What medications is the patient currently taking? Please be specific with prescription names and dosages.

Analyzing CoCounsel’s outputs for the testcase, Jessica determines that on the new model, the Review Documents skill is failing to identify all medications for the patient consistently, causing a clear discrepancy with the ideal response. The new model occasionally includes all the relevant medications, but such inconsistent behavior does not meet the required standard.

*[Click image to expand] Fig. 3: Scorecard screenshots of the AI model’s failing answer. *As can be seen in the expanded “model response” window above, the model was including medications that were no longer currently active and was failing to identify the only two current, active medications (Aspirin 81MG EC TAB and Aspirin 325MG EC TAB).

By digging deeper and examining the problem testcase response as well as some of the other, underperforming testcase responses, Jessica pinpoints the core issue as being the AI model’s ability to provide a sufficiently comprehensive level of detail. Since the model sometimes does output a complete response, Jessica observes, as a secondary concern, that the AI model struggles to produce consistent results.

Iterative Resolution & Continuous Improvement

Having identified the core issues, Jessica brings the issue to the CoCounsel engineering team for resolution. She describes the parameters of an ideal response and how the new model’s response fails to meet target metrics. This gives the engineers concrete goals, which they can use to modify the backend AI prompts. After each prompt change, Jessica evaluates a portion of the test set which is continuously updated, complemented by independent attorney reviews. Jessica and the engineering team continuously execute multiple rounds of prompt changes and use Scorecard to evaluate the results until the issue has completely resolved, and the new model is performing as expected. Scorecard now assigns the problem testcase a 4 out of 5 on all metrics, a good score—it reflects that the model has produced a valid response that captures all relevant substantive data points contained in the ideal response but may differ in more subtle ways, such as writing style or level of additional detail. Resolving this core issue ensures the secondary issue of inconsistent performance has been resolved as well. Jessica further conducts manual reviews of CoCounsel’s performance on the problem testcase.

These adjustments have cascading positive effects. When the problem testcase begins passing 99-100% of the time, the other testcases that had experienced the same issues (albeit less frequently) begin passing 100% of the time.

*[Click image to expand] Fig 4: Scorecard screenshots of the AI model’s passing answer. *This was achieved after multiple rounds of testing and prompting changes, which confirmed the engineers were able to pinpoint and fix the issue. As shown in the expanded “model response” window above, the issue was ultimately fixed, and the model began answering this testcase correctly (as well as a few other testcases that had been failing, albeit less frequently, due to the same issue).

Once the model consistently returns results that meet TR’s expectations and are suitable for legal work, Jessica feels secure in the knowledge that the Review Documents skill meets necessary standards and can be released to customers.

Even after the skill is released on the new model, Jessica continues to run various Scorecard tests, multiple times daily to ensure consistency.

Fig 5: Continuous improvement process between attorney SME and engineers.

Observations

CoCounsel’s proactive and continuous iterative improvement process is painstaking but necessary. The problem testcase identified by Jessica using Scorecard provided a useful benchmark for improvement, because it failed more consistently than other testcases. Using a “least common denominator” testcase provided a measuring stick against which we could measure other testcases.

Using Scorecard allowed Jessica to extrapolate improvements from the single problem testcase to all other testcases, dramatically increasing the efficiency and speed with which she could iterate and improve CoCounsel’s performance across the board.

Conclusion

Innovation in AI is never “one and done.” Models evolve, new risks emerge, and customer needs grow more complex. While this post has focused on Scorecard as one essential component of our testing infrastructure, it represents just one element of our comprehensive evaluation methodology. Our broader approach integrates dataset rotation, automated testing at scale, expert assessment from legal professionals like Jessica, continuous monitoring of live performance, and strategic partnerships with leading AI providers.

This multi-layered framework is what sets CoCounsel’s approach apart. By combining deep legal expertise with world-class technology infrastructure, we’re not only raising the standard for AI in professional fields, we’re defining it. Through proactive system optimization and evaluation approaches, CoCounsel continues to deliver the transformative professional-grade legal AI capabilities that tens of thousands of legal professionals depend on.

—-

About the Authors

Tyler Alexander is the Director of AI Reliability at Thomson Reuters, where he leads a team of attorneys to ensure CoCounsel delivers trustworthy, professional-grade performance. He specializes in large-scale testing and benchmarking of AI systems for legal professionals.

Heather Nodler is a Lead CoCounsel AI Reliability Manager at Thomson Reuters. With years of experience practicing law, they now apply their expertise to evaluating, calibrating, and continuously improving CoCounsel’s legal AI skills. Heather works closely with product and engineering teams to ensure that every CoCounsel feature meets the high standards required for real-world legal practice.

Similar Posts