How AI-Driven Testing Enabled Sub-Second Latency for Agentforce Voice

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today, meet Angie Howard, Senior Manager of Software Engineering, who leads the team behind the Flash Reasoning Engine powering Agentforce Voice. This engine delivers natural, human-fast responses striving for sub-second Time to First Audio (TTFA) across a real-time voice pipeline.

Explore how her team used AI-driven synthetic customer testing to uncover misleading 70-second latency readings, **shaved hundreds of milliseconds **from critical microservices across the real-time voice pipeline, and engineered semantic end-pointing algorithms that maintain accuracy and conversational fluidity while meeting aggressive sub-second…

What is your team’s mission building the Flash Reasoning Engine for Agentforce Voice?

We build a reasoning engine prioritizing extreme speed and consistent accuracy, recognizing that voice interactions demand both. While multi-second delays are tolerable in text interfaces, even a brief pause in voice breaks conversational flow. Our focus delivers responses quickly enough to feel natural, ensuring retrieved information and actions remain contextually correct. A fast but inaccurate system breeds mistrust, and an accurate but slow system becomes unusable.

To achieve this, we design for the smallest possible TTFA while maintaining correctness across retrieval, reasoning, and downstream integrations. We also optimize turn-taking signals, making the handoff between automatic speech recognition (ASR), the reasoning engine, and text-to-speech (TTS) feel human, not robotic. Our goal is to create an effortless agent interaction–fluid, unforced, and responsive.

Ultimately, our mission is straightforward: build a system fast enough to go unnoticed and smart enough to be unquestioned.

Angie shares why engineers should join Salesforce.

What cross-system latency bottlenecks did you encounter building Flash, and how did you overcome them?

The most significant latency challenges arose from the need for every component — ASR, Flash, TTS, routing layers, and downstream systems—to operate far faster than any previous internal builds. Text experiences tolerate multi-second delays, but voice does not. If responses take 1.5–2 seconds, callers lose context or assume a system malfunction. We consistently uncovered hundreds of milliseconds hidden within microservices, synchronous calls, and serialization paths never optimized for a sub-second budget.

To address this, we first aligned all partner teams on the strict performance requirements of real-time voice interactions. Once everyone understood that even 300 milliseconds of extra work could break the entire experience, teams prioritized architectural changes, streamlined operations, and eliminated unnecessary computation. Our strategy was clear: reduce 100 milliseconds per service across the board, knowing cumulative improvements would meaningfully reduce total latency.

This coordinated effort produced substantial gains across the entire pipeline, enabling Flash to consistently meet conversational expectations.

What challenges did you face designing semantic end-pointing for real-time voice interactions?

Semantic end-pointing presented one of our most complex challenges. The system must accurately detect when a user finishes speaking, without premature interruption or excessive delay. Silence-based detection fails in real-world usage. If someone pauses while providing a credit card number, the system must wait. However, if they say “My name is Ja’Marr,” the response should begin immediately. The difficulty lies in distinguishing meaningful pauses from true utterance completion while maintaining conversationally acceptable timing.

We built algorithms evaluating the semantic content of partial utterances to predict whether the speaker intends to continue speaking. This required calibrating confidence thresholds so the model responded quickly without prematurely interrupting the user.

The end result is an end-pointing system that intervenes only when appropriate, enabling smooth, human-like conversational pacing without sacrificing accuracy.

Agentforce Voice powers seamless speech-to-text conversational AI experiences.

What challenges did you face building end-to-end QA and performance pipelines for voice?

Developing voice QA necessitated entirely new pipelines, as existing text-based frameworks within Salesforce proved unsuitable. Early performance tests revealed significantly inflated latency values, reaching up to 70 seconds per turn. This required a thorough investigation to identify the root cause, spanning Flash, the test harness, and the audio subsystem. We ultimately discovered that the system was measuring the completion of the entire spoken output, rather than the TTFA. This misrepresentation made the product appear far slower than its actual performance, highlighting the need to re-evaluate foundational assumptions even in measurement definitions.

To address this, we collaborated closely with the performance team. We developed synthetic customers capable of generating diverse utterances through TTS. By feeding consistent, controlled audio inputs into Flash, we created large and repeatable data sets across ASR, Flash, and TTS boundaries. These pipelines effectively eliminated noise from unpredictable inputs, delivering clean and trustworthy metrics.

This process resulted in a robust testing framework that accurately reflects real-time behavior. This framework now serves as the foundation for validating latency, correctness, and interaction quality across new features, ensuring reliable performance.

Angie explains what keeps her at Salesforce.

What difficulties arose defining production metrics for a real-time voice agent, and how did you apply them?

As Agentforce Voice represented a new frontier, historical benchmarks were absent. We lacked standard definitions for availability, established escalation baselines, and a consistent granularity for measuring audio usage. Even fundamental decisions, such as reporting audio in minutes versus seconds, demanded careful consideration. Reporting “hundreds of thousands of minutes of audio” offered little actionable insight, while reporting in seconds quickly became unreadable. A crucial distinction also involved differentiating intentional human handoffs from genuine agent failures to prevent misclassification.

We addressed these challenges by defining metrics directly correlated with conversational outcomes. TTFA emerged as our primary metric due to its significant influence on perceived responsiveness. End-to-end latency was tracked independently to capture system-level delays. Escalations were categorized to reflect whether they were intentional transfers or failure-driven events. Furthermore, availability was defined by end-to-end pipeline health, rather than relying on isolated service metrics.

These precisely defined measurements now guide engineering decisions. They illuminate points of friction, identify where latency accumulates, and pinpoint areas where targeted optimizations will yield the greatest impact.

Angie highlights what makes Salesforce Engineering’s culture unique.

What engineering challenges did you face preparing Flash for multilingual support at global scale?

Preparing Flash for multilingual readiness presented both technical and organizational hurdles. While our models were fine-tuned in English, expanding to languages such as Japanese necessitated validating translation accuracy, TTS naturalness, and conversational tone. The core team lacked the firsthand expertise to evaluate these critical elements. Early failures proved difficult to diagnose. For instance, English TTS unintentionally processed Japanese text, generating distorted outputs that masked the actual root cause of the issue. We recognized the need for native speakers to assess phrasing, timing, and tonal expectations.

To address this, we partnered with colleagues in Japan and as well as contractors and labelers. This team reviewed synthetic audio across thousands of utterances. Utilizing non-customer synthetic data, we developed evaluation workflows to assess TTS quality, semantic end-pointing behavior, and translation correctness. This established consistent, scalable evaluation loops, even when the required language expertise was not present within the core engineering team.

With these systems now in place, Flash is positioned to deliver natural, culturally aligned voice experiences as we expand into new markets.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.