Large language models are powerful, but they can be slow—especially when generating thousands of tokens per request. For real-time applications - likeHiring Assistant, LinkedIn’s first AI agent for recruiters- latency is critical for both performance and user experience. For Hiring Assistant, recruiters expect conversational responses in seconds, not minutes. That’s challenging when the agent is processing through a large set of information - such as long job descriptions and candidate profiles.
In this blog, we will share one of the techniques that we’ve applied to address latency challenges and improve the responsiveness of the Hiring Assistant experience for recruiters …
Large language models are powerful, but they can be slow—especially when generating thousands of tokens per request. For real-time applications - likeHiring Assistant, LinkedIn’s first AI agent for recruiters- latency is critical for both performance and user experience. For Hiring Assistant, recruiters expect conversational responses in seconds, not minutes. That’s challenging when the agent is processing through a large set of information - such as long job descriptions and candidate profiles.
In this blog, we will share one of the techniques that we’ve applied to address latency challenges and improve the responsiveness of the Hiring Assistant experience for recruiters - speculative decoding. It’s a technique that accelerates text generation without sacrificing quality, making it a key part of how we serve large language models at scale.
What is speculative decoding?
Large language model (LLM) inference consists of two stages: prefill and generation. The prefill stage encodes the input sequence in parallel, which is compute-heavy but can have larger token throughput. The generation stage is memory-bound and slow because it produces tokens one at a time, each requiring a full forward pass.
Speculative decoding changes this dynamic in the prefill and generation stages by drafting multiple tokens ahead and verifying them in parallel. Checking is significantly cheaper than generating, so we save time proportional to the number of tokens accepted. If the guesses are wrong, the system falls back gracefully — meaning it discards the incorrect draft tokens and resumes generation directly from the last verified position — ensuring the final output remains consistent with what the base model would have produced.
This approach guarantees that the final output is identical to what the base model would have produced. It works because the verification step uses the target model’s probabilities to accept or reject proposed tokens, preserving the original distribution. In other words, speculative decoding does not compromise quality. The speed induced by speculative decoding depends on the time we spent on drafting and acceptance rate. The higher acceptance and lower drafting time would introduce more throughput gain.
In practice, there are two main approaches to speculative decoding:
- N‑gram speculative decoding: First introduced as prompt lookup decoding, this is a model-agnostic and purely statistical approach that uses patterns from the existing input to predict the next few tokens. Its drafting cost is low, and works best when outputs contain rephrasings or structured text.
- Draft-model speculation: A smaller “draft” model proposes tokens, and the main model verifies them. This can accelerate less repetitive text but adds complexity because you’re serving two models.
For our use case—long, structured outputs with recurring I/O schema—n‑gram speculation was the perfect fit.
How we applied it to Hiring Assistant
Hiring Assistant is an AI agent that helps hirers with their most time-consuming tasks, to give them more time to focus on the most impactful, people-centric parts of their jobs. In production, Hiring Assistant routinely ingests thousands of input tokens (such as job descriptions and candidate profiles) and generates a 1,000+ tokens of structured analysis. Given a job and a profile, Hiring Assistant classifies the strength of the match (e.g. skills, seniority, domain experience) and explains its reasoning with grounded evidence. To deliver an experience that is scalable to a growing number of customers while still robust and fast for recruiters, we focused on optimizing end-to-end latency—particularly Time Per Output Token (TPOT)—and boosting the realized throughput (QPS), all while preserving response quality.
For Hiring Assistant, n‑gram speculation (without an auxiliary “draft” model) is an ideal fit because the workload for the agent exhibits several unique characteristics. Hiring Assistant produces structured outputs in the form of rubric‑style summaries that include ratings, evidence, and rationale, which creates stable phrasing patterns that n‑gram matchers can leverage effectively. The generated text also has high lexical overlap with the prompt, frequently quoting verbatim relevant components of job and candidate profile such as skill names, titles, tools, certifications, and locations—leading to strong acceptance rates for speculative n‑grams.
Consistency and transparency are non‑negotiable for recruiters who need explanations they can trust, and lossless verification guarantees that the final text is identical to the base model’s output, aligning with Hiring Assistant standards for traceability and policy compliance. Generally, long prompts with recurring schema—such as multi-turn conversations, pre‑screening Q&A, and ATS-connected workflows—benefit from longer n‑gram lookups.
We enabled n‑gram speculative decoding in our vLLM serving stack with the following configuration:
- num_speculative_tokens parameter controls how many tokens the system attempts to draft in one go before verification. Increasing this value can lead to significant speed-ups, as more tokens are accepted in a single step. However, the trade-off lies in the risk of mismatches. If any token in the proposed batch is incorrect, the system must discard all tokens after the mismatch, which results in wasted compute for those mismatched predictions.
- prompt_lookup_max sets the maximum length of pattern matches (n-grams) the system searches for in the prompt history. A higher value allows the system to capture long repetitive sequences, such as structured templates or boilerplate text, which can lead to substantial gains when matches occur. Because these lookups are lightweight, setting this parameter generously introduces minimal overhead. For Hiring Assistant, which often reuses structured phrasing, this setting occasionally produced speed-ups by accelerating large chunks of text generation.
- prompt_lookup_min defines the minimum match length required to trigger speculation. A higher value makes the system more conservative, only speculating on strong matches and achieving a higher acceptance rate with fewer wasted attempts.
N‑gram speculation offers operational simplicity at scale by delivering performance gains without introducing a second model, avoiding additional latency tail risks, infrastructure costs, and orchestration complexity—critical for global deployment. It enabled Hiring Assistant to operate globally for its complicated workload at very low cost and tuning effort.
The results
We observed nearly** 4× higher throughput** at the same QPS and SLA ceiling, along with an average 66% reduction in P90 end-to-end latency—all without any quality degradation, as verified by our internal evaluation pipelines.
In practical terms, this means Hiring Assistant can handle more concurrent recruiter conversations while staying within strict latency budgets.
We noticed that verification is cheaper because scoring multiple tokens in parallel costs about the same as scoring one. The approach is also lossless by design. Thanks to the acceptance/rejection sampling mechanism inspired from Metropolis-Hastings Algorithm, the final output distribution matches what the base model would have produced.
When to use n‑gram speculation
N‑gram speculative decoding shines in workloads where outputs naturally repeat phrases or follow structured patterns. Summarization, document question answering, code editing, and multi-turn conversations are prime examples—these tasks often reuse context, making them ideal for this technique. It’s particularly effective when you need faster generation without the added complexity of running a separate draft model. Workloads with long prompts and predictable structures benefit the most because the likelihood of finding matching sequences—and therefore achieving high acceptance rates—is significantly higher in such scenarios.
On the other hand, if your text is highly variable, creative, or less structured, n‑gram speculation may deliver smaller gains. In those cases, a draft-model approach might be more suitable.
Final thoughts
Speculative decoding is a low-risk, high-reward optimization for many LLM workloads. For structured outputs like those in Hiring Assistant, n‑gram speculation offers an elegant and highly effective solution. When combined with a plethora of other techniques such as ideal model choice, robust fine-tuning pipeline, agentic architecture, continuous batching and prefix caching, we can deliver fast, scalable, and cost-efficient GenAI experiences without compromising quality.
Acknowledgements
We’d like to thank our leaders - Vivek Hariharan, Dave Burgess, Raghu Hiremagalur - and other cross-functional partners for their support and collaboration. We are also grateful to the vLLM community for their contributions on optimizations to speed up LLM inference with speculative decoding.