Building a Distributed Priority Queue on Kafka

How We Transformed Klaviyo’s Event Processing System Which Handles Billions of Daily Events

6 min readSep 22, 2025

–

Coauthored by Rohit Pathak, Tanya Fesenko, Collin Crowell, and Dmitry Mamyrin.

Moving from architectural principles to production reality means solving distributed systems problems that don’t have textbook solutions. When you’re processing 170,000 events per second with strict SLO requirements ranging from 5 seconds to 20 minutes, every implementation decision becomes critical.

How We Transformed Klaviyo’s Event Processing System Which Handles Billions of Daily Events

6 min readSep 22, 2025

–

Coauthored by Rohit Pathak, Tanya Fesenko, Collin Crowell, and Dmitry Mamyrin.

In ourprevious post, we covered the strategic decisions behind migrating Klaviyo’s event processing infrastructure from RabbitMQ to Kafka. This time, we’re diving into some of the engineering details — how do you design a system with events that have different SLO requirements? How do you design consumer architectures that enable parallel processing within partitions? How do you efficiently distribute the events to the processors? How do you create observability systems that monitor the billions of events in your system?

This post shares the hard-won lessons from creating innovative solutions at scale and the techniques that emerged through iterative engineering.

Implementation Challenges & Solutions

Moving from design principles to production reality meant solving several complex distributed systems problems that don’t have textbook solutions.

Challenge: Non-Blocking Parallelization

Traditional Kafka consumers process messages sequentially within partitions, but when you’re handling database lock contention and transient failures at scale, delays create head-of-line blocking that can result in SLO violations.

Solution: Separation of Kafka Consumers and Event Processors for out-of-order processing

We architected a clear separation between Kafka consumers and event processors by adding a consumer proxy layer. The consumer proxy layer maintains traditional offset management and partition assignment, but it does not process events directly; instead, it delegates that to the event processors, which handle per-record acknowledgment and rejection (ACK/NACK).

This decoupling enables parallel processing within partitions while supporting non-blocking retries. The processors can acknowledge individual messages rather than requiring sequential offset commits, dramatically improving throughput during partial failure/delay scenarios. The consumer proxy layer will only commit the offsets that it safely can.

Press enter or click to view image in full size

We introduced a layer between Kafka and event processing

Challenge: Multi-Priority Processing Efficiency

Previously, we ran separate processing infrastructure for each priority level (P10, P20, P30), leading to resource inefficiencies (e.g. excess unused processor capacity) and operational complexity. With our scale projections, this approach wouldn’t be sustainable.

Solution: Unified Priority-Aware Processing

The new architecture uses the same processor fleet to handle all priority levels, with intelligent routing and priority-aware scheduling. This optimizes resource utilization while maintaining strict SLO compliance — high-priority events can preempt lower-priority processing, but the system ensures fairness to prevent starvation of background workloads.

Each priority class lives on its own Kafka topic (P10, P20, P30). The consumer proxy polls them in parallel and coalesces records into a single in-memory priority queue keyed by per-event SLO deadlines — an earliest-deadline-first (EDF) scheduler.

Now, the processing tier doesn’t think about priority at all — it focuses on work, while the proxy layer handles prioritization upstream.

Press enter or click to view image in full size

We created an earliest-deadline-first scheduler

Challenge: Efficient Distribution to Processing Tier

The system needs to distribute events from the proxy to the processing in such a way that slow handlers don’t create head-of-line blocking, while keeping the proxy independent of downstream backpressure mechanics and processor capacity.

Solution: A Dual-loop, Pull-based Design

The proxy runs two concurrent cycles:

Intake loop: Poll Kafka, populate an in-memory priority queue ordered by per-event SLO (EDF), and commit safe offsets.
Serve loop: Expose a non-blocking “give me next” endpoint; processors pull when ready, run the processing, and send ACKs/NACKs.

Because distribution is pull-based, slow processors self-throttle; others continue to drain the queue. The proxy doesn’t need to push, track per-worker buffers, or tune backpressure heuristics — the contract is simple and decoupled.

To stay memory-safe, the proxy keeps only a configurable window of events in memory. Hitting the high watermark automatically pauses Kafka intake; consumption resumes once we drop below the low watermark.

Press enter or click to view image in full size

Pull-based distribution of events

We surveyed existing Kafka consumer proxies, but none paired a pull-based design with native priority scheduling. We built a proxy that does both.

These solutions required careful coordination between multiple teams and extensive testing under production-like loads, but the result is infrastructure that scales efficiently while maintaining the reliability guarantees our business depends on.

Operational Excellence

Building a system that processes hundreds of thousands of events per second is one challenge; operating it reliably in production is another entirely. From day one, we designed the system with operational excellence as a core requirement.

What We Measure

Intake loop (Kafka → proxy)

Message staleness: time from a record’s timestamp to “now.”
Poll latency & throughput: how long polls take; bytes/records per poll.
Deserialize time: per-record decode duration.
Pause windows: how often/long partitions are paused and why.

Serve loop (proxy → processors)

Request latency: processor → proxy round-trip.
Dequeue time: time to pop from the in-memory EDF queue.
Batch shape: batch sizes and assembly time (if batching).

Priority scheduler (in-memory EDF)

Queue depth by lane: P10/P20/P30 items in state.
Age/EDF skew: how far items are from their deadlines.
Memory consumption: total size of our state.

Outcomes (SLOs & delivery)

On-time rate: % within SLO; distribution of “how early.”
Misses: % over SLO; “by how much” and by lane/tenant.
Result types: ACK/NACK percentage

The best distributed systems remove mystery. With detailed, per-loop telemetry, the proxy tells us what it’s doing and why — under load, during incidents, and on the quiet days in between.

Organizing Observability

We built our monitoring around a hierarchical dashboard approach, starting with a high-level “Mission Control” view that provides immediate system health visibility. This top-level dashboard focuses on business-critical metrics — overall SLO compliance across priority levels, throughput trends, and system-wide error rates.

From this overview, operators can drill down into component-specific dashboards that reveal detailed operational metrics: consumer lag by partition, retry pattern analysis, memory utilization, and processing latency breakdowns. Each dashboard provides contextual links to even more granular views, allowing rapid problem isolation without overwhelming operators with unnecessary detail during normal operations.

The key insight was designing dashboards that mirror our incident response process — start with “is the system healthy?” and progressively narrow to “which component needs attention?” This approach dramatically reduces mean time to resolution during outages.

Conclusion

Building resilient event processing at scale requires more than just choosing the right technologies — it demands careful attention to the implementation details that emerge when distributed systems meet real-world constraints.

The key engineering lessons we learned:

Separate concerns aggressively. The separation between Kafka consumers and event processors felt over-engineered initially, but it’s what enables the parallel processing necessary to achieve scale.

Observability drives reliability. The detailed subcomponent monitoring and hierarchical dashboard approach doesn’t just help during incidents — it guides how we think about system health and provides early warning signals for capacity planning.

Innovate where it matters. A custom Kafka proxy with built-in prioritization and a pull-based design cut infrastructure costs by 30% and improved SLO adherence by 20%. Decoupling the durable buffer (Kafka), the prioritization layer, and the processing tier made the system both more reliable and scalable.

For teams building similar systems, remember that the most elegant solutions often emerge through iterative engineering rather than upfront design. Build for the problems you understand today, but create extension points for the challenges you’ll discover tomorrow.

The patterns we’ve shared here — separated consumer architectures, prioritization, and hierarchical monitoring — aren’t specific to just Klaviyo. They’re responses to fundamental distributed systems challenges that any team operating at scale will encounter. We hope sharing our implementation experience helps other engineers navigate similar challenges more efficiently.

How We Transformed Klaviyo’s Event Processing System Which Handles Billions of Daily Events

How We Transformed Klaviyo’s Event Processing System Which Handles Billions of Daily Events

Implementation Challenges & Solutions

Challenge: Non-Blocking Parallelization

Solution: Separation of Kafka Consumers and Event Processors for out-of-order processing

Challenge: Multi-Priority Processing Efficiency

Solution: Unified Priority-Aware Processing

Challenge: Efficient Distribution to Processing Tier

Solution: A Dual-loop, Pull-based Design

Operational Excellence

What We Measure

Organizing Observability

Conclusion

Similar Posts