“The Kafka community is currently seeing an unprecedented situation with three KIPs (KIP-1150, KIP-1176, KIP-1183) simultaneously addressing the same challenge of high replication costs when running Kafka across multiple cloud availability zones.” — Luke Chen, The Path Forward for Saving Cross-AZ Replication Costs KIPs
At the time of writing the Kafka project finds itself at a fork in the road where …
“The Kafka community is currently seeing an unprecedented situation with three KIPs (KIP-1150, KIP-1176, KIP-1183) simultaneously addressing the same challenge of high replication costs when running Kafka across multiple cloud availability zones.” — Luke Chen, The Path Forward for Saving Cross-AZ Replication Costs KIPs
At the time of writing the Kafka project finds itself at a fork in the road where choosing the right path forward for implementing S3 topics has implications for the long-term success of the project. Not just the next couple of years, but the next decade. Open-source projects live and die by these big decisions and as a community, we need to make sure we take the right one.
This post explains the competing KIPs, but goes further and asks bigger questions about the future direction of Kafka.
Deciding What We Want Future-Kafka to Be
Before comparing proposals, we should step back and ask what kind of system we want Kafka to become.
Kafka now faces two almost opposing forces. One force is stabilizing: the on-prem deployments and the low latency workloads that depend on local disks and replication. Kafka must continue to serve those use cases. The other force is disrupting: the elastic, cloud-native workloads that favor stateless compute and shared object storage.
Relaxed-latency workloads such as analytics have seen a shift in system design with durability increasingly delegated to shared object storage, freeing the compute layer to be stateless, elastic, and disposable. Many systems now scale by adding stateless workers rather than rebalancing stateful nodes. In a stateless compute design, the bottleneck shifts from data replication to metadata coordination. Once durability moves to shared storage, sequencing and metadata consistency become the new limits of scalability.
That brings us to the current moment, with three competing KIPs defining how to integrate object storage directly into Kafka. While we evaluate these KIPs, it’s important to consider the motivations for building direct-to-S3 topics. Cross-AZ charges are typically what are on people’s minds, but it’s a mistake to think of S3 simply as a cheaper disk or a networking cheat. The shift is also architectural, providing us an opportunity to achieve those operational benefits such as elastic stateless compute.
The devil is in the details: how each KIP enables Kafka to leverage object storage while also retaining Kafka’s soul and what made it successful in the first place.
Many KIPs, but only two paths
With that in mind, while three KIPs have been submitted, it comes down to two different paths:
Revolutionary: Choose a direct-to-S3 topic design that maximizes the benefits of an object-storage architecture, with greater elasticity and lower operational complexity. However, in doing so, we may increase the implementation cost and possibly the long-term code maintenance too by maintaining two very different topic-models in the same project (leader-based replication and direct-to-S3). 1.
Evolutionary: Shoot for an evolutionary design that makes use of existing components to reduce the need for large refactoring or duplication of logic. However, by coupling to the existing architecture, we forfeit the extra benefits of object storage, focusing primarily on networking cost savings (in AWS and GCP). Through this coupling, we also run the risk of achieving the opposite: harder to maintain code by bending and contorting a second workload into an architecture optimized for something else.
In this post I will explain the two paths in this forked road, how the various KIPs map onto those paths, and invite the whole community to think through what they want for Apache Kafka for the next decade.
*Note that I do not include KIP-1183 as it looks dead in the water, and not a serious contender. The KIP proposes AutoMQ’s storage abstractions without the accompanying implementation. Which perhaps cynically, seems to benefit AutoMQ were it ever adopted, leaving the community to rewrite the entire storage subsystem again. If you want a quick summary of the three KIPs (including KIP-1183), you can read Luke Chen’s The Path Forward for Saving Cross-AZ Replication Costs KIPs or Anton Borisov’s summary of the three KIPs. *
This post is structured as follows:
The term “Diskless” vs “Direct-to-S3” 1.
The Common Parts. Some approaches are shared across multiple implementations and proposals. 1.
Revolutionary: KIPs and real-world implementations 1.
Evolutionary: Slack’s KIP-1176 1.
The Hybrid: balancing revolution with evolution 1.
Deciding Kafka’s future
1. The term “Diskless” vs “Direct-to-S3”
I used the term “diskless” in the title as that is the current hype word. But it is clear that not all designs are actually diskless in the same spirit as “serverless”. Serverless implies that users no longer need to consider or manage servers at all, not that there are no servers.
In the world of open-source, where you run stuff yourself, diskless would have to mean literally “no disks”, else you will be configuring disks as part of your deployment. But all the KIPs (in their current state) depend on disks to some extent, even KIP-1150 which was proposed as diskless. In most cases, disk behavior continues to influence performance and therefore correct disk provisioning will be important.
So I’m not a fan of “diskless”, I prefer “direct-to-S3”, which encompasses all designs that treat S3 (and other object stores) as the only source of durability.
2. The Common Parts
Combined Objects
The main commonality between all Direct-to-S3 Kafka implementations and design proposals is the uploading of objects that combine the data of multiple topics. The reasons are two-fold:
Avoiding the small file problem. Most designs are leaderless for producer traffic, allowing for any server to receive writes to any topic. To avoid uploading a multitude of tiny files, servers accumulate batches in a buffer until ready for upload. Before upload, the buffer is sorted by topic id and partition, to make compaction and some reads more efficient by ensuring that data of the same topic and same partition are in contiguous byte ranges. 1.
Pricing. The pricing of many (but not all) cloud object storage services penalize excessive requests, so it can be cheaper to roll-up whatever data has been received in the last X milliseconds and upload it with a single request.
Sequencing and metadata storage
In the leader-based model, the leader determines the order of batches in a topic partition. But in the leaderless model, multiple brokers could be simultaneously receiving produce batches of the same topic partition, so how do we order those batches? We need a way of establishing a single order for each partition and we typically use the word “sequencing” to describe that process. Usually there is a central component that does the sequencing and metadata storage, but some designs manage the sequencing in other ways.
Zone-aligned producers via metadata manipulation
WarpStream was the first to demonstrate that you could hack the metadata step of initiating a producer to provide it with broker information that would align the producer with a zone-local broker for the topics it is interested in. The Kafka client is leader-oriented, so we just pass it a zone-local broker and tell the client “this is the leader”. This is how all the leaderless designs ensure producers write to zone-local brokers. It’s not pretty, and we should make a future KIP to avoid the need for this kind of hack.
Zone-aligned consumers
Consumer zone-alignment heavily depends on the particular design, but two broad approaches exist:
Leaderless: The same way that producer alignment works via metadata manipulation or using KIP-392 (fetch from follower) which can be used in a leaderless context. 1.
Leader-based:
Zone-aware consumer group assignment as detailed in KIP-881: Rack-aware Partition Assignment for Kafka Consumers. The idea is to use consumer-to-partition assignment to ensure consumers are only assigned zone-local partitions (where the partition leader is located).
KIP-392 (fetch-from-follower), which is effective for designs that have followers (which isn’t always the case).
Object Compaction
Given almost all designs upload combined objects, we need a way to make those mixed objects more read optimized. This is typically done through compaction, where combined objects are ultimately separated into per-topic or even per-partition objects. Compaction could be one-shot or go through multiple rounds.
3. Revolutionary: The KIPs and Real-world Implementations
The “revolutionary” path draws a new boundary inside Kafka by separating what can be stateless from what must remain stateful. Direct-to-S3 traffic is handled by a lightweight, elastic layer of brokers that simply serve producers and consumers. The direct-to-S3 coordination (sequencing/metadata) is incorporated into the stateful side of regular brokers where coordinators, classic topics and KRaft live.
I cover three designs in the “revolutionary” section:
WarpStream (as a reference, a kind of yardstick to compare against)
KIP-1150 revision 1
Aiven Inkless (a Kafka-fork)
3.1. WarpStream (as a yardstick)
Before we look at the KIPs that describe possible futures for Apache Kafka, let’s look at a system that was designed from scratch with both cross-AZ cost savings and elasticity (from object storage) as its core design principles. WarpStream was unconstrained by an existing stateful architecture, and with this freedom, it divided itself into:
Leaderless, stateless and diskless agents that handle Kafka clients, as well as compaction/cleaning work.
Coordination layer: A central metadata store for sequencing, metadata storage and housekeeping coordination.
As per the Common Parts section, the zone alignment and sorted combined object upload strategies are employed.
On the consumer side, which again is leaderless and zone-local, a per-zone shared cache is implemented (which WarpStream dubbed distributed mmap). Within a zone, this shared cache assigns each agent a portion of the partition-space. When a consumer fetch arrives, an agent will download the object byte range itself if it is responsible for that partition, else it will ask the responsible agent to do that on its behalf. That way, we ensure that multiple agents per zone are not independently downloading the same data, thus reducing S3 costs.
WarpStream implements agent roles (proxy, job) and agent groups to separate the work of handling producer, consumer traffic from background jobs such as compaction, allowing for independent scaling for each workload. The proxy role (producer/consumer Kafka client traffic) can be further divided into proxy-producer and proxy-consumer.
Agents can be deployed as dedicated agent groups, which allows for further separation of workloads. This is useful for avoiding noisy neighbor issues, running different groups in different VPCs and scaling different workloads that hit the same topics. For example, you could use one proxy group for microservices, and a separate proxy-consumer group for an analytics workload.
Being a pure Direct-to-S3 system allowed WarpStream to choose a design that clearly separates traffic serving work into stateless agents and the coordination logic into one central metadata service. The traffic serving layer is highly elastic, with relatively simple agents that require no disks at all. Different workloads benefit from independent and flexible scaling via the agent roles and groups. The point of contention is the metadata service, which needs to be carefully managed and scaled to handle the read/write metadata volume of the stateless agents.
Confluent Freight Clusters follow a largely similar design principle of splitting stateless brokers from central coordination. I will write about the Freight design sometime soon in the future.
Apache Kafka is a stateful distributed system, but next we’ll see how KIP-1150 could fulfill much of the same capabilities as WarpStream.
3.2. KIP-1150 – Revision 1
KIP-1150 has continued to evolve since it was first proposed, undergoing two subsequent major revisions between the KIP itself and mailing list discussion. This section describes the first version of the KIP, created in April 2025.
KIP-1150 revision 1 uses a leaderless architecture where any broker can serve Kafka producers and consumers of any topic partition. Batch Coordinators (replicated state machines like Group and Transaction Coordinators), handle coordination (object sequencing and metadata storage).
Brokers accumulate Kafka batches and upload shared objects to S3 (known as Shared Log Segment Objects, or SLSO). Metadata that maps these blocks of Kafka batches to SLSOs (known as Batch Coordinates) is then sent to a Batch Coordinator (BC) that sequences the batches (assigning offsets) to provide a global order of those batches per partition. The BC acts as sequencer and batch coordinate database for later lookups, allowing for the read path to retrieve batches from SLSOs. The BCs will also apply idempotent and transactional producer logic (not yet finalized).
Batch Coordinators
The Batch Coordinator (BC) is the stateful component akin to WarpStream’s metadata service. Kafka has other coordinators such as the Group Coordinator (for the consumer group protocol), the Transaction Coordinator (for reliable 2PC) and Share Coordinator (for queues aka share groups). The coordinator concept, for centralizing some kind of coordination, has a long history in Kafka.
The BC is the source of truth about uploaded objects, Direct-to-S3 topic partitions, and committed batches. This is the component that contains the most complexity, with challenges around scaling, failovers, reliability as well as logic for idempotency, transactions, object compaction coordination and so on.
A Batch Coordinator has the following roles:
Sequencing. Chooses the total ordering for writes, assigning offsets without gaps or duplicates. 1.
Metadata storage. Stores all metadata that maps partition offset ranges to S3 object byte ranges. 1.
Serving lookup requests.
Serving requests for log offsets.
Serving requests for batch coordinates (S3 object metadata).
Partition CRUD operations. Serving requests for atomic operations (creating partitions, deleting topics, records, etc.) 1.
Data expiration
Managing data expiry and soft deletion.
Coordinating physical object deletion (performed by brokers).
The Group and Transaction Coordinators use internal Kafka topics to durably store their state and rely on the KRaft Controller for leader election (coordinators are highly available with failovers). This KIP does not specify whether the Batch Coordinator will be backed by a topic, or use some other option such as a Raft state machine (based on KRaft state machine code). It also proposes that it could be pluggable. For my part, I would prefer all future coordinators to be KRaft state machines, as the implementation is rock solid and can be used like a library to build arbitrary state machines inside Kafka brokers.
Produce path
When a producer starts, it sends a Metadata request to learn which brokers are the leaders of each topic partition it cares about. The Metadata response contains a zone-local broker. 1.
The producer sends all Produce requests to this broker. 1.
The receiving broker accumulates batches of all partitions and uploads a shared log segment object (SLSO) to S3. 1.
The broker commits the SLSO by sending the metadata of the SLSO (the metadata is known as the batch coordinates) to the Batch Coordinator. The coordinates include the S3 object metadata and the byte ranges of each partition in the object. 1.
The Batch Coordinator assigns offsets to the written batches, persists the batch coordinates, and responds to the broker. 1.
The broker sends all associated acknowledgements to the producers.
Brokers can parallelize the uploading SLSOs to S3, but commit them serially to the Batch Coordinator.
Consume path
When batches are first written, a broker does not assign offsets as would happen with a regular topic, as this is done after the data is written, when it is sequenced and indexed by the Batch Coordinator. In other words, the batches stored in S3 have no offsets stored within the payloads as is normally the case. On consumption, the broker must inject the offsets into the Kafka batches, using metadata from the Batch Coordinator.
The consumer sends a Fetch request to the broker. 1.
The broker checks its cache and on a miss, the broker queries the Batch Coordinator for the relevant batch coordinates. 1.
The broker downloads the data from object storage. 1.
The broker injects the computed offsets and timestamps into the batches (the offsets being part of the batch coordinates). 1.
The broker constructs and sends the Fetch response to the Consumer.
Broker roles, the key to elasticity (future work)
The KIP notes that broker roles could be supported in the future. I believe that KIP-1150 revision 1 starts becoming really powerful with roles akin to WarpStream. That way we can separate out direct-to-S3 topic serving traffic and object compaction work on proxy brokers, which become the elastic serving and background job layer. Batch Coordinators would remain hosted on standard brokers, which are already stateful. With broker roles, we can see how Kafka could implement its own WarpStream-like architecture, which makes full use of disaggregated storage to enable better elasticity.
To be decided…
Things like supporting idempotency, transactions, caching and object compaction are left to be decided later. While taxing to design, these things look doable within the basic framework of the design. But as I already mentioned in this post, this will be costly in effort to develop but also may come with a long term code maintenance overhead if complex parts such as transactions are maintained twice. It may also be possible to refactor rather than do wholesale rewrites.
3.3. Aiven Inkless
Inkless is Aiven’s direct-to-S3 fork of Apache Kafka. The Inkless design shares the combined object upload part and metadata manipulation that all designs across the board are using. It is also leaderless for direct-to-S3 topics. Inkless is firmly in the revolutionary camp. If it were to implement broker roles, it would make this Kafka-fork much closer to a WarpStream-like implementation (albeit with some issues concerning the coordination component as we’ll see further down).
While Inkless is described as a KIP-1150 implementation, we’ll see that it actually diverges significantly from KIP-1150, especially the later revisions (covered later).
Postgres as coordinator
Inkless eschewed the Batch Coordinator of KIP-1150 in favor of a Postgres instance, with coordination being executed through Table Valued Functions (TVF) and row locking where needed.
On the read side, each broker must discover what batches exist by querying Postgres (using a TVF again), which returns the next set of batch coordinates as well as the high watermark. Now the broker knows where the next batches are located, it requests those batches via a read-through cache. On a cache miss, it fetches the byte ranges from the relevant objects in S3.
Inkless bears some resemblance to KIP-1150 revision 1, except the difficult coordination bits are delegated to Postgres. Postgres does all the sequencing, metadata storage, as well as coordination for compaction and file cleanup.
For example, compaction is coordinated via Postgres, with a TVF that is periodically called by each broker which finds a set of files which together exceed a size threshold and places them into a merge work order (tables file_merge_work_items and file_merge_work_item_files) that the broker claims. Once carried out, the original files are marked for deletion (which is another job that can be claimed by a broker).
Inkless doesn’t implement transactions, and I don’t think Postgres could take on the role of transaction coordinator, as the coordinator does more than sequencing and storage. Inkless will likely have to implement some kind of coordinator for that.
The Postgres data model is based on the following tables:
logs. Stores the Log Start Offset and High Watermark of each topic partition, with the primary key of topic_id and partition.
files. Lists all the objects that host the topic partition data.
batches. Maps Kafka batches to byte ranges in Files.
producer_state. All the producer state needed for idempotency.
Some other tables for housekeeping, and the merge work items.
The Commit File TVF, which sequences and stores the batch coordinates, works as follows:
A broker opens a transaction and submits a table as an argument, containing the batch coordinates of the multiple topics uploaded in the combined file. 1.
The TVF logic creates a temporary table (logs_tmp) and fills it via a SELECT on the logs table, with the FOR UPDATE clause which obtains a row lock on each topic partition row in the logs table that matches the list of partitions being submitted. This ensures that other brokers that are competing to add batches to the same partition(s) queue up behind this transaction. This is a critical barrier that avoids inconsistency. These locks are held until the transaction commits or aborts. 1.
Next it, inside a loop, partition-by-partition, the TVF:
Updates the producer state.
Updates the high watermark of the partition (a row in the logs table).
Inserts the batch coordinates into the batches table (sequencing and storing them).
Commits the transaction.
Postgres concerns
Apache Kafka would not accept a Postgres dependency of course, and KIP-1150 has not proposed centralizing coordination in Postgres either. But the KIP has suggested that the Batch Coordinator be pluggable, which might leave it open for using Postgres as a backing implementation.
As a former database performance specialist, the Postgres locking does concern me a bit. It blocks on the logs table rows scoped to the topic id and partition. An ORDER BY prevents deadlocks, but given the row locks are maintained until the transaction commits, I imagine that given enough contention, it could cause a convoy effect of blocking. This blocking is fair, that is to say, First Come First Serve (FCFS) for each individual row.
For example, with 3 transactions: T1 locks rows 11–15, T2 wants to lock 6-11, but only manages 6-10 as it blocks on row 11. Meanwhile T2 wants to lock 1-6, but only manages 1-5 as it blocks on 6. We now have a dependency tree where T1 blocks T2 and T2 blocks T3. Once T1 commits, the others get unblocked, but under sustained load, this kind of locking and blocking can quickly cascade, such that once contention starts, it rapidly expands. This contention is sensitive to the number of concurrent transactions and the number of partitions per commit. A common pattern with this kind of locking is that up until a certain transaction throughput everything is fine, but at the first hint of contention, the whole thing slows to a crawl. Contention breeds more contention.
I would therefore caution against the use of Postgres as a Batch Coordinator implementation.
4. Evolutionary: Slack’s KIP-1176
The following is a very high-level look at Slack’s KIP-1176, in the interests of keeping this post from getting too detailed.
There are three key points to this KIP’s design:
Maintain leader-based topic partitions (producers continue to write to leaders), but replace Kafka replication protocol with a per-broker S3-based write-ahead-log (WAL).
Try to preserve existing partition replica code for idempotency and transactions.
Reuse existing tiered storage for long-term S3 data management.
The basic idea is to preserve the leader-based architecture of Kafka, with each leader replica continuing to write to an active local log segment file, which it rotates periodically. A per-broker write-ahead-log (WAL) replaces replication. A WAL Combiner component in the Kafka broker progressively (and aggressively) tiers portions of the local active log segment files (without closing them), combining them into multi-topic objects uploaded to S3. Once a Kafka batch has been written to the WAL, the broker can send an acknowledgment to its producer.
This active log segment tiering does not change how log segments are rotated. Once an active log segment is rotated out (by closing it and creating a new active log segment file), it can be tiered by the existing tiered storage component, for the long-term.
The WAL acts as write-optimized S3 storage and the existing tiered storage uploads closed log segment files for long-term storage. Once all data of a given WAL object has been tiered, it can be deleted. The WAL only becomes necessary during topic partition leader-failovers, where the new leader replica bootstraps itself from the WAL. Alternatively, each topic partition can have one or more followers which actively reconstruct local log segments from the WAL, providing a faster failover.
The general principle is to keep as much of Kafka unchanged as possible, only changing from the Kafka replication protocol to an S3 per-broker WAL. The priority is to avoid the need for heavy rework or reimplementation of logic such as idempotency, transactions and share groups integration. But it gives up elasticity and the additional architectural benefits that come from building on disaggregated storage.
Having said all of the above. There are a lot of missing or hacky details that currently detract from the evolutionary goal. There is a lot of hand-waving when it comes to correctness too. It is not clear that this KIP will be able to deliver a low-disruption evolutionary design that is also correct, highly available and durable. Discussion in the mailing list is ongoing.
Luke Chen remarked: “the current availability story is weak… It’s not clear if the effort is still small once details on correctness, cost, cleanness are figured out.”, and I have to agree.
5. The Hybrids – Balancing revolution with evolution
5.1. – KIP-1150 – Revision 2 (Tiered-storage for long-term management)
The second revision of KIP-1150 replaces the future object compaction logic by delegating long-term storage management to the existing tiered storage abstraction (like Slack’s KIP-1176).
The idea is to:
Remove Batch Coordinators from the read path.
Avoid separate object compaction logic by delegating long-term storage management to tiered storage (which already exists).
Rebuild per-partition log segments from combined objects in order to:
Submit them for long-term tiering (works as a form of object compaction too).
Serve consumer fetch requests.
The write path
The entire write path becomes a three stage process:
Stage 1 – Produce path, synchronous. Uploads multi-topic WAL Segments to S3 and sequences the batches, acknowledging to producers once committed. This is unchanged except SLSOs are now called WAL Segments. 1.
Stage 2 – Per-partition log segment file construction, asynchronous. Each broker is assigned a subset of topic partitions. The brokers download WAL segment byte ranges that host these assigned partitions and append to on-disk per-partition log segment files. 1.
Stage 3 – Tiered storage, asynchronous. The tiered storage architecture tiers the locally cached topic partition log segments files as normal.
Stage 1 – The produce path
The produce path is the same, but SLSOs are now called WAL Segments.
Stage 2 – Local per-partition segment caching
The second stage is preparation for both:
Stage 3, segment tiering (tiered storage).
The read pathway for tailing consumers
Each broker is assigned a subset of topic partitions. Each broker polls the BC to learn of new WAL segments. Each WAL segment that hosts any of the broker’s assigned topic partitions will be downloaded (at least the byte range of its assigned partitions). Once the download completes, the broker will inject record offsets as determined by the batch coordinator, and append the finalized batch to a local (per topic partition) log segment on-disk.
At this point, a log segment file looks like a classic topic partition segment file. The difference is that they are not a source of durability, only a source for tiering and consumption. WAL segments remain in object storage until all batches of a segment have been tiered via tiered storage. Then WAL segments can be deleted.
Stage 3 – Tiered storage
Tiered storage continues to work as it does today (KIP-405), based on local log segments. It hopefully knows nothing of the Direct-to-S3 components and logic. Tiered segment metadata is stored in KRaft which allows for WAL segment deletion to be handled outside of the scope of tiered storage also.
The read path
Data is consumed from S3 topics from either:
The local segments on-disk, populated from stage 2.
Tiered log segments (traditional tiered storage read pathway)
End-to-end latency of any given batch is therefore based on:
Produce batch added to buffer. 1.
WAL Segment containing that batch written to S3. 1.
Batch coordinates submitted to the Batch Coordinator for sequencing. 1.
Producer request acknowledged 1.
Tail, untiered (fast path for tailing consumers) ->
Replica downloads WAL Segment slice.
Replica appends the batch to a local (per topic partition) log segment.
Replica serves a consumer fetch request from the local log segment.
Tiered (slow path for lagging consumers) ->
Remote Log Manager downloads tiered log segment
Replica serves a consumer fetch request from the downloaded log segment.
Partition-to-broker assignment
Avoiding excessive reads to S3 will be important in stage 2 (when a broker is downloading WAL segment files for its assigned topic partitions). This KIP should standardize how topic partitions are laid out inside every WAL segment and perform partition-broker assignments based on that same order:
Pick a single global topic partition order (based on a permutation of a topic partition id). 1.
Partition that order into contiguous slices, giving one slice per broker as its assignment ( (a broker may get multiple topic partitions, but they must be adjacent in the global order). 1.
Lay out every WAL segment in that same global order.
That way, each broker’s partition assignment will occupy one contiguous block per WAL Segment, so each read broker needs only one byte-range read per WAL segment object (possibly empty if none of its partitions appear in that object). This reduces the number of range reads per broker when reconstructing local log segments.
KIP-1150 revision 2 summary
By integrating with tiered storage and reconstructing log segments, revision 2 moves to a more diskful design, where disks form a step in the write path to long term storage. It is also more stateful and sticky than revision 1, given that each topic partition is assigned a specific broker for log segment reconstruction, tiering and consumer serving. Revision 2 remains leaderless for producers, but leaderful for consumers. Therefore to avoid cross-AZ traffic for consumer traffic, it will rely on KIP-881: Rack-aware Partition Assignment for Kafka Consumers to ensure zone-local consumer assignment.
This makes revision 2 a hybrid. By delegating responsibility to tiered storage, more of the direct-to-S3 workload must be handled by stateful brokers. It is less able to benefit from the elasticity of disaggregated storage. But it reuses more of existing Kafka.
5.2. – KIP-1150 – Revision 3 (No more Batch Coordinators)
The third revision, which at the time of writing is a loose proposal in the mailing list, ditches the Batch Coordinator (BC) altogether. Most of the complexity of KIP-1150 centers around Batch Coordinator efficiency, failovers, scaling as well as idempotency, transactions and share groups logic.
Revision 3 proposes to replace the BCs with “classic” topic partitions. The classic topic partition leader replicas will do the work of sequencing and storing the batch coordinates of their own data. The data itself would live in SLSOs (rev1)/WAL Segments (rev2) and ultimately, as tiered log segments (tiered storage).
To make this clear, if as a user you create the topic Orders with one partition, then an actual topic Orders will be created with one partition. However, this will only be used for the sequencing of the Orders data and the storage of its metadata. The benefit of this approach is that all the idempotency and transaction logic can be reused in these “classic-ish” topic partitions. There will be code changes but less than Batch Coordinators. All the existing tooling of moving partitions around, failovers etc works the same way as classic topics.
So replication continues to exist, but only for the metadata.
One wrinkle this adds is that there is no central place to manage the clean-up of WAL Segments. Therefore a WAL File Manager component would have responsibility for background cleanup of those WAL segment files. It would periodically check the status of tiering to discover when a WAL Segment can get deleted.
The motivation behind this change to remove Batch Coordinators is to simplify implementation by reusing existing Kafka code paths (for idempotence, transactions, etc.).
However, it also opens up a whole new set of challenges which must be discussed and debated, and it is not clear this third revision solves the complexity. One new complication with revision 3 is that we replace a single WAL Segment commit to a Batch Coordinator, with writing multiple records to different topic partitions. One combined object maps to multiple leader replicas. The Transaction Coordinator does this type of multi-topic-partition write to commit or abort transactions, but it is a durable state machine that can ensure the multiple writes either all complete or abort despite crashes and failovers. So we’d have to implement some kind of durable retry in revision 3.
Revision 3 now depends on the existing classic topics, with leader-follower replication. It moves a little further again towards the evolutionary path. It is curious to see Aiven productionizing its Kafka fork “Inkless”, which falls under the “revolutionary” umbrella, while pushing towards a more evolutionary stateful/sticky design in these later revisions of KIP-1150.
6. Deciding the future of Kafka
Apache Kafka is approaching a decision point with long-term implications for its architecture and identity. The ongoing discussions around KIP-1150 revisions 1-3 and KIP-1176 are nominally framed around replication cost reduction, but the underlying issue is broader: how should Kafka evolve in a world increasingly shaped by disaggregated storage and elastic compute?
At its core, the choice comes down to two paths. The evolutionary path seeks to fit S3 topics into Kafka’s existing leader-follower framework, reusing current abstractions such as tiered storage to minimize disruption to the codebase. The revolutionary path instead prioritizes the benefits of building directly on object storage. By delegating to shared object storage, Kafka can support an S3 topic serving layer which is stateless, elastic, and disposable. Scaling coming by adding and removing stateless workers rather than rebalancing stateful nodes. While maintaining Kafka’s existing workloads with classic leader-follower topics.
The dangers of retrofitting
While the intentions and goals of the KIPs clearly fall on a continuum of revolutionary to evolutionary, the reality in the mailing list discussions makes everything much less clear.
The devil is in the details, and as the discussion advances, the arguments of “simplicity through reuse” start to strain. The reuse strategy is a retrofitting strategy which ironically could actually make the codebase harder to maintain in the long term. Kafka’s existing model is deeply rooted in leader-follower replication, with much of its core logic built around that assumption. Retrofitting direct-to-S3 into this model forces some “unnatural” design choices. Choices that would not be made otherwise (if designing a cloud-native solution).
My opinion
My own view aligns with the more revolutionary path in the form of KIP-1150 revision 1. It doesn’t simply reduce cross-AZ costs, but fully embraces the architectural benefits of building on object storage. With additional broker roles and groups, Kafka could ultimately achieve a similar elasticity to WarpStream (and Confluent Freight Clusters).
The approach demands more upfront engineering effort, may increase long-term maintenance complexity, but avoids tight coupling to the existing leader-follower architecture. Much depends on what kind of refactoring is possible to avoid the duplication of idempotency, transactions and share group logic. I believe the benefits justify the upfront cost and will help keep Kafka relevant in the decade ahead.
The community decides
In theory, both directions are defensible, ultimately it comes down to the specifics of each KIP. The details really matter. Goals define direction, but it’s the engineering details that determine the system’s actual properties.
We know the revolutionary path involves big changes, but the evolutionary path comes with equally large challenges, where retrofitting may ultimately be more costly while simultaneously delivering less. The committers who maintain Kafka are cautious about large refactorings and code duplication, but are equally wary of hacks and complex code serving two needs. We need to let the discussions play out.
My aim with this post has been to take a step back from “how to implement direct-to-S3 topics in Kafka”, and think more about what we want Kafka to be. The KIPs represent the how, the engineering choices. Framed that way, I believe it is easier for the wider community to understand the KIPs, the stakes and the eventual decision of the committers, whichever way they ultimately decide to go.