This incident started on December 5th, and is one of the longest in Honeycomb history, having been actively worked on and closed only on December 17th. Due to its impact and duration, we wanted to offer a partial and preliminary report to explain, at a high level, what happened.
On December 5th, at 20:23 UTC, our Kafka cluster suffered a critical loss of redundancy. Our Kafka cluster contains multiple topics, including all telemetry events submitted by Honeycomb users, the rematerialization of state changes into the activity log, and multiple metadata topics used by Kafka to manage its own workloads. This led multiple partitions leaderless and by 20:35 UTC, we were getting alerts that roughly a quarter of our usual event topic partitions were unable to accept writes.
For most Honeyc…
This incident started on December 5th, and is one of the longest in Honeycomb history, having been actively worked on and closed only on December 17th. Due to its impact and duration, we wanted to offer a partial and preliminary report to explain, at a high level, what happened.
On December 5th, at 20:23 UTC, our Kafka cluster suffered a critical loss of redundancy. Our Kafka cluster contains multiple topics, including all telemetry events submitted by Honeycomb users, the rematerialization of state changes into the activity log, and multiple metadata topics used by Kafka to manage its own workloads. This led multiple partitions leaderless and by 20:35 UTC, we were getting alerts that roughly a quarter of our usual event topic partitions were unable to accept writes.
For most Honeycomb customers, this does not result in ingest outages, because traffic gets redirected to other partitions environments can be assigned to. However, we noted some teams for whom all their assigned partitions fit within the impacted set, and for whom all ingest was down (0.23% of all datasets were impacted). By 21:30 UTC, we had identified impacted teams, were working on a traffic reassignment script, and thought other features (such as SLO evaluation) were unimpacted.
However, at 22:08 UTC, we noticed, through the noise of all the alerts that had been going on, that our ingestion fleet had been in overload protection since 20:38 UTC, impacted everyone more broadly by returning errors on api.eu1.honeycomb.io that would have led to dropped events. Stabilization work was done and 6 minutes later (22:14 UTC), ingestion was stabilized, yielding roughly 1h25 of increased ingest error rates on api.eu1.honeycomb.io for everyone.
At 01:30 UTC on Saturday, December 6th, our responders had managed to force new leader elections on all impacted ingestion partitions, which re-established working traffic internally on all but one of them, which remained shuttered in read-only mode. By then, our brokers were severely imbalanced, we attempted to tweak retention settings, and monitored disk usage to come back during the day to repair the remaining broken partitions.
By 11:00 UTC, our disk usage on Kafka brokers reached a threshold where it became necessary to devise and perform emergency operations. We noticed that our metadata partitions, some of which handle the tracking of Kafka offloading its storage to S3 and its autobalancing features, were still leaderless, and we thought this could be the problem.
In fixing them, we also repaired consumer group metadata topics, which revealed, at 14:27 UTC, that our SLO product had been partially stuck since the start of the outage. By having broken consumer groups topic partitions, our event consumers for SLOs had been working fine on some partitions but were fully stalled on others, despite reporting healthy—they were idling as “online” but not seeing that they were late. They caught up and by 19:15 UTC, they were backfilled and brought back to normal.
However, before then, the issue with Kafka disk storage got worse. We feared that with full disks, the entirety of our Kafka cluster would reach an irreparable state (where the only way to free space is to do dangerous untested operations to free storage data), and at 15:09 UTC, we instead chose to turn off ingest entirely with less than 5% of disk space left. Ingest would turn itself off if the disk were to be filled, so we elected to keep recovery simpler by cutting traffic off a few minutes ahead of time.
We quickly shuffled to try and add new Kafka brokers and partitions, which wouldn’t be full, to shift traffic onto them – but before we were done, we thought of disabling our tiered storage. Our Kafka brokers store only a few hours of data locally (“hotset” data), and tier out a longer retention period (2-3 days) to S3. All data written to Kafka is quickly replicated to our storage engine and longer retention is only kept for disaster recovery. We had tried repairing topics, and reducing the size of our hotset for many hours before, but nothing had the desired effect. Our theory was that tiering data was broken anyway, and we were stuck waiting for the offload of data that wouldn’t happen.
At 21:32 UTC, we disabled tiered storage altogether and most of the disk space was recovered. We abandoned the cluster expansion, which was minutes from completion, and at 21:49 UTC, Ingestion of customer data was turned back on. In total, we were unable to accept traffic for 6 hours and 23 minutes.
Our responders were online trying to clean up issues further until roughly 5 am UTC on December 7, and they disbanded to come back during the day. Sunday was mostly spent stabilizing the emergency configuration changes that were made, and the response team, who had worked around the clock since Friday night, was able to take some resting time while our EU cluster was mostly healthy again, aside from the activity log and one of our storage partitions that was unavailable still for querying.
On Monday December 8, employees who weren’t on call over the weekend took over stabilization work. Efforts went to successfully restore the partial querying outage of roughly 1/40th of our storage data. No significant progress was made on improving our activity log feature’s storage, and instead the team increased the storage retention of its changesets to 7 days (the max allowed in our storage engine). Our thinking was that once the Kafka cluster would be stable and these partitions fixed, we could simply replay events and insert everything back.
Tuesday December 9 was spent trying to further stabilize our Kafka cluster and to turn on some features, but we were able to make little progress in salvaging partitions, and started doing corrective work while our Kafka experts tried to see what they could do with the many still damaged metadata and internal topics that were less critical to staying up, but still very important.
On December 10, knowing we only had a few days to salvage Activity Log data, we decided to stop trying to save its Kafka topics, and to instead recreate them. At 19:25 UTC, we found that deletion operations fail, and that our Kafka cluster’s control plane no longer lets us do any manipulation whatsoever aside from listing topics. We can’t delete, create, describe, or mutate any of them. They are fine accepting and transmitting events, but we are essentially unable to administer the cluster anymore.
What we realized at that point in time is that our initial cluster outage severely damaged multiple internal partitions used to manage the cluster itself; as we cut off tiered storage (and as it aged out as well), the damage became more or less irreversible.
We then considered our chances of salvaging the cluster to be rather low, and feared that most small changes (such as assigning a new controller) could collapse the cluster and result in a large outage with data loss. Internal details of some of our applications are relevant here: our storage engine is tightly coupled to Kafka’s own event offsets. To prevent data consistency issues, it will refuse to “roll back” to an older offset, which could indicate a misconfiguration or some other problem that could lead to re-reading, duplicating, or losing data. As such, just “resetting” the cluster was not doable at this point in time, without adding a lot of infrastructure, or writing emergency fixes that significantly change our storage engine’s boot sequences and safety checks.
Later, still on December 10, we decided to split efforts into a) keep trying to save our cluster to the extent it doesn’t risk its stability, and b) start an emergency migration project that requires figuring out how to modify our storage engine, infrastructure, and multiple services to tolerate moving from the damaged Kafka cluster to a new one. We also prepared for multiple contingencies in case the current cluster were to die before the migration was ready to run.
December 11 and December 12 were spent working on this migration at a high priority. Likewise, during that time, our low-risk efforts to salvage the Kafka cluster yielded no great results. As we neared the end of Friday December 12, our retention of activity for the Activity Log also came to an end. In deciding between causing a major outage to rush an evacuation of our Kafka cluster before we were ready or losing days of Activity Log data to wait for a safer migration path, we chose the latter.
On Monday December 15, we managed to boot a new Kafka cluster with new infrastructure, and had all the fixes required for all services to do a migration, along with runbooks from every team involved. We decided to do a “dress rehearsal” in a pre-production environment, where we tried a full evacuation from a functional Kafka cluster to a brand new one, finding what the tricky parts were and making sure that if we were to damage or lose data, it would be internal telemetry, and not customer data.
Fortunately, everything went well, and on Tuesday December 16, we ran the full emergency evacuation. Running it required doing a switchover of all ingest from the old damaged Kafka cluster to the new Kafka cluster. We started by switching components that write data into the cluster first (producers), letting the consumers catch up on all topics and partitions. We then started migrating our consuming services, first with the query engine, and then the other consumers. This order of operation ensured that we would not corrupt, damage, or miss any data, but forced us to have delays on alerting and the freshness of query data. This happened without major issues.
During the migration, we believe our trigger data was stale and runs using incomplete data for roughly 30 minutes, if we average out most partitions. SLOs were delayed for a bit more than one hour, and service maps will have flat out skipped that hour as well. Activity Log data was lost between December 5 at 20:23 UTC and December 9 at 23:45 UTC.
These are added to the 10 days during which Activity Log data was unavailable, the ingestion issues of December 5, the 6h23 minutes of full ingest outage on December 6, and the 18h of delays on SLO processing between December 5 and 6.
At this point in time, we have fully mitigated this incident and developed new processes that promise better readiness to respond if similar outages were to happen again. A more in-depth review will be published in a few weeks, in January. This has been a significant outage, and we will need some time to do a proper analysis of it.