This incident started on December 5th, and is one of the longest in Honeycomb history, having been actively worked on and closed only on December 17th. Due to its impact and duration, we wanted to offer a partial and preliminary report to explain, at a high level, what happened.

On December 5th, at 20:23 UTC, our Kafka cluster suffered a critical loss of redundancy. Our Kafka cluster contains multiple topics, including all telemetry events submitted by Honeycomb users, the rematerialization of state changes into the activity log, and multiple metadata topics used by Kafka to manage its own workloads. This led multiple partitions leaderless and by 20:35 UTC, we were getting alerts that roughly a quarter of our usual event topic partitions were unable to accept writes.

For most Honeyc…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help