This post is a deep dive into how we improved the P95 latency of an API endpoint from 5s to 0.3s using a niche little computer science trick called a bloom filter.
We’ll cover why the endpoint was slow, the options we considered to make it fast and how we decided between them, and how it all works under the hood.
Intro
A core concept of our On-call product is Alerts. An alert is a message we receive from a customer’s monitoring systems (think Alertmanager, Datadog, etc.), telling us that something about their product might be misbehaving. Our job is figure out who we should page to investigate the issue.
We store every alert we receive in a big ol’ database table. As a customer, having a complete history of every alert you’ve ever sent us is useful for spotting trends,…
This post is a deep dive into how we improved the P95 latency of an API endpoint from 5s to 0.3s using a niche little computer science trick called a bloom filter.
We’ll cover why the endpoint was slow, the options we considered to make it fast and how we decided between them, and how it all works under the hood.
Intro
A core concept of our On-call product is Alerts. An alert is a message we receive from a customer’s monitoring systems (think Alertmanager, Datadog, etc.), telling us that something about their product might be misbehaving. Our job is figure out who we should page to investigate the issue.
We store every alert we receive in a big ol’ database table. As a customer, having a complete history of every alert you’ve ever sent us is useful for spotting trends, debugging complex incidents, and understanding system health. It’s also helping us build the next generation of AI SRE tooling.
We surface alert history in the dashboard. Here’s an example of ours today:
Alert history in the dashboard
This is cool, but for large organisations with lots of alerts, this view isn’t very helpful. They want to be able to dig in to a subset of this data. The little “Filter” button in the UI lets you do exactly that.
You can filter by things like source and priority. Source is the monitoring system that sent the alert, and priority is configured on a per-customer basis and provided in the alert message we receive.
In our set up, we have these “Team” and “Features” columns in the UI too. Teams are a first class citizen in incident.io, but a “Feature” is a concept that we’ve defined ourselves. Both are powered by Catalog. Any incident.io customer can model anything they want in Catalog, and then define rules to tell us how to match an Alert to a specific entry.
This is now really powerful. For example, I can ask for “alerts with a priority of urgent or critical, assigned to the On-call team, affecting the alerts feature”:
Filtering alerts in the dashboard
These filters are great for our customers, and, as we found out, a potential performance nightmare for us. As we’ve onboarded larger customers with millions of alerts, our slick, powerful filtering started to feel… less slick.
Some customers reported waiting 12 seconds for results. Our metrics observed the P95 response time for large organisations at 5s. Every time they loaded this page, updated a filter, or used the infinite scrolling to fetch another page, they looked at a loading spinner for far too long.
How filtering works (and why it was slow)
Let’s start with how we store alerts in our Postgres database, and the algorithm we use to fetch a filtered result set.
Here’s a simplified representation of our alerts table:
+----------+------------------+----------------------+------------------+--------------------------------------------------+
| id | organisation_id | created_at | priority_id | attribute_values |
+----------+------------------+----------------------+------------------+--------------------------------------------------+
| $ALERT1 | $ORG1 | 2025-10-31 00:00:00 | $lowPriorityID | { |
| | | | | "$myTeamAttributeID": "$onCallTeamID", |
| | | | | "$myFeatureAttributeID": "$alertFeatureID" |
| | | | | } |
+----------+------------------+----------------------+------------------+--------------------------------------------------+
| $ALERT2 | $ORG2 | 2025-10-31 00:01:00 | $highPriorityID | { |
| | | | | "$myTeamAttributeID": "$responseTeamID" |
| | | | | } |
+----------+------------------+----------------------+------------------+--------------------------------------------------+
idis a ULID: a unique identifier that’s also lexicographically sortable. We use them to implement pagination.$ALERT{N}isn’t a valid ULID, but it is easy to read in a blog post.organisation_idis another ULID, identifying the organisation/customer.created_atis a timestamp that does what it says on the tin.priority_idis a foreign key that references your organisation-specific priorities. Each incident.io customer can define the priorities that work for them.
When you set up an alert source like Alertmanager, you specify the “attributes” we should expect to find in the metadata we receive - in this example: priority, team and feature. First-class dimensions like priority have their own columns in the database, but everything else that’s custom to you is stored as attribute_values in JSONB.
Extracting attributes from alert payloads
The algorithm to compute a filtered result set works like this:
- Construct a SQL query including filters that can be applied in-database
- Fetch rows from the database in batches of 500, and apply in-memory filters
- Continue fetching batches until we’ve found enough matching alerts to fill a page
Let’s work that through using the same example as above. The infinite scrolling UI hides pagination parameters, but they’re always defined behind the scenes. Let’s ask for “50 alerts with a priority of urgent or critical, assigned to the On-call team, affecting the alerts feature”.
Priority has its own column in the DB, so we can get Postgres to do that filtering for us. The SQL to get the first batch of alerts looks somet all 500K to sort them, only to throw away 495.5K for our LIMIT 500 clause.
The bloom filter’s query plan looked like:
- Use
idx_alerts_paginationto “stream” alert tuples sorted by ID - Filter the “stream” using efficient bitwise logic
- Continue until 500 matching tuples are found
This is really fast when 500K alerts match the filters, because we have a 50% chance that any one alert matches, so we end up reading only a small part of the index. When 500 alerts match, we have a 0.05% chance that each alert matches, and we end up reading much more of the index.
The latencies are well within bounds for a much improved user experience, and we couldn’t think of a good reason why one of the frequent or infrequent alert scenarios was preferable to the other, so we called this a draw from a performance perspective. However, we’d illustrated a critical issue with both options - they scale with the number of organisation alerts. We keep a complete history of customers’ alerts, so whilst either would deliver what we wanted now, performance would degrade over time.
Time makes fools of us all
Whilst drunk on maths and computer science, something really simple had been staring us in the face.
We use pagination to sort alerts by the time we receive them, and we show customers their most recent alerts first in the UI. Most of the time they’re interested in recent history, and yet our queries can get expensive because we’re searching all the way back to the first alert the ever sent us. Why? If we can partition our data by time, we can use this very legitimate recency bias to realise a lot of performance.
The available filters in the dashboard include one for the created_at column we have in the alerts table. We made this mandatory, and set 30 days as a default value. We even had an idx_alerts_created_at index on (organisation_id, created_at) ready to go!
The GIN index query plan now uses two indexes:
-
Intersect indexes to find all recent alerts that match the filters
-
idx_alerts_attribute_valuesapplies the filters -
idx_alerts_created_atfinds alerts in the last 30 days -
Read all of the alerts
-
Sort and take the top 500
The bloom filter query plan doesn’t need to change at all, thanks to a very useful detail we touched on at the start. Our alert IDs are ULIDs - unique IDs that are lexicographically sortable. ULIDs have two components; a timestamp and some “randomness”. We can use “30 days ago” as a timestamp, stick some randomness on the end, and use it in a range query with idx_alerts_pagination.
Both query plans still technically scale with the number of organisation alerts, but the amortized cost is now much better. Customers can select large time ranges to analyze, which might take some time to process, but we’re trading that off against the performance we gain for much more common use cases. We’re going to have to onboard customers with biblical alert volumes before the default 30 day window will cause UX issues. That’ll be a nice problem to have, and one we don’t feel the need to design for now.
Now we’ve solved the scaling issue, we’re back at our stalemate. How to pick one to implement?
The debate that followed was what you might call “robust”. Respectful, but rigorous.
GIN indexes can be large, on disk and in memory, and have high write overhead. We were concerned that as the index bloated, it could take up too much space in Postgres’s shared buffers, potentially harming performance in other parts of the platform. We didn’t have high confidence that we wouldn’t be back in a few months trying to fix a new, more subtle problem.
Bloom filters are a pretty niche topic, we thought the code would be hard to understand and change if you weren’t involved in the original project, and we felt uncomfortable about essentially implementing our own indexing mechanism - that’s what databases are for.
In the end we were very resistant to the idea of re-work. We bet on the thing we were confident we’d only have to build once, even if it was a bit more complex - the bloom filter.
Conclusion
Combining the mandatory 30-day time bound with the bloom filtering has had a massive impact: the P95 response time for large organisations has improved from 5s to 0.3s. That’s a ~16x improvement! The bloom filter reduces the number of rows in each batch, and the mandatory time bound reduces the maximum number of batches.
Our powerful and slick filtering is slick again, and it should stay slick as we continue to onboard large customers. As of September 2025 we ingest more than a million alerts per week, and that number will only increase!
This is a great example of how technical and product thinking can go hand in hand. However you go about it, this is one of those hard technical problems to solve, and I’m lucky to work alongside incredibly talented engineers. However, understanding our customers and how they use our product led us to a vital piece of the puzzle. We needed both to really nail this.