How Cloudflare Helped X Improve Network Performance and Reliability

I attended a presentation by Cloudflare about how they helped X (formerly Twitter) improve their reliability and network performance.

To my knowledge, you will not find this information anywhere else online.

It was presented by Alex Krivit (PM), **David “Hubes” Huber **(Director, Product Management), and Sachin Fernandes (Engineering Manager) at the Cloudflare Connect 2025 partner and customer summit in Las Vegas last week.

Here are my detailed notes from the session for anyone who might find it helpful.

What X Uses Cloudflare For

X deployed several Cloudflare products as part of their infrastructure modernization:

CDN and content delivery as the primary focus
Argo for dynamic performance opti…

I attended a presentation by Cloudflare about how they helped X (formerly Twitter) improve their reliability and network performance.

To my knowledge, you will not find this information anywhere else online.

Here are my detailed notes from the session for anyone who might find it helpful.

What X Uses Cloudflare For

X deployed several Cloudflare products as part of their infrastructure modernization:

CDN and content delivery as the primary focus
Argo for dynamic performance optimization and congestion solutions (with custom tuning)
PNI and CNI (Private Network Interconnect and Cloudflare Network Interconnect) for direct connectivity
Extensive peering work to improve global connectivity
Security tools though these were mentioned as being outside the scope of this particular presentation and possibly under NDA

The engagement discussed in the presentation focused heavily on both network-level products and CDN delivery optimization.

Alex Krivit on stage talking about X infrastructure

X’s Infrastructure

To understand the challenge, it’s important to understand X’s existing architecture.

X operates a globally distributed infrastructure with:

POPs (Points of Presence) distributed globally where users connect to the closest one
Router and load balancer layers
Envoy proxy, which is similar to Cloudflare in many ways
A private backbone that transits traffic to core data centers, often in North America
Backend servers that populate the front-end application – either the Groq interface or the X interface

This architecture required significant operational overhead and complexity to maintain globally.

The Three Migration Options

When X approached Cloudflare, they evaluated three potential approaches for integrating Cloudflare into their infrastructure:

Option 1: Front-End Proxy Only

Keep everything in the backend the same and put Cloudflare in front so that Cloudflare’s POPs would talk to X’s POPs.

The benefits of this approach were:

Fastest onboarding path
Offload as many requests as possible from X’s infrastructure
Allow X to scale back and wind down some infrastructure
Reduce operational costs
Fastest possible response times for cached content

Option 2: Partial Infrastructure Replacement

Keep X’s POP and global backbone, but establish strong peering with Cloudflare in all locations.

This would remove some of the middle boxes, with Cloudflare acting as the POPs and talking directly to X’s core data centers.

Option 3: Full Decommission

Decommission everything and rely on Cloudflare’s smart routing to handle the movement of traffic back to X’s core data centers.

X went with Option 1 initially, then moved towards a combination of Option 2 and 3 over time (hence the peering work and infrastructure simplification talk which you will see later). See the “How We Fixed” slide below.

Performance Criteria and Success Metrics

X came to Cloudflare with very specific criteria for how they would evaluate success:

No degradation of P99 round trip time for key operations. This included how users access their timeline, post tweets, view their profile, and other core interactions.
Statistical improvement in core markets. Specifically US and Japan, which represent core markets for X’s user base.
Infrastructure complexity reduction. Cloudflare needed to help reduce the operational complexity of X’s infrastructure.

These criteria would drive all technical decisions and validation approaches.

Time to First Tweet: Key Performance Metric

X uses “time to first tweet” as their primary performance metric. They had probes running from actual user devices that would measure this metric in real-world conditions.

The challenge for Cloudflare was that this wasn’t a standard, replicable test.

You can’t simply run “time to first tweet” from an external system – or can you?

Cloudflare sat down with X and identified the specific calls and APIs used to implement “time to first tweet.” By understanding these underlying API calls, Cloudflare was able to replicate the exact same test from their infrastructure.

The “Wait Time” Problem

When analyzing web performance, most people are familiar with standard metrics from browser development tools:

DNS resolution time
Connection time
TLS handshake time

But there’s often a really long bar at the end of waterfall graphs labeled “wait time.”

This wait time represents something happening upstream on a server that needs to finish before the response starts arriving.

Wait time is particularly challenging to debug because it could be many different things happening.

waterfall image showing wait time– nee, Cloudflare time

The Cloudflare team’s approach was to break down “wait time” into specific “Cloudflare time,” looking at each individual step that a request flows through to get a response, and then finding places to improve efficiency.

For every single request, they wanted to examine each Cloudflare step and understand how to optimize it better. This granular approach allowed them to move from a vague “something is slow” to specific, actionable problems and solutions.

Synthetic Probes at Global Scale

To achieve this level of visibility, Cloudflare deployed synthetic probes as a heartbeat across all Cloudflare POPs. This includes more than 350 locations across the entire world.

These probes would:

Test from every Cloudflare server back to X’s origins
Return all the different “Cloudflare times,” which were the breakdown of wait time into specific steps
Identify which specific server, region, or POP was slow
Provide data on why specific components were slow
Allow for aggregations to understand if a particular piece of Cloudflare infrastructure could be improved Detailed image of some sort of synthetic probe testing thing

Performance Bottlenecks Discovered

When Cloudflare ran their synthetic probes replicating X’s time to first tweet metric, they discovered performance issues in four specific locations:

India
Lithuania
Turkmenistan
Bangladesh

The team presented this data to X, showing them the replicated “time to first tweet” results.

4 countries showing outlier results from testing

After X validated that Cloudflare’s synthetic testing matched their real-world measurements, they agreed to use this methodology to drive performance improvements.

At this point I was a little disappointed that, like, these tests and the presentation are ONLY showing what was done to improve those 4 countries? But I think X uses Cloudflare for the whole system, and this was the only part outside the NDA that they could talk about publicly.

Tactical Fixes

The solutions were remarkably targeted:

India: X had backbone infrastructure in India. Cloudflare opened a peering session at the Singapore Internet exchange and turned up peering sessions. Within a day, latency significantly improved in India.

Turkmenistan and Lithuania: Peering issues were identified and fixed, bringing down latencies.

Bangladesh: Working with the local partner, Cloudflare was able to improve performance.

Fixes that were implemented

Flamingo: Cloudflare’s Secret Weapon

The presentation revealed a previously internal-only Cloudflare service called Flamingo.

At this point in the speech, the Cloudflare team shifted to getting feedback and presenting about their new Flamingo tool. So if you’re only reading this article for details on the X infrastructure, you can stop now.

What is Flamingo?

Flamingo is a service that Cloudflare’s internal teams use to monitor the health and latency performance of their services. Long before X asked about network health and performance, Cloudflare was already using Flamingo internally to monitor their own infrastructure.

When customers like X started asking “what does it look like to monitor the health and performance of services at the edge?”, Cloudflare realized they already had the answer. This became an “aha moment” where Flamingo could be pointed at customer infrastructure.

Flamingo from Cloudflare sort of overview image

Flamingo’s Architecture

Flamingo leverages Cloudflare’s “every service can run everywhere” philosophy:

Deployed to every Cloudflare data center. Just like any other Cloudflare service.
Runs from all 350+ cities globally. You can set up probes from these Flamingos pointing to your origin.
Emits vast amounts of health data. Data collected from origins flows back to a command and control server at Cloudflare core.
Analyzes service behavior. The system analyzes this data to understand how the tested service is performing.

Flamingo by the Numbers

The scale of Flamingo is huge:

1 billion+ data points emitted on health for upstream services.
40,000+ metals meaning that every single Cloudflare server has a Flamingo monitoring it.
Infinite scale as Cloudflare adds more data centers and servers, new Flamingos automatically deploy.

Real-World Impact: The Rust Stack Migration

One of Flamingo’s early achievements was during Cloudflare’s replacement of their old front-line proxy with a modernized Rust stack. This new stack handles trillions of requests to CDNs every day.

When Cloudflare swapped out this critical service, no customers noticed any issues. But Flamingo noticed.

Before any deployment could break customers, Flamingo would alert teams: “Hey, you’re trying to deploy these things but, just FYI, it’s kind of broken.” Teams rely heavily on these health metrics because they can gradually roll out changes based on what Flamingo detects, preventing customer impact.

Beyond Simple Probing

Flamingo is not a traditional prober that just checks for HTTP 200 responses.

Because of the complexity of Cloudflare’s feature stack, teams test almost 1,000+ features across the Flamingos:

DNS team ensures DNS resolves correctly using Flamingo
TCP and Magic team confirms that Magic packets and Magic Firewall block packets as expected
Bot management can be tested to ensure APIs are properly protected
Header validation can verify expected headers are present
UDP support is being added to understand how UDP packets are routed
Parsing capabilities allow Flamingo to parse responses and validate complex behaviors up and down the stack

Targeting Capabilities

Since Flamingo deploys to every server, teams can target different geographies, server types, and network tiers. This enables testing of:

Data localization services – “I want all my data processed in the US. Did that happen or did we leak something?”
Regional WAF rules – “Did this WAF rule do what I wanted in France?”
Specific network tiers – Test behavior on different classes of infrastructure

Flamingo v2: The Future

Currently, Flamingo is only available internally at Cloudflare. However, the team is working on Flamingo v2 which will be a public version available to all Cloudflare customers.

The vision is described as “Catchpoint, but beefed up to Cloudflare scale” that would give customers the ability to monitor their applications from everywhere, all the time, just inside Cloudflare’s infrastructure whenever they make a change. The team is actively gathering feedback from customers to understand what exposing this capability would look like, potentially integrated into the Cloudflare dashboard and API.

Comparing Synthetic Data to RUM

One question from the audience addressed how synthetic testing relates to real user monitoring data, aka RUM data.

X actually started with RUM data using measurements built from their actual client applications. The success criteria for using Flamingo’s synthetic probes was that the RUM data and synthetic data needed to be “directionally correct.”

Importantly, they didn’t need to show the same numbers. And if I understood it correctly, that was actually the point. RUM data includes everything: TCP connect, DNS resolution, and all the client-side overhead. But as noted earlier, what X was actually concerned about was the wait time.

The beauty of Flamingo is that it bypasses all the front-end variability and allows focus only on the wait time. Cloudflare worked hard to ensure that X’s RUM-like data matched directionally with what their synthetics showed, enabling them to dig into the wait time systematically.

Conclusion

This presentation offered a cool glimpse into how Cloudflare approaches performance optimization at massive scale for a gold key customer. I love Cloudflare (I am their biggest fan!!), and I’m a daily user of X.

I believe there was a lot more that Cloudflare worked with X on that wasn’t discussed. But this was all that they were approved to share in a public format based on privacy agreements.

Kudos to the team at Cloudflare and X for making services faster for customers around the world.

If you want to try Flamingo for your application, email me and I can try to put you in touch with someone on the Cloudflare team. Or reach out directly to one of these guys at Cloudflare: Alex Krivit (PM), David “Tubes” Huber (Director, Product Management), or Sachin Fernandes (Engineering Manager).

Sachin, Alex, and David from Cloudflare

What X Uses Cloudflare For

What X Uses Cloudflare For

X’s Infrastructure

The Three Migration Options

Option 1: Front-End Proxy Only

Option 2: Partial Infrastructure Replacement

Option 3: Full Decommission

Performance Criteria and Success Metrics

Time to First Tweet: Key Performance Metric

The “Wait Time” Problem

Synthetic Probes at Global Scale

Performance Bottlenecks Discovered

Tactical Fixes

Flamingo: Cloudflare’s Secret Weapon

What is Flamingo?

Flamingo’s Architecture

Flamingo by the Numbers

Real-World Impact: The Rust Stack Migration

Beyond Simple Probing

Targeting Capabilities

Flamingo v2: The Future

Comparing Synthetic Data to RUM

Conclusion

Similar Posts