An eBPF Loophole: Using XDP for Egress Traffic

TL;DR:

XDP (eXpress Data Path) is the fastest packet processing framework in linux - but it only works for incoming (ingress) traffic. We discovered how to use it for outgoing (egress) traffic by exploiting a loophole in how the linux kernel determines packet direction. Our technique delivers 10x better performance than current solutions, works with existing Docker/Kubernetes containers, and requires zero kernel modifications.

This post not only expands on the overall implementation but also outlines how existing container and VM workloads can immediately take advantage with minimal effort and zero infrastructure changes.

At Loophole Labs, we live migrate everything - containers, VMs, and even […

TL;DR:

At Loophole Labs, we live migrate everything - containers, VMs, and even network connections.

During a migration every single packet for a workload needs to be intercepted, modified, encapsulated, encrypted, and rerouted to its new destination - all without the application noticing. Our scale requires us to be able to move workloads across clouds at hundreds of gigabits per second - and with that sort of performance requirement, every single CPU cycle matters.

All of this is to say, we need to be able to process packets at line-rate (however much the underlying network can support, whether that’s 20Gbps or 200Gbps), and there’s really only one approach that lets us do that:

Linux Packet Processing Performance Comparison

In Linux, the gold standard for high-performance packet processing is XDP (eXpress Data Path). By intercepting packets as soon as they arrive at the network driver (before reaching the kernel) XDP is able to achieve line-rate speeds in most environments.

Our own benchmarks above show how easily we were able to reach line-rate with XDP, not to mention the fact that major companies like Meta, Cloudflare, and GCore have already been using it for more than 5 years now to handle 10s of millions of packets per second.

XDP’s Main Limitation

Unfortunately XDP has one fundamental flaw that everyone accepts as fact: it only works for ingress (incoming) traffic. This isn’t a bug or an oversight - it’s the entire identity of XDP, one of the main characteristics that define it. XDP only processes packets on ingress. Period.

For routers and load balancers, this limitation is perfectly fine: every packet they handle arrives from an external interface, making it all ingress from the kernel’s perspective.

Our network plane, on the other hand, has to run on the same compute nodes as the workloads that we’re live migrating. And when these workloads generate packets - initiating connections, sending responses, etc. - that’s considered egress traffic by the host kernel. XDP simply does not work in this scenario.

A popular method for handling egress packets is Traffic Control (TC), another eBPF-based mechanism that allows for packet processing at both ingress and egress. TC is already commonly used for traffic shaping, queuing, filtering, and policing outbound traffic. In fact, it’s the de facto standard in the Kubernetes ecosystem - CNIs like Cilium and Calico all rely on TC for egress control because, until now, XDP for egress simply wasn’t possible.

Given all of this, TC might seem like an obvious choice for our use case as well, but it has a fundamental flaw of its own:

Performance.

We haven’t been able to process more than 21Gbps with TC on egress (or more than 23Gbps on ingress),which makes it a non-starter for our needs. The reason why TC suffers from this performance bottleneck is due to how (and more importantly when) the linux kernel runs the TC program:

TC Program Flow Diagram

As shown in the diagram above, TC programs operate quite late in the networking stack, after packets have already some time travelling through the linux kernel. By the time a packet reaches the TC hook, the kernel has already processed it through various subsystems for routing, firewalling, and even connection tracking. This means that we’ve wasted quite a few CPU cycles before our TC program even runs.

Another major limitation of TC is that it works on socket buffers (called struct sk_buff in the linux kernel) which are allocated per-packet. This structure - while necessary for linux’s own packet handling - comes with a significant performance hit due to the allocations themselves as well as the additional memory copies required to populate it. This all becomes doubly problematic when you’re trying to process millions of packets every second.

XDP, on the other hand, not only operates directly on the raw packet memory (because it runs directly in the network drivers before the packet even reaches the linux kernel) but does so at the earliest point in the packet’s lifecycle, meaning almost no CPU cycles have been spent by the time our XDP program starts running. All this results in zero-copy packet processing, meaning packets can be inspected, modified, and redirected with the absolute minimum overhead possible.

For us XDP is a hard requirement, and while the industry seems to have accepted that this is impossible, we haven’t.

One of our core beliefs at Loophole Labs is that every so-called “limitation” imposed by modern infrastructure is really just a problem we haven’t solved yet. In the spirit of this, we decided to go digging through the linux kernel source in an attempt to understand exactly why and how the kernel decides to classify a packet as “ingress” in the first place.

As it turns out, linux doesn’t actually classify the packet at all. When a packet arrives at a physical network interface, the network card writes the contents into an RX ring buffer - a memory region allocated by the device driver that the network card can write to directly via DMA (Direct Memory Access).

Next, the network card uses an interrupt to signal the device driver that there’s a packet available for processing. The device driver then copies the packet from the ring buffer into its RX queue. And this is exactly when the XDP program runs: directly on the packet in the RX queue. This process is illustrated in the diagram below:

XDP Program Flow Diagram

If this entire process makes one thing clear, it’s that there is very little work being done in between the packet arriving at the physical interface and it being ready for the XDP program to run. The RX queue is the trigger that tells the linux kernel how to “classify” the packet as ingress and whether it should run the XDP hook.

As we saw in this diagram, the RX queue is not used at all for egress packets, and this simple limitation is the cause of all our headaches.

Now that we know all this, how can we get around it? As it turns out, we don’t have to.

We were reading through the various linux interface docs, hoping to find some little insight into our predicament, when an interesting virtual interface caught our eye: Virtual Ethernet.

A Virtual Ethernet (veth) interface is a pair of network interfaces that act as a direct tunnel between each other. When a packet is transmitted from the TX queue of one side of the veth pair, a pointer to the packet’s memory is simply moved to the RX queue of the other interface. This makes the packet appear as if it were received by a physical network interface with very low overhead.

Yep, you read that right - veth interfaces have an RX queue that’s used when receiving a packet from the other side.

To illustrate this better, let’s take an example setup like the one below. We have two applications running in their own network namespaces, with two veth pairs (veth0-A and veth0-B) being used to route traffic out of the namespaces.

XDP for Egress Traffic Flow Diagram

The key insight here is that if we send outgoing traffic through one end of the veth pairs (veth0-A in the diagram above), then from the perspective of the second interface (veth1-A), the packet arrives at the RX queue of an interface, and is now considered ingress traffic. And, since XDP programs can be attached to any interface’s RX queue, our XDP hook will automatically run on that egress packet.

Furthermore, if we run our XDP programs in native mode like in the diagram above, packets can be processed with zero-copy and will bypass the linux kernel entirely when we use XDP_REDIRECT to route directly to the TX queue of the eth0 interface.

What makes this discovery even more powerful is that modern container runtimes - Docker, Kubernetes, containerd - already use veth pairs and network namespaces for container networking. Every container you’re running right now is already connected through veth interfaces, and it looks exactly like the diagram above.

That’s right - not only can we use XDP for egress traffic in any of these environments, but we can do it without having to change them in any way.

Unfortunately, while implementing this technique seemed straightforward at first, we quickly hit a snag while benchmarking. Our packets kept getting dropped after our XDP program ran, and at first we couldn’t figure out why.

We decided to run tcpdump on the receiving host and realized the packets weren’t even making it over the network. Next, we decided to run tcpdump on the switch handling the packets, and that’s when we realized what we’d missed.

As it turns out, when you bypass the kernel’s networking stack, you inherit its responsibilities.

Normally, when a packet is sent out via the linux kernel, it handles the routing, checksumming, and ARP resolution for us. But we of course have bypassed the kernel’s networking stack entirely, meaning now we have to take full responsibility for ensuring packets are properly formed and can actually reach their next hop.

Our network plane already handles proper routing for us, but we’d missed both checksumming and ARP resolution.

Checksum Calculations in XDP

XDP programs unfortunately are not provided with the same checksum helpers that TC programs get. For NAT (Network Address Translation) or any other packet header modifications, you need to recalculate checksums manually - and when performance matters, the trick is to use incremental checksum updates rather than full recalculations:

static __always_inline __u16 csum16_add(__u16 csum, __u16 addend) {
csum += addend;
return csum + (csum < addend);
}

// Remove old IP from checksum
csum = csum16_add(~tcp->check, ~old_ip_high);
csum = csum16_add(csum, ~old_ip_low);

// Add new IP to checksum
csum = csum16_add(csum, new_ip_high);
csum = csum16_add(csum, new_ip_low);
tcp->check = ~csum;

ARP Resolution

The linux kernel normally handles ARP to resolve IP addresses to MAC addresses and automatically sets the destination MAC address in the ethernet layer of the outgoing packet. With XDP however, we need to maintain our own ARP table and pass in the destination MAC ourselves:

struct arp_entry {
__u8 mac[ETH_ALEN];
};

struct {
__uint(type, BPF_MAP_TYPE_HASH);
__type(key, __be32);  // IP address
__type(value, struct arp_entry);
__uint(max_entries, 65535);
} arp_table SEC(".maps");

// In your XDP program, lookup the destination MAC
struct arp_entry *entry = bpf_map_lookup_elem(&arp_table, &dest_ip);
if (entry) {
memcpy(eth->h_dest, entry->mac, ETH_ALEN);
}

To validate our overall approach, we set up iPerf3 containers between two 200Gbps-capable EC2 instances in the same AWS VPC. We purposely reduced the MTU to 1500 since traffic to the public internet generally can’t use jumbo frames in the first place.

We used the exact same container networking setup for all three tests - the same standard network namespaces with veth pairs that Docker and every other container runtime uses by default. The only thing we changed was how packets were routed from the container’s veth interface to the host’s physical interface:

iptables: The default that everyone uses today - PREROUTING chains to move traffic out of the namespace
Traffic Control: Using a TC egress program on the veth interfaces
XDP: Our technique - Using an XDP program attached to the veth interfaces

We also decided to benchmark both the generic and native XDP drivers implemented for veth interfaces.

We’ll let the results speak for themselves:

iPerf3 Benchmark With Various Routing Strategies

The first two results are exactly what we expect - iptables introduces the most overhead because it routes through the linux kernel, and traffic control performs better but still operates post-socket-buffer and can’t come close to achieving line-rate.

With the generic XDP driver, however, we see something surprising: worse performance than our TC program. After a little digging we realized this actually makes sense. The generic XDP driver does not run on the RX queue and instead, like TC, runs after the socket buffer has been allocated. The worse performance is the result of running the XDP program in the same place as TC but without any of the optimizations that TC benefits from.

With the native XDP driver (which is available in linux 4.19+) we finally see the results we’ve been looking for - we’re routing just shy of line-rate at about 194Gbps, 12.4 times the throughput of iptables and about 9.2x the throughput of TC.

One final thing to note here are the error bars, which were significantly smaller with native XDP. This makes sense since the bulk of our performance improvements come from bypassing the linux kernel and doing less work. iPerf3, iptables, and the linux kernel are all constantly fighting for the CPU which results in inconsistent throughput.

One of the most exciting aspects of this discovery is how immediately applicable it is. We setup our benchmarks to replicate how containers already use network namespaces and veth pairs. This means we can dramatically accelerate container networking without changing how containers work or how they’re orchestrated.

Consider what happens every time a containerized application sends a packet today: it traverses through iptables rules, gets NAT’d, maybe goes through connection tracking, and finally makes it out to the network. All of this happens in the kernel, consuming precious CPU cycles that could be used by actual applications.

With XDP on veth interfaces, we can bypass all of that overhead. The packet goes straight from the container’s namespace through our XDP program to the physical interface. No iptables. No conntrack. Just pure, line-rate packet routing.

While our primary use case at Loophole Labs is live migration - where this technique enables us to transparently reroute connections at line rates during migrations - we recognize the broader impact this can have on container networking as a whole.

That’s why we’re working on a Docker network plugin that implements this technique. It’ll be a drop-in replacement for Docker’s default bridge network driver, except it uses XDP instead of iptables for packet routing.

For simpler container deployments that don’t need the full complexity of Kubernetes networking (microservices, development environments, or edge computing nodes) this could mean:

Doubling network throughput without any hardware upgrades
Dramatically reducing CPU usage for network-heavy workloads
Eliminating iptables as a bottleneck in container-to-container communication

We plan to open source this plugin soon, but the beauty of this technique is that you don’t need to wait for us. Everything you need to implement this yourself is described in this post. The veth pairs are already there, all that’s left is writing the XDP programs to route your packets.

Loophole Labs was built on a very simple premise: Better Building Blocks = Better Applications.

This discovery - that XDP can process egress traffic by taking advantage of veth interfaces - is the best representation of just that, a better building block that results in significantly better applications.

While we’ll be open-sourcing the Docker network plugin for those who want to take advantage of XDP’s egress performance for themselves, this discovery also powers something much bigger: Architect, our live migration platform.

Architect uses this XDP technique (as well as other breakthrough implementations for disk & memory checkpointing) to seamlessly live migrate your containers, VMs, and even active network connections between any clouds or regions - all without your users noticing.

If you’re interested in diving deeper into the technical details or implementing XDP egress in your own infrastructure, join our Discord where our engineering team hangs out and answers questions from the community. Trust me, we love talking about this stuff.

Ready to Use Live Migration?

Join our waitlist to be among the first to dramatically reduce your infrastructure costs while improving reliability:

──/~\ Architect

──Optimize cluster costs and maximize node utilization, all without modifying your applications or your infrastructure.

── Join the waitlist:

──

Going to KubeCon NA 2025?

If you are in Atlanta for KubeCon NA 2025 (November 10-13), stop by Booth #1752 to see live demos of workloads migrating between clouds. We’ll show you exactly how this XDP technique combines with our other innovations to make the impossible, possible.

TL;DR:

TL;DR:

XDP’s Main Limitation

Checksum Calculations in XDP

ARP Resolution

Ready to Use Live Migration?

──/~\ Architect

──Optimize cluster costs and maximize node utilization, all without modifying your applications or your infrastructure.

Going to KubeCon NA 2025?

Similar Posts