🚀 The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it)

Why I Wrote This Article — The Incident That Sparked Everything

This article didn’t come from curiosity.

It came from pain.

One morning, I received a message from the DevOps team:

“Some services are failing to resolve hostnames again —

we’re getting Temporary failure in name resolution.”

And this wasn’t the first time.

It had happened before — randomly, unpredictably, quietly causing latency and connection failures.

As the new Cloud Architect, the responsibility landed on my desk:

“We need this fixed forever — no more band-aids.”

So I started investigating.

(NOTE: I’ll publish a second article soon with the full debugging journey.)

Nothing looked broken at first.

Pods healthy. Cluster stable. CoreDNS replicas are running.

No crashes. No alerts…

Why I Wrote This Article — The Incident That Sparked Everything

This article didn’t come from curiosity.

It came from pain.

One morning, I received a message from the DevOps team:

“Some services are failing to resolve hostnames again —

we’re getting Temporary failure in name resolution.”

And this wasn’t the first time.

It had happened before — randomly, unpredictably, quietly causing latency and connection failures.

As the new Cloud Architect, the responsibility landed on my desk:

“We need this fixed forever — no more band-aids.”

So I started investigating.

(NOTE: I’ll publish a second article soon with the full debugging journey.)

Nothing looked broken at first.

Pods healthy. Cluster stable. CoreDNS replicas are running.

No crashes. No alerts.

But something felt off — so I went deep into metrics.

And there it was:

CoreDNS wasn’t resolving —

it was drowning in NXDOMAIN.

Thousands per second.

It wasn’t an outage.

It was a storm — a silent performance killer - we have around 80% of the DNS queries with response code NXDOMAIN

And the storm had one surprising source…

🕵️ The Real Breakthrough — It Was One Hostname

When I traced DNS traffic volume by hostname,

The data made me stop.

It wasn’t many hostnames.

It wasn’t dozens.

It was one about from 80% to 90% of the DNS queries are related to only one host.

Our RabbitMQ endpoint — the heart of our event-driven system — contained only four dots:

rabbitmq.eu-west-1.aws.company.production

And with Kubernetes default ndots=5,

This meant the resolver didn’t treat it as a fully qualified domain.

Instead, Kubernetes expanded it through every search domain in the pod:

rabbitmq.eu-west-1.aws.company.production.default.svc.cluster.local ❌ NXDOMAIN rabbitmq.eu-west-1.aws.company.production.svc.cluster.local ❌ NXDOMAIN rabbitmq.eu-west-1.aws.company.production.cluster.local ❌ NXDOMAIN rabbitmq.eu-west-1.aws.company.production. ❌ NXDOMAIN rabbitmq.eu-west-1.aws.company.production ✅ finally correct

For each attempt:

A lookup ❌
AAAA lookup ❌

🟡 4 to 8 extra DNS queries for every single valid lookup

RabbitMQ is used everywhere — messaging, telemetry, queues, notifications.

So every millisecond meant more queries → more NXDOMAIN → more pressure.

We weren’t resolving DNS.

We were manufacturing DNS traffic.

⚡ The One-Character Fix That Saved Us

Under pressure and needing a fast mitigation,

I tried a tiny change that felt almost silly:

I added a trailing dot to the hostname.

Just one dot:

rabbitmq.eu-west-1.aws.company.production.

That trailing dot tells Linux resolver:

“This is a fully qualified domain.

Do not apply search paths.”

The effect was instant:

❌ NXDOMAIN flood dropped immediately
💡 CoreDNS CPU reduced by ≈50%
⚡ Lookup performance improved ~5x
🧘 Zero failures since
😊 Developers finally stopped pinging me about DNS issues

We didn’t scale DNS.

We didn’t tune CoreDNS.

We didn’t rewrite applications.

We removed unnecessary work.

One dot → stability restored.

We later applied additional DNS optimizations to handle even larger query loads

— more on that in the next article.

🔍 The Root Cause: `ndots` in `/etc/resolv.conf`

Every Kubernetes pod has a resolver config like:

search default.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
nameserver 172.20.0.10
options ndots:5

The ndots value controls: How many dots must exist in a hostname before it is treated as an absolute FQDN.

If hostname dot-count < ndots → search domains appended

This Kubernetes default exists to support internal service discovery:

service → service.default.svc.cluster.local → resolves successfully But for external hostnames?

🚩 Disaster waiting to happen.

🧪 Benchmark — Measured Results I have used this python script to test the ndots effect

#!/usr/bin/env python3
import argparse
import asyncio
import time

import dns.asyncresolver


async def main() -> None:
parser = argparse.ArgumentParser(
description="Measure DNS lookup time for multiple queries using current resolv.conf (ndots/search)."
)
parser.add_argument("host", help="Hostname to resolve (bare name to exercise ndots/search)")
parser.add_argument(
"-n",
"--queries",
type=int,
default=100,
help="Number of concurrent queries to issue (default: 100)",
)
parser.add_argument(
"-t",
"--timeout",
type=float,
default=2.0,
help="Per-query timeout in seconds (default: 2.0)",
)
args = parser.parse_args()

resolver = dns.asyncresolver.Resolver()  # uses /etc/resolv.conf (ndots/search respected)
resolver.timeout = args.timeout
resolver.lifetime = args.timeout
resolver.use_edns = False

async def one_query() -> None:
try:
await resolver.resolve(args.host, "A", search=True)
except Exception:
# Ignore failures; we only care about timing behavior.
pass

tasks = [asyncio.create_task(one_query()) for _ in range(args.queries)]
start = time.monotonic()
await asyncio.gather(*tasks)
elapsed = time.monotonic() - start
print(f"{args.queries} queries for '{args.host}' in {elapsed:.3f}s ({elapsed/args.queries:.4f}s/query)")


if __name__ == "__main__":
asyncio.run(main())

Python async DNS resolver test:

Before fix (no trailing dot)

python ndots_async_bench.py rabbitmq.eu-west-1.aws.xxxxx.production -n 100
100 queries for 'rabbitmq.eu-west-1.aws.xxxxx.production' in 2.207s (0.0221s/query)

python ndots_async_bench.py rabbitmq.eu-west-1.aws.xxxxx.production -n 10
10 queries for 'rabbitmq.eu-west-1.aws.xxxxx.production' in 0.302s (0.0302s/query)

100 queries → 2.207s (0.0221 s/query) 10 queries → 0.302s (0.0302 s/query)

After fix (trailing dot → FQDN) rabbitmq.eu-west-1.aws.company.production.

python ndots_async_bench.py rabbitmqc.eu-west-1.aws.xxxxx.production. -n 100
100 queries for 'rabbitmqc.eu-west-1.aws.xxxxx.production.' in 0.399s (0.0040s/query)
python ndots_async_bench.py rabbitmqc.eu-west-1.aws.xxxxx.production. -n 10
10 queries for 'rabbitmqc.eu-west-1.aws.xxxxx.production.' in 0.095s (0.0095s/query)

100 queries → 0.399s (0.0040 s/query) 10 queries → 0.095s (0.0095 s/query) 🚀 DNS became ~5x faster 🧨 NXDOMAIN traffic dropped nearly in half

🛠 Fixing the Problem (Two Options)

1️⃣ Use Fully Qualified Domain Names with a trailing dot

Examples:

api.company.internal.
googleapis.com.
rabbitmq.eu-west-1.aws.company.

✔ Easiest fix ✔ No Kubernetes changes ✔ Zero search-domain expansion ✔ Best performance

2️⃣ Reduce ndots for external workloads

spec:
dnsConfig:
options:
- name: ndots
value: "2"

AWS docs state:

You can reduce the number of requests to CoreDNS by lowering the ndots option of your workload or fully qualifying your domain requests by including a trailing . (e.g. api.example.com. ).

📘 References

Kubernetes Docs https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

AWS EKS https://docs.aws.amazon.com/eks/latest/best-practices/scale-cluster-services.html#:~:text=Reduce%20external%20queries%20by%20lowering%20ndots

Linux Resolver https://man7.org/linux/man-pages/man5/resolv.conf.5.html

🎯 Final Thought

Sometimes the biggest reliability problems come from the smallest defaults.

ndots=5 is perfect for Kubernetes internal services… but for external hostnames it can quietly overwhelm DNS and drag performance down across the entire cluster.

One dot fixed everything.

Fix it once → enjoy peace and performance forever.

💬 If you’d like Part 2 (the full debugging journey — how I traced and proved the root cause), comment below:

Show me the debugging story

Why I Wrote This Article — The Incident That Sparked Everything

Why I Wrote This Article — The Incident That Sparked Everything

CoreDNS wasn’t resolving —

it was drowning in NXDOMAIN.

🕵️ The Real Breakthrough — It Was One Hostname

⚡ The One-Character Fix That Saved Us

🔍 The Root Cause: ndots in /etc/resolv.conf

🛠 Fixing the Problem (Two Options)

1️⃣ Use Fully Qualified Domain Names with a trailing dot

2️⃣ Reduce ndots for external workloads

📘 References

🎯 Final Thought

💬 If you’d like Part 2 (the full debugging journey — how I traced and proved the root cause), comment below:

Similar Posts

🔍 The Root Cause: `ndots` in `/etc/resolv.conf`