When the “enterprise-grade” orchestrator becomes a full-time job nobody asked for
19 min read3 days ago
–
Look, I need to be honest with you.
Before this incident, I assumed “the database will handle it.”
It won’t.
After debugging this and many similar failures, I condensed the patterns into a small reference I now keep open while building and debugging:
👉 SQL Performance Cheatsheet — The Query Mistakes That Kill Databases in Production
If this saves you even one production incident, it already paid for itself.
Spring Boot makes it very easy to build systems — and very easy to build fragile ones.
After repeatedly hitting the same problems, I wrote down the failure patterns and fixes I now use as my own reference:
👉 Spring Boot Production Cheatsheet — The Stuff That Brea…
When the “enterprise-grade” orchestrator becomes a full-time job nobody asked for
19 min read3 days ago
–
Look, I need to be honest with you.
Before this incident, I assumed “the database will handle it.”
It won’t.
After debugging this and many similar failures, I condensed the patterns into a small reference I now keep open while building and debugging:
👉 SQL Performance Cheatsheet — The Query Mistakes That Kill Databases in Production
If this saves you even one production incident, it already paid for itself.
Spring Boot makes it very easy to build systems — and very easy to build fragile ones.
After repeatedly hitting the same problems, I wrote down the failure patterns and fixes I now use as my own reference:
👉 Spring Boot Production Cheatsheet — The Stuff That Breaks in Production (And How to Fix It)
Use it if you want to avoid learning these lessons the expensive way.
For six months, I told everyone Kubernetes was the future. I defended it in architecture reviews. I wrote Helm charts. I debugged networking policies at 2 AM. I became that guy who wouldn’t shut up about K8s.
Then last Tuesday, we ripped it all out and moved to Docker Compose.
Our deployment time went from 45 minutes to 4 minutes. Our infrastructure costs dropped by $1,200/month. And most importantly — I got my weekends back.
This isn’t a rant against Kubernetes. It’s a cautionary tale about adopting tools you don’t actually need.
I ended up collecting these incidents into a single playbook so teams stop repeating the same mistakes.
👉 Production Failure Playbook — 50 Real Incidents That Cost Companies $10K–$1M
After debugging enough slow systems, you start seeing the same bottlenecks over and over.
I collected the 20 most common ones and how to fix them here: 👉 Backend Performance Rescue Kit — Find and Fix the 20 Bottlenecks Killing Your App
Before we ever deploy to production, we run through a brutal checklist — and when things break, we follow a structured incident process. I made both public:
– Pre-Production Checklist + Incident Response Template (free): 👉
– Production Failures Playbook — 30 real incidents with timelines, root causes, and fixes: 👉
Use them if you want to avoid learning these lessons the expensive way.
How We Got Here
We’re a small SaaS company. 8 engineers. About 12,000 active users. Five microservices:
- API gateway
- User service
- Payment processing
- Background job worker
- Admin dashboard
Revenue was growing. Product was stable. Customers were happy.
Then our new senior engineer joined. Let’s call him Marcus.
Marcus came from a Big Tech company. Spotify or Netflix or one of those places where they have entire teams dedicated to infrastructure. On his second week, he dropped this in our engineering channel:
“We should really be running this on Kubernetes.”
I should’ve asked questions. I should’ve pushed back. But I didn’t.
Because Kubernetes sounded impressive. Modern. Scalable. The kind of thing you put on your resume and feel smart about.
Month 1: The Honeymoon Phase
Marcus set up our first cluster. AWS EKS, three nodes, all the standard components. Ingress controller, cert-manager, metrics-server, the whole ecosystem.
The initial setup took two weeks of his time. Full-time.
But once it was running? It looked beautiful. We had a dashboard. We had auto-scaling. We had health checks and readiness probes and resource limits.
“This is what production infrastructure should look like,” Marcus said, showing off the Grafana dashboards.
I was sold. This felt professional.
Our old Docker Compose setup was just… running on two VMs. Boring. Simple. Worked fine, but where’s the sophistication in that?
Month 2: The Cracks Start Showing
First production issue: a service kept crashing with OOMKilled errors.
In our old setup, I would’ve SSH’d into the server, checked logs, adjusted the memory limit in docker-compose.yml, and restarted. Ten minutes, done.
With Kubernetes?
I spent three hours figuring out:
- Which namespace the pod was in
- How to check resource limits
- Why changing the deployment didn’t trigger a rollout
- What the difference between requests and limits even means
- Why kubectl kept timing out
Eventually fixed it by adjusting memory limits in the Helm values file, updating the chart, and doing a rolling restart.
Three hours for what used to take ten minutes.
Month 3: The Context Switching Hell
Here’s what a typical bug fix looked like in our Docker Compose days:
- Make code change
docker-compose up --build- Test it
- Git push
- SSH to production server
docker-compose pull && docker-compose up -d- Done
Here’s what it looked like with Kubernetes:
- Make code change
- Build Docker image
- Push to registry
- Update image tag in Helm values
- Run
helm upgrade - Wait for rollout
- Check if pods are running
- Check if ingress is working
- Debug why the new pods are in CrashLoopBackOff
- Realize you forgot to update the ConfigMap
- Update ConfigMap
- Delete pods to force restart
- Check logs across three pods to see if it worked
- Realize the load balancer health check is failing
- Debug ingress controller configuration
- Two hours later, your fix is live
Every. Single. Deployment.
Month 4: When Deployments Became Everyone’s Problem
Marcus was our Kubernetes expert. The only one who really understood it.
When he took a week off for his wedding, our deployment pipeline essentially stopped.
A junior dev needed to deploy a hotfix. Simple change, one line of code.
She spent four hours trying to deploy it. Eventually gave up and waited for Marcus to get back from his honeymoon.
The hotfix that should’ve taken 10 minutes took three days.
That’s when I realized: we’d built a dependency on one person’s specialized knowledge. And that knowledge had nothing to do with our actual product.
The Breaking Point
Friday, 4:37 PM. Production alert fires.
Payment service is down. Customers can’t check out. Revenue is literally stopped.
I check Kubernetes dashboard. The payment service pod is in “Pending” state.
Why? No available nodes with enough resources.
In our Docker Compose world, this problem didn’t exist. Services ran where we told them to run. If we needed more capacity, we’d provision another VM and add it to the compose file.
In Kubernetes? I needed to:
- Understand why the cluster autoscaler didn’t spin up a new node
- Manually scale the node group
- Wait for AWS to provision the instance
- Wait for the node to join the cluster
- Wait for pods to schedule
- Debug why the pod still won’t start (turned out to be a PersistentVolume issue)
One hour and forty minutes of downtime. For a problem that shouldn’t have existed in the first place.
The CEO called me. “I thought Kubernetes was supposed to make things more reliable?”
I didn’t have a good answer.
The Weekend That Changed Everything
That Saturday, I sat down and actually measured what Kubernetes was costing us.
Time costs:
- Marcus spending 60% of his time on K8s issues instead of product features
- Every engineer spending 30+ minutes per deployment vs 5 minutes before
- Weekly “Kubernetes troubleshooting” sessions that used to be sprint planning time
- Two engineers who couldn’t deploy without Marcus’s help
Actual money costs:
- EKS control plane: $73/month
- Three t3.large nodes (because K8s overhead requires beefier instances): $450/month
- Load balancers: $180/month (one per service because of how we’d configured ingress)
- Increased data transfer: ~$80/month
- Total: $783/month
Our old Docker Compose setup:
- Two t3.medium VMs: $135/month
- One load balancer: $18/month
- Total: $153/month
We were paying 5x more for infrastructure that made our lives harder.
The Uncomfortable Conversation
Monday morning, I called a meeting with Marcus and our CTO.
“I think we should move back to Docker Compose.”
Marcus looked offended. “Docker Compose isn’t production-grade. It doesn’t scale. It doesn’t have — “
“We’re not Netflix,” I interrupted. “We have five services and 12,000 users. Docker Compose handled it fine before. It’ll handle it fine now.”
“But what about when we scale to 100,000 users?”
“Then we’ll have the revenue to hire a dedicated infrastructure team. Right now, we have eight engineers and we’re spending 60 hours a week managing Kubernetes instead of shipping features.”
The CTO looked at our metrics. Deploy frequency down 60%. Infrastructure costs up 5x. Two engineers who couldn’t ship without help.
“How long to migrate back?” he asked.
“Three days,” I said.
How We Un-fucked Ourselves
Here’s what moving back to Docker Compose looked like:
Day 1: Set up the VMs
Provisioned two t3.medium instances. Installed Docker and Docker Compose. Copied our old docker-compose.yml files. Updated with current image tags.
Total time: 4 hours.
Day 2: Migrate the databases
We’d kept our databases separate (RDS), so nothing to migrate there. Just updated connection strings in environment files.
Deployed services one by one:
- API gateway
- User service
- Payment service
- Job worker
- Admin dashboard
Tested each service. Everything worked first try because we weren’t fighting with pod scheduling, resource limits, or networking policies.
Total time: 5 hours.
Day 3: Update CI/CD and DNS
Updated GitHub Actions to deploy via SSH instead of kubectl. Updated DNS to point to the new load balancer.
Did a final production test. Switched traffic over.
Total time: 3 hours.
Total migration time: 12 hours spread over 3 days.
Compare that to the six months we’d spent on Kubernetes.
The Results (Two Months Later)
Let me show you what actually changed:
Deployment time:
- Before K8s: 5 minutes
- With K8s: 45 minutes (including debugging time)
- After moving back: 4 minutes
Infrastructure costs:
- Before K8s: $153/month
- With K8s: $783/month
- After moving back: $168/month (slightly higher because we’d grown)
Engineer time spent on infrastructure:
- Before K8s: ~5 hours/week
- With K8s: ~60 hours/week
- After moving back: ~6 hours/week
Deploy confidence:
- Before K8s: Everyone could deploy
- With K8s: Only Marcus could deploy reliably
- After moving back: Everyone can deploy again
Production incidents:
- Before K8s: 2–3 minor issues per month
- With K8s: 8–12 issues per month (mostly K8s-related)
- After moving back: 2–3 issues per month
We got our productivity back. We got our sanity back. We got our weekends back.
What We Actually Learned
Lesson 1: “Enterprise-grade” doesn’t mean “right for you”
Kubernetes is an incredible piece of technology. It solves real problems at real scale.
But it also creates problems. Complexity problems. Knowledge problems. Operational problems.
Those problems are worth solving when you’re running hundreds of services across thousands of nodes.
They’re not worth solving when you have five services and two VMs.
Lesson 2: Boring technology is underrated
Docker Compose is boring. It doesn’t have auto-scaling. It doesn’t have a fancy dashboard. It doesn’t look impressive in architecture diagrams.
But it’s simple. It’s reliable. Everyone on the team understands it.
And “everyone understands it” is worth more than any fancy feature.
Lesson 3: The best abstraction is the one you don’t need
Kubernetes abstracts away a lot of infrastructure complexity. Networking, service discovery, load balancing, health checks.
But we didn’t have that complexity. We introduced it by adopting Kubernetes.
Sometimes the best solution is the one that doesn’t create problems to solve.
Lesson 4: Infrastructure should be invisible
Good infrastructure disappears. You don’t think about it. You just deploy your code and it works.
With Kubernetes, infrastructure became our primary focus. We spent more time managing the platform than building the product.
That’s backwards.
Lesson 5: Complexity is a tax you pay every day
Every piece of complexity in your stack is a tax. You pay it every time you deploy. Every time you debug. Every time you onboard a new engineer.
Kubernetes’s tax was too high for what we got in return.
When Kubernetes Actually Makes Sense
I’m not saying Kubernetes is bad. I’m saying it’s often overkill.
Use Kubernetes when:
- You have 20+ microservices
- You have 50+ engineers
- You need true multi-region/multi-cloud
- You have dedicated infrastructure team
- You’re actually hitting Docker Compose’s limitations
- You need complex deployment strategies (blue-green, canary at scale)
Don’t use Kubernetes when:
- You have <10 services
- You have <20 engineers
- Docker Compose is working fine
- No one on your team is a K8s expert
- You’re adopting it “for the future”
- You’re doing it because it looks good on resumes
We fell into that last category. And it cost us six months.
The Harder Question: When to Actually Scale Up
“But what about when you do need Kubernetes?” people ask.
Here’s the thing: you’ll know.
You’ll know because Docker Compose will actually become a bottleneck. Deploys will be slow. Managing multiple VMs will become painful. Coordinating updates across hosts will suck.
That hasn’t happened to us yet. We’re at 18,000 users now. Still running on Docker Compose. Still working fine.
When we hit actual limitations — not theoretical ones — we’ll consider alternatives. Maybe Kubernetes. Maybe something simpler like Docker Swarm. Maybe something we haven’t heard of yet.
But we won’t adopt complexity until we have problems that require it.
I’ve seen too many backend systems fail for the same reasons — and too many teams learn the hard way.
So I turned those incidents into a practical field manual: real failures, root causes, fixes, and prevention systems. No theory. No fluff. Just production.
👉 The Backend Failure Playbook — How real systems break and how to fix them:
The Uncomfortable Truth About Modern Infrastructure
Most startups are over-engineering their infrastructure.
They’re adopting Kubernetes because Google uses it. Service meshes because Uber built one. Event-driven architectures because Netflix wrote a blog post.
But Google has 50,000 engineers. Uber has thousands of microservices. Netflix operates at a scale you’ll probably never reach.
You’re not them. Your problems are different.
The stack that’s “embarrassingly simple” might be exactly what you need.
What Our Stack Looks Like Now
Since people always ask, here’s our actual production setup:
Infrastructure:
- Two AWS EC2 instances (t3.medium)
- One Application Load Balancer
- RDS PostgreSQL (db.t3.small)
- ElastiCache Redis (cache.t3.micro)
- S3 for file storage
Deployment:
- Docker Compose on both VMs
- GitHub Actions for CI/CD
- Blue-green deployments (stop one VM, update, start, repeat for second VM)
- Nginx for reverse proxy and SSL termination
Monitoring:
- CloudWatch for basic metrics
- Sentry for error tracking
- Simple uptime monitoring via UptimeRobot
Total monthly cost: ~$280
It’s boring. It works. We sleep at night.
The Engineer Time We Got Back
Remember those 60 hours I mentioned? Here’s what we did with them:
- Shipped the payment retry feature customers had been asking for
- Rewrote our slowest API endpoint (went from 800ms to 120ms)
- Built an admin tool that saves support team 10 hours/week
- Actually paid down some technical debt
- Started working on the mobile app we’d been delaying
That’s what happens when you stop spending all your time managing infrastructure.
You ship product. You make customers happy. You grow the business.
Crazy concept, right?
When I talk about production failures and operational complexity, these resources have actually saved me from repeating the same mistakes:
For teams dealing with similar infrastructure decisions:
- Spring Boot Production Checklist — The deployment concerns I check before shipping anything
For engineers tired of over-complicated setups:
- Docker Compose Production Patterns — How we actually run services in production (Covered in the Backend Failure Playbook)
For startups questioning their architecture:
- Next.js SaaS Starter Template — A clean, minimal foundation without the bloat
For teams struggling with deployment complexity:
- Production Engineering Cheatsheet — The fundamentals that matter more than tooling
For anyone preparing for backend interviews about infrastructure:
- Top 85 Java Interview Q&A — What actually gets asked about deployment and operations
I’m not selling dreams — just documenting what worked when we stopped chasing trends and started solving real problems.
If you’re dealing with similar infrastructure complexity, all my resources are organized by category here:
What I’d Tell My Past Self
If I could go back to that meeting where Marcus suggested Kubernetes:
“Ask these questions first:
- What specific problem are we solving?
- Is Docker Compose actually preventing us from doing something?
- Do we have someone who can own this full-time?
- What’s the opportunity cost of the migration?
- Are we doing this because it’s necessary or because it sounds impressive?”
We didn’t ask those questions. We just assumed “industry best practice” meant “right for us.”
It didn’t.
The Real Cost of Technical Trends
The worst part about the Kubernetes detour wasn’t the wasted money or time.
It was the features we didn’t ship.
Six months of engineering time went into migrating to and managing Kubernetes. That’s six months we could’ve spent building things customers actually wanted.
How many customers did we lose because we were too busy playing with infrastructure? How much revenue did we miss out on?
I don’t know. But I know it was more than zero.
And that’s the real cost of adopting technology you don’t need.
🔧 Tools that saved me from production hell
After a few years of building and breaking real systems, I noticed a pattern:
I kept losing time (and sometimes sleep) to the same problems: bad test data, silent failures, broken automations, fragile deployments, and debugging in the dark.
So I stopped rewriting the same fixes over and over again — and turned them into small tools and runbooks I actually use.
If you’re dealing with any of these pains, this might save you some time:
❌ “Our tests keep failing because the data is broken.”
Relational Test Data Generator (TDG) Generate realistic, relationally consistent test data — without touching production or breaking foreign keys. 👉 https://devrimozcay.gumroad.com/l/vtuju
❌ “We only discover problems when production is already on fire.”
Production Engineering Toolkit — Real Production Failures A collection of real incidents, failure patterns, and how to avoid them before they hurt. 👉
❌ “Our internal automations are fragile, unmaintainable, and always break.”
Selenium Automation Starter Kit (Python) A clean, extensible base for building internal tools, scrapers, and test automations that don’t rot after a week. 👉 https://devrimozcay.gumroad.com/l/rdablh
❌ “We’re starting a mobile product but don’t want to waste weeks on boilerplate.”
Expo Habit App Boilerplate — Production Ready A ready-to-ship mobile foundation for habit, health, and tracking apps. No setup hell. 👉 https://devrimozcay.gumroad.com/l/mliech
❌ “Nobody taught me what actually matters in production.”
Production Engineering Cheatsheet The fundamentals nobody tells you until things break at 2 AM. 👉
❌ “Spring broke again and I don’t know where to look.”
Spring Boot Troubleshooting — When Things Break in Production A battle-tested debugging guide for Spring systems based on real failures, not docs. 👉
❌ “I know Python, but not how to run it in production.”
Python for Production — Cheatsheet The parts of Python that actually matter when systems run 24/7. 👉
❌ “I want to ship an AI product without drowning in overengineering.”
Ship an AI SaaS MVP — The No-BS Checklist A practical checklist to ship an AI MVP fast, without building a science project. 👉 https://devrimozcay.gumroad.com/l/ai-saas-mvp-checklist
❌ “I want a real starting point, not a demo repo.”
AI SaaS Starter Kit (Next.js + OpenAI) A clean foundation for spinning up AI-powered products quickly. 👉 https://devrimozcay.gumroad.com/l/tzqjh
❌ “Our backend setup always takes longer than expected.”
Spring Boot Microservices Starter Kit v2 A production-ready backend stack you can run locally in under 30 minutes. 👉
❌ “Our frontend is always a mess at the start.”
Next.js SaaS Starter Template A minimal, clean frontend foundation for SaaS products. 👉
❌ “I’m preparing for interviews but don’t want trivia.”
Cracking the AWS & DevOps Interview Real questions, real answers — no filler. 👉
❌ “Java interviews still scare me.”
Top 85 Java Interview Questions & Answers Curated questions that actually show up in real interviews. 👉
I’m not selling motivation or dreams — just the tools I built because I was tired of solving the same problems over and over again.
If one of these saves you even a few hours, it already paid for itself.
The Response (What Happened After)
We published this story internally first. The reactions were… mixed.
Marcus (our K8s advocate): Felt vindicated, actually. He’d been burning out from being the only one who could troubleshoot K8s issues. “I was starting to hate infrastructure work,” he told me later. “Now I can write code again.”
Junior developers: Relieved. They’d been scared to admit they didn’t understand Kubernetes. Now they could deploy with confidence again.
CTO: Initially defensive (it was his decision to approve the migration), but came around when he saw the productivity numbers. “We optimized for the wrong thing,” he admitted.
Customers: Didn’t notice anything. Which is exactly the point. Infrastructure changes should be invisible to users.
The only person who had strong negative feelings was a backend engineer who’d been excited to put “Kubernetes” on his resume. He left three months later for a job at a bigger company.
We replaced him with someone who cared more about solving problems than collecting buzzwords.
One Year Later: Did We Make the Right Call?
It’s been 14 months since we moved back to Docker Compose.
We’re now at 28,000 active users. Eight microservices (added three more). Still on Docker Compose.
Zero Kubernetes-related incidents (obviously). Zero late-night troubleshooting of pod networking. Zero “why won’t this deploy” debugging sessions.
Our infrastructure costs: $340/month.
If we’d stayed on Kubernetes and scaled proportionally? Probably $1,500–2,000/month. Plus the ongoing operational complexity.
We made the right call.
The Write-up That Changed How I Think
If you’re interested in more stories like this — real engineering decisions, production lessons, no bullshit — I write regularly on Substack about what actually breaks and what actually works.
👉 Subscribe here:
No spam. No courses. Just production engineering reality.
Your Turn: Are You Over-Engineering?
Here’s a quick self-assessment. Answer honestly:
- Can everyone on your team deploy without help?
- Do you spend more time managing infrastructure than shipping features?
- Could you explain your deployment process to a new engineer in under 10 minutes?
- Have you hit actual limitations of your current setup, or are you “preparing for scale”?
- If your infrastructure expert quit tomorrow, would everything fall apart?
If you answered wrong on 2+ questions, you might be over-engineering.
And that’s okay. We all do it. The important thing is recognizing it and course-correcting.
Sometimes the best technical decision is the one that lets you stop thinking about technology and start shipping product.
Drop a comment if:
- You’re running production workloads on Docker Compose (solidarity!)
- You migrated TO Kubernetes and it was worth it (I want to hear that story)
- You’re considering K8s and this made you reconsider
- You think I’m wrong and Docker Compose is a terrible idea (let’s debate)
And if you’re currently drowning in Kubernetes complexity and wondering if there’s a simpler way — there is. You’re not crazy for thinking it’s harder than it should be.
Build something this week. Preferably something your customers will actually notice.
About me and what I’m working on
I’m an engineer and entrepreneur who has spent years building and operating real production systems — and dealing with what happens when they fail.
I’ve been on the receiving end of late-night incidents, unclear root causes, risky releases, and systems that only make sense to one or two people in the team. I’m now working on turning those painful, expensive experiences into tools and practices that help teams detect, understand, and prevent production failures before they turn into incidents.
If your team is struggling with late detection, recurring incidents, unclear failure modes, or fragile release processes, I’d genuinely love to hear what you’re dealing with and what’s been hardest to solve.
Reach out: 🔗 LinkedIn: https://www.linkedin.com/in/devrimozcay/ ✍️ Medium:
Follow along: 𝕏 (Twitter): https://x.com/devrimozcy Instagram: https://www.instagram.com/devrim.software/
One last thing.
I’m actively talking to teams who are dealing with problems like:
- services slowly eating memory until they crash
- rising cloud costs nobody understands anymore
- incidents that feel “random” but keep repeating
- systems that only one or two people truly understand
If any of this sounds like your team, I’d genuinely love to hear what you’re dealing with.
I’m not selling anything here — I’m trying to understand where teams are struggling most so I can build better tools and practices around it.
You can reach me through any of the channels above. Let’s talk about what’s actually breaking in your systems.
🧩 If you enjoy these deep-dive stories, you might like some of the notes I keep around while working on Spring systems: • Grokking the Spring Boot Interview →
• Spring Boot Troubleshooting Cheatsheet →
• 250+ Spring Certification Practice Questions →
☕ I’ve been keeping these handy while mentoring junior devs and preparing for interviews myself: • Grokking the Java Interview →
• Grokking the SQL Interview (Free Copy) →
• Grokking the Java Interview Vol 2 →
They’re short, practical, and cover exactly what interviewers actually ask. These have saved me countless hours chasing weird bean issues and context reload bugs.