5 min read10 hours ago
–
Architecture and Cost Trade-offs
How we designed a scalable, reliable microservices platform on Azure Kubernetes Service while significantly reducing infrastructure costs-without sacrificing observability or security.
Press enter or click to view image in full size
Introduction
When building cloud-native systems, teams often face a familiar dilemma:
Do we optimize for reliability or for cost?
Fully managed cloud services promise simplicity, strong SLAs, and fewer operational headaches. The trade-off is cost-often far higher than expected once a system grows beyond a handful of services. On the other hand, self-hosting everything can reduce spend but increases operational complexity and risk.
This post shares how we approached this problem …
5 min read10 hours ago
–
Architecture and Cost Trade-offs
How we designed a scalable, reliable microservices platform on Azure Kubernetes Service while significantly reducing infrastructure costs-without sacrificing observability or security.
Press enter or click to view image in full size
Introduction
When building cloud-native systems, teams often face a familiar dilemma:
Do we optimize for reliability or for cost?
Fully managed cloud services promise simplicity, strong SLAs, and fewer operational headaches. The trade-off is cost-often far higher than expected once a system grows beyond a handful of services. On the other hand, self-hosting everything can reduce spend but increases operational complexity and risk.
This post shares how we approached this problem while building a production-grade microservices platform on Azure Kubernetes Service (AKS). Instead of choosing one extreme, we adopted a hybrid architecture -using managed services where they truly matter, and self-hosting components where the risk-to-cost trade-off made sense.
The result was a platform that is:
- Cost-efficient (70–90% lower than a fully managed stack)
- Reliable for critical workloads
- Observable without per-GB ingestion costs
- Portable and fully defined as code
This article focuses on architecture and decision-making, not step-by-step implementation. A follow-up post will cover the Terraform and Kubernetes details.
The Core Problem: Cost, Reliability, and Complexity
Modern microservices platforms are more than application code. Even small systems require:
- Databases and message queues
- Caching layers
- Metrics, logs, and dashboards
- Secure networking and access control
Each component introduces the same question:
Should this be a managed service or something we run ourselves?
The Cost Reality
Managed services are excellent-but their pricing compounds quickly.
A typical micro services setup can look like this:
| Component | Managed Service | Typical Monthly Cost || ---------- | --------------------- | -------------------- || PostgreSQL | Azure PostgreSQL | $65–200 || Redis | Azure Cache for Redis | $50–150 || RabbitMQ | Managed broker | $100–300 || Logging | Azure Monitor | $2.50 per GB |
Individually, these costs are reasonable. Together, they add up fast-especially for early-stage or cost-sensitive workloads.
The Key Insight: Not All State Is Equal
The most important architectural decision we made was to classify state.
Two Types of State
Irreplaceable state Regenerable state
Once you make this distinction, the managed vs self-hosted decision becomes much clearer.
Our Hybrid Architecture Strategy
We deliberately mixed managed and self-hosted components.
What We Ran as Managed Services
- PostgreSQL (database of record) Chosen for durability, point-in-time recovery, backups, and SLA guarantees.
What We Ran Inside Kubernetes
- Redis (cache)
- RabbitMQ (message broker)
- Observability stack (metrics and logs)
These components are important, but failure is recoverable. Kubernetes handles orchestration, restarts, and rescheduling, making this a reasonable trade-off.
This single decision accounted for most of the cost reduction-without increasing the blast radius of real failures.
Kubernetes-First, but Not Kubernetes-Everything
Kubernetes was chosen as the orchestration layer, not as a place to host everything indiscriminately.
Why Kubernetes Works Well Here
- Consistent deployment model
- Horizontal scaling built-in
- Strong ecosystem
- Portability across clouds
Where Kubernetes Is Not Ideal
- Primary databases
- Highly stateful systems requiring strong consistency guarantees
Using Kubernetes for compute and managed services for critical state gives the best of both worlds.
Scaling Without Paying for Idle Capacity
One of the easiest ways to waste money in the cloud is provisioning for peak load.
The Approach
At idle, the platform runs on a small footprint. During traffic spikes, it scales automatically-then scales back down.
This approach dramatically reduced monthly costs without affecting availability.
Using Spot Instances-Safely
Not all workloads need guaranteed uptime.
Background workers, batch jobs, and asynchronous processing can tolerate interruptions. These workloads ran on spot instances, trading availability guarantees for steep discounts.
When Spot Instances Make Sense
- Background processing
- Data pipelines
- Non-user-facing jobs
When They Don’t
- APIs
- Databases
- Stateful services
By isolating these workloads, we reduced compute costs significantly without impacting user experience.
Observability Without Per-GB Pricing
Observability is not optional-but managed logging platforms charge heavily at scale.
Instead of paying per-GB ingestion fees, we used a self-hosted stack:
- Metrics collected and stored in-cluster
- Logs stored in object storage using low-cost tiers
- Dashboards built on top of open-source tooling
Why This Works
- Logs are queried infrequently
- Storage is cheap
- Ingestion costs dominate managed observability pricing
This approach reduced observability costs by orders of magnitude while preserving full visibility into the system.
Security and Access: Simple, Auditable, and Cheap
Security was designed around a few principles:
Private by Default
- Databases are accessible only within the virtual network
- No public endpoints for internal services
Identity Over Secrets
- Workloads authenticate using cloud identity
- No long-lived credentials stored in Kubernetes
Least Privilege
- Day-to-day operations require minimal permissions
- Elevated access is limited to initial setup
This keeps the system secure without introducing VPNs, bastion hosts, or unnecessary operational overhead.
Cost Snapshot (Production)
A representative production setup looked roughly like this:
| Component | Monthly Cost || ---------------------- | ------------ || AKS compute (baseline) | $80–100 || Managed PostgreSQL | $65 || Storage and networking | $30 || Persistent volumes | $10 || **Total** | **~$190** |
Comparable fully managed setups often exceeded $500-$1,500 per month.
Lessons Learned
1. Don’t Optimise the Wrong Layer
Saving money on your database of record is rarely worth the risk.
2. Spot Instances Are High-Leverage
Used correctly, they offer some of the highest cost savings available.
3. Observability Is a Requirement
Skipping it to save money always costs more later.
4. Infrastructure as Code Pays Off Early
Teams that automate early spend less time firefighting later.
5. Kubernetes Is an Enabler, Not the Goal
Use it where it adds leverage-not as a default for everything.
Conclusion
Production-grade systems don’t require premium managed services at every layer.
By:
- Classifying state correctly
- Using managed services selectively
- Scaling dynamically
- Leveraging open-source observability
it’s possible to build platforms that are cost-efficient, reliable, secure, and portable -even with small teams.
In Part 2, we’ll dive into the actual implementation: Terraform modules, AKS autoscaling, spot node pools, Workload Identity, and Kubernetes deployment patterns.