🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Supercharge your Karpenter: Tactics for smarter K8s optimization (COP208)
In this video, Alexey, CTO at Zesty, presents Karpenter best practices for Kubernetes cost optimization. He covers consolidation policies with Pod Disruption Budgets and delays, choosing the right instance types while avoiding extremes, managing DaemonSet overhead, leveraging spot instances with on-demand failover, using Graviton ARM instances, and limiting availability zones …
🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Supercharge your Karpenter: Tactics for smarter K8s optimization (COP208)
In this video, Alexey, CTO at Zesty, presents Karpenter best practices for Kubernetes cost optimization. He covers consolidation policies with Pod Disruption Budgets and delays, choosing the right instance types while avoiding extremes, managing DaemonSet overhead, leveraging spot instances with on-demand failover, using Graviton ARM instances, and limiting availability zones for non-production environments. He emphasizes the challenge of continuous optimization and introduces Zesty’s platform, which combines resource and financial optimization layers, offering predictive scaling, automatic workload resizing, Fast Scale technology for 30-second node recovery, and spot-to-savings-plan distribution to achieve 70% cost savings.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Karpenter Best Practices: Consolidation Policies, Instance Selection, and Spot Instance Management
Hey guys, welcome. Thank you for joining the session. We’re going to spend the next roughly 15 minutes talking about Karpenter and its great features that will actually help you achieve cost efficiency and make your setup really adaptive at scale. Before we start, I would actually want to ask you a quick question. How many of you guys are using Karpenter today in production? Perfect, amazing. And keep your hands up, those who use it for managing spot instances. Amazing, perfect. Thank you for sharing.
So my name is Alexey. I’m a co-founder and CTO at Zesty, where we are focusing on making cloud really efficient. And I hope you will actually be able to take some actual and practical tips from this presentation that you can implement in your setup. Let’s dive in.
When it comes to stability and performance, it’s really easy to solve the problem by just throwing money at the problem, right? We can always take the biggest, the most powerful instances out there and just get amazing performance at a terrible cost. On the other hand, if we squeeze the budget and we scale down to the bare minimum, suddenly efficiency becomes unavailable. So as DevOps engineers and platform engineers, our goal is to find the sweet spot. Our goal is to run complex production workloads that are demanding for resources, give them just the right infrastructure, and it shouldn’t be overprovisioned or underprovisioned. And this is a tough balance to strike, right? And especially in Kubernetes, it always means tuning the configuration to stay efficient, and it’s easier said than done.
So let’s dive in. Basically, one of the best features in Karpenter, it’s very powerful, called consolidation policies. What it actually does is it takes and monitors your nodes and looks for underutilized nodes and actually checks if it can safely terminate them to save costs and move the containers somewhere else. Now, while this sounds appealing and sounds great, there are some guardrails and fine tunings to this behavior you want to have in your setup.
Number one is using what’s called PDB or Pod Disruption Budgets. It actually controls and makes sure that not all of your replicas are being consolidated at the same time. You want to make sure that you control how many replicas Karpenter can interrupt at any given moment to keep your setup safe, right? Another thing is consolidation delays. You can always configure how long Karpenter will wait until it actually goes and terminates nodes. And you’re setting this in order to prevent Karpenter from doing this during short-term dips in usage, so it’s not going to go crazy and interrupt your containers.
So by actually implementing those controls, you can avoid consolidation during your peak traffic hours, and you can see an example configuration here to set this up. Okay, another important strategy is to choose the right instances for your pool. So it’s not just about what’s available, it’s about giving Karpenter the right amount of flexibility and giving it a set of broader choices.
So you definitely want to have diversification because Karpenter actually balances between cost and availability and tries to find the instance that delivers the best performance at lowest cost. So in theory, the more instance types and sizes you give Karpenter across different families, generations, and sizes, in theory it can find a better fit. But you always want to avoid extremes, right? You don’t want to include instances that are too small because small instances are causing a lot of network overhead. They actually add more complexity to scheduling and monitoring. So in this example, we are actually cutting out all the tiny flavors, for example.
But also you don’t want to have too large nodes because we spoke about consolidation earlier.
Basically large nodes that are not fully utilized is a waste of time, is a waste of money, and makes the consolidation less efficient. So to summarize, all we want to do is constantly monitor the behavior of our cluster, and it’s up to us to define what’s too small or too big in our exact case.
Another factor that is constantly overlooked is DaemonSets. Basically, those are containers that are running for system services like monitoring, logging, and telemetry, and all of the system stuff, and they’re running per each node. So when we have larger nodes, we have small overhead of DaemonSets. And on the opposite, if we have a lot of small nodes, it means we have a large overhead of DaemonSets, and actually this is consuming resources. And especially if we’re using any software that is licensed per node, this can actually silently inflate your costs, so take this also as a consideration.
So let’s be honest, talking about cost savings without mentioning spot instances probably is not the right thing. And the good news is that Karpenter makes management of spot instances really easy. You just need to allow spot instances in your node pool, and Karpenter will just take it further. It will automatically identify spots with best price and availability, take this into account and choose the right machines, and also take care if they’re interrupted so pods can move somewhere else and basically provide cost savings.
While this sounds good, let’s think where this can fit, where we can use spot instances. And if you ask me, I think everywhere, every workload that can tolerate interrupts, whether it’s a web server behind load balancer, whether it’s a stateless microservice, batch jobs, your dev, your staging environments, all of them can recover quickly from spot interruptions, and you can leverage this for cost savings. So the best practice here is actually to configure spot instances with a failover to on-demand, and you can see an example configuration here. In this case, Karpenter will actually try to provision a spot instance for cost savings, and if there is no capacity, it will transparently move you to on-demand.
So it’s a good idea to create two separate node pools for spots and on-demand so you can actually use node affinity in your deployments to configure how much you put to spot and how much you keep on-demand for safety. And finally, you always need to keep an eye on fallbacks. If you have too many fallbacks and you’re falling to on-demand frequently, and we spoke about choosing the right instances, maybe in this case you want to give Karpenter a much broader selection because you can use instances that may be not the perfect fit but still providing more than 60% discount, so it’s the right thing to do.
Another strategy would be using Graviton instances. As you probably know, Graviton provides at least the same performance or even better for a lower price. And most of the modern applications are able to run on ARM architectures with no code change. The only investment you have here is to make sure that during your CI process you’re building your Docker images both for ARM and x86, and you have both versions of the container in your registry, and Kubernetes will automatically pick the right one depending on the node that this workload is running.
Another thing, it’s a last tip for today, and please use this with caution, not for production, but it can really save some costs. As you probably know, cross-availability zone traffic is very expensive. So for a non-production environment, what you can do is actually limit your node pool to a specific availability zone. And it basically means that you can enforce data locality at the infrastructure level, so it’s actually going to reduce significantly the cost of networking. But again, it’s not something you can do in production. I don’t recommend putting this stuff in production. But as you can see here in this example, we can also configure a failover, basically use weights and affinity rules to make sure if there is a failure of a zone, we can actually go and utilize another one.
Beyond Configuration: Addressing Kubernetes Cost Optimization Challenges with Zesty’s Automation Platform
So when we spoke about all of these best practices, the big question comes up: are we done? And the answer is not really. There is still a lot of challenges to solve, and let’s mention them.
Kubernetes gives us a lot of metrics, and it’s really hard to connect them down to costs in order to identify inefficiencies, especially when we operate large environments with multiple clusters. We talked about Karpenter a lot, and even if Karpenter makes instant decisions and it’s working pretty amazing, it doesn’t mean that the underlying infrastructure can keep up because it takes a lot of time. It’s about a couple of minutes for the infrastructure to be ready to serve our containers. And obviously there’s still operational complexity to make sure we constantly modify and adjust our setup to make it truly efficient.
So the question is, can you afford your teams to be constantly invested into making sure your environment is configured properly? At the end of the day, in large environments, we actually need continuous automation, something that will provide us the ability to be focused on building rather than optimizing, and this is exactly where tools and platforms come to assist. At Zesty, we developed a Kubernetes optimization platform that helps close those gaps. Our platform provides you with deep visibility into all of your clusters and containers, both from usage and utilization perspective, making it very easy to identify inefficiencies.
It uses predictive models to make scaling decisions far more efficient. We execute optimization actions automatically on your behalf, so again, your teams can focus on building and not optimizing. Our platform automatically resizes workloads both vertically and horizontally based on workload behavior. It covers not just CPU and memory but also persistent volumes for stateful workloads that are requiring local storage.
Zesty automatically distributes, and by the way, we’re calling this a resource optimization layer that goes and does active right-sizing. We are also navigating and distributing workloads between spot instances and savings plans to reduce the expensive on-demand costs. With our Fast Scale technology, we maintain a fleet of hibernated nodes that are pre-cached and have all of your containers and stuff ready to kick in within 30 seconds. This allows us instant recovery from spot instance failures, so we can move more workloads to utilize spots without the risk of being interrupted or having no capacity.
We call it the financial optimization layer, and Zesty is the only platform out there that combines both resource optimization and cloud financial optimization practices to provide at least 70% cost savings for your clusters. If you want to learn more and hear exactly how we can help companies achieve efficiency, our booth is located right here to my right. So feel free to come say hi and ask questions. Thank you so much and have a good convention ahead. Thank you.
; This article is entirely auto-generated using Amazon Bedrock.