🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Build AI-powered developer experiences with MCP on ECS and EKS (CNS358)
In this video, Steve Kendrex and George introduce the Model Context Protocol (MCP) and demonstrate AWS’s hosted MCP servers for Amazon ECS and EKS. They explain how MCP addresses limitations of LLMs by providing standardized integration between AI agents and data sources through tools, resources, and prompts. The presentation covers the architecture of hosted MCP servers usin…
🦄 Making great presentations more accessible. This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.
Overview
📖 AWS re:Invent 2025 - Build AI-powered developer experiences with MCP on ECS and EKS (CNS358)
In this video, Steve Kendrex and George introduce the Model Context Protocol (MCP) and demonstrate AWS’s hosted MCP servers for Amazon ECS and EKS. They explain how MCP addresses limitations of LLMs by providing standardized integration between AI agents and data sources through tools, resources, and prompts. The presentation covers the architecture of hosted MCP servers using SigV4 authentication, IAM permissions like InvokeMCP and CallReadOnlyTools, and the MCP proxy for request signing. George demonstrates using Kiro IDE with natural language commands to manage EKS clusters, assess upgrade readiness, create clusters, and troubleshoot LoadBalancer issues. The video also showcases the new "Inspect with Amazon Q" feature in ECS and EKS consoles, which leverages MCP tools and knowledge bases to automatically diagnose and provide resolution steps for common issues like stuck task definitions and container image pull failures.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction to Building Developer Experiences with Model Context Protocol
Good afternoon. My name is Steve Kendrex. With me today is George. We are here to present a session for you on building developer experiences with the Model Context Protocol. If you aren’t familiar with MCP, you’re going to get a great introduction as well as how to use MCP in your daily work with our container services, Elastic Container Service and Elastic Kubernetes Service.
We’re going to go through a few things. Generally speaking, we’ll cover how GenAI agents and LLMs have evolved and why we got to a point where MCP is so useful. We’ll give you an introduction to MCP just a little bit as a refresher for exactly how we at Amazon tend to think about these things and particularly why container services benefit from using MCP. We’ll talk in detail about the ECS and EKS MCP servers that we have released for your consumption today. Then we’ll have a demo demonstrating what you can do today using these MCP servers both in the console and through your IDE of choice such as Kiro.
So let’s talk a little bit about MCP, the Model Context Protocol. Let’s talk about overall how we got here, and again this is going to be a very fundamental introduction to LLM. I want to cover some of the fundamental constraints so that you understand exactly why it is that we built what we did and how we went about building it.
Understanding the Fundamental Limitations of LLMs and AI Agents
Let’s imagine I am a grandmother trying to buy children’s gifts for my grandkids. That is my fundamental task, and I have heard all about these AI assistants which are supposed to help me with tasks like this. I have an AI assistant, I have an LLM, maybe I think those are exactly the same thing. I don’t really necessarily know. All I know is that I can ask questions. So here we go, we ask what is AWS or maybe I say, hey, what are some great kids’ gift ideas for Christmas? Of course that interface, whatever it is that I am using, is going to come back to me and say, okay, here’s the answer that I have for you. It’s perfect, it is exactly what we need.
Here’s the limitation. Let’s get a little bit more specific. I’m a little bit more used to using these agents, I’ve used them for quite some time, so I want to go beyond just saying what are some great gift ideas for kids. I move on to what are the most popular gifts this year, or what is the must-have children’s toy to be used this year. My kids are super into Bluey, for example, or other different toys. Different kids have different needs, so there’s lots of different options. Maybe I want to use the best STEM toys for this year. Of course the agent will come back and without anything in between it will say, well, I don’t really have access to current data in a best case scenario, and maybe it’ll just say, well, I’m going to give you an answer and maybe I don’t know that that answer isn’t fully correct.
Of course there are other limitations when we try and make agents do specific tasks. If I want to actually do something with the agent, I want the agent to give me specific data not just about the world itself but maybe say go look at my Amazon purchase history and tell me what I bought. Of course the agent doesn’t have access to that information. Or if I want the agent to do something like say specifically I want it to purchase something on my behalf, of course without anything intermediating the agent and allowing the agent to take that action, it’s not going to be able to do that. Here we have a message where somebody’s tried to get a bug fixed in the code and the LLM of course says I can’t do that. LLMs, as those of you who have played around with them to at least some degree know, are only as good as the specific context we give them and they are allowed to access.
So let’s summarize the limitation. We have two fundamental problems. We have an input problem and we have an output problem. LLMs do not understand and they don’t know what they don’t know because LLMs are trained on a point in time and they don’t know anything that exists in the world outside of when they were trained. Now you could sit there providing that context in exquisite detail, and this is what many of us did when the LLMs first became popular. We just said things like assume that the world is this way. Assume that the world in the last six months has done this, and then they would do a decent job, but that got tiring and often it would get it wrong or it would even say, well, I can’t assume that because it’s not in my training data or whatever.
Evolution from RAG to Tools: Addressing Dynamic Context Retrieval Challenges
So then we came up with many different ways to try and address this workflow. So here is one, especially RAG became popular shortly after LLMs became big. RAG was providing a way to give the agent context through knowledge bases. So instead of us just accessing the information directly from the model’s training, what would happen is that the AI assistant would take its prompt. So say for example, what is the most popular STEM gift for kids this holiday season, and it would then access that knowledge base first and inject that context into the prompt for the user on the user’s behalf. So we’d go look up the RAG and it would go look up the knowledge base and it would pull that information into the prompt which would then be fed to the LLM, and then we would get a response.
There are some limitations with this, of course, because the knowledge base has to be maintained by somebody. It has to be updated. It is only as good as the context that is provided into the knowledge base. And of course there are limitations about how much we can do. Context windows became very, very large as prompts became larger and larger as the knowledge base information was injected. In many cases the entire thing was injected into the prompt on the user’s behalf.
So there’s other ways to do dynamic retrieval and dynamic context, and this is where the introduction of things called tools came about, where the AI assistant can actually provide an action for you. The most common way to understand a tool would be something like web search. Web search is an extremely powerful tool. So of course we go and we enter the prompt, then the prompt comes and the tool description goes to the LLM. So the LLM says, okay, I know what I need to do. And there’s a mechanism for that LLM to interact with the agent, call the tool. In this case we’re going to go search the internet for what the best STEM gifts are so that I can make my child happy, and then we get the tool result and the LLM consolidates that information and provides it back to the user.
Okay, so we seem to have solved this problem, at least for simple use cases. The problem comes when we are dealing with more complex use cases. So for example, if I’m an agent and of course moving beyond just the Christmas gift idea, let’s say that we have three different things that we need to have. We need the agent to be able to access a database. The agent has to access source code repositories so it knows if anything has changed. The agent maybe needs to be populating a CRM. In the previous world that would mean that if I wanted to do that, if I wanted to achieve all these outcomes, I would have to create my own integrations, specifically teaching and training my agent to access these different systems. I would have to train and build some kind of mechanism and give it the correct APIs and give it the correct way to say this is how you use this tool, and it would be all very, very custom. And if I wanted to switch that out or share it with a colleague, we would have to work through all of these limitations. It wasn’t repeatable, of course.
So that is just giving you a summary of some of the problems that we have for model tool interactions. This is an end to end problem. Every model to tool connection required custom engineering, lack of standardized context sharing, and stale knowledge over time because again the fundamental problem here being unless the agent was consistently calling and expending a bunch of tokens and knowing how and when to acquire new knowledge, very often it was still cut off from real-time data. And of course the model is inherently restricted just by the nature of its development from new data sources.
Model Context Protocol: A Standardized Integration Framework for AI
Okay, so this is where the MCP came from, the Model Context Protocol. It was introduced by Anthropic. Model Context Protocol is an open standard for integration between AI apps and agents that use tools and data sources. So APIs, if you think about it, standardize how web backends work. MCP is effectively a similar way to think about how AI should be interacting with tools as well. So think about it this way. It’s not necessarily 100% accurate, but it is useful as a way to think about how these tools interact.
Okay, so with MCP we can achieve standardized integration. I’ll explain a little bit about what MCP is and how it achieves this, and then we’ll explain about how our container services have used these tools to provide good results to you. So of course when we start talking about MCPs we have different terminology. We have the MCP host which hosts MCP clients.
If you think about an MCP client, that is simply the thing that the MCP server is running on, especially if we’re talking about a local MCP server installation. We have an MCP server for our database, we have an MCP server for maybe Git, and we have an MCP server for our CRM. What that means is that the database host, the source code repository, and the CRM system own and generally speaking, own and maintain that MCP server. The things on the left have a standard way to interact with external sources such that if we wanted to maybe replace our CRM system, we could do that easily without having to recalibrate everything on the left side.
So that is effectively how MCP operates. Again, here are the core components of an MCP server. We have the MCP client, which I explained what that is, just the thing that is running the MCP and is actually acting as, usually it’s your agent, right? Kiro would be acting as an MCP client. And then we have the MCP server. The MCP server has three components. It has tools, which is the most common thing that most people think about when they think of MCP servers. We have resources, and we have prompts. So I’ll explain what each of these are in just a 50,000 foot detail.
An MCP tool is what we call model controlled. The LLM is controlling the use of the tool through the MCP server. A tool is basically a function of how the model works. So if you think about back to my example of trying to purchase, having my agent purchase or just help me in my journey of purchasing a Christmas gift, one tool might be Christmas gift price lookup, right? Price lookup for different types of Christmas gifts or different types of toys or things like that, right? It might be web search, it might be something very specific, something very actionable. That’s usually a tool. Other ways in your specific use case might be retrieving data, searching for a message, or maybe updating a record. There are other types of things. Again, most people are very familiar with MCP tools.
There are other types of things or other types of categories that MCP uses. MCP resources are application controlled. Your application, the thing that is actually trying to run, maybe your service, maybe your web service, maybe your database, that is controlled by the application. And what that is, is effectively things that used to be your knowledge base, so your files, your database records, right? My history and the context that is important to me when I have purchased things in the past, so maybe my purchase history might be an example of a resource.
Prompts, now prompts is a very interesting one because we usually think of prompts as the thing that we put into the agent, right? That’s what we think of a prompt. In MCP it’s effectively the same thing, but MCP prompt is a specific capability that allows the user to define their MCP request and response. So think about it this way, maybe if I own the MCP, I might say, here is a specific prompt which I know is going to provide very good resources. The user has the ability to discover those prompts within my MCP and then use that prompt, right, and say okay great, this is the prompt that the MCP has suggested that I use, but of course I can modify that prompt. I can change it however I want. I can augment it and I can add things onto it. So those are the three different types of resources that an MCP server has.
Why AWS Container Services Need MCP: Bridging the Gap Between AI and Real-Time Infrastructure
Now, let’s talk in detail. So I talked overall, right, that was a very rapid rush into how MCP, you know, the history of why Model Context Protocol is needed and how we got to this point. Let’s talk about the container services. Generally speaking, if you are watching this, we assume general knowledge of our container service, the Elastic Container Service, which is our AWS native containerized, very simple container orchestrator, and the Kubernetes service, which is our Kubernetes optimized container orchestrator, right, which we consider the best way to run Kubernetes in the cloud.
Why did we launch MCP servers for AWS services, right? If these things, especially things like ECS where we say it’s very easy to use, what do we need an MCP server for? We believe that, as generally Amazon does, that AI is transforming how people work and build and ship and observe their code.
AI agents are transforming developer productivity, and we believe that there is going to be effectively a new persona of user for our systems. The agents are not going to be generally very good at looking at our console because they weren’t built for an agent to look at. They are good at using the CLI, but that doesn’t mean that they optimally take advantage of everything that our APIs have to offer. They need an interface which is designed for them to extract the most out of our services.
AI tools lack real-time awareness. Let’s imagine that, you know, a recent announcement that ECS launches ECS Express mode. We launched that two weeks ago. That is very unlikely to be in an LLM’s core set of databases. Now we’ve set documentation up. We have documentation. These agents have the access to tool lookup, and so they can look up and they can use web search and they can go say, okay, the user is telling me there’s something called express mode, so let me go look up and let’s see what express mode is, and I can list all the resources that express mode has, and I can discover and it can funnel its way to a response.
But there’s two problems with that. Number one, generally speaking, LLMs do good with a very robust set of information, and so it’s going to stick to what it knows, especially using older features which maybe you all have spent time researching and writing and sharing about how to use ECS and EKS together, and so it has a wealth of data in its training set, which means that it’s going to trust that and it’s going to be better equipped to use that even if it can see the new feature. It may not know how to use it optimally, right, so that is a big problem that even the most sophisticated agents have.
There’s also one other limitation, which is the ability to know exactly the way that AWS intends to use things. You all through trial and error and maybe through some very sophisticated guidance, we guide you into the experience of using our tools, but the agent doesn’t really have that experience. It doesn’t really have the ability to learn from trial and error in the same way that you do. And so we need a way to teach it and say this is how you use it.
So MCP has that opportunity to fix things, right? Because we are going to be able to give the agent updated, up to the minute information. Instead of using these APIs, use this set of APIs we just launched them. Instead of using this deployment mechanism, use this one. We can also provide guidance for the user. We can do things like improve troubleshooting and improve operations because we have a wealth of knowledge that would be very difficult for us to communicate to you. Nobody wants to read a 1000 page manual on how to respond to every single thing that might ever go awry with one of the container orchestrators. Every single issue that you have, nobody has time to read that. The LLMs actually do a pretty good job of that if we give them the right mechanism to integrate with it, and so that’s the opportunity that MCP servers have.
So we started off building these and we’re going to explain how these work in detail. We started off shipping these locally. There are two different kinds of ways to improve or to use MCP. You can just run these locally. And so, earlier this year we launched MCP servers for AWS containers and serverless, so ECS, EKS, Finch, and Lambda. Each MCP server has its own unique set of tools, and I will, we will describe more in detail about what all of these do, and they are available today to download if you so want from the GitHub repository, the AWS Labs GitHub repository. You run these things locally on your local machine, and they’re a great way to get up, started, and testing.
However, there are some limitations with running machines locally once you start running them at scale. And so to answer those and to talk more in detail about how exactly ECS and EKS hosted MCP servers work, I’m going to pass it off to George who is going to explain in detail about these. So George, take it away please.
Launching Fully Managed, Remotely Hosted MCP Services for Enterprise Use
Right, thank you, Steve. So there are three main reasons or four main reasons why we launched the hosted MCP server. I’ll get into the details in a bit, but one of the main features we have been, the requests we have been getting is, especially the enterprises, they are not comfortable in getting their developers locally installing and running an MCP server. Their security teams wanted to make sure that they had full visibility into what the users are doing. When there is a zero day vulnerability, they wanted full control of the ability to patch it. The other one was they wanted broader integration.
So for example, we have a lot of third-party software-as-a-service providers are building AI agents that need the ability to interact with Amazon EKS and Amazon ECS clusters, which means they need an MCP server that’s running and hosted in the cloud. The last reason is the superior reliability and scalability that AWS has to offer.
Two weeks ago, we launched the fully managed, remotely hosted MCP service for both Amazon EKS and Amazon ECS. They’re available in preview. For the rest of the presentation, I’ll dive a little bit deeper into each of these, including what capabilities they offer and how you get started with them. This has also enabled a pretty cool feature in that you can now troubleshoot issues in the EKS and ECS consoles. We’ll dive into that, and finally, we’ll take a look at a demo.
Amazon EKS MCP Server: Natural Language Management for Kubernetes Resources
I’ll start with the Amazon EKS MCP server. Before I dive into the details, I just want to give you a high-level idea of what you’re able to achieve once you have configured your AI assistant or your AI tool with the MCP server. Essentially, you’re able to use natural language to interact and manage your resources. For example, you can just type into your AI assistant, let’s say you’re using Kiro, which is an AI coding IDE from AWS, and you can just type in "show me the status of my production EKS cluster."
It’s not just limited to EKS resources. You’re also able to manage and interact with Kubernetes resources like pods and namespaces, so you could do things like "show me all the pods that are not in a running state." Just type in natural language. You don’t need kubectl, which is the official Kubernetes CLI, and you don’t need to set up kubeconfig. Just type it and you get the response. We have also enabled the MCP server with tools that are going to help with troubleshooting. For example, let’s say you have a pod stuck in a failed state. Just type that in: "Why is my nginx-ingress-controller pod failing?" The last thing I want to call out is that these tools are not just read-only tools. These also enable you to create resources.
Let’s take a look at the request flow. This applies to both Amazon EKS and Amazon ECS. These are hosted in the cloud and are available in all commercial regions. This is available in preview in all regions except regions in China and GovCloud. You can pick the region that you prefer depending on where your resources are. Let’s say you have configured the MCP client. The MCP client here is an AI system. It could be Kiro, Cursor, Cline, or any MCP-compatible tool. Once you have set that up, and later on in the presentation I’ll show you how you can configure that, the way the request reaches the hosted MCP server—the hosted EKS or ECS, these are two separate MCP servers with separate endpoints that are available in all the commercial regions—is through a proxy in between.
Now the question is, why do we need a proxy? Both EKS and ECS MCP servers are AWS services, and the way you authenticate to AWS services is through IAM. The way IAM does authentication is through a mechanism called SigV4 signing. For that SigV4 signing, the MCP protocol today doesn’t support SigV4 natively, so you need this intermediary which is signing the request and then making a request to the hosted MCP server. Once the request lands in the hosted MCP server, depending on the tool you’re using, the appropriate downstream AWS service is invoked.
I want to give you a concrete example now. I know it’s a little bit hard to read. In the demo, I have zoomed in, so you’ll be able to see this better. But here I have Kiro, which is the AWS AI-powered coding assistant. Once you have configured the MCP server, on the left you can see this is just a one-time step you do, and all the tools are listed out. In the case of EKS, which is the screenshot here, there are tools that are all available to you. As a user, like Steve mentioned earlier, you are not directly interacting with the tools. These tools are for LLMs, but it’s always good to know about them. Once you have configured and done the one-time step, you can just type in natural language in the chat window. In this case, it’s asking "show me the status of a particular EKS cluster." The MCP client or the LLM knows about the tool called describe EKS resource, which would then return the response. I just wanted to give you a high-level view of what you’re able to achieve with the MCP server.
Deep Dive into EKS MCP Server Capabilities: Tools, Documentation, and Troubleshooting
Now let’s get into the details. The capabilities of the MCP server can be defined in terms of the tools we support. You can classify the tools we have in the case of EKS into four main categories. The first one is cluster management. These are tools that help with managing your EKS resources like creating clusters, creating add-ons, and deleting them. Some of these are read-only tools, meaning that, for example, the second one here is just doing a list operation, just listing all the EKS resources. There are tools like the first one, manage EKS stack, which is actually creating a cluster. As a user, you don’t really need to interact with it, but it’s always good to know. If you want to know more about it, our user guide covers the parameters it takes and which of those are required as such.
The next set of tools we have are for the Kubernetes resource management. So previously we were talking about EKS resources. Now we have Kubernetes resources like pods and services. You can create them and delete them. There are tools for applying YAMLs. There are tools for generating manifests.
The third set of tools are documentation and troubleshooting. I want to spend a little bit more time here. So Steve earlier mentioned the knowledge gap. Let me take a step back. Initially when we launched the open source version of the MCP server, we did not have search EKS documentation. What we found out during the development was the knowledge gap that Steve was mentioning. Any LLM version, if you go to the particular version and look at the documentation, they say there’s a knowledge cutoff date. Basically that is the date on which that LLM was trained. So if there is a feature that came out after the knowledge cutoff date, it’s not really trained on it. It might be able to use tools and still get information, but it’s not really trained on that. We were finding gaps when we were trying to troubleshoot newer EKS features. We found that the LLMs were not really doing a good job. Hence we built this tool.
What this tool does is that in the backend it is connected to an index of all AWS documentation. So all AWS documentation, all AWS What’s New posts, all AWS blog posts are indexed there, and this tool is able to reach out to the index and retrieve the information. Now you’re augmenting the LLM with some of the missing information through this tool. So that’s the first tool. The second one is EKS troubleshooting guide. This is actually a knowledge base we are hosting in the cloud. What this knowledge base is, is pretty much a distillation of all the knowledge we have gained managing millions of Kubernetes clusters. Over the years we have been managing a lot of Kubernetes clusters. There are a lot of runbooks we have internally, and these are all runbooks that can be shared externally and are useful. Now all these runbooks have been added to this knowledge base, and this particular tool is actually reaching out to the knowledge base so that you are now augmenting the LLM with various runbooks that it can use for troubleshooting.
This set of tools is related to troubleshooting. If you usually look at any troubleshooting runbook, one of the first two steps is to get more telemetry, right? Getting more metrics, getting more events, getting more logs. So these tools are providing that needed telemetry back into the LLM so that it can effectively troubleshoot an issue. And then we have a couple of tools for security. This was introduced because we noticed a lot of the times the issues are related to IAM permissions or the lack of it. So this tool helps with that.
Configuring IAM Permissions and MCP Proxy for Secure Access
All right, so now we looked at the capabilities. Let’s take a look at some of the prerequisites, right? So like I said before, EKS and ECS, and we’ll dive into ECS a little bit later. EKS MCP server is protected by SigV4. It’s protected by IAM. There are certain permissions that you need to set up. The two key permissions are EKS invoke MCP. This is what allows the IAM entity, whether it’s an IAM user or IAM role that you’re using to connect to the hosted MCP server endpoint, to have invoke MCP permission. That is what is allowing the MCP client to do a list call to understand about all the tools that are available. Then we have another permission called call read-only tools, and this is granting access to the read-only tools. So if you recall before, I said that some of the tools are only doing read-only actions like getting or listing a resource, but then there are a bunch of tools that are actually mutating, like creating a cluster or creating a pod. So this permission is ensuring that only read-only tools have permissions.
And the last one is called privileged tools. If you want to allow all tools to be able to connect to your MCP server, then you also need to have call privileged tools. One thing I want to really call out and emphasize is that especially in production, start just with the read-only tools. LLMs have their own intelligence, right? Most of the time they do a good job. Sometimes they can hallucinate. I’ve been in scenarios where it inadvertently goes and deletes resources which I didn’t want them to do. So definitely start with just invoke MCP and call read-only, at least in production. You can use call privileged tools in your dev environment or test environment, but I would say start with the first two.
So those are the three permissions that are required to connect to the MCP server, but once you connect to the MCP server, there are a bunch of tools. So if you recall, there were tools to reach out to CloudWatch to get logs. So you need additional permissions that you need to configure on your IAM identity. Now we have a managed policy called AmazonEKSMCPReadOnlyAccess. So if you go into your IAM console, search for policies, just type that word, you can see this precanned policy available to you. It has the full list of permissions that all the read-only tools need.
For the right tools, we don’t have a pre-canned policy today, but it’s listed in the EKS user guide if you want to learn more about the permissions. So we looked at what are all the permissions you need to configure on the IAM principle that you’re using to connect to the MCP server endpoint. Now, let’s talk about the proxy.
I covered it a little bit earlier. Proxy is the intermediary between the MCP client and MCP server which is doing the SigV4 signing on your behalf or on behalf of the request. The proxy is available in that GitHub repo if you want to take a look at it. The proxy is also available in the Python package index, so it’s very simple to install. You can use uvx to install it.
The good thing is that this proxy is required to connect with any AWS first party hosted MCP server, not just EKS and ECS. There are many other MCP servers remotely hosted that are coming out, so you need to run this proxy, but you just need to run it one time, and then you can have different blocks to connect to the different AWS hosted MCP servers that you’re interested in. What are some of the other configurations? The first one is profile. So the MCP proxy is doing SigV4 signing, and to do the signing it needs credentials, and it picks up the credentials from the profile you specify here. This is referring to the AWS profile.
So depending on which profile you specify, it uses those credentials to sign the request to connect to the remotely hosted MCP server. A couple of other interesting things you should be aware of is the regional endpoint. So like I mentioned, the remotely hosted MCP server is available in all commercial regions. In this case, I’m connecting to our Oregon region, so you would use us-west-2. And the servers here is eks-mcp.
This is an important call out. Like I said, there are two ways in which you can ensure that the LLM doesn’t take actions that you don’t want it to do, especially mutating kind of actions. So one is IAM. Like I said, when you create the IAM permissions, make sure you don’t have the call privilege tool. The second one is if you specify a read-only argument here, the MCP client gets access to only the read-only tools and not all the tools.
Amazon ECS MCP Server and Console Integration with Amazon Q
Now, so I was talking so far about EKS. Now let’s jump to ECS. A lot of what I said also applies to ECS, but let’s take a deeper look. ECS MCP server enables you to use natural language to manage ECS resources. These are just examples. I mean, there’s much broader than that, but just to give you an idea, you can do deployment monitoring. You’re able to investigate the health of containers or tasks. You’re able to troubleshoot issues and you’re also able to look at things like network configurations.
ECS classifies the tools into three broad buckets. The first one is operational tools. There are a bunch of tools that help you with doing things like getting the status of your deployments or fetching network configurations. These are very critical when you’re trying to troubleshoot issues. You have tools for resource management. There are tools for tasks like getting more details about task definitions.
You also have tools for troubleshooting. I think ECS has done a really good job at really pinpointing what are the biggest pain points that customers have, and they have come up with tools that are helping with troubleshooting some of the most common issues that I’ve seen. In terms of permissions, there’s a key difference between EKS MCP server and ECS. All of ECS MCP servers are read-only tools, so you don’t see the third call privilege tool kind of permission.
But these are the two key permissions you need to configure on your IAM principle that you’re using to connect to the ECS MCP server endpoint. There are, if you recall, some of the tools like fetch network configuration is actually making calls to EC2 APIs. A bunch of the tools are calling ECS APIs, so there are additional permissions that you need to configure on your IAM entity. You can learn more about it in the user guide if you’re interested in seeing what those are.
Same thing like EKS, you need to run the MCP proxy. You don’t need to run it multiple times. You can just have an additional block. If you, let’s say you have a scenario where you need to connect with both EKS and ECS MCP server, you can just have an additional block that is now pointing to ECS instead. The key difference between EKS and ECS, if you look at the endpoint here, it starts with ECS.
So this ECS is similar to EKS as a regional endpoint. So you connect, point to the appropriate ECS regional endpoint. The service here would be ecs-mcp, and the profile is where it picks up the credentials to do the SigV4 signing. All right, so now switching gears a little bit, one of the cool things that we have been able to do with the MCP server is really improve the troubleshooting experience in console, both EKS and ECS. In the demo, I’ll give you a better understanding, but just to quickly walk over it.
In many places where we surface an issue or error in the EKS or ECS AWS Management Console, you can now see the Inspect with Amazon Q button in context. What we have done is integrate with Amazon Q and the MCP server tools to quickly help you triage your root cause issues and also give you recommendations on how you can go and resolve them.
Just to give you an idea, I don’t know how much of this is visible, but just at a high level, these are some of the various points in the EKS console where you have this Inspect with Amazon Q integration. For example, if you go to the Observability dashboard, there are various kinds of health issues for the cluster that we surface up. Now you would see there’s an Inspect with Amazon Q button. If you click on it, you have an Amazon Q chat panel open up where you can go and triage and resolve the issue. Similarly, you have other points like Upgrade Insights or Control Plane Monitoring and Node Health Issues where this integration is available.
In the ECS console, similarly, there are multiple points where you are able to leverage this integration. For example, here I’m showing the deployment. This deployment is rolled back. When you click on that status saying Rollback Successful, if you click on it, you have an Inspect with Amazon Q button that you can use to learn more about what was the reason and how you can go mitigate it. Similarly, for task failures, you can click on the status and you have that integration available.
Before, we talked about and saw how the request flow works to connect to the hosted MCP server. Now with this Q integration, you also have another flow from the console where you are able to troubleshoot various issues, an AI-powered experience for troubleshooting, which is in the backend using the various tools provided by EKS and ECS MCP server. All right, with that, let me move to the demo. This is a recording I took a couple of days ago. With LLMs and hallucinations, I want to make sure we have something decent to show here. It’s a recording, but it’s real.
Live Demonstration: From Setup to Troubleshooting with EKS and ECS MCP Servers
I’m going to start with two main sections in the demo. The first one is the hosted MCP server, and for that I’m going to use EKS, but ECS is going to be very similar in terms of the capabilities and what you can do. The second part of the demo, I’m going to show you the troubleshooting experience in the console, and for that I’m going to be using Amazon ECS. We’ll start with the prerequisites. If you remember, I said there are two things you have to do. The first one is you need to configure your IAM entity with the right permissions so that you can connect to the remotely hosted MCP endpoint. The second one is configuring your MCP client. Here I’m using Kiro, which is the AWS AI-powered IDE. You need to configure it one time so that the MCP client Kiro can connect to the MCP server. Those are the two prerequisites. Let’s take a look at that first.
All right, so I’m in the IAM console. Let’s go to the policy. I mentioned earlier there is a pre-canned managed policy called Amazon EKS MCP Read-Only Access. If you click on JSON, you can see the full set of permissions that are required by all the tools. These are all the read-only permissions that are required. At the bottom you can see the two key permissions I mentioned that are required to connect to the EKS MCP server, the hosted endpoints, so InvokeMCP and CallReadOnlyTool. Above that you can see all the other permissions. These tools are making a bunch of calls to, for example, some of the tools are making calls to EC2. Some of them are calling to CloudWatch. Some are calling to STS. You can see the full set of permissions that the tools need. They’re all listed out here.
Now, for the demo today, I’m going to be using both read and write tools, so let me go ahead and create a new policy. I’m using the console here, so you can create a policy. You can click on JSON where you now have the ability to fill in the policy. Right now it doesn’t have anything. I’m going to switch to the user guide, the EKS user guide under Tools MCP Protocol. This particular page on Step 3, the full set of permissions that you need for both read and write is listed out here. Just for reference, these are all the permissions the tools need. You can copy that into the policy editor and use that to create your IAM policy. Like I mentioned before, you can see some of the three key permissions that are required for EKS MCP at the bottom.
All right, so now let’s go ahead and click Next. I’m going to name it EKS MCP Server Write Policy. You can verify all the permissions look good and you can hit Create Policy. Now you have created the policy. Now we need to attach this policy to an IAM principal.
For the sake of simplicity of this demo, I’m just going to create an IAM user. So I’m going to create a new IAM user. Our goal is to now attach the previously created policy to this user, so I’m going to call it EKS MCP Server User. Now we need to locate the policy that we created in the previous step. To search for MCP, you can see it at the bottom. Select it. Click next. If everything looks good, you can create it. So now we have the user created.
Now we need to create an access key and secret key for this particular user, so we can go into the particular user, go to the Security Credentials tab, and you can create an access key and secret key for this particular user. What you’re doing is once you have created these credentials, you’re copying it over to your local machine, the machine where you have the MCP client set up. In my case, I’m using a MacBook, so I’m copying the access key to my local machine and I’m going to be setting that as the AWS profile. Here I’m using Kiro IDE. So we are done with the first step of the prerequisite of creating the IAM permission.
The next step, the next one-time step you have to do, is now configure your AI coding assistant to connect to the MCP server. Here, the AI coding assistant or MCP client that I’m using is Kiro. If you go to Kiro, you can see that I do not have any MCP servers configured right now. You can click Open MCP Server Config. As you can see, there are no MCP servers. If you go to the EKS user guide, we have instructions for Mac and Windows. Since I’m on MacBook, I’m going to copy the Mac section here, and all of this applies to ECS as well. In the interest of time, I’m just showing it for EKS today.
So you copy it over. You can see the proxy. I’m going to connect to the Oregon region, so I’m going to update the region to be us-west-2 since that’s the Oregon region endpoint. My profile, if you recall, even though I did not show it in the demo, I have a profile called demo-profile that has the access key and secret key that was created previously. So I’m updating to use that profile and update the region to us-west-2. Now when I hit save, notice on the left side you can see the MCP. Kiro is trying to connect to the remotely hosted MCP server. It takes a few seconds. It’s using SigV4 for signing and it’s connecting to it, making sure I have the appropriate permissions.
Once it does that, you can see that it has found 20 tools. There are 20 EKS MCP server tools. All of them are listed here. As a user, you would not be directly calling them, but it’s good to know. You can hover over them and you can see detailed descriptions, but these are also available in the EKS user guide if you want to learn more about these. All right, now we have done both the steps, right? We created the IAM permissions in step one. Now we have configured my MCP client, which is Kiro, to connect with the MCP server.
Now let’s start using it. I’m going to try something simple. Let’s start with showing me the status of my clusters. I’m going to open the chat window in Kiro, and then I’m going to just type in the prompt: Show me all EKS clusters and their status. I have four EKS clusters in my environment. Kiro automatically detects that there is an MCP server. Based on the tool description, it figures out that there is a tool called list_eks_resources, which looks like the right tool to get more details about the clusters. Actually, it’s also listed on the left side. If you see, there’s list_eks_resources, one of the tools there.
I’ve set up Kiro in a way that I’m not auto-approving. You have the option to say auto-approve all tool calls. Especially when you’re getting started, it’s good to individually approve and make sure it is calling tools that you understand and the tools that you wanted to invoke. So I approve this request. It’s calling a few other tools. It found out there are four clusters. It’s doing a describe call for each of these, and in the end what it returns is it gives me a status about the four EKS clusters. It gives information about the status of each one of them, the version, when it was created, the region, all that information.
It was so easy. I didn’t even have to set up kubectl. I didn’t set up AWS CLI. The only thing I needed was the AWS profile, the IAM permissions, and the MCP server configuration I showed you. With that and just natural language, you are able to now interact and manage your EKS cluster. This is a very simple case. Let’s try another one.
What I want to show next is I’m going into one of those clusters, which is the second one here, EKS ECR cluster. Now that is on Kubernetes version 1.31. If you have worked with Kubernetes, you know version upgrades are not a simple thing. Kubernetes doesn’t support rollback of versions, which is a little bit of a scary thing.
Before you decide to upgrade to a version, you want to make sure things all look good and that things don’t break. So what I’m going to do is demonstrate this with version 131, which is not the latest. The latest version is actually 134. I’m trying to migrate from 131 to 132, but before that, I just want to see if I’m ready and good to go ahead and hit the upgrade button.
My prompt, if you read it at the bottom, says assess my EKS ECR test cluster’s upgrade readiness, including support status, upgrade timelines, and identify any blocking issues. That’s all I entered and I hit enter. Now Kiro figures out that there are tools within the 20 tools that the MCP server supports which help with this scenario. There are tools like insights which give information about the upgrade readiness. So Kiro is actually thinking through it and making a bunch of calls.
It found something and is doing a list EKS, so it’s trying to learn more about the resources it found. Let’s just give it a few seconds here for it to complete its information gathering. Finally, it generates a report here. I’ll scroll back up and just let it finish. So it’s giving me this report. It identified one blocking issue, saying that one of the add-ons I have running on my cluster called AWS GuardDuty agent is on a version that’s incompatible with the version I’m trying to upgrade to. It’s flagging that for me, saying I have to go fix that. If I don’t fix it and upgrade, my cluster is going to be in a broken state.
After that, it has a bunch of checks that all look good. It has a few other checks for kube-proxy and cluster health issues, and all of those succeeded. Finally, it’s giving me a recommendation of the steps I have to do before I hit the upgrade button. All right, so let’s do another one here, which is creating a new cluster.
I don’t know how many of you have created a cluster either using AWS CLI or eksctl. EKS cluster creation involves a few steps. There are a bunch of prerequisites you have to create. You have to create a VPC, you have to create subnets, and then after creating them, you have to pass those to the create cluster calls. There’s a lot of work you need to do today to create a cluster. But here you can see that with just a simple text line, I’m just saying create a new EKS cluster, and that’s it. It figures out all the dependencies and uses CloudFormation behind the scenes to create the cluster.
Now you might not create production clusters using this, but if you have an idea you want to quickly try out and want to quickly create a cluster, this is a great way to get started. Even though this particular tool is using CloudFormation behind the scenes, you can use other tools. Let’s say if you want to use Terraform, as long as you’re prompting in the prompt to create a new EKS cluster with Terraform, LLMs usually respect what you have specified. But if you don’t specify anything, by default it uses a tool which uses CloudFormation at the backend.
There’s a tool called manage EKS stacks. You can actually go to CloudFormation and see that particular template that’s being used here. It creates a template and then goes ahead and starts deploying the template. Now creating a cluster still takes around 15 to 20 minutes, so it’s going to take a while, but at least it gives you a start. Here you can see that it started that process of creating the cluster and all the configurations associated with the cluster.
All right, so I’m not going to wait for that. Let’s move to the next and the last scenario here. This is a troubleshooting scenario. I have a cluster called MCP demo cluster which has a load balancer that’s in a broken state. This LoadBalancer service, which is a service object in Kubernetes, is in a pending state and not able to get external IPs. I’m just typing in natural language, asking it, my LoadBalancer is in the default namespace on my cluster MCP demo cluster, it’s stuck in pending state, it’s not getting an external IP, can you help troubleshoot? This is where some of the troubleshooting runbooks and the troubleshooting tools I was mentioning to you previously are being used now. So let’s see what Kiro does here