11 min read19 hours ago
–
This article describes a way, how AI can operate on Kubernetes clusters by GitOps and direct way, with human control.
If you cannot see the whole article, click here.
Press enter or click to view image in full size
Human-controlled AI operations on Kubernetes cluster
Problem: Traditional Documentation doesn’t Work
I’m working in a team, which provides observability services on Kubernetes clusters. Most of the work is DevOps, Ops, but sometime development in Python or Golang. The 24/7 standby is handled by an another support team. The deployment and configuration is managed with Application Controller (ArgoCD…
11 min read19 hours ago
–
This article describes a way, how AI can operate on Kubernetes clusters by GitOps and direct way, with human control.
If you cannot see the whole article, click here.
Press enter or click to view image in full size
Human-controlled AI operations on Kubernetes cluster
Problem: Traditional Documentation doesn’t Work
I’m working in a team, which provides observability services on Kubernetes clusters. Most of the work is DevOps, Ops, but sometime development in Python or Golang. The 24/7 standby is handled by an another support team. The deployment and configuration is managed with Application Controller (ArgoCD) by GitOps way.
My team is responsible for everything from setup a new Kubernetes cluster with all services, trough handling user requests, until incident handling in worktime. Out of the worktime, the incidents are handled by another 24/7 support team.
Wide competence, experience and willingness to self-learning are required to this job. The learning curve takes long time, but honestly, it takes forever. From knowledge point of view, the most critical is the 24/7 support team member, who don’t have time and chance to get enough competence and experience to make responsible operation on our Kubernetes cluster.
Several knowledge sharing page were made, but it aren’t useful for incidents: each of them is different story and we cannot write enough detailed and well structured Runbooks for anybody.
Our system is not enough robust, because of a tradeoff. The FinOps organization pushes the my team to use as less as possible resources to keep the operational cost low. Our users want to have 99.9% availability, which needs more idle resources, which is more expensive.
So, The allocated base resource consumption is 2–3 times higher than the average real usage, in order to handle small usage peaks. If the usage (CPU or RAM) is higher, autoscaler will start a new instance or node. But this autoscaling does not work perfectly in each cases on our heterogeneous system. Most of the incidents in the last year were resource usage issues: not enough resource (CPU, RAM, storage) or limits (message size, messages/sec, MB/s).
Goal: AI as a Tool to Glue Everything
The traditional info sharing could not be enough good in the team, so we have make a paradigm change.
Instead of trying to figure out all the possible incident cases (which is not possible) and trying to write documentation for anybody (which is also not possible), we decided to follow another approach: promote the AI as the main tool for configuring, troubleshooting and incident handling.
The AI will be the glue between several information and will collect the information from live systems automatically. Based on this information, AI will provide solution alternatives with executable commands on live systems or pull requests to GitOps repos.
There are several roles, who has different competence, experience and expectation:
24/7 Support Engineer
These engineers must know several systems, so they are not able to know each system enough deeply. They will work on our Kubernetes clusters out of the worktime and their goal is hotfix or workaround our system until worktime, when my team will handover the incident and will make the final solution.
Most of the incidents are related to resource consumption or reaching a rate limit, so they should be able to identify it and add more resource and/or increase a limit. These changes may be made with GitOps or without GitOps (directly on the live system) operation.
The AI should discover the cluster, identify the bottleneck and should provide solution alternatives and commands which should be executed on the cluster by one click or should be a pull request to GitOps.
Junior Team Members
Any job grade (intern, junior, medior, senior, staff, etc.) can have this role, mostly the newcomers in the team. They have access to all the clusters (including test, sandbox) git repos, documentations. But they don’t have experience on our system, they don’t know how our system can be discovered, they don’t know where can be found the proper config files.
They should be able to implement tickets alone as soon as possible. They can access our test and sandbox clusters, so they can independently try out solution alternatives that they are unsure of.
The AI should help them similar to 24/7 Support Engineer, but their preference is different: they should prefer the long-term GitOps solution, instead of the direct change on live system. The preference depends on the task: if the task is an incident handling, the direct change of the system is acceptable (until the final solution has not deployed by GitOps).
AI does not know the different preference, depending on the task type, so the engineers must be prepared for selecting the solution alternatives, depending on the case.
Developer Team Member
Higher job grade (medior, senior, staff, etc.) can have this role. This role integrates a new solution, service, application or implements a new feature in Python or Golang.
These engineers must read several internal and external documentation. The engineer may be in contact to users or Product Owners. The new features should be tested on the test or sandbox cluster, before deploying it to live clusters.
The AI should help them similar to Junior Team Members and should support them to learn new things. Most of the time this role does not work on the live nodes.
Concerns: Responsibility and Information
AI is a buzzword in IT world nowadays. In DevOps and Ops domain, AI is mostly used only to get useful information. Operating by AI on a live system is risky, so human control is suggested, without human control it’s science fiction. Typical issues and concerns with using AI in Ops and DevOps domain:
- Responsibility: Who will be the responsible, if AI suggests a bad solution or makes a mistake on a live system?
- Unpredictable behavior: The base concept of LLMs is figuring out the next words and sentences, based on the the learned info and using probability. The result of the conceptual probability, the result will be different sometime.
- Lack of information: AI does not join to informal conversations, where the key principles and concepts are decided and shared verbal.
- Custom information: The team way of working and priorities of solution alternatives are different to the AI-known typical methods.
- Quality of information: A part of shared information on the corporate intranet is not maintained and will be obsoleted.
- Heterogeneous information: The needed information to make good decision on live systems are in different format, which may not be understable by AI
- Fragmented information: The needed information is at several places
- No access to information: The information on corporate intranet can be accessed by different way and credentials.
Based on above, the engineer is the responsible. The AI is only a tool, which can present what it discovered and can suggest solution alternatives. The engineer has to make sure, any change on live system and in the GitOps is safe.
So, the AI can only help the engineer’s work, who will make better solution faster.
Solution: VSCode + Copilot
My company took a smart decision: use only GitHub Enterprise for version control and GitOps. If company needs a service and it exists on GitHub, it will be selected, instead of another competitors. An example for it is the Copilot for AI.
VSCode is using in my team with Copilot Chat. In the past, Copilot was using in Ask mode with ChatGPT mode, because it was the default. After a while, we changed to Agent mode and Anthropic Clause model, mostly the Sonnet. This setup is enough for feature and deployment implementing.
Press enter or click to view image in full size
Connection Setup
AGENTS.md for DevOps and Ops
There was an important decision: the solution must be used to real clusters from the beginning with strict control (on sandbox and test clusters first), in order to get enough experience and have safe context config for using this solution on live clusters.
When I tried to to use Copilot for DevOps and Ops activities, I had to set a lot of information for the context, which was annoying and obstructive. The solution was creating AGENTS.md files in each root of cluster repositories (*-infra, *-controller, *-helm), which are added automatically to the context by Copilot, if these repos are in the VSCode workspace. When I set the cluster name at the beginning of the promt, so Copilot can pick the selected cluster-related information from the AGENTS.md files.
The AGENTS.md files can be placed at another directory, for example .github/copilot-instructions.md , but it’s too specific for Copilot. It may be important, when we want to use Copilot Agent, running on the GitHub web page.
At the first time, I asked Copilot to write AGENTS.md files, based on what I wrote in the promt and what it discovered in the repos. The result were 500-lines-length files, which are added to the context automatically. These long files consume tokens of the limited token number, so it should be shorter. I used Writing a good CLAUDE.md suggestions to make them shorter (less is more: write only information, which cannot be discovered from the repo). To make AGENTS.md files more shorter, I put AGENTS.md file to obs-ai repo with cluster-independent information.
Get Peter Gillich’s stories in your inbox
Join Medium for free to get updates from this writer.
Below information is in the AGENTS.md files:
- What this repo is used for.
- Purpose and role description (what kind of engineers will use the promt).
- Relationship between the repos of one cluster (
*-infra,*-controller,*-helm). - Reference to another repos (local git directory, GitHub URL): private and public.
- Kubectl connection info to the cluster (config file location, kubectl context).
- Autoscaling, high availability and multizone concept
- Notice the users of the cluster (tenants)
- Notice the main services of the cluster (Grafana stack: Loki, Mimir, Tempo, Grafana UI)
- Notice the main data flow of the cluster (telemetry endpoint and pipeline)
- Notice about network setup (Istio)
- Troubleshooting guidance and preferences
- Troubleshooting info references: Runbooks
Generating initial AGENTS.md
I generated the initial AGENTS.md file by below promt:
Create AGENTS.md for each below repositories, which will be used by Copilot AI agents, especially Calaude Sonnet 4.5 . The developers will use Copilot chat to change and improve the cluster deployments. If needed, use kubectl commands to discover the cluster. The Copilot prompts should be able to use kubectl commands to discover and validate the current configuration.Discover and describe which services are deployed in each repositories. Discover and describe how the deployments and services are related between the repositories. Make more detailed discovery for Grafana, Loki, Mimir, Tempo, Alloy, OpenTelemetry Gateway and Istio. Make more detailed discovery for the pattern of telemetry data path, managed by OTEL operator with opentelemetrycollectors.opentelemetry.io CR, but do not list each Tenant Applications and Tenant Namespaces, because it will be extended in the future. Make more detailed discovery for self monitoring and log collection (telemetry) in org-obs.Discover and describe the referred Helm charts.For the Context:I have a Kubernetes cluster named cluster-1.This cluster can be accessed by below kubectl configuration:Kubectl config file: $HOME/.kube/cluster-1.yamlKubectl context: cluster-1This cluster can be managed by below repositories:Terraform deployment for infrastructure:* Git URL: https://github.com/company-org-infra/cluster-1-infra* Local directory: cluster-1Terraform deployment for services:* Git URL: https://github.com/company-org-obs/cluster-1-controller* Local directory: cluster-1-configArgoCD deployment:* Git URL: https://github.com/company-org-obs/cluster-1-helm* Local directory: cluster-1-cluster-resourcesMore info about referred charts:obs-agent* Helm repo URL: https://nexus.company.int/repository/obs-charts/* Git URL: https://github.com/company-org-obs/obs-agent* Local directory: obs-agenttelemetry-gateway* Helm repo URL: https://nexus.company.int/repository/obs-charts/* Git URL: https://github.com/company-org-obs/telemetry-gateway* Local directory: telemetry-gatewayhelm-charts/grafana* Helm repo URL: https://grafana.github.io/helm-charts/* Git URL: https://github.com/grafana/helm-charts* Local directory: ../github/grafana/helm-charts/charts/grafanahelm-charts/loki* Helm repo URL: https://grafana.github.io/helm-charts/* Git URL: https://github.com/grafana/loki* Local directory: ../github/grafana/loki/production/helm/lokihelm-charts/mimir* Helm repo URL: https://grafana.github.io/helm-charts/* Git URL: https://github.com/grafana/mimir* Local directory: ../github/grafana/mimir/main/operations/helmThe parent directory to local repositories is $HOME/work/src, but it's different for the developers, so use relative directory references.Validate yourself, do not hallucinate.
During the usage, I had to improve the AGENTS.md files, for example:
- Versions: remove specified versions (it will be updated always).
- Guidance: check Istio config in the repos and on the cluster, if communication errors.
- Guidance: if pod not able to schedule, analyze the reason more deeply
- Guidance: check if the Deployment/StatefulSet/Daemonset is managed by operator; if yes, change the CR
- Infrastructure: adescription about 3-zone high-availability setup.
- Safety: highlight (warn) if the kubectl command wants to change the cluster.
- Optimization: do not use another cluster’s
AGENTS.mdfile.
Command safety
The harmful and harmless commands executed by Copilot Chat are defined in the VSCode workspace file, for example:
"settings": { "chat.tools.terminal.autoApprove": { "kubectl": true, "/kubectl.*\\b(annotate|apply|attach|auth|autoscale|certificate|cordon|cp|create|delete|drain|edit|exec|expose|kustomize|label|patch|replace|rollout|run|scale|set|taint|uncordon)\\b/": false, "awk": true, "uniq": true, "yq": true, "/^yq\\b.*(-[a-zA-Z]*i[a-zA-Z]*|--inplace)\\b/": false, "bc":true, ... }}
Confluence pages and Runbooks
I gave connection info to company Confluence pages for the Copilot Chat, in order to get additional troubleshooting information, but processing the HTML responses was not enough good. Instead, I made a simple Python code in the obs-ai repo by vibe coding to fetch the selected index pages and its references and convert it to Markdown files into docs/md directory. Copilot can process the Markdown files better.
The private and public Runbooks have similar issues, so it should be converted same way.
Jira
Tasks are managed by Jira in my team. The connection and info processing will be worked out in the future.
Logs
The Pod logs are stored in Grafana Loki. The connection way has not been worked out yet. Most probable, it will be fetched trough Grafana API, not directly from Loki API.
Temporary, Copilot Agent automatically runs the kubectl logs command can fetch the logs of running Pods, but not for stopped logs.
Metrics
The metrics are stored in Grafana Mimir. The connection way has not been worked out yet. Most probable, it will be fetched trough Grafana API, not directly from Mimir API.
Temporary, Copilot Agent automatically runs the kubectl port-forward and curl command can fetch the metrics of running Pods, but not for a time range.
Alerts
The alerts stored in Grafana Mimir. The connection way has not been worked out yet. Most probable, it will be fetched trough Grafana API, not directly from MimirAPI.
Automation
GitHub Copilot CLI is not in GA yet, so it’s disabled in my company. When it will be in GA, automation possibilities will be investigated. At the first, only typical info collection will be automated. Some automated change action may be possible, but not in the in the near future.
Below articles may be helpful:
Experiences
My opinion about Copilot Agent for Ops and Devops is mixed. Most of time the executed commands for discovery are better than I should do, but a few time I could make better.
Sometime the issue analysis finds the root cause faster than me (or at least I find in the possible root cause in a list), but sometime It could not find it. So, additional hints must be added in a new promt command. It can be improved by extending the AGENTS.md files.
Most of the time, the right alternative in the provided solution alternatives list is not at the 1st place. The root cause of this inaccuracy can be:
- Lack of information: the design and architecture decisions, WoWs are not written in
AGENTS.mdfiles (should be written there) - Use case: depending on the cluster delivery type (production, test, sandbox) and the engineer role, the best solution alternative can be different on same symptoms.
But overall, with competent human control, this AI solution is useful, safe and increases the team velocity and the service readiness SLOs.
Conclusions
Even though I wrote good documentation, it’s rarely found or read, so it’s not worth it. Sometimes even I don’t remember what and where I documented. It may be the cognitive limitation of heterogeneous DevOps and GitOps systems.
Using the AI assistant with well-configured context will be increase the confidence of the engineers. This human-controlled AI way promises better and more effective knowledge sharing and learning path.
The AI agent cannot take the best decision all the time, because of lack of information: the delivery type the engineer role and the non-documented info. The way of giving this information will increase the AI agent decision better and more safe in the future.