Monitor your Kubernetes operators to keep applications running smoothly

The performance of your Kubernetes operators often influences the behavior of the applications they manage. Operators automate the day-to-day management of your applications by executing critical activities, which may include scaling replicas, performing upgrades, and recovering from failures. For example, a PostgreSQL operator can ensure that standby servers are always deployed, that the database’s failover is correctly configured, and that data is backed up on schedule.

An operator encodes the operational logic of the application it manages, and its behavior is shaped not only by Kubernetes but also by the application’s own configuration, life cycle, and dependencies. Operator metrics and logs can give you early indications of application problems, so monitoring your operators is critical in maintaining the reliability of the workloads running in your cluster.

In this post, we’ll show you how you can monitor your operators’ performance through metrics and logs, and how you can use that visibility to diagnose and resolve operator performance issues.

But first, we’ll take a look at the components and behaviors that enable operators to manage applications in Kubernetes.

How Kubernetes operators work

To understand how operators work, it helps to identify some fundamental components and functionality. Many operators are published and maintained by the Kubernetes community, including those available on OperatorHub.io. Organizations also build their own operators to manage internal services or applications that don’t have a community-provided alternative. In either case, operators share a common structure and set of behaviors.

Each operator includes a Custom Resource Definition (CRD), which extends the Kubernetes API by defining a new Kubernetes resource type. The CRD provides the schema for the Custom Resources (CRs) that describe the application. CRs represent the application’s configuration, including storage requirements, replication settings, and upgrade parameters.

Operators implement the Kubernetes controller pattern. Whereas a controller reconciles the state of the cluster to keep it aligned with its desired state, an operator’s reconciliation is targeted specifically to the application it manages. Operators continually watch their CRs and modify the cluster as necessary, creating, updating, or deleting cluster objects to align with changes to those resources. For example, if you create or update a PostgreSQL CR, the operator detects this change and might automatically provision a PersistentVolumeClaim (PVC) to handle database storage.

This continual activity is called a reconciliation loop, and it allows the operator to maintain the application’s consistency automatically. In the next section, we’ll look at metrics that track how the reconciliation loop and other operator components shape the performance of your application.

Track operator metrics to monitor performance

Most Kubernetes operators are written in Go, and most of those Go-based operators are built by using either the Operator SDK or Kubebuilder framework. Both frameworks rely on the controller-runtime library, which exposes a set of Prometheus metrics that describe the operator’s performance.

With this library as a common foundation, these metrics are consistently available. To monitor your operator, you can collect these metrics by scraping its metrics endpoint, which is http://<IP>:8080/metrics by default. Many Operator SDK and Kubebuilder scaffolds put a secure proxy such as kube-rbac-proxy in front of this and expose it externally on a different port (commonly 8443). But in all cases, the metrics address and port are configurable.

To track your operator’s performance over time, you can use a Prometheus server or the Datadog Agent to collect and store these metrics, and then query and visualize them in your monitoring tools.

In this section, we’ll look at some metrics you can collect to understand the health and performance of your operator. We’ll explore metrics that describe the behavior of the operator’s reconciliation loop and the status of the work queue, where reconciliation requests wait for workers to process them. We’ll also look at Go runtime metrics that give you insight into the resource usage that can shape the operator’s performance.

Reconciliation loop metrics

Like all controllers, operators continuously reconcile cluster state. The metrics described in this section can help you understand how frequently these reconciliations occur and how long they take, enabling you to detect and resolve latency and errors that could affect the performance of your application.

controller_runtime_reconcile_total

This counter tracks the number of reconciliation attempts, labeled by outcome: success, error, or requeue. Tasks are requeued if the operator is unable to complete them (but they haven’t resulted in an error), such as when the operator is waiting on a pod to become ready.

You can use this metric to see trends in the ratio of successful reconciliations to errors. The following code sample illustrates a PromQL query that you can use to graph or alert on the proportion of reconciliations that resulted in errors over the last 5 minutes.

sum(rate(controller_runtime_reconcile_total{result="error"}[5m]))/sum(rate(controller_runtime_reconcile_total[5m]))

An increasing proportion of errors indicates reconciliation problems. Kubernetes events and logs can provide context, for example, to show you whether rate limiting is leading to API request errors and causing reconciliation to fail.

controller_runtime_reconcile_time_seconds

This metric shows you how long it takes the operator to execute each reconciliation. Operators that interact with many objects or external systems (for example, cloud APIs) can experience spikes here during high load. Latency increases may indicate network delays or unresponsive API endpoints.

Depending on the cause, you may be able to address this by changing the operator’s code, such as by using caching or timeouts. There are also steps a cluster administrator can take. For example, if this metric is rising due to resource exhaustion, increasing the CPU requests and limits in the operator’s deployment spec may reduce latency.

controller_runtime_max_concurrent_reconciles

The controller_runtime_max_concurrent_reconciles metric indicates the maximum number of reconciliation loops your operator can run at one time, based on its configuration. This metric reflects the operator’s configured limit but doesn’t describe its behavior.

controller_runtime_active_workers

The controller_runtime_active_workers metric shows you how many workers are actively processing tasks at any given moment. The depth of an operator’s work queue depends in part on the size and complexity of the application. For example, a PostgreSQL operator creates one CR for each instance of the database. Since each CR must be reconciled every time the application changes, a larger cluster means more reconciliation work for the operator. To efficiently process the queue, the operator can use multiple workers to execute reconciliation tasks in parallel.

Work queue metrics

The reconciliation loop’s speed and concurrency, as well as the rate of incoming reconciliation requests, all shape the depth of the work queue. Each item in the queue represents a CR to be reconciled. All of the workers in the pool take turns reconciling the next CR in the queue, which includes looking up its actual state, comparing that with its desired state, and updating the cluster if the states don’t match. Monitoring the queue’s status helps you understand whether the operator is keeping up with demand.

workqueue_depth and workqueue_queue_duration_seconds

These metrics show how many items are currently in the queue and how long they’ve been waiting. If either metric rises, the operator may be receiving more work than it can process. This can happen when:

Many CRs change in a short period
The Kubernetes API responds slowly
The operator has too few workers

For example, a helm upgrade that modifies many CRs at once can suddenly increase queue length and wait time. The operator will gradually reduce the backlog, but how quickly it catches up depends on its concurrency configuration.

workqueue_adds_total

This metric shows the rate at which items are added to the queue. If you’re investigating a queue backlog, it can help to understand the rate of incoming work. A sudden spike in incoming queue items often indicates a burst of CR updates, and inspecting logs and events can help you identify the source.

Go runtime metrics

If your operator’s reconciliation loops are slow or its work queue is backlogged, you can look to Go runtime metrics to determine whether its memory consumption is a contributing factor. Whether you’ve written your own operator or you’re using a third-party operator, there’s some risk that flaws in the code could lead to resource consumption problems and performance issues. If you see anomalies with Go runtime metrics, you may also see impact on the operator’s performance and on some of the other metrics we’ve looked at so far. Use these to complement those signals and home in on an issue’s root cause.

go_memstats_alloc_bytes

This metric measures heap memory currently allocated by the Go process. A steady increase can indicate a memory leak, which could stem from cached client responses, unbounded queues, or references to objects that are never released. A leak will eventually degrade operator performance and may lead to out-of-memory (OOM) errors.

go_goroutines

A rising goroutine count can indicate that routines are not exiting correctly—for example, network calls without proper timeouts or retry loops that never terminate. Goroutine leaks consume memory and can slow down your operator’s reconciliations.

Explore operator logs to gain context and detail

The operator metrics we’ve covered so far can help you detect problems with your operator’s health and performance. After metrics alert you to an issue, logs can help you determine why it occurred.

Log content varies among operators since each developer decides what information to log. But because most operators are built on the controller-runtime library, there is some standard information you can expect to see in your logs. For example, if an operator generates an error when it experiences a CR with an invalid configuration, the controller-runtime library logs the error. The library also logs a subset of other activity, including operator startup and leader elections. These logs are emitted automatically by controller-runtime for any operator built on it.

Beyond the default information logged by the controller-runtime library, individual operators may log other information depending on their implementation. Your logs may include expanded information at the INFO and WARNING log levels that provides context around the errors in the log. For example, if you see an increasing ratio of requeued-to-successful reconciliations (both of these numbers are shown in the controller_runtime_reconcile_total metric), your logs may help you understand the cause.

Operator logs are exposed through the Kubernetes API, just like other pod logs. You can access them with kubectl logs -n <namespace> <pod_name>, where <pod_name> is one of the pods in your operator’s deployment.

Diagnose and resolve operator performance issues

Metrics and logs can help you detect problems and show you data that indicates what’s wrong. But to successfully investigate and mitigate an emerging issue with your operator, you need visibility into the health of your Kubernetes cluster and the infrastructure where it runs. In this section, we’ll look at how you can use Datadog to identify the root causes of operator performance issues and take action before they affect the applications running in your cluster.

Find and fix queue backlogs

If you see the workqueue_depth or workqueue_queue_duration_seconds metrics increasing, your operator may be receiving reconciliation requests faster than it can process them. This can be caused by an increasing rate of incoming events (such as when a large number of CRs change at once), slow responses from the Kubernetes API, or too few workers reconciling the items in the queue.

For example, if you apply a helm upgrade to your application, all affected CRs are added to the work queue for reconciliation. If this is a large number of objects, the length of the queue and the time each task spends in the queue will increase until workers are able to process the tasks. How quickly they catch up depends on the number of workers in the pool.

Because queue backlogs often signal early performance issues, it’s useful to create a monitor that alerts you when queue depth or wait time exceeds a threshold you specify. A rising queue length, especially when paired with longer reconciliation times, is a strong indicator that the operator is struggling to keep up with incoming work and may need tuning or additional resources.

To resolve a backlog of queued reconciliation tasks, you may be able to increase the operator’s MaxConcurrentReconciles setting to allow more reconciliation workers to run in parallel. Some operators allow you to configure this value through an environment variable. In the following code sample, the operator reads the MAX_CONCURRENT_RECONCILES variable and uses it to set concurrency.

(Note that this example does not directly update the operator’s configuration. Instead, it illustrates a way this can be done, assuming the operator’s code reads the MAX_CONCURRENT_RECONCILES variable declared here.)

apiVersion: apps/v1kind: Deploymentmetadata:  name: my-application-operator  namespace: operatorsspec:  replicas: 1  selector:    matchLabels:      app: my-application-operator  template:    metadata:      labels:        app: my-application-operator    spec:      serviceAccountName: my-application-operator      containers:        - name: manager          image: my-application-operator:latest          command:            - /manager          env:            - name: MAX_CONCURRENT_RECONCILES              value: "10"          resources:            requests:              cpu: 200m              memory: 256Mi            limits:              cpu: 500m              memory: 512Mi

Increasing concurrency means the operator can run more workers simultaneously, which increases CPU usage. It’s important to monitor the operator’s CPU utilization after this change to determine whether you need to increase CPU requests or limits.

Investigate latency in the controller’s reconciliation loop

If your operator’s reconciliation work slows down, it could affect your application’s performance. Elevated latency can be caused by errors, slow dependencies, or resource constraints such as CPU throttling or memory pressure.

To visualize reconciliation latency, you can graph the 95th percentile (p95) value of the controller_runtime_reconcile_time_seconds metric. For example, this PromQL query will show you the p95 reconciliation duration over the previous five minutes:

histogram_quantile(0.95, sum(rate(controller_runtime_reconcile_time_seconds_bucket[5m])) by (le))

If the p95 duration increases, the operator pod may be experiencing CPU throttling or starvation. You can confirm this in Datadog’s Live Containers view, which surfaces CPU limits, throttling, and resource pressure on the node.

Reconciliation latency can also increase when the operator is encountering errors due to timeouts, RBAC issues, or invalid CR specs. To diagnose these issues, correlate latency spikes with increases in the controller_runtime_reconcile_total{result="error"} metric, then inspect operator logs and Kubernetes events for messages that indicate misconfigurations or access issues.

Address runtime inefficiencies in your custom code

Go runtime metrics can reveal efficiency problems inside your operator, especially issues that cause gradual performance degradation over time. Two metrics in particular, go_memstats_alloc_bytes and go_goroutines, provide early visibility into code problems such as resource leaks and infinite retry loops. A rise in either of these metrics signals increased resource consumption, which can be a leading indicator of deteriorating operator performance.

Runtime inefficiencies like these typically require code-level fixes. Datadog’s Continuous Profiler can help you identify memory-heavy functions, leaked goroutines, and inefficient code paths. If you maintain the operator yourself, you can patch it. Alternatively, you may be able to find a community-provided one that offers similar functionality.

The performance of your operator and the application it manages also depends on the health of the infrastructure it runs on and the activity across your wider environment. Datadog enables you to correlate operator metrics and logs with signals from the rest of your stack so you can trace performance issues back to resource constraints, dependency failures, or upstream configuration changes.

Keep your Kubernetes operators reliable and performant

An operator that falls behind or encounters errors can degrade the performance of the application it manages, which makes monitoring operator health essential. Datadog’s unified observability helps you track operator performance at every level. You can visualize reconciliation behavior, queue health, concurrency utilization, and runtime resource pressure alongside Kubernetes events, API server metrics, and node-level signals.

See the documentation for more information about how to keep your operators healthy and performant with Kubernetes Monitoring, Live Containers view, and Continuous Profiler. If you’re not already using Datadog, sign up for a free 14-day trial.

Similar Posts