7 min read1 day ago
–
Press enter or click to view image in full size
Welcome to the next pillar of the observability stack every SRE should understand: Tracing
If you haven’t yet set up the other pillars — monitoring, alerting, and logging, I recommend starting there first. You can follow the complete series in the link below, which walks through each component step by step before diving into tracing.
Why Tracing Matters?
From an SRE perspective, tracing is the most critical — and often the most undervalued — pillar of the observability stack**.** Metrics and alerts are excellent at telling you something is broken, but they completely fall short when it comes to explaining why. In real production environments, it’s common to see all dashboards green, SLIs within …
7 min read1 day ago
–
Press enter or click to view image in full size
Welcome to the next pillar of the observability stack every SRE should understand: Tracing
If you haven’t yet set up the other pillars — monitoring, alerting, and logging, I recommend starting there first. You can follow the complete series in the link below, which walks through each component step by step before diving into tracing.
Why Tracing Matters?
From an SRE perspective, tracing is the most critical — and often the most undervalued — pillar of the observability stack**.** Metrics and alerts are excellent at telling you something is broken, but they completely fall short when it comes to explaining why. In real production environments, it’s common to see all dashboards green, SLIs within thresholds, and alerts silent — while users still complain that the application feels slow. This is where tracing becomes non-negotiable. Tracing gives you request-level truth: how a request propagates across services, where latency is actually introduced, which downstream dependency is slowing things down, what status codes are returned, and how retries or fan-outs behave. For an SRE, this depth is not a “nice to have”; it is the difference between guessing and knowing, between reactive firefighting and confident, data-driven debugging.
Introduction
In this section, we will implement distributed tracing using a dummy microservices-based application. The service is designed to mimic a real-world architecture, where a single request fans out to multiple downstream dependencies such as Redis, PostgreSQL, and other internal services. As requests flow through the system, we will generate end-to-end traces, store them in Tempo, and visualize them in Grafana. This setup allows us to observe how each request propagates across services, identify latency bottlenecks, and understand dependency behavior — exactly the kind of visibility required to debug and operate modern microservice architectures with confidence.
HotROD: Our Trace Generator for This Blog
HotROD (“Hot Rides on Demand”) is an open-source demo microservices application provided by the Jaeger project. It simulates a ride-sharing service composed of several services that communicate with each other, making it ideal for generating distributed trace data that reflects real-world latency, service calls, and dependencies. HotROD is widely used in the community to illustrate how tracing works end-to-end because it naturally emits spans across multiple services without requiring manual instrumentation in every component.
we run HotROD as a Kubernetes pod and configure it to send trace spans to Tempo so we can visualize them in Grafana. Below is the deployment manifest we use to run HotROD in Kubernetes:
apiVersion: apps/v1kind: Deploymentmetadata: labels: app: hotrod name: hotrodspec: replicas: 1 selector: matchLabels: app: hotrod template: metadata: labels: app: hotrod spec: containers: - args: - all env: - name: JAEGER_AGENT_HOST value: tempo.monitoring.svc.cluster.local - name: JAEGER_AGENT_PORT value: '6831' image: jaegertracing/example-hotrod:1.41.0 name: hotrod ports: - containerPort: 8080 restartPolicy: Always---apiVersion: v1kind: Servicemetadata: name: hotrod labels: app: hotrodspec: type: ClusterIP ports: - name: http port: 8080 targetPort: 8080 selector: app: hotrod
In this configuration, HotROD emits spans using the Jaeger Thrift protocol by specifying JAEGER_AGENT_HOST and JAEGER_AGENT_PORT. Tempo accepts Jaeger traces natively, so this setup works without additional collectors.
Can HotROD Be Used with OpenTelemetry?
The app is launched with an OTLP exporter (--otel-exporter=otlp). In this mode, HotROD emits traces using the OpenTelemetry Protocol (OTLP) instead of native Jaeger Thrift
apiVersion: apps/v1kind: Deploymentmetadata: name: hotrod-otelspec: replicas: 1 template: spec: containers: - name: hotrod image: jaegertracing/example-hotrod:latest args: - all - --otel-exporter=otlp env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://otel-collector.monitoring.svc.cluster.local:4317" ports: - containerPort: 8080
This configures HotROD to emit OTLP traces to an OpenTelemetry Collector running at otel-collector.monitoring.svc.cluster.local:4317
Once the deployment is applied and the HotROD pod is running in the Kubernetes cluster, you can access the UI by port-forwarding the pod to your local machine.
kubectl port-forward $(kubectl get pods | grep hotrod | awk '{print $1}') 8080
After the port-forward is active, open your browser and navigate to http://localhost:8080 The HotROD UI will be available. Interact with the UI by clicking the available buttons to generate requests. Each action triggers distributed tracing spans across the simulated services.
Press enter or click to view image in full size
Press enter or click to view image in full size
Grafana Tempo
At this stage, traces are actively being generated and exported, but they are not yet stored anywhere. To persist and query these traces, we now need a backend designed specifically for distributed tracing. This is where Tempo comes into the picture.
Get Sanskar Agrawalla’s stories in your inbox
Join Medium for free to get updates from this writer.
Tempo is a natural fit because it is cost-efficient, scalable, and index-free. Unlike traditional tracing backends, Tempo stores traces cheaply (object storage–friendly) and relies on metrics and logs for trace discovery. This makes it ideal for high-cardinality, high-volume tracing workloads commonly seen in microservice architectures. Tempo also integrates seamlessly with Grafana, which keeps the observability stack consistent and simple. We will install Tempo using Helm for a quick and production-aligned setup.
git clone https://github.com/sanskar153/Grafana-Tempo.gitcd Grafana-Tempohelm install tempo tempo -n monitoring --debug
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
Now it’s time to install Grafana for trace visualization. For a quick and straightforward setup, we will use Helm.
Run the following commands to add the Grafana Helm repository and install Grafana in the monitoring namespace:
helm repo add grafana https://grafana.github.io/helm-chartshelm repo updatehelm install my-grafana grafana/grafana --namespace monitoring
Press enter or click to view image in full size
Once Grafana is installed, retrieve the admin credentials by running:
kubectl get secret --namespace monitoring my-grafana \ -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
Next, port-forward the Grafana pod to access the UI locally:
kubectl port-forward $(kubectl get pods -n monitoring | grep grafana | awk '{print $1}') -n monitoring 3000
Now open your browser and navigate to http://localhost:3000 , login using the retrieved credentials, and Grafana will be ready for configuration and visualization.
Press enter or click to view image in full size
Press enter or click to view image in full size
Configuring Tempo as a Grafana Data Source
Now that Tempo and Grafana is up and running, the next step is to configure it as a data source in Grafana.
Follow these steps:
- Navigate to Grafana → Connections → Data sources.
- Search for and select Tempo.
- In the URL field, add the Tempo service endpoint:
[http://tempo\.monitoring\.svc\.cluster\.local:3200](http://tempo.monitoring.svc.cluster.local:3200) - Scroll to the bottom and click Save & Test.
Press enter or click to view image in full size
If the configuration is correct, Grafana will display a “Data source is working” message, confirming successful connectivity with Tempo.
Press enter or click to view image in full size
At this point, Grafana is fully configured to query and visualize traces stored in Tempo. To explore the traces, navigate to the Explore section in Grafana, select Tempo as the data source, and choose the Search query type. You will now be able to see all the generated traces along with their individual spans. Each trace provides detailed insights into the request flow, including which services were called, the duration of each span, HTTP status codes, and the associated request and response metadata. This view gives a complete, end-to-end picture of how a request traverses your system.
With this, the tracing setup is complete. You can now visualize end-to-end traces and individual spans, gaining deep visibility into how requests flow through your services. Beyond exploration, this data becomes even more powerful when combined with dashboards — for example, identifying the busiest services or highlighting requests that take longer than one second. These insights help SREs move from reactive debugging to proactive performance optimization.
This brings the blog to a close. I encourage you to explore tracing further, experiment with different queries and dashboards, and adapt this setup to your own production environments. If you have any questions or run into issues, feel free to reach out. And finally, a shout-out to all the SREs who debug late into the night, chase elusive latency spikes, and quietly keep systems reliable at scale — your work may not always be visible, but it is foundational to everything running smoothly.