Distributed Tracing in Microservices: How It Actually Works

Modern software rarely lives in a single process anymore. What used to be a monolith is now a web of services, APIs, queues, and background workers, often written in different languages and deployed across regions.

This shift brings clear benefits in scalability and team autonomy, but it also makes systems harder to reason about. When something goes wrong, the failure is rarely confined to one place.

A single user action, like loading a page or submitting a payment, can trigger a cascade of calls across half a dozen services. Small delays compound, failures propagate in indirect ways, and the original symptom observed by the user may be several layers removed from the underlying cause.

Traditional [observability signals like logs and metrics](https://www.dash0.com/knowledge/logs-m…

This shift brings clear benefits in scalability and team autonomy, but it also makes systems harder to reason about. When something goes wrong, the failure is rarely confined to one place.

Traditional observability signals like logs and metrics remain essential, but on their own they struggle to reconstruct this journey with enough fidelity to explain why a specific request behaved the way it did.

Distributed tracing exists to fill that gap. By following a request end to end, across service and network boundaries, tracing provides the connective tissue between components that otherwise look independent. You can see which services were involved, how long each step took, and where the critical path formed.

This article brings together the practical foundations of distributed tracing with the architectural context that makes it essential. We’ll look at why tracing became necessary, how it fits alongside logs and metrics, and what you should understand before instrumenting real systems.

The goal is not just to define tracing, but to make it intuitive and useful when systems stop behaving the way you expect.

What is distributed tracing?

Distributed tracing is an observability technique that records how a single request flows through a distributed system as it passes between services, processes, and network boundaries.

Rather than analyzing components in isolation, tracing follows the request itself and captures what each participating service did and how long it took.

By stitching these individual operations together through shared identifiers and relationship markers, tracing produces a coherent view of the full request lifecycle. This makes it possible to understand where latency accumulates, how errors propagate, and which parts of the system are on the critical path for a given outcome.

Tracing complements logs and metrics by answering a different question. Metrics describe overall behavior and trends, logs provide local detail, and tracing explains how a specific request experienced the system end to end.

How distributed tracing works

Tracing starts by assigning a unique identifier to a request at its entry point and carrying that identifier as the request moves through the system.

Each service involved records its own slice of work, including timing and relevant context, and links it back to the same trace identifier. This is what allows these otherwise independent observations to be assembled into a single, coherent view of the request.

These units of work are captured as spans. Each span represents a specific operation, such as serving an HTTP request, running a database query, or performing internal application logic, and records its duration and outcome. As a request fans out or triggers additional work, spans naturally nest within one another, forming a hierarchy that reflects how the request actually moved through the system.

To make this possible across service boundaries, trace context is propagated alongside requests, usually through protocol metadata like HTTP headers or message attributes. Instrumentation libraries handle most of this automatically, to ensure that downstream services can attach their spans to the correct parent and preserve the causal relationship between operations.

Once spans are created, they’re exported to a tracing backend, often asynchronously and in batches, and assembled into a coherent trace. This final representation allows you to inspect the full path of the request, identify bottlenecks, and understand how different services contributed to the overall behavior.

OpenTelemetry distributed tracing example

One of the easiest ways to understand distributed tracing is to study a system that was built to expose its behavior rather than hide it. The OpenTelemetry Demo is exactly that: a deliberately rich microservice environment designed to resemble a real production system instead of a simplified example.

It’s composed of more than a dozen services, implemented in a wide range of programming languages and communicating over both HTTP and gRPC. Each service is responsible for a narrow slice of functionality, which means even simple user interactions result in a chain of cross-service calls. This makes it an ideal reference for understanding how trace context is propagated and how spans from different runtimes come together to form a single trace.

The diagram below shows the overall structure of the application and how its services interact. What makes this view interesting is that it’s derived entirely from distributed tracing data, not from static configuration or documentation:

Running the demo locally is straightforward. Clone the repository and start the full environment with Docker Compose:

1docker compose up --force-recreate --remove-orphans --detach

Once the services are running, you can explore the generated traces through the Jaeger user interface at http://localhost:8080/jaeger/ui/:

What is a trace?

A trace captures the full lifecycle of a single request as it passes through a distributed system. It provides a unified view of how that request was handled across service boundaries, from the moment it entered the system to the point where a response was produced.

Rather than being a single event, a trace is composed of multiple spans, each describing a discrete piece of work performed along the way. These spans are linked through shared context, which preserves ordering and parent-child relationships and allows the trace to reflect the actual execution flow of the request.

This structure gives tracing its explanatory power. By grouping related operations together and preserving their causal relationships, traces make it possible to reason about latency, error propagation, and control flow at the level that matters most during debugging: a single request and the path it took through the system.

Understanding spans: the building blocks of traces

A span is the basic unit of work in distributed tracing. It represents a single, well-defined operation performed by a service while handling a request, capturing both what happened and how long it took.

Spans are intentionally flexible. They can describe an inbound handler, a call to an external dependency, a message publish or consume step, a cache read, a CPU-heavy function, or any other slice of work where you want clarity about duration and behavior.

Every span records three critical things:

Timing: When it started and how long it took.
Status: Did it succeed or fail?
Context: What actually happened and where.

Spans are connected into a trace by explicit relationships and a shared trace identifier. The first span created for a request is the entry point for the trace and establishes its context. From there, any additional work triggered as part of handling that request creates new spans that reference their immediate predecessor.

This parent-child structure captures both sequencing and causality, allowing the trace to reflect not just what happened, but why one operation led to another.

When assembled into a full trace, spans form a timeline that makes it possible to isolate latency and failures with clarity, even when the root cause emerges from the interaction between multiple services rather than a single faulty component.

Attributes are what enable observability

Without attributes, a span is little more than a fancy stopwatch. It can tell you how long something took, but not what actually happened during that time or why the operation behaved the way it did.

Attributes give spans their explanatory power by describing the operation and the circumstances under which it ran. They capture concrete details such as which route was hit, which query was executed, which response code was returned, or which tenant or customer triggered the work. This contextual data is what turns raw timing into insight.

There are two categories of attributes, each answering a different class of question:

1. Span attributes

Span attributes provide contextual detail about what the code was doing when the span was recorded. Common examples include request metadata, feature flags, and application-level identifiers:

text

123http.request.method: GETdb.query.text: SELECT * FROM usersapp.user_id: 12345

Without such context, all you can conclude is that an operation is slow. With it you might observe that latency only appears for a specific endpoint, under certain inputs, or for a particular class of users, allowing you to move from vague suspicion to a concrete explanation.

2. Resource attributes

Resource attributes describe the environment in which the code ran. They capture where a span originated rather than what the code was doing, and they tend to be stable across many spans produced by the same process.

Examples include deployment and infrastructure context such as service identity, runtime location, and execution platform:

text

123service.name: checkout-servicecloud.region: us-east-1k8s.pod.name: checkout-service-x8j2

When latency or error rates spike, resource attributes are what make it possible to determine whether the issue is systemic or isolated to a specific region, host, container, or deployment.

Without this layer, traces lack the environmental grounding needed to connect software behavior to real world infrastructure conditions.

A note on attribute names

Attribute names are part of the contract between your instrumentation and every tool that consumes the data. Where possible, you should rely on established semantic conventions rather than inventing your own vocabulary.

Standardized attribute names allow observability tools to apply built-in understanding, such as recognizing HTTP routes, grouping database operations, or calculating latency metrics automatically. When conventions are followed, tools can work with your data out of the box instead of requiring custom parsing and dashboards.

Custom attributes still have their place, particularly for domain-specific concepts that have no standard representation. When you do introduce them, always namespace them to prevent collisions with future standard attributes.

What a span actually looks like

At the wire level, a span is a structured object with clear separation of concerns. This is roughly what a tracing backend receives when a service exports spans using the OpenTelemetry protocol:

JavaScript

1234567891011121314151617181920212223242526272829303132333435363738{  "resourceSpans": [    {      "resource": {        "attributes": [          {            "key": "k8s.pod.name",            "value": { "stringValue": "frontend-proxy-x92ks" }          }        ]      },      "scopeSpans": [        {          "spans": [            {              "name": "ingress_request",              "kind": 2,               "traceId": "da5b97cecb0fe7457507a876944b3cf",              "spanId": "fa7f0ea9cb73614c",              "parentSpanId": "",               "startTimeUnixNano": "1756571696706248000",              "endTimeUnixNano": "1756571696709237000",              "status": { "code": 0 },               "attributes": [                {                  "key": "http.route",                  "value": { "stringValue": "/api/v1/checkout" }                }              ]            }          ]        }      ]    }  ]}

One important detail to notice is that resource attributes and span attributes live in different parts of the payload. Resource attributes are attached once per payload but span attributes are attached to each individual span.

This separation is what allows backends to efficiently associate many spans with the same execution context while still preserving detailed, per operation data.

Span events: the logs inside your trace

We already established that spans tell you when an operation started and when it finished, and that attributes provide the context that explains what the operation represents.

Span events serve a different purpose: they mark notable moments that occur within the lifetime of a span without turning that moment into a separate operation. In practice, span events behave like structured logs that are inseparable from the trace context they belong to.

When to use span events vs attributes

These two concepts are easy to blur together, especially early on. A simple mental model helps keep the distinction clear:

Use attributes for state that applies to the entire operation. They describe what the span is about and remain true for its whole lifetime.
Use events for notable moments that happen at a specific point in time during the operation. These describe transitions, waits, retries, or other internal steps.

Consider a span that represents a database query. Attributes might describe the operation itself:

text

12db.query.text: SELECT * FROM ordersdb.query.duration: 154.322

Events, on the other hand, describe what happened during execution:

Event at T+0ms: acquiring connection pool lock
Event at T+15ms: connection acquired
Event at T+150ms: first byte received

Without events, all you see is a 154ms database span. With events, it becomes obvious that most of the time was spent waiting for a connection, which points to an entirely different fix.

The killer use case for events: exceptions

One of the most practical uses of span events is error recording. In OpenTelemetry, calling span.recordException(e) creates a span event named exception that includes the error message and stack trace.

This means you open the trace, select the failing span, and see the exception exactly where it occurred in the timeline along with the relevant context to understand why it happened.

Correlating application logs with traces

Regular application logs can also be correlated with traces by including trace and span identifiers in log records.

OpenTelemetry already provides many integrations that do this automatically, allowing you to jump from a trace to the logs that were written while it was executing.

In observability backends, this makes existing logs function identically to span events.

Connecting the dots with context propagation

The whole premise of distributed tracing is predicated on the ability to carry tracing context alongside a request as it moves through a system.

Without context propagation, each service would record its own activity in isolation. You may still collect spans, but they no longer form a trace, which defeats the purpose of instrumenting your services in the first place.

The mechanism that prevents this fragmentation is trace context. At a minimum, that context includes the trace identifier and the identifier of the currently active span. Together, these values define both which trace the work belongs to and where it fits within the overall timeline.

When a service makes an outbound call, this context is injected into the request, typically using protocol metadata such as HTTP headers or message attributes. The receiving service extracts that context and uses it to attach its own spans to the existing trace rather than starting a new one.

OpenTelemetry instrumentation handles this automatically for common protocols like HTTP and gRPC, which is why traces can span processes, hosts, and programming languages without requiring every boundary to be manually wired.

While OpenTelemetry supports multiple propagation formats, its default configuration follows the vendor-neutral W3C Trace Context specification, which defines a standardized way to carry trace context across system boundaries.

If you inspect a network request in your browser while tracing is active, you’ll see a traceparent header in your Request headers:

text

1traceparent: 00-6841c10c878e1849dc7efb598905c04f-cb6b9f8432533349-01

It includes four fields separated by hyphens:

Version: The trace context version (currently 00).
Trace ID: The unique identifier for the entire trace (6841c10c878e1849dc7efb598905c04f).
Parent ID: The identifier of the span that directly caused the current operation, establishing the parent-child relationship (cb6b9f8432533349).
Flags: Sampling and tracing options, such as whether this trace should be recorded and exported (01).

By reading this header, the receiving service knows exactly where to attach its new spans in the trace hierarchy.

While traceparent defines the core identifiers needed to correlate spans across systems, tracestate provides an extension mechanism for tool-specific context:

text

1tracestate: congo=ucfJifl5GOE,rojo=00f067aa0ba902b7

Each entry in tracestate is a key/value pair, and multiple vendors can contribute entries as long as they follow ordering and size rules. This makes it possible for different tracing systems to interoperate without losing vendor specific features or metadata.

W3C Baggage

Baggage is a mechanism for propagating arbitrary key value data alongside trace context as a request moves through a distributed system.

Unlike span attributes, baggage is not tied to a specific operation. Instead, it travels with the request itself and is available to any service that participates in handling it.

This makes baggage useful for information that is known early in a request’s lifecycle but remains relevant downstream. Examples include session identifiers, user or tenant IDs, feature flags, or other request metadata. By propagating this data automatically, services do not need to manually pass it through every function call or API boundary.

A value placed into baggage can later be attached to spans, emitted as a metric attribute, or included in logs, provided the participating service explicitly reads it. It’s also useful for correlating traces, metrics, and logs using the same contextual data without duplicating instrumentation logic.

Instrumenting your services to emit tracing data

Instrumenting a service means adding the ability to emit tracing data while it handles requests. In practice, this involves using a tracing library that can create spans, manage context propagation, and export trace data to a backend for analysis.

Most modern tracing setups rely on OpenTelemetry, which provides a consistent API and SDKs across languages and frameworks along with a Collector that can receive, process, and export telemetry data.

For many services, instrumentation requires little or no code changes. Automatic instrumentation can wrap common libraries and frameworks, such as HTTP servers, database clients, and messaging systems, generating spans for standard operations and handling context propagation transparently. This is often enough to gain immediate visibility into request flows and broad latency patterns.

For example, a Node.js service can be instrumented without touching application code:

12345678910npm install --save @opentelemetry/apinpm install --save @opentelemetry/auto-instrumentations-nodeenv OTEL_TRACES_EXPORTER=otlp \    OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=<your-endpoint>node --require @opentelemetry/auto-instrumentations-node/register app.js

Manual instrumentation becomes useful when you need insight into service-specific behavior rather than just infrastructure wiring. By creating spans around important sections of business logic and attaching meaningful attributes or events, traces can reflect how your system actually behaves under real conditions.

JavaScript

123456789101112131415161718const span = tracer.startSpan("calculate_cart_total");try {  const total = calculateTotal(items);  span.setAttribute("app.cart.items_count", items.length);  span.setAttribute("app.cart.value", total);  return total;} catch (err) {  span.recordException(err);  span.setStatus({ code: SpanStatusCode.ERROR });  throw err;} finally {  span.end();}

A common and effective approach is to start with automatic instrumentation where possible to get a broad map of the system, then layer in manual spans where deeper understanding is needed.

The reality check: challenges and pitfalls

Distributed tracing is essential for understanding modern systems, but adopting it successfully is far from trivial. Here are the most common reasons tracing implementations fail, and how to survive them.

1. Fragile context propagation

Tracing relies on trace context being propagated flawlessly across every boundary a request crosses. In real systems, this is where things most often break.

Context can be lost in unremarkable ways:

A piece of middleware drops headers it does not recognize.
An older load balancer rewrites requests.
A message queue treats work as fire-and-forget and never forwards the original context.

The impact only shows up later, when you look at a trace and it appears to stop at the API gateway, even though the request continued on after that point.

The downstream services did record spans, but without a parent to attach to, they appear as separate traces. These orphan spans obscure causality and can easily mislead an investigation.

2. The cost vs value trap

Every span recorded, transported, and stored adds cost and overhead, and in high-traffic systems, that cost compounds quickly. Many organizations end up paying to ingest vast amounts of data that is rarely queried, creating friction between engineering needs and financial reality.

To make tracing sustainable, you most likely have to accept a hard constraint: not every request can be kept. This is where sampling becomes unavoidable, and trade-offs start to matter.

With head-based sampling, the decision to keep or drop a trace is made at the start of the request, usually based on a fixed percentage. This keeps infrastructure simple and costs predictable, but it risks discarding the very traces you needed to debug a failure.

Tail-based sampling delays that decision until after the request completes, keeping traces that are slow or erroneous. This preserves the most valuable data, but shifts complexity and resource usage into the tracing pipeline, and often means paying to process data that is ultimately discarded.

3. The clock skew problem

Distributed tracing assumes that time moves forward in a straight line, but distributed systems do not share a single clock. Every service measures time using its own local clock, and even small differences between machines can distort how a trace appears.

When clocks drift, spans can appear to finish before they start or fall outside their parent’s duration. These artifacts are confusing and can send engineers chasing problems that do not exist.

Clock synchronization via tools like NTP should be treated as mandatory infrastructure. Even then, traces should be interpreted carefully as minor timing inconsistencies are often artifacts of skew rather than true issues.

4. Privacy and security risks

Useful traces tend to carry rich context, and that makes them easy places to leak sensitive data. Request parameters, headers, or payload fragments can end up in span attributes without anyone realizing it.

The OpenTelemetry Collector allows sensitive attributes to be removed or masked before data leaves your environment, which is far safer than relying on every service to sanitize its own output.

Distributed tracing pays off only when these realities are acknowledged upfront and accounted for. Success depends less on the tooling itself and more on disciplined instrumentation, cost awareness, and the understanding that observability is an ongoing practice, not a one-time setup.

Future directions and bridging the gap

The limits of early tracing systems are shaping the next phase of observability. The focus is shifting toward unifying signals, reducing cognitive load, and extracting insight automatically rather than relying entirely on manual investigation.

Unified observability and the decline of silos

The long-standing separation between logs, metrics, and traces is fading as modern observability practices increasingly rely on unified data models and shared storage, where traces are not just visual artifacts but a primary source of truth.

Latency histograms, error rates, and service-level indicators can be derived directly from trace data, removing duplication and inconsistency between signals.

Engineers can now move from a high-level alert to the exact trace instance that triggered it, without switching tools or reconstructing context by hand. The result is a shorter path from detection to understanding, which is critical during incidents when time and attention are limited.

Trace-based testing

Another emerging direction is the use of traces as an active input to testing rather than a passive record of production behavior. Trace-based testing applies assertions to traces generated during automated tests or staging runs, allowing teams to verify that distributed flows behave as expected.

Instead of validating only local outcomes, developers can assert properties of the full request path, such as how many times a database is written to or which downstream services are contacted. This approach extends the value of instrumentation into the development lifecycle and helps catch architectural regressions before they reach production.

AI-driven analysis and aggregate views

As trace volumes grow, manual inspection no longer scales. To address this, observability platforms are increasingly applying machine learning and LLMs to identify anomalies, emerging failure patterns, and unusual latency distributions across large sets of traces.

Visualization is also evolving beyond individual timelines. Aggregate views such as heatmaps and density plots make it possible to analyze thousands of traces at once, revealing outliers and systemic slowdowns without scrolling through individual examples. Together, automated analysis and higher-level views help operators focus on what matters most.

Distributed tracing tools

Distributed tracing tools are responsible for turning raw span data into something you can actually reason about. They collect spans emitted by services, group them by trace ID, reconstruct parent-child relationships, and present the resulting traces through query and visualization interfaces.

Most modern tools let you search for slow or failing requests, filter traces by service or operation, and inspect timelines to see where time was spent. Since the observability ecosystem has largely converged on OpenTelemetry as the standard for instrumentation and transport, you only need to instrument once and choose a backend based on scale, cost, and workflow preferences.

Within that landscape, OpenTelemetry-native backends have a clear advantage as they’re designed around OpenTelemetry’s data model rather than adapting to it after the fact. This tends to result in better support for semantic conventions and more reliable correlation across traces, metrics, and logs.

Dash0 takes this a step further by focusing on how traces are used during real investigations. As telemetry becomes easier to collect, the limiting factor is no longer data availability but human attention.

Rather than expecting engineers to inspect traces one by one, Dash0 leans on aggregate views and comparison-based workflows. Patterns across many spans are surfaced first, and individual traces are used to confirm or refine an explanation.

Features like Triage reflect this shift. Instead of starting with a hypothesis and searching for evidence, engineers can select an interesting slice of telemetry and let the system highlight which attributes or conditions actually distinguish it.

This approach is particularly effective in high-volume environments, where manual inspection does not scale and the fastest path from symptom to cause is the one that reduces cognitive load.

Final thoughts

Distributed tracing is no longer optional in modern cloud-native systems, as it’s the only reliable way to understand how real requests behave across service boundaries.

As telemetry volumes grow, the challenge shifts from mere collection to interpretation. Tools that are OpenTelemetry-native and designed around real investigation workflows make that gap easier to cross.

If you want to see what that looks like in practice, Dash0 is worth a look. You can plug it into your existing OpenTelemetry setup and see how quickly traces turn into answers.

Thanks for reading!

Similar Posts