OpenTelemetry for Java Services: Auto-Instrumentation, Context Propagation, and Custom Spans

The fastest way to get a Java service producing traces is to add nothing to your code. The OpenTelemetry Java agent is a -javaagent JAR that hooks bytecode at class load and instruments the libraries you already use — Servlet, Spring MVC/WebFlux, JDBC, Kafka, gRPC, and roughly a hundred others — then exports spans and metrics over OTLP (the OpenTelemetry Protocol). No SDK wiring, no recompile, no per-method annotations to start. You drop the agent in, point it at a Collector, set a service name, and a request that crosses three services lights up as one connected trace.

That is the easy 80%. The remaining 20% is where the real engineering lives, and it is the reason traces look broken in production despite “auto-instrumentation working.” Context propagates automatically within a thread and across a network hop, but the moment you hand work to an ExecutorService, a CompletableFuture, a reactive scheduler, or a message queue, the agent can lose the thread of execution — literally — and you get orphan spans that never join their parent. On top of that you have decisions the agent can’t make for you: how much to sample and where, which libraries to silence, what goes in the resource so every span is attributable, and how to add the handful of spans that carry business meaning the agent will never know about (promotion.apply, inventory.reserve).

This guide walks through all of it for a current agent release (v2.x): attaching and configuring the agent, exactly what auto-instrumentation captures, the W3C context-propagation model and the async boundaries you have to bridge by hand, writing custom spans against the API artifact only (never the SDK — that is the double-export trap), samplers and per-library control, the Collector as your policy layer, the resource and attribute model, and a troubleshooting playbook for the failure modes that actually happen. Every section anchors to tables you can scan mid-incident, with config and JVM flags that work against a real agent, not pseudocode.

What problem this solves

Distributed systems fail across boundaries, not inside them. A user reports “checkout is slow,” and the request touched a gateway, an order service, a payments service, a fraud check, two databases and a Kafka topic. Without distributed tracing you have six sets of logs with no shared identifier, and you correlate by eyeballing timestamps — which is to say you don’t. Tracing solves this by stamping every operation with a trace ID that flows with the request across every hop, so one search reconstructs the entire causal path and tells you the fraud check added 800 ms.

The classic way to get there is to write instrumentation by hand: import an SDK, create a tracer, wrap every HTTP handler and database call in a span, thread context through manually. That is enormous, repetitive, error-prone work, and it rots — a new library gets added and nobody instruments it. The OpenTelemetry Java agent eliminates the bulk of it by instrumenting the libraries themselves at the bytecode level, so a Spring controller, a JDBC statement and a Kafka consumer produce correct spans with zero lines of your code involved.

What breaks without understanding the agent’s boundaries: you attach it, see some traces, and assume you’re done. Then half your traces are fragments because an @Async method or a worker-pool task dropped context; your trace backend bill triples because nobody set sampling and a Kafka pipeline emits a span per message at full volume; services show up as unknown_service:java because the image shipped without OTEL_SERVICE_NAME; and a “custom spans don’t appear” ticket turns out to be the SDK shipped alongside the agent, giving you two providers and duplicate exports. Who hits this: every team running JVM services — Spring Boot, Quarkus, Micronaut, plain main() — that wants tracing without rewriting their codebase, which is most of them. The fix is almost never “more instrumentation”; it’s understanding precisely where the agent sees and where it goes blind, and helping it across exactly those gaps.

Learning objectives

By the end of this article you can:

Attach the OpenTelemetry Java agent via -javaagent (and JAVA_TOOL_OPTIONS), configure it entirely through environment variables / system properties, and point it at a Collector over the correct OTLP protocol and port.
Explain exactly what auto-instrumentation captures per library category (HTTP, JDBC, messaging, RPC), which semantic-convention attributes it sets, and how to opt into extra headers safely without leaking secrets.
Trace how W3C traceparent/tracestate and baggage propagate across the wire, and bridge the async boundaries the agent can’t see — thread pools, CompletableFuture, reactive schedulers, and manual message handoffs.
Write custom spans against the opentelemetry-api artifact only (compile-only), avoiding the SDK double-export trap, and add @WithSpan/@SpanAttribute declaratively.
Choose a sampling strategy and the layer to apply it: parentbased_traceidratio at the agent for head sampling versus tail-based, error-biased sampling in the Collector — and explain why the agent samples blind.
Build a production resource (service name/namespace/version, deployment.environment, k8s pod/namespace via the Downward API) so every span and metric is attributable and joinable.
Stand up an OpenTelemetry Collector pipeline (receivers → processors → exporters) as the policy and fan-out layer between your fleet and your backend.
Diagnose the real failure modes — lost context, double instrumentation, wrong port/protocol, unknown_service, baggage bloat, missing async spans — by symptom, with the exact check and fix.

Prerequisites & where this fits

You should be comfortable running a Java application from a JAR, setting environment variables and JVM flags, and reading JSON/YAML. Familiarity with HTTP, TCP ports, and the idea of a span (a timed operation with a name, attributes, and a parent) helps. You do not need prior OpenTelemetry experience — that is the point of zero-code instrumentation — but you should understand that a trace is a tree of spans sharing one trace ID, and a span has a kind (SERVER, CLIENT, PRODUCER, CONSUMER, INTERNAL) that says what role it played.

This sits in the Observability track, on the instrumentation/data-production side. The agent is the source; everything downstream consumes what it emits. It pairs directly with the Collector deep dives — OpenTelemetry Collector pipelines for production is where the data this article produces gets processed, and OpenTelemetry Collector tail-based sampling and load balancing is where the real sampling decision is made for the head-sampled traces you send. The traces land in a backend covered by Distributed tracing with Tempo, Jaeger, exemplars and correlation. The metrics the same agent emits connect to OpenTelemetry metrics, exemplars and Prometheus trace correlation. If you’d rather not run a JVM agent at all, the kernel-level alternative is eBPF zero-code instrumentation with Grafana Beyla and OTel.

Here is the layered map of who owns what in this pipeline, so you know which layer a problem lives in before you start debugging:

Layer	What lives here	Who configures it	Failure classes it causes
Application code	Business logic; manual spans; thread/async handoffs	App / dev team	Lost context across async; missing business spans
OTel Java agent	Auto-instrumentation; propagation; OTLP export	App + platform	`unknown_service`, wrong port, double export, over-emission
Resource / config (env)	`OTEL_*` vars, JVM flags, Downward API	Platform / deploy	Unattributable spans, wrong endpoint
Network / OTLP	gRPC 4317 / HTTP 4318 to the Collector	Platform / network	Silent connection failure, TLS, DNS
Collector	Receivers, processors (sampling, batch), exporters	Observability team	Dropped/duplicated data, tail-sampling policy
Backend (Tempo/Jaeger)	Trace storage, query, exemplar correlation	Observability team	Search slowness, cost, retention

Core concepts

Six mental models make every later decision obvious.

The agent instruments libraries, not your code. The -javaagent JAR uses the JVM’s instrumentation API to transform bytecode as classes load. When it sees a class from a library it knows — javax.servlet.Servlet, java.sql.Connection, org.apache.kafka.clients.consumer.KafkaConsumer — it weaves in span creation around the relevant methods. Your code is untouched; you don’t import anything to get HTTP and database spans. The trigger is presence on the classpath, not an opt-in. This is why “it just works” for standard libraries and why nothing happens for a bespoke RPC framework the agent has never heard of.

Context is the current span, carried implicitly. OpenTelemetry maintains a Context — at minimum the currently active span — attached to the executing thread (via a thread-local, with a stack of scopes). When the agent creates a child span, it reads the current context to find the parent. Within one thread this threading is automatic. The entire propagation problem is: how does the current context get from thread A to thread B (async) or from process X to process Y (network)? Network is solved by injecting/extracting headers; async is the part you sometimes bridge by hand.

Propagation across the wire is W3C headers by default. On every outbound HTTP/gRPC client call the agent injects the current context into headers (traceparent, tracestate, and baggage); on every inbound request it extracts them to set the context. The default propagators are W3C tracecontext + baggage. Because both sides default to W3C, a Java service calling another Java service correlates with zero configuration — the child span’s parent is the upstream span, across the network.

The API is the contract; the agent is the implementation. OpenTelemetry splits into the API (interfaces: Tracer, Span, Context) and the SDK (the implementation that builds, samples and exports spans). When the agent is attached, it is the SDK — it installs a live OpenTelemetry implementation behind GlobalOpenTelemetry. Your custom-span code depends on the API only. If you also ship the SDK, you get two implementations and duplicate telemetry. This single rule prevents the most confusing class of bug.

Sampling decides which traces to keep, and where you decide matters. Tracing every request at high throughput is expensive and rarely necessary. A sampler decides keep/drop. Head sampling decides at the start of the trace (at the agent) with no knowledge of the outcome — fast, cheap, but blind to whether the trace errored or was slow. Tail sampling decides after the whole trace is assembled (in the Collector) — it can keep 100% of errors and slow traces and a fraction of the boring ones, but needs the Collector to buffer spans. The agent does head sampling; serious policy belongs in the Collector.

The resource describes who emitted the telemetry. A span is useless if you can’t tell which deployment produced it. The resource is the set of attributes attached to every span and metric from a process — service.name, service.version, deployment.environment, k8s pod/namespace. service.name is mandatory; without it you are unknown_service:java and unsearchable. The resource is what lets traces and metrics from the same pod join on shared attributes in your backend.

The vocabulary in one table

Pin down every moving part before the deep sections. The glossary at the end repeats these for lookup; this is the model side by side:

Term	One-line definition	Where it lives	Why it matters
Java agent	`-javaagent` JAR that instruments bytecode at class load	JVM startup flag	Zero-code traces + metrics
OTLP	The wire protocol for exporting telemetry	Agent → Collector	gRPC 4317 / HTTP 4318
Span	One timed operation (name, kind, attributes, parent)	In a trace tree	The unit of a trace
Trace ID	16-byte ID shared by every span in one trace	`traceparent` header	The thing you search on
Context	The currently active span, carried on the thread	Thread-local	Parent lookup; propagation
Propagator	Reads/writes context to/from headers	Agent (configurable)	W3C `tracecontext`, `baggage`, `b3`
Baggage	Key-value context that rides alongside the trace	`baggage` header	Cross-cutting values deep in the graph
Resource	Attributes on every span/metric (who emitted it)	Agent config	Attribution; join key
Sampler	Decides keep/drop for a trace	Agent (head) / Collector (tail)	Cost vs. completeness
API artifact	`opentelemetry-api` — interfaces only	`compileOnly` dependency	Custom spans without the SDK
SDK	The implementation (build/sample/export)	Provided by the agent	Do not also ship it
Collector	Receive → process → export pipeline	Separate process	Policy, batching, fan-out

Attaching the agent and pointing it at a Collector

Download the agent JAR and run it ahead of your application. Pin a version — never float on latest in production, because an agent upgrade can change span names or attributes and shift your dashboards.

# Pin a version. Do not float on latest in production.
curl -sSLo opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.20.1/opentelemetry-javaagent.jar

java -javaagent:/otel/opentelemetry-javaagent.jar \
     -jar /app/checkout-service.jar

Everything is driven by environment variables (or matching -D system properties), so the same image works in every environment. These are the variables that matter on day one:

export OTEL_SERVICE_NAME=checkout-service
export OTEL_RESOURCE_ATTRIBUTES=service.namespace=payments,deployment.environment=prod,service.version=2.14.0
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability.svc:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1

Two protocol facts trip people up constantly, and both produce silent failure (the app runs, no data arrives):

OTEL_EXPORTER_OTLP_PROTOCOL is grpc, http/protobuf, or http/json. The agent default is grpc. The gRPC port is 4317; the HTTP port is 4318. Match the port to the protocol or you get a connection that never delivers.
With http/protobuf, the agent appends the signal path itself (/v1/traces, /v1/metrics) to OTEL_EXPORTER_OTLP_ENDPOINT. Set the base URL only. If you instead use the per-signal override OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, that one is the full path and no suffix is added.

System properties are equivalent and win over env when both are set. This is handy when you can’t edit the environment but can edit JAVA_TOOL_OPTIONS (the JVM honours it automatically, so even a java command you don’t control picks up the agent):

export JAVA_TOOL_OPTIONS="-javaagent:/otel/opentelemetry-javaagent.jar \
  -Dotel.service.name=checkout-service \
  -Dotel.exporter.otlp.protocol=http/protobuf \
  -Dotel.exporter.otlp.endpoint=http://otel-collector.observability.svc:4318"

The env var OTEL_SERVICE_NAME and the property otel.service.name are the same setting. The naming rule is mechanical: uppercase the property, replace dots with underscores. Knowing this means you never have to look up a variable name — otel.traces.sampler is OTEL_TRACES_SAMPLER, full stop.

The configuration reference

Here is the day-one-through-day-ninety configuration surface — the variables you will actually set, what each does, its default, and when you change it:

Variable (env)	Property	What it controls	Default	When to change
`OTEL_SERVICE_NAME`	`otel.service.name`	The `service.name` resource attribute	`unknown_service:java`	Always — never ship the default
`OTEL_RESOURCE_ATTRIBUTES`	`otel.resource.attributes`	Extra resource attrs (`key=val,…`)	empty	Add namespace, env, version, k8s
`OTEL_EXPORTER_OTLP_ENDPOINT`	`otel.exporter.otlp.endpoint`	OTLP target (all signals)	`http://localhost:4317`	Always, to your Collector
`OTEL_EXPORTER_OTLP_PROTOCOL`	`otel.exporter.otlp.protocol`	`grpc` / `http/protobuf` / `http/json`	`grpc`	Match Collector receiver + port
`OTEL_EXPORTER_OTLP_HEADERS`	`otel.exporter.otlp.headers`	Headers on export (e.g. auth)	empty	SaaS backends needing an API key
`OTEL_TRACES_SAMPLER`	`otel.traces.sampler`	Head sampler strategy	`parentbased_always_on`	Cut volume; pair with parentbased
`OTEL_TRACES_SAMPLER_ARG`	`otel.traces.sampler.arg`	Sampler argument (ratio/endpoint)	n/a	Set the ratio (e.g. `0.1`)
`OTEL_PROPAGATORS`	`otel.propagators`	Propagator set, comma list	`tracecontext,baggage`	Add `b3multi` for a non-OTel hop
`OTEL_TRACES_EXPORTER`	`otel.traces.exporter`	Trace exporter (`otlp`/`none`/`logging`)	`otlp`	`none` to disable; `logging` to debug
`OTEL_METRICS_EXPORTER`	`otel.metrics.exporter`	Metric exporter	`otlp`	`none` if JVM metrics live elsewhere
`OTEL_LOGS_EXPORTER`	`otel.logs.exporter`	Log exporter	`otlp`	`none` if you don’t ship logs via OTel
`OTEL_METRIC_EXPORT_INTERVAL`	`otel.metric.export.interval`	Metric export period (ms)	60000	Lower for tighter metric resolution
`OTEL_BSP_SCHEDULE_DELAY`	`otel.bsp.schedule.delay`	Batch span processor flush delay (ms)	5000	Lower for snappier export in dev
`OTEL_BSP_MAX_QUEUE_SIZE`	`otel.bsp.max.queue.size`	Max spans buffered before drop	2048	Raise for bursty high-throughput apps
`OTEL_JAVAAGENT_ENABLED`	`otel.javaagent.enabled`	Master switch for the agent	true	`false` to fully disable without removing the JAR
`OTEL_JAVAAGENT_DEBUG`	`otel.javaagent.debug`	Verbose agent logging	false	`true` when debugging attach/config

A precedence note that saves real time: when the same setting is given as both an env var and a -D property, the property wins. And JAVA_TOOL_OPTIONS is additive to flags already on the command line, so an agent attached there layers on top of an existing java -jar invocation.

What auto-instrumentation actually captures

The agent ships instrumentation modules that activate only when the matching library is on the classpath. You do not opt in per module; presence is the trigger. The agent reads OpenTelemetry semantic conventions — a standard naming scheme — so http.request.method means the same thing whether the span came from a Java agent or a Go SDK, which is what makes cross-language traces coherent.

The high-value categories and the attributes they set:

Category	Example libraries	Span kind	Key captured attributes
HTTP server	Servlet, Spring MVC/WebFlux, JAX-RS, Vert.x	SERVER	`http.request.method`, `url.path`, `http.route`, `http.response.status_code`
HTTP client	Apache HttpClient, OkHttp, JDK HttpClient, Reactor Netty	CLIENT	`http.request.method`, `server.address`, `server.port`, `http.response.status_code`
Databases (SQL)	JDBC (any driver), Hibernate, R2DBC, jOOQ	CLIENT	`db.system.name`, `db.namespace`, `db.query.text`, `server.address`
Databases (NoSQL)	MongoDB, Cassandra, Redis (Lettuce/Jedis), Elasticsearch	CLIENT	`db.system.name`, `db.operation.name`, `db.namespace`
Messaging	Kafka, RabbitMQ, JMS, AWS SQS, Pulsar	PRODUCER / CONSUMER	`messaging.system`, `messaging.destination.name`, `messaging.operation`
RPC	gRPC, Apache Dubbo	CLIENT / SERVER	`rpc.system`, `rpc.service`, `rpc.method`
Frameworks	Spring Boot, Spring Scheduling, Quartz, Executors	INTERNAL	framework-specific operation spans
Cloud SDKs	AWS SDK v1/v2, Google Cloud, Azure SDK	CLIENT	`rpc.system`, target service + operation

How a span gets its name and shape

The agent follows the conventions for span names too, and the rules are worth knowing because a badly-named span wrecks your “group by operation” view. HTTP server spans are named for the route template (GET /orders/{id}), not the concrete URL (GET /orders/12345) — using the template keeps cardinality bounded. Database spans are named for the operation and target (SELECT orders). When the agent can’t derive a route (an unmapped path), it falls back to the method plus a generic placeholder.

For JDBC the agent wraps the DataSource/Connection and produces a CLIENT span per statement. By default db.query.text is sanitized — literals are stripped and replaced with ? so the attribute carries the statement shape, not customer data. That sanitization is what makes “group by query” meaningful and keeps PII out of spans. Leave it on. The trade-offs of the default capture behaviour:

Captured by default?	What	Why the default is set this way	How to change
Yes (sanitized)	`db.query.text` with literals → `?`	Avoids PII; bounds cardinality	`OTEL_INSTRUMENTATION_COMMON_DB_STATEMENT_SANITIZER_ENABLED=false` (risky)
No	Full SQL with literal values	Would leak data + explode cardinality	Generally don’t
Yes	A curated set of HTTP headers	Useful, safe defaults	—
No	Arbitrary request/response headers	Could leak `authorization`/`cookie`	Opt in by name (below)
No	HTTP request/response bodies	Huge + sensitive	Not supported as attributes; use events sparingly
Yes	Exceptions on failed spans (`exception.*`)	Core debugging signal	—

Capturing extra headers — deliberately

HTTP spans capture a curated header set, not every header. To record specific request headers as span attributes (for example a tenant header you route on), opt in explicitly:

export OTEL_INSTRUMENTATION_HTTP_SERVER_CAPTURE_REQUEST_HEADERS=x-tenant-id,x-request-source
export OTEL_INSTRUMENTATION_HTTP_CLIENT_CAPTURE_RESPONSE_HEADERS=x-ratelimit-remaining

Each listed header becomes an attribute like http.request.header.x-tenant-id. Be deliberate — capturing authorization, cookie, or set-cookie writes secrets into your trace store, where they live as long as your retention and are visible to everyone with trace access.

Context propagation: the part you have to think about

This is the section that separates a working tracing setup from a broken one. Within a single thread the agent threads context automatically. Across a network hop it serializes context into headers. Across an async boundary you may have to help. Here is the propagation matrix — what the agent handles for free and what you bridge:

Boundary	Example	Does the agent handle it?	What you do
Same thread	Method calls within one request	Yes, automatically	Nothing
Outbound HTTP/gRPC (network)	`RestTemplate`, `WebClient`, OkHttp, gRPC stub	Yes — injects `traceparent`	Nothing
Inbound HTTP/gRPC (network)	Servlet, controller, gRPC service	Yes — extracts `traceparent`	Nothing
Messaging produce/consume	Kafka, RabbitMQ, JMS	Mostly — links via message headers	Verify headers propagate; sometimes link
`@Async` / Spring task executor	`@Async` method, `TaskExecutor`	Often, via executor instrumentation	Verify; wrap if orphaned
Raw `ExecutorService`	`Executors.newFixedThreadPool`	Sometimes — depends on submit path	Wrap with `Context.taskWrapping`
`CompletableFuture` chains	`supplyAsync`, `thenApplyAsync`	Partially	Capture + `makeCurrent` in the task
Reactive (Reactor/RxJava)	`Mono`/`Flux` on a scheduler	Partially	Reactor: context is propagated via the agent; verify on custom schedulers
Manual `new Thread()`	Hand-rolled threads	No	Capture context, restore in `run()`

Across the wire (W3C, by default)

The agent’s default propagators are W3C tracecontext and baggage. Every outbound client call injects headers and every inbound request extracts them:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate:  vendora=value,vendorb=value
baggage:     tenant.id=acme,request.origin=mobile

traceparent has four hyphen-separated fields, left to right: version (00, one byte / two hex — the W3C trace-context version); trace-id (4bf92f35…e4736, 16 bytes / 32 hex — shared by every span in the trace, the value you search on); parent-id (00f067aa0ba902b7, 8 bytes / 16 hex — the caller’s span id, which becomes the child’s parent); and trace-flags (01, one byte — bit 0 is the sampled flag, so 01 means sampled and 00 means not). tracestate is an optional vendor-specific companion. Because both sides default to W3C, a Java service calling another Java service is correlated with zero configuration.

The moment a non-OTel hop sits in the middle — say a legacy edge that speaks B3 (the Zipkin format) — you widen the propagator set so the agent reads and writes both:

export OTEL_PROPAGATORS=tracecontext,baggage,b3multi

Propagators are tried in order on extract; all of them run on inject. List tracecontext first so W3C is authoritative. The propagators you can choose and when each applies:

Propagator value	Format	Headers	Use when
`tracecontext`	W3C Trace Context	`traceparent`, `tracestate`	Default; always include
`baggage`	W3C Baggage	`baggage`	Default; to carry baggage
`b3`	B3 single-header	`b3`	Interop with single-header Zipkin/Istio
`b3multi`	B3 multi-header	`X-B3-TraceId`, `X-B3-SpanId`, …	Interop with multi-header B3 systems
`jaeger`	Jaeger	`uber-trace-id`	Interop with legacy Jaeger clients
`ottrace`	OpenTracing	`ot-tracer-*`	Interop with old OpenTracing apps
`xray`	AWS X-Ray	`X-Amzn-Trace-Id`	Interop on AWS / X-Ray daemon

Baggage is not attributes

Baggage is key-value context that rides alongside the trace across every hop. Two things to internalize: it is not automatically copied onto spans (you set an attribute separately if you want it queryable), and it is not free — every key inflates the baggage header on every outbound request for the rest of the call graph. Use it for a small number of cross-cutting values (tenant id, request origin) you need deep in the call graph, set via the API:

import io.opentelemetry.api.baggage.Baggage;
import io.opentelemetry.context.Scope;

try (Scope scope = Baggage.current().toBuilder()
        .put("tenant.id", tenantId)
        .build().makeCurrent()) {
    // Downstream HTTP/gRPC calls now carry baggage=tenant.id=...
    paymentClient.authorize(request);
}

Baggage versus span attributes versus resource attributes — three different things people conflate:

Property	Baggage	Span attribute	Resource attribute
Scope	Whole trace, all hops	One span	Every span/metric from the process
Travels across the network?	Yes (`baggage` header)	No	No
Queryable in the backend?	Only if copied to an attribute	Yes	Yes
Cost of adding one	Header bytes on every request	A few bytes on one span	Negligible (set once)
Set via	`Baggage` API	`span.setAttribute` / agent	`OTEL_RESOURCE_ATTRIBUTES`
Good for	tenant, request origin, feature flag	operation specifics	service identity, env, pod

Across thread pools and reactive boundaries

This is where traces break in real systems. When you hand work to an ExecutorService, the runnable executes on a thread that has no current context, so the child span attaches to nothing and you get an orphan. The agent instruments common executors, but the robust, explicit fix is to wrap the executor so context is captured at submit time and restored at run time:

import io.opentelemetry.context.Context;

ExecutorService traced = Context.taskWrapping(Executors.newFixedThreadPool(8));
// Work submitted to `traced` runs under the submitting thread's context.
traced.submit(() -> chargeAsync(order));

For a single manual handoff — a one-off CompletableFuture or a raw thread — capture the context and re-make it current inside the task:

Context captured = Context.current();
CompletableFuture.runAsync(() -> {
    try (Scope scope = captured.makeCurrent()) {
        settle(order);   // spans here parent correctly
    }
});

The rule is uniform: capture Context.current() on the thread that has the right context, carry it to the other thread, and makeCurrent() it there inside a try-with-resources. The decision table for each async construct:

Construct	Symptom if unhandled	Cleanest fix
`ExecutorService.submit/execute`	Orphan child spans	`Context.taskWrapping(executor)` once at creation
Spring `@Async`	Sometimes orphaned	Verify; if orphaned, wrap the `TaskExecutor`
`CompletableFuture.runAsync/supplyAsync`	Orphan in the async stage	Capture + `makeCurrent` inside the lambda
`CompletableFuture.*Async` with custom executor	Orphan	Wrap the executor and/or capture context
Raw `new Thread(r).start()`	Orphan	Capture + `makeCurrent` in `run()`
Reactor `Mono`/`Flux`	Usually fine (agent propagates)	Verify on custom `Schedulers`; avoid manual context juggling
Kafka consumer batch	Spans link but may not nest	Process per-record under the record’s context
Scheduled jobs (`@Scheduled`)	New root each run (correct)	Nothing — each run is its own trace

Manual spans with the API only — no SDK dependency

You do not pull in the OpenTelemetry SDK to write custom spans under the agent. You depend on the API artifact only (io.opentelemetry:opentelemetry-api), mark it compileOnly / provided, and the agent supplies the live implementation at runtime. If you accidentally ship the SDK too, you get two providers and duplicate exports — the single most common “why are my spans doubled / why are there two services” bug.

// build.gradle.kts — API only; the agent provides the implementation.
dependencies {
    compileOnly("io.opentelemetry:opentelemetry-api:1.55.0")
    // Optional: declarative @WithSpan support
    compileOnly("io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations:2.20.1")
    // DO NOT add opentelemetry-sdk here when running under the agent.
}

The dependency decision in one table — what to include for each scenario:

Scenario	`opentelemetry-api`	`…-instrumentation-annotations`	`opentelemetry-sdk`
Agent attached, want custom spans	`compileOnly`	optional `compileOnly`	No
Agent attached, only `@WithSpan`	(transitive)	`compileOnly`	No
No agent, manual SDK setup	`implementation`	optional	`implementation` (you wire it)
Library author (don’t pick impl)	`implementation` (API)	—	No
Tests with no exporter	`compileOnly` (no-op at runtime)	—	optional in-memory exporter

Then create spans that wrap business operations the agent can’t name:

import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanKind;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;

private static final Tracer tracer =
    GlobalOpenTelemetry.getTracer("checkout-service", "2.14.0");

void applyPromotion(Order order) {
    Span span = tracer.spanBuilder("promotion.apply")
        .setSpanKind(SpanKind.INTERNAL)
        .startSpan();
    try (Scope scope = span.makeCurrent()) {
        span.setAttribute("promotion.code", order.promoCode());
        span.setAttribute("order.item_count", order.items().size());
        // ... business logic ...
        span.addEvent("discount.calculated");
    } catch (Exception e) {
        span.setStatus(StatusCode.ERROR, "promotion failed");
        span.recordException(e);
        throw e;
    } finally {
        span.end();   // ALWAYS end in finally; makeCurrent() does not end the span
    }
}

Two correctness rules underpin that snippet. makeCurrent() only sets the span as the current context for the scope — it does not end the span, so end() belongs in finally. And GlobalOpenTelemetry.getTracer(...) returns a no-op tracer if no agent is attached, so the exact same code runs harmlessly in a unit test with no exporter and no NPE. The span lifecycle calls you actually use:

Call	What it does	Mistake to avoid
`tracer.spanBuilder(name).startSpan()`	Creates + starts a span	Forgetting to start (builder isn’t a span)
`span.makeCurrent()`	Sets it as current for the scope (returns `Scope`)	Not closing the `Scope` → leaked context
`span.setAttribute(k, v)`	Adds a typed attribute	High-cardinality values (user id) blow up storage
`span.addEvent(name)`	Timestamped event within the span	Using events where a child span fits better
`span.setStatus(ERROR, desc)`	Marks the span failed	Forgetting it — error traces look successful
`span.recordException(e)`	Attaches `exception.*` attributes/event	Recording then swallowing the exception silently
`span.end()`	Ends + makes it exportable	Not in `finally` → leaked/unended on throw

Declarative spans with `@WithSpan`

If you prefer declarative spans, add the opentelemetry-instrumentation-annotations dependency and annotate methods; the agent weaves them at load time:

import io.opentelemetry.instrumentation.annotations.WithSpan;
import io.opentelemetry.instrumentation.annotations.SpanAttribute;

@WithSpan("inventory.reserve")
int reserve(@SpanAttribute("sku") String sku, int qty) {
    // a span named inventory.reserve wraps this method automatically
    return warehouse.decrement(sku, qty);
}

Manual spanBuilder versus @WithSpan — pick by control vs. brevity:

Aspect	Manual `spanBuilder`	`@WithSpan` annotation
Control over name/kind/attrs/events	Full	Limited (name + `@SpanAttribute` params)
Boilerplate	More (`try`/`finally`)	Minimal
Status/exception handling	Explicit, your choice	Automatic on thrown exception
Conditional / dynamic spans	Easy	Not really
Works on private methods	n/a	Only on agent-woven (public/visible) methods
Best for	Rich business spans, branching	Quick coverage of a service method

Sampling: how much to keep, and where to decide

Out of the box (default parentbased_always_on) you trace everything, including health checks and internal polls. At any real throughput that is too much data and too much cost. Three levers cut the noise, and the most important decision is which layer makes the call.

Sampler. parentbased_traceidratio respects an upstream sampling decision and applies your ratio only to traces that start at this service. That keeps traces whole — you never sample the parent in and the child out, which would give you a span with a missing parent.

export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1   # keep 10% of root traces

The samplers the agent supports, and what each does:

Sampler value	Behaviour	Argument	When to use
`always_on`	Sample every trace	—	Dev; very low throughput
`always_off`	Sample nothing	—	Disable sampling (keep config)
`traceidratio`	Keep a fixed fraction by trace id	ratio `0.0–1.0`	Rarely alone — ignores parent
`parentbased_always_on`	Follow parent; root → on	—	Default; full volume
`parentbased_always_off`	Follow parent; root → off	—	Only child of sampled parents
`parentbased_traceidratio`	Follow parent; root → ratio	ratio	The standard head sampler
`parentbased_jaeger_remote`	Follow parent; root → remote-configured rate	`endpoint=…`	Centrally-managed per-service rates

Head versus tail — the decision that actually matters. The agent samples blind: it decides at the first span whether to keep the trace, before it knows if the request errored or was slow. That means head sampling at 10% drops 90% of your errors too. The fix is to head-sample generously at the agent (or keep 100% and rely on the Collector) and put the real decision — keep all errors, keep slow traces, sample the boring ones — in the Collector as tail sampling, where the whole trace is visible. Side by side:

Property	Head sampling (agent)	Tail sampling (Collector)
Decides when	At trace start	After the trace is assembled
Knows the outcome?	No (blind to errors/latency)	Yes — can keep all errors/slow
Cost on the app	Tiny	None (off-box)
Cost on the Collector	None	Buffers spans; needs memory + grouping
Risk	Drops errors with the baseline	Misconfigured policy / under-provisioned buffer
Best practice	Generous ratio or 100% to Collector	Error-biased + latency-biased + base rate

The standard production posture: agent on a generous head ratio (or 100% in lower-volume systems), Collector doing tail-based, error-biased sampling — covered in depth in OpenTelemetry Collector tail-based sampling and load balancing. The agent stays dumb; the Collector gets smart.

Drop specific endpoints and disable chatty modules. Health and readiness probes are pure noise. Every instrumentation module has a flag otel.instrumentation.<name>.enabled; turn one off when it’s redundant, and use the “disable everything, opt back in” pattern for a service that should emit only a curated handful of span types:

# Disable a single noisy module (the JDBC datasource open span)
export OTEL_INSTRUMENTATION_JDBC_DATASOURCE_ENABLED=false

# Surgical mode: disable all, opt back in only what you want
export OTEL_INSTRUMENTATION_COMMON_DEFAULT_ENABLED=false
export OTEL_INSTRUMENTATION_SPRING_WEBMVC_ENABLED=true
export OTEL_INSTRUMENTATION_JDBC_ENABLED=true

The noise-control levers, ranked by precision:

Lever	Granularity	Mechanism	Trade-off
Lower the head sample ratio	Whole service	`OTEL_TRACES_SAMPLER_ARG`	Drops errors too (blind)
Tail sampling in the Collector	Per-trace by outcome	Collector `tail_sampling` processor	Collector complexity + buffering
Disable a module	Per library	`OTEL_INSTRUMENTATION_<name>_ENABLED=false`	Lose that library’s spans entirely
`COMMON_DEFAULT_ENABLED=false` + opt-in	Per library, whitelist	Disable all, enable a few	Easy to forget a needed module
Drop by attribute in the Collector	Per span/route	`filter`/`tail_sampling` policies	Logic lives off-box

The Collector: your policy and fan-out layer

The agent’s job is to produce telemetry well. The OpenTelemetry Collector is a separate process whose job is to process and route it. You almost always want one between your services and your backend, because it lets you change sampling, batching, redaction and destinations without redeploying 140 services — config is the product surface. Point the agent at the Collector; point the Collector at your backend.

A Collector pipeline is three stages: receivers (how data comes in), processors (what happens to it), exporters (where it goes). A minimal but production-shaped config:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 8192
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25
  resourcedetection:
    detectors: [env, system]
  # tail_sampling lives in a gateway tier — see the tail-sampling article

exporters:
  otlp/tempo:
    endpoint: tempo.observability.svc:4317
    tls:
      insecure: false

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

The pipeline components you’ll actually wire, and why:

Stage	Component	Purpose	Notes
Receiver	`otlp`	Accept agent traffic (gRPC 4317 / HTTP 4318)	Match the agent’s protocol
Processor	`memory_limiter`	Shed load before OOM	Put it first in the list
Processor	`batch`	Group spans for efficient export	Big throughput win; tune size/timeout
Processor	`resourcedetection`	Add host/cloud/k8s resource attrs	Enriches what the agent already set
Processor	`tail_sampling`	Keep errors/slow, sample the rest	Needs spans of a trace on one Collector
Processor	`attributes` / `redaction`	Drop/scrub sensitive attributes	Last line of defence for PII
Exporter	`otlp/<backend>`	Send to Tempo/Jaeger/SaaS	One pipeline can fan out to several

The deployment topology matters for tail sampling specifically: tail sampling needs all spans of a given trace to reach the same Collector instance, which is why a horizontally-scaled Collector tier uses a load-balancing exporter keyed on trace id in front of the tail-sampling Collectors. That, and the agent-vs-gateway split, are detailed in OpenTelemetry Collector pipelines for production.

Resource detection: service identity, k8s, and cloud

A span is unattributable if you can’t tell which deployment emitted it. The resource is the set of attributes attached to every span and metric. OTEL_SERVICE_NAME is the one you must always set — without it spans land under unknown_service:java and your backend becomes unsearchable.

The agent runs resource detectors automatically and merges what it finds: process and host attributes, container id, and cloud/k8s metadata when the environment exposes it. You stitch the rest in through the Kubernetes Downward API so each pod is self-describing:

env:
  - name: OTEL_SERVICE_NAME
    value: checkout-service
  - name: K8S_POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name
  - name: K8S_NAMESPACE
    valueFrom:
      fieldRef:
        fieldPath: metadata.namespace
  - name: K8S_NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: >-
      service.namespace=payments,
      deployment.environment=prod,
      service.version=2.14.0,
      k8s.pod.name=$(K8S_POD_NAME),
      k8s.namespace.name=$(K8S_NAMESPACE),
      k8s.node.name=$(K8S_NODE_NAME)

Kubernetes expands $(VAR) references in env values only when the referenced variable is declared earlier in the same env list — which is exactly why the fieldRef entries come first. The resource attributes that earn their place:

Attribute	Example	Source	Why it matters
`service.name`	`checkout-service`	`OTEL_SERVICE_NAME`	Mandatory; the primary search/group key
`service.namespace`	`payments`	resource attrs	Disambiguates same-named services across teams
`service.version`	`2.14.0`	resource attrs	Correlate a regression to a release
`deployment.environment`	`prod`	resource attrs	One backend serves prod + staging
`service.instance.id`	UUID	agent / detector	Per-process uniqueness
`host.name`	node hostname	system detector	Node-level correlation
`container.id`	docker id	container detector	Map a span to a container
`k8s.pod.name`	`checkout-7f9…`	Downward API	Join traces to pod logs/metrics
`k8s.namespace.name`	`payments`	Downward API	Scope by namespace
`k8s.node.name`	`aks-np-0`	Downward API	Node-affinity / noisy-neighbour analysis

The deployment.environment attribute is what lets one backend query prod separately from staging without separate stores; the k8s attributes are what let a trace join the pod’s logs and JVM metrics, which is the payoff in the next section.

JVM and runtime metrics from the same agent

The agent is not trace-only. It emits JVM runtime metrics over OTLP on the same pipeline — no Micrometer, no separate Prometheus scrape. You get heap/non-heap usage, GC duration, thread counts, and class loading as OTLP metrics, exported to the metrics endpoint your OTLP config already points at.

export OTEL_METRICS_EXPORTER=otlp
export OTEL_METRIC_EXPORT_INTERVAL=30000   # milliseconds

This matters operationally because both signals carry the identical resource attributes: when a trace shows a 4-second pause, the JVM metrics from the same agent on the same pod tell you whether it was a stop-the-world GC, and they join on service.name + k8s.pod.name in your backend with no extra correlation work. The metric families the agent emits:

Metric family	What it measures	Use it to diagnose
`jvm.memory.used` / `committed` / `limit`	Heap + non-heap usage	OOM risk, leak trends
`jvm.gc.duration`	Time spent in garbage collection	Stop-the-world pauses behind latency spikes
`jvm.thread.count`	Live thread count	Thread leaks, pool saturation
`jvm.class.count` / `loaded`	Loaded classes	Classloader leaks (redeploy churn)
`jvm.cpu.recent_utilization`	Process CPU usage	Saturation vs. throttling
`http.server.request.duration`	Request latency histogram	RED metrics straight from the agent

To turn metrics off entirely (some teams keep JVM metrics in Prometheus already), set OTEL_METRICS_EXPORTER=none. These metrics also carry exemplars — links from a metric bucket to an example trace — which is how you jump from “p99 latency spiked” to the exact slow trace; see OpenTelemetry metrics, exemplars and Prometheus trace correlation.

Architecture at a glance

Picture the path of one request and where each OpenTelemetry mechanism engages. A user request arrives at the checkout-service JVM and hits a Spring controller. The agent’s Servlet/Spring instrumentation extracts any incoming traceparent and opens a SERVER span as the root of this service’s contribution. Inside the handler, a JDBC call to the orders database produces a CLIENT span (sanitized db.query.text), and an outbound WebClient call to the payments-service produces another CLIENT span — and on that outbound call the agent injects the current traceparent and baggage headers, so the downstream service’s agent extracts them and its SERVER span nests directly under the checkout CLIENT span. The trace is now spanning two processes with no code on either side.

Where you intervene is the async seam: the handler offloads a fraud check to a thread pool. Unless that pool is wrapped (Context.taskWrapping) or the context is captured and makeCurrent()'d on the worker thread, the fraud-check span has no parent and floats free — the single most common reason a trace “looks broken.” And where you add value is the business span: a manual promotion.apply span (API artifact only) wraps logic the agent can’t name, carrying promotion.code and order.item_count.

Every span — SERVER, CLIENT, the manual INTERNAL one, plus JVM metrics — leaves the JVM over OTLP (gRPC 4317 or HTTP 4318) carrying the shared resource (service.name, deployment.environment, k8s pod). It lands on the Collector, which batches, optionally tail-samples (keeping every errored trace and a fraction of the rest), redacts anything sensitive, and fans out to the trace backend and the metrics backend. The mental model to hold: the agent produces and propagates, you bridge async and add business spans, and the Collector decides policy — three responsibilities, three places, and almost every problem you hit is “the wrong layer is trying to do another layer’s job.”

Real-world scenario

Northwind Logistics runs ~140 JVM services (Spring Boot 3, Java 21) on EKS, shipping ~9,000 requests/second at peak across order capture, routing, and a Kafka-backed events pipeline. The platform team of five standardized on the OpenTelemetry Java agent via a shared base image so every service got tracing “for free.” Within a week of the rollout, two things went wrong at once: the trace backend bill roughly tripled, and search got slow enough that on-call engineers stopped using traces during incidents — the opposite of the goal.

The root causes were textbook. First, the agent defaulted to parentbased_always_on, so they were keeping 100% of traces including a Kafka consumer fleet emitting a CONSUMER span per message at full volume — millions of low-value spans per minute. Second, a meaningful fraction of services showed up as unknown_service:java because teams had shipped the shared image without setting OTEL_SERVICE_NAME, so those traces were both unattributable and impossible to filter out. Third — the subtle one — the routing service’s traces were full of orphan spans: it offloaded geocoding to a raw ExecutorService that wasn’t wrapped, so half of each routing trace was disconnected, which made traces look incomplete and eroded trust further.

The constraint was the usual one: they could not touch 140 codebases or coordinate 140 redeploys on any reasonable timeline. So they fixed what they could from the deployment layer and reserved code changes for the one service that needed it. Via the platform Helm chart they mandated OTEL_SERVICE_NAME from the existing release name ({{ .Release.Name }}) so it was impossible to deploy nameless again, pinned a generous head ratio on the agent, and moved the real sampling decision to a Collector gateway doing tail-based, error-biased sampling — keep 100% of error traces and 10% of the rest:

# Injected by the platform Helm chart into every service. No app change.
env:
  - name: OTEL_SERVICE_NAME
    value: {{ .Release.Name }}              # never unknown_service:java again
  - name: OTEL_TRACES_SAMPLER
    value: parentbased_traceidratio
  - name: OTEL_TRACES_SAMPLER_ARG
    value: "0.25"                           # head ratio; the gateway makes the real call
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: http://otel-gateway.observability.svc:4318
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: http/protobuf

For the routing service’s orphans they made the one targeted code change that mattered — wrapping the geocoding pool with Context.taskWrapping(...) — a three-line diff and a single redeploy. The combined result over the next billing period: stored span volume fell ~68% while the signal improved, because no error trace was ever dropped at the source and orphans vanished from the routing traces. Search latency dropped back to usable. The line the team wrote in their runbook: “With the agent, configuration is your product surface. Keep the agent boring, push policy to the Collector, and bridge async in code only where the agent truly can’t see.”

Advantages and disadvantages

Zero-code agent instrumentation is a genuine force multiplier, but it has sharp edges you must know about. Weigh it honestly:

Advantages	Disadvantages
Instruments ~100 libraries with no code change — HTTP, JDBC, Kafka, gRPC for free	Auto-spans are generic; business meaning needs manual spans you still write
Cross-service W3C propagation works out of the box between OTel services	Async boundaries (executors, futures, reactive) can drop context silently
Same agent emits traces + JVM metrics on one pipeline, joinable by resource	Defaults are not production-safe: full sampling, `unknown_service`, no env attrs
Vendor-neutral — point OTLP at any backend; swap without re-instrumenting	An accidental SDK dependency causes double export that’s confusing to spot
Configuration is the surface — change policy via env/Collector, no redeploy	Wrong protocol/port fails silently — app runs, no data, no error
Semantic conventions make traces coherent across languages	Convention/attribute names shift between agent major versions — pin the version
Custom spans use API-only (no-op without the agent) — safe in tests	High-cardinality attributes or unbounded baggage quietly inflate cost

The model is right for almost every JVM service where you want tracing without a rewrite — which is most of them. It bites hardest on heavily-async or reactive codebases (where you must bridge context deliberately), on cost-sensitive high-throughput fleets (where head-only sampling is a trap), and on any team that ships the agent with defaults and never tunes service name, sampling, or the Collector. Every disadvantage is manageable — but only if you know it exists, which is the entire point of this guide.

Hands-on lab

Run a real Spring Boot service under the agent, see auto-instrumented spans, add a manual span and baggage, watch a thread pool drop context and fix it — using a local Collector with a logging exporter so you can read spans in your terminal. No cloud account required.

Step 1 — Start a local Collector that prints spans. Create collector.yaml:

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
exporters:
  debug:
    verbosity: detailed
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [debug]

Run it in Docker:

docker run --rm -p 4318:4318 \
  -v "$(pwd)/collector.yaml:/etc/otelcol/config.yaml" \
  otel/opentelemetry-collector:latest \
  --config /etc/otelcol/config.yaml

Expected: the Collector logs Everything is ready. Begin running and processing data.

Step 2 — Download a pinned agent JAR.

curl -sSLo opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.20.1/opentelemetry-javaagent.jar
ls -l opentelemetry-javaagent.jar   # expect a ~20+ MB file

Step 3 — Configure and run any Spring Boot JAR under the agent. Use a service that exposes an HTTP endpoint (a minimal Spring Web app, or any existing one):

export OTEL_SERVICE_NAME=lab-checkout
export OTEL_RESOURCE_ATTRIBUTES=deployment.environment=lab,service.version=0.0.1
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_TRACES_SAMPLER=always_on
export OTEL_METRICS_EXPORTER=none
export OTEL_BSP_SCHEDULE_DELAY=1000        # flush fast so you see spans quickly

java -javaagent:./opentelemetry-javaagent.jar -jar app.jar

Expected: at startup the agent logs a line containing opentelemetry-javaagent and its version.

Step 4 — Generate a request and read the SERVER span. In another terminal:

curl -s http://localhost:8080/api/checkout >/dev/null

Watch the Collector terminal: a span with Kind: Server, a name like GET /api/checkout, attributes http.request.method=GET, url.path=/api/checkout, http.response.status_code=200, and a 32-hex Trace ID. That span exists with zero instrumentation code.

Step 5 — Validate cross-process propagation. Send an explicit traceparent and confirm your span attaches to that exact trace id:

curl -s http://localhost:8080/api/checkout \
  -H "traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01" >/dev/null

Expected: in the Collector output, the SERVER span’s Trace ID is 4bf92f3577b34da6a3ce929d0e0e4736 and its Parent ID is 00f067aa0ba902b7 — the agent extracted your header and parented to it.

Step 6 — Add a manual span + baggage (code). Add compileOnly("io.opentelemetry:opentelemetry-api:1.55.0") and, in a service method called by the endpoint:

Tracer tracer = GlobalOpenTelemetry.getTracer("lab-checkout");
Span span = tracer.spanBuilder("promotion.apply").startSpan();
try (Scope scope = span.makeCurrent()) {
    span.setAttribute("promotion.code", "SAVE10");
} finally {
    span.end();
}

Rebuild, rerun (Step 3), hit the endpoint. Expected: a child span named promotion.apply with attribute promotion.code=SAVE10, nested under the SERVER span (same Trace ID, its Parent ID = the SERVER span’s id).

Step 7 — Break context across a thread pool, then fix it. Make the endpoint offload work to an unwrapped pool:

ExecutorService pool = Executors.newFixedThreadPool(2);   // UNWRAPPED — orphans
pool.submit(() -> {
    Span s = tracer.spanBuilder("async.work").startSpan();
    try (Scope sc = s.makeCurrent()) { /* ... */ } finally { s.end(); }
});

Hit the endpoint. Expected: async.work appears with a different Trace ID (or no parent) — an orphan. Now wrap the pool:

ExecutorService pool = Context.taskWrapping(Executors.newFixedThreadPool(2));

Rebuild, rerun, hit the endpoint. Expected: async.work now shares the request’s Trace ID and nests correctly. That diff is the entire lesson of context propagation.

Step 8 — Teardown. Stop the JVM (Ctrl-C), stop the Collector (Ctrl-C), and remove the JAR and collector.yaml if you don’t need them. Nothing was provisioned in the cloud.

The lab’s validation checkpoints in one table:

Step	What you did	Expected evidence
4	Hit the endpoint	SERVER span, `GET /api/checkout`, status 200, a trace id
5	Sent a `traceparent`	Span’s trace id = `4bf92f35…`, parent = `00f067aa…`
6	Added a manual span	`promotion.apply` child with `promotion.code`
7a	Unwrapped pool	`async.work` orphaned (different trace id)
7b	`Context.taskWrapping`	`async.work` nests under the request trace

Common mistakes & troubleshooting

The failure modes that actually happen, by symptom. Each row gives the tell, how to confirm it, and the fix:

#	Symptom	Root cause	How to confirm	Fix
1	No telemetry at all; app runs fine	Wrong protocol/port (gRPC config → 4318, or vice versa)	`OTEL_JAVAAGENT_DEBUG=true`; check exporter target vs. Collector receiver	Match protocol to port: `grpc`/4317, `http/protobuf`/4318
2	Service shows as `unknown_service:java`	`OTEL_SERVICE_NAME` never set	Backend service list; grep env in the pod	Set `OTEL_SERVICE_NAME` (mandate via Helm)
3	Spans appear twice; two “services”	SDK shipped alongside the agent → two providers	`./gradlew dependencies` shows `opentelemetry-sdk`	Make custom-span deps `compileOnly`; drop the SDK
4	Async work makes orphan spans	Thread pool / future not carrying context	Span has a different trace id or no parent	`Context.taskWrapping` or capture + `makeCurrent`
5	Trace stops at a service boundary	Downstream not OTel, or propagator mismatch (B3 vs W3C)	Downstream SERVER span has no parent	Add `b3multi` to `OTEL_PROPAGATORS` (W3C first)
6	Trace bill exploded	`parentbased_always_on` default at high volume	Backend ingestion graph; sampler env unset	Head ratio + tail sampling in the Collector
7	Errors missing from sampled traces	Head sampling drops errors blind	10% head ratio + error rate higher than 10%	Keep 100% to Collector; tail-sample errors
8	`baggage` header huge / requests slow	Too many / large baggage keys	Inspect outbound `baggage` header size	Prune baggage to a few small cross-cutting keys
9	Custom span never appears	No agent attached → no-op tracer; or wrong tracer	`GlobalOpenTelemetry` returns no-op without agent	Run under the agent; verify it logged at startup
10	Span has no parent though same thread	`Scope` from `makeCurrent()` never closed earlier	A prior span leaked its scope	Always close `Scope` in try-with-resources
11	Health checks flood the traces	Probe endpoints traced at full volume	High-count low-value SERVER spans on `/health`	Drop in Collector by route; or disable the module
12	DB span shows literal values (PII)	Statement sanitizer disabled	`db.query.text` contains real values	Re-enable sanitizer (it’s on by default)
13	Agent didn’t attach	`-javaagent` path wrong / not first	No agent version line at startup	Fix the absolute path; verify `JAVA_TOOL_OPTIONS`
14	Metrics missing but traces fine	`OTEL_METRICS_EXPORTER=none` or interval too long	Metrics endpoint quiet; check env	Set exporter `otlp`; lower `OTEL_METRIC_EXPORT_INTERVAL`
15	Two agents / conflicting APM	Vendor APM agent + OTel agent both attached	Two `-javaagent` flags or an APM env injector	Run one tracing agent; pick OTel or the vendor

Lost context — the number-one issue

Orphan spans (#4, #10) are the most common and most damaging failure because they make traces look incomplete, which destroys trust faster than no traces at all. Run this mental checklist when a span won’t parent, in order:

Is the span created on the same thread as its intended parent? If not, you crossed an async boundary — capture Context.current() and makeCurrent() it in the task, or wrap the executor.
If it’s async, is the executor wrapped with Context.taskWrapping? If not, wrap it at creation, or capture context and makeCurrent() inside the submitted task.
Was the parent’s Scope still open when the child was created? If you closed the parent scope too early (didn’t use try-with-resources), the child has no current parent — keep the scope open while children are created.
Is the boundary a network hop to a non-OTel service? If so, the downstream isn’t reading W3C — align propagators (b3multi, W3C first) or instrument the downstream.

Double instrumentation — the confusing one

Symptom #3 (doubled spans / two services) is almost always the SDK shipped next to the agent. The agent is the SDK at runtime; a second one means two OpenTelemetry providers, two batch processors, two exports. Confirm with a dependency tree and fix by making OpenTelemetry deps compileOnly. There are four variants worth recognizing, each with the same root shape — two SDK implementations active at once:

opentelemetry-sdk on the runtime classpath — declared implementation instead of compileOnly. Confirm with ./gradlew dependencies (it shows opentelemetry-sdk); fix by switching it to compileOnly.
A Spring Boot OTel starter and the agent — both auto-configure an SDK. You’ll see Spring’s OTel autoconfiguration active alongside the agent; pick one path: the agent or the starter, not both.
A vendor APM agent plus the OTel agent — two -javaagent JARs, two agent banners at startup. Run only one tracing agent.
Manual SDK init in code plus the agent — someone called OpenTelemetrySdk.builder() and installed a global provider. Remove the manual SDK setup; under the agent you never wire the SDK yourself.

Best practices

Always set OTEL_SERVICE_NAME, and make it impossible to forget. Source it from the deployment (Helm release, Downward API) so a nameless deploy can’t happen. unknown_service:java is the single most common avoidable mistake.
Pin the agent version. Never float on latest — a major-version bump can rename spans/attributes and silently move your dashboards. Upgrade deliberately and review the changelog.
Match protocol and port exactly. grpc/4317 or http/protobuf/4318. The default is grpc; mismatches fail silently. For http/protobuf, set the base endpoint and let the agent append the path.
Depend on the API artifact only for custom spans. compileOnly/provided. Never ship the SDK under the agent — that is the double-export trap.
End every manual span in finally and close every Scope in try-with-resources. makeCurrent() doesn’t end the span; an unclosed scope leaks context into later work.
Bridge async deliberately. Wrap executors with Context.taskWrapping, or capture Context.current() and makeCurrent() it in the task. Treat every submit/runAsync/new Thread as a context boundary until proven otherwise.
Head-sample generously at the agent; tail-sample for real in the Collector. The agent is blind to outcomes; keep errors and slow traces by deciding in the Collector where the whole trace is visible.
Keep tracecontext first in OTEL_PROPAGATORS. Add b3multi/others only when a specific non-OTel hop needs it. Don’t carry propagators you don’t use.
Use baggage sparingly. A few small cross-cutting keys (tenant, origin), never a grab-bag — every key is header bytes on every request, and baggage isn’t queryable until copied to an attribute.
Keep span attributes low-cardinality. Route templates, not concrete URLs; ids belong on a small number of spans, not as grouping keys. High cardinality is what blows up storage cost.
Push policy to config, not code. Service name, sampling, redaction, destinations all live in env vars and the Collector — change them without redeploying the fleet.
Validate propagation end to end on day one. Send a known traceparent and confirm the backend shows your span as a child. Re-run it after any agent or propagator change.

Security notes

Traces are a data-exfiltration surface if you’re careless, because they capture request metadata and can capture more if you let them. The controls that matter:

Risk	What leaks	Control
Sensitive headers as attributes	`authorization`, `cookie`, `set-cookie` tokens	Never list them in `_CAPTURE__HEADERS`; redact in the Collector
Full SQL with literals	Customer data, PII in `db.query.text`	Keep the statement sanitizer on (default)
Request/response bodies	Payload PII	Don’t put bodies in attributes/events
High-cardinality PII attributes	Emails, account numbers as attributes	Hash or omit; keep ids off grouping attributes
Unencrypted OTLP in transit	Traces sniffable on the wire	TLS on OTLP (`tls.insecure: false`); mTLS to the gateway
Backend auth tokens in env	Exporter API keys exposed	`OTEL_EXPORTER_OTLP_HEADERS` from a secret, not plaintext
Over-broad trace access	Anyone reads everything	RBAC on the backend; redact before storage as defence-in-depth

Two operational rules. First, redaction belongs in the Collector as a last line of defence — even if an app accidentally captures something sensitive, an attributes/redaction processor can drop it before it reaches storage, fleet-wide, without an app redeploy. Second, the OTLP path should be encrypted: terminate TLS at the Collector and, in zero-trust environments, use mTLS between the agent and a gateway Collector so telemetry can’t be sniffed or spoofed. The sanitizer being on by default is doing real security work — turning it off to “see the values” is a frequent, regrettable mistake.

Cost & sizing

Tracing cost is driven almost entirely by span volume (count) and secondarily by per-span size (attribute cardinality/length). The agent and Collector themselves are cheap; the backend storage and the SaaS per-span/per-GB pricing are where money goes. The levers:

Cost driver	What inflates it	How to control it	Rough effect
Trace/span volume	Full sampling at high RPS; chatty modules	Head ratio + Collector tail sampling	The dominant lever — 60–90% reductions typical
Per-span size	Many/long attributes, captured headers	Keep attributes lean; don’t capture bodies	Linear in attribute bytes
Attribute cardinality	Concrete URLs, user ids as grouping keys	Route templates; ids on few spans	Explodes index/storage cost
Baggage	Many keys on every request	A few small keys only	Network + per-span bytes
Retention	Long trace retention windows	Shorter retention for sampled traffic	Linear in days kept
Metric series	High-cardinality metric labels	Bound label cardinality	Separate from trace cost

Resource footprint of the moving parts, for sizing: the agent adds modest overhead to the JVM — typically low-single-digit % CPU and tens of MB of memory for buffering, with the batch span processor queue (OTEL_BSP_MAX_QUEUE_SIZE, default 2048) the main memory knob; raise it for bursty high-throughput services, and watch for dropped spans if the queue overflows. The Collector sizes with throughput: a gateway tier doing tail sampling needs enough memory to buffer in-flight traces (all spans of a trace until the decision window closes), which is why memory_limiter goes first and why tail-sampling Collectors are sized by concurrent traces × spans/trace × span size, not just request rate. There is no per-span fee for the OpenTelemetry software itself — it is open source; you pay for the backend (Tempo/Jaeger you run, or a SaaS that bills per span/GB). The cheapest large saving is almost always moving from full head sampling to a generous head ratio plus error-biased tail sampling — you keep the traces that matter and drop the ones nobody ever opens.

Interview & exam questions

Q1. How does the OpenTelemetry Java agent instrument code without changes? It attaches as a -javaagent and transforms bytecode at class load via the JVM instrumentation API, weaving span creation into known library classes (Servlet, JDBC, Kafka, gRPC, …). The trigger is the library’s presence on the classpath, not an opt-in. Your code is untouched.

Q2. What’s the difference between the OpenTelemetry API and SDK, and why does it matter under the agent? The API is interfaces (Tracer, Span, Context); the SDK is the implementation that builds, samples and exports. Under the agent, the agent is the SDK. Custom-span code depends on the API only (compileOnly); shipping the SDK too creates two providers and duplicate exports.

Q3. Walk through W3C traceparent propagation across two services. On an outbound call the agent injects traceparent (version-traceid-parentid-flags) and baggage. The downstream agent extracts them, sets the context, and opens its SERVER span parented to the upstream span id — same trace id across both. Both defaulting to W3C means it works with zero config.

Q4. Why do async operations lose trace context, and how do you fix it? Context lives in a thread-local. When work moves to another thread (executor, future, raw thread), the new thread has no current context, so the child span orphans. Fix by wrapping the executor with Context.taskWrapping, or capturing Context.current() and makeCurrent()'ing it inside the task.

Q5. What is baggage, and how does it differ from a span attribute? Baggage is key-value context that travels across every hop in the baggage header. It is not automatically a span attribute and not queryable until you copy it onto a span; every key costs header bytes on every request. Attributes live on one span and are queryable.

Q6. Compare head and tail sampling. Where does each happen? Head sampling decides at trace start (at the agent), blind to the outcome — cheap but drops errors with the baseline. Tail sampling decides after the trace is assembled (in the Collector), so it can keep all errors/slow traces and sample the rest. Production: generous head ratio at the agent, real decision in the Collector.

Q7. Why does parentbased_traceidratio keep traces whole where traceidratio doesn’t? parentbased_* follows the upstream sampling decision for non-root spans and applies the ratio only to root spans. So a child of a sampled parent is always kept — you never sample the parent in and a child out, which would leave dangling spans.

Q8. A service shows up as unknown_service:java. What happened and how do you prevent it? OTEL_SERVICE_NAME was never set, so the agent used the default. Prevent it by sourcing the name from the deployment (Helm release name, Downward API) so a nameless deploy is impossible.

Q9. You configured grpc but data isn’t arriving and the app is healthy. First check? Protocol/port mismatch: gRPC is 4317, HTTP is 4318. Confirm the agent’s protocol matches the Collector receiver and the port. Enable OTEL_JAVAAGENT_DEBUG=true to see the exporter target. The failure is silent because export errors don’t crash the app.

Q10. How do you add a custom span that means something to the business, safely? Depend on opentelemetry-api as compileOnly, get a Tracer from GlobalOpenTelemetry, startSpan(), makeCurrent() in a try, set attributes/events, set ERROR status + recordException on failure, and end() in finally. Without an agent the tracer is a no-op, so it’s test-safe.

Q11. Why keep the database statement sanitizer on? By default db.query.text has literals replaced with ?, so it carries the query shape, not customer data. That keeps PII out of the trace store and bounds attribute cardinality so “group by query” stays meaningful. Disabling it leaks data and explodes cardinality.

Q12. What does the Collector add that the agent can’t do alone? A central policy/fan-out layer: tail sampling (outcome-aware), batching, redaction, resource enrichment, and routing to multiple backends — all changeable without redeploying services. The agent produces; the Collector decides policy and routes.

These map to OpenTelemetry Certified Associate (OTCA) topics (signals, context propagation, the Collector, sampling) and to the observability sections of cloud and SRE certifications. The propagation and double-export questions are the ones that separate people who’ve run this from people who’ve only read about it.

Quick check

Which OTLP port does the agent use by default, and for which protocol?
You depend on opentelemetry-sdk and run the agent. What symptom appears and why?
A span created inside CompletableFuture.runAsync has no parent. What’s the fix?
Name two things that are true of baggage but false of span attributes.
Where should the real (error-aware) sampling decision be made, and why not at the agent?

Answers

4317, for gRPC (the agent’s default protocol). HTTP/protobuf uses 4318. Match port to protocol or export fails silently.
Doubled spans / two providers (double export): the agent installs the SDK at runtime, so a second SDK on the classpath means two OpenTelemetry implementations exporting independently. Fix: make the API compileOnly, drop the SDK.
Capture Context.current() before the async stage and makeCurrent() it inside the lambda (try-with-resources), or wrap the executor with Context.taskWrapping. The async thread had no current context.
Baggage travels across the network (in the baggage header) and is not queryable in the backend until copied onto a span; span attributes don’t travel across hops and are queryable on their span. (Also: baggage costs header bytes on every request.)
In the Collector, as tail sampling, because the agent decides at trace start and is blind to the outcome — head sampling at 10% drops 90% of errors too. The Collector sees the whole trace and can keep all errors/slow traces.

Glossary

Term	Definition
Java agent	A `-javaagent` JAR that instruments application bytecode at class load via the JVM instrumentation API.
Auto-instrumentation	Span/metric creation added by the agent for known libraries, with no application code.
OTLP	OpenTelemetry Protocol — the wire format for exporting telemetry (gRPC on 4317, HTTP on 4318).
Span	A single timed operation with a name, kind, attributes, events, status, and a parent.
Span kind	SERVER, CLIENT, PRODUCER, CONSUMER, or INTERNAL — the role the span played.
Trace	A tree of spans sharing one trace id, representing one request’s path.
Context	The currently active span (and baggage), carried on the executing thread.
Propagator	The component that serializes/deserializes context to/from headers (W3C, B3, Jaeger, …).
`traceparent`	The W3C header carrying version, trace id, parent span id, and sampled flag.
Baggage	Key-value context that travels with the trace across hops in the `baggage` header.
Resource	Attributes attached to every span/metric identifying the emitting service/host/pod.
API artifact	`opentelemetry-api` — interfaces only; the dependency you use for custom spans under the agent.
SDK	The implementation that builds, samples and exports — supplied by the agent at runtime.
Sampler	Logic deciding whether to keep a trace; head (at the agent) or tail (in the Collector).
Head sampling	A keep/drop decision at trace start, blind to the eventual outcome.
Tail sampling	A keep/drop decision after the trace is assembled, able to keep errors/slow traces.
Collector	A separate process (receivers → processors → exporters) for policy, batching and routing.
Semantic conventions	Standard attribute/span names so telemetry is consistent across languages.
`@WithSpan`	An annotation (instrumentation-annotations artifact) that wraps a method in a span.
Statement sanitizer	The default that replaces SQL literals with `?` in `db.query.text` to avoid PII/cardinality.

Next steps

OpenTelemetry Collector pipelines for production — build the receiver→processor→exporter layer that consumes everything this agent emits.
OpenTelemetry Collector tail-based sampling and load balancing — where the real, error-aware sampling decision lives for your head-sampled traces.
Distributed tracing with Tempo, Jaeger, exemplars and correlation — store, query and correlate the traces you’ve produced.
OpenTelemetry metrics, exemplars and Prometheus trace correlation — join the JVM/RED metrics from the same agent to your traces via exemplars.
eBPF zero-code instrumentation with Grafana Beyla and OTel — the kernel-level alternative when you can’t attach a JVM agent.