Observability Multi-Cloud

OpenTelemetry for Java Services: Auto-Instrumentation, Context Propagation, and Custom Spans

The fastest way to get a Java service producing traces is to add nothing to your code. The OpenTelemetry Java agent is a -javaagent JAR that hooks bytecode at class load and instruments the libraries you already use — Servlet, JDBC, Kafka, gRPC, and roughly a hundred others — then exports spans and metrics over OTLP. No SDK wiring, no recompile. The work that remains is configuration, deliberate propagation across the async boundaries the agent can’t see, and a handful of manual spans where business meaning lives. This guide walks through all of it, with config that actually works against a current agent release.

1. Attach the agent and point it at a collector

Download the agent JAR and run it ahead of your application. Everything is driven by environment variables (or matching -D system properties), so the same image works in every environment.

# Pin a version. Do not float on latest in production.
curl -sSLo opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.20.1/opentelemetry-javaagent.jar

java -javaagent:/otel/opentelemetry-javaagent.jar \
     -jar /app/checkout-service.jar

Configure it via env. These are the variables that matter on day one:

export OTEL_SERVICE_NAME=checkout-service
export OTEL_RESOURCE_ATTRIBUTES=service.namespace=payments,deployment.environment=prod,service.version=2.14.0
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability.svc:4318
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.1

Two protocol notes that trip people up constantly:

System properties are equivalent and win over env when both are set. This is handy when you can’t edit the environment but can edit JAVA_TOOL_OPTIONS:

export JAVA_TOOL_OPTIONS="-javaagent:/otel/opentelemetry-javaagent.jar \
  -Dotel.service.name=checkout-service \
  -Dotel.exporter.otlp.protocol=http/protobuf \
  -Dotel.exporter.otlp.endpoint=http://otel-collector.observability.svc:4318"

The env var OTEL_SERVICE_NAME and the property otel.service.name are the same setting. The naming rule is mechanical: uppercase the property, replace dots with underscores. Knowing this means you never have to look up a variable name.

2. What auto-instrumentation actually captures

The agent ships instrumentation modules that activate only when the matching library is on the classpath. You do not opt in per module; presence is the trigger. The high-value categories:

Category Examples Span kind Key captured attributes
HTTP server Servlet, Spring MVC/WebFlux, JAX-RS SERVER http.request.method, url.path, http.response.status_code
HTTP client Apache HttpClient, OkHttp, JDK HttpClient CLIENT http.request.method, server.address, http.response.status_code
Databases JDBC (any driver), Hibernate, R2DBC CLIENT db.system.name, db.namespace, db.query.text
Messaging Kafka, RabbitMQ, JMS, AWS SQS PRODUCER / CONSUMER messaging.system, messaging.destination.name
RPC gRPC CLIENT / SERVER rpc.system, rpc.service, rpc.method

For JDBC the agent wraps the DataSource/Connection and produces a CLIENT span per statement. By default db.query.text is sanitized — literals are stripped and replaced with ? so the attribute carries the statement shape, not customer data. That sanitization is what makes “group by query” in your backend meaningful and keeps PII out of spans. Leave it on.

HTTP spans capture a curated header set, not every header. To record specific request headers as span attributes (for example a tenant header you route on), opt in explicitly:

export OTEL_INSTRUMENTATION_HTTP_SERVER_CAPTURE_REQUEST_HEADERS=x-tenant-id,x-request-source

Each listed header becomes an attribute like http.request.header.x-tenant-id. Be deliberate — capturing authorization or cookie leaks secrets into your trace store.

3. Context propagation: the part you have to think about

Within a single thread the agent threads context automatically. Across a network hop it serializes context into headers, and across an async boundary you may have to help.

Across the wire (W3C, by default)

The agent’s default propagators are W3C tracecontext and baggage. Every outbound HTTP/gRPC client call injects two headers and every inbound request extracts them:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate:  vendorA=value,vendorB=value

traceparent carries the version, trace ID, parent span ID, and the sampled flag. Because both sides default to W3C, a Java service calling another Java service is correlated with zero configuration. The moment a non-OTel hop sits in the middle — say a legacy edge that speaks B3 — you widen the propagator set so the agent reads and writes both:

export OTEL_PROPAGATORS=tracecontext,baggage,b3multi

Propagators are tried in order on extract; all of them run on inject. List tracecontext first so W3C is authoritative.

Baggage is not attributes

Baggage is key-value context that rides alongside the trace across every hop. It is not automatically copied onto spans, and it is not free — every key inflates the baggage header on every request. Use it for a small number of cross-cutting values (tenant, request origin) you need deep in the call graph, and set it through the API:

import io.opentelemetry.api.baggage.Baggage;
import io.opentelemetry.context.Scope;

try (Scope scope = Baggage.current().toBuilder()
        .put("tenant.id", tenantId)
        .build().makeCurrent()) {
    // Downstream HTTP/gRPC calls now carry baggage=tenant.id=...
    paymentClient.authorize(request);
}

Across thread pools and reactive boundaries

This is where traces break in real systems. When you hand work to an ExecutorService, the runnable executes on a thread that has no current context, so the child span attaches to nothing and you get an orphan. The agent instruments common executors, but the robust fix is to wrap the executor so context is captured at submit time and restored at run time:

import io.opentelemetry.context.Context;

ExecutorService traced = Context.taskWrapping(Executors.newFixedThreadPool(8));
// Work submitted to `traced` runs under the submitting thread's context.
traced.submit(() -> chargeAsync(order));

For a single manual handoff, capture and re-make current explicitly:

Context captured = Context.current();
CompletableFuture.runAsync(() -> {
    try (Scope scope = captured.makeCurrent()) {
        settle(order);   // spans here parent correctly
    }
});

4. Manual spans with the API only — no SDK dependency

You do not pull in the OpenTelemetry SDK to write custom spans under the agent. You depend on the API artifact only (io.opentelemetry:opentelemetry-api), mark it compileOnly / provided, and the agent supplies the live implementation at runtime. If you accidentally ship the SDK too, you get two providers and duplicate exports.

// build.gradle.kts — API only; the agent provides the implementation.
dependencies {
    compileOnly("io.opentelemetry:opentelemetry-api:1.55.0")
}

Then create spans that wrap business operations the agent can’t name:

import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;

private static final Tracer tracer =
    GlobalOpenTelemetry.getTracer("checkout-service", "2.14.0");

void applyPromotion(Order order) {
    Span span = tracer.spanBuilder("promotion.apply").startSpan();
    try (Scope scope = span.makeCurrent()) {
        span.setAttribute("promotion.code", order.promoCode());
        span.setAttribute("order.item_count", order.items().size());
        // ... business logic ...
        span.addEvent("discount.calculated");
    } catch (Exception e) {
        span.setStatus(StatusCode.ERROR, "promotion failed");
        span.recordException(e);
        throw e;
    } finally {
        span.end();   // ALWAYS end in finally; makeCurrent() does not end the span
    }
}

Two correctness rules: makeCurrent() only sets the span as the current context for the scope — it does not end the span, so end() belongs in finally. And GlobalOpenTelemetry.getTracer(...) returns a no-op tracer if no agent is attached, so this same code runs harmlessly in a unit test with no exporter.

If you prefer declarative spans, add the opentelemetry-instrumentation-annotations dependency and annotate methods; the agent weaves them:

import io.opentelemetry.instrumentation.annotations.WithSpan;
import io.opentelemetry.instrumentation.annotations.SpanAttribute;

@WithSpan("inventory.reserve")
int reserve(@SpanAttribute("sku") String sku, int qty) { ... }

5. Suppress noise: samplers and per-library control

Out of the box you will trace your own health checks and every internal poll. Three levers cut the noise.

Sampler. parentbased_traceidratio respects an upstream sampling decision and applies your ratio only to traces that start at this service. That keeps traces whole — you never sample the parent in and the child out.

export OTEL_TRACES_SAMPLER=parentbased_traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.05

For volume-proportional sampling that survives bursts, the agent supports a built-in rate-limited sampler:

export OTEL_TRACES_SAMPLER=parentbased_jaeger_remote
export OTEL_TRACES_SAMPLER_ARG=endpoint=http://otel-collector:14250

Most teams should do real sampling decisions (tail-based, error-biased) in the Collector and keep the agent on a generous head ratio. The agent samples blind to outcome; only the Collector sees the whole trace.

Drop specific endpoints. Health and readiness probes are pure noise. Suppress them at the source so they never become spans:

# Comma-separated path list; matched server spans are not recorded.
export OTEL_INSTRUMENTATION_COMMON_DEFAULT_ENABLED=true
export OTEL_INSTRUMENTATION_HTTP_SERVER_EMIT_EXPERIMENTAL_TELEMETRY=false

Disable a whole instrumentation. Every module has a flag otel.instrumentation.<name>.enabled. Turn one off when it is chatty or redundant:

# Stop instrumenting the JDBC datasource open span if it adds no value.
export OTEL_INSTRUMENTATION_JDBC_DATASOURCE_ENABLED=false
# Or disable everything and opt back in only what you want:
export OTEL_INSTRUMENTATION_COMMON_DEFAULT_ENABLED=false
export OTEL_INSTRUMENTATION_SPRING_WEBMVC_ENABLED=true
export OTEL_INSTRUMENTATION_JDBC_ENABLED=true

The COMMON_DEFAULT_ENABLED=false pattern is the surgical option for a service that should emit only a curated handful of span types.

6. Resource detection: service.name, k8s, and cloud

A span is useless if you can’t tell which deployment emitted it. The resource is the set of attributes attached to every span and metric. OTEL_SERVICE_NAME is the one attribute you must always set — if you don’t, spans land under unknown_service:java and your backend becomes unsearchable.

The agent runs resource detectors automatically and merges what it finds: process and host attributes, container ID, and cloud/k8s metadata when the environment exposes it. You stitch the rest in through the Downward API so each pod is self-describing:

env:
  - name: OTEL_SERVICE_NAME
    value: checkout-service
  - name: K8S_POD_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.name
  - name: K8S_NAMESPACE
    valueFrom:
      fieldRef:
        fieldPath: metadata.namespace
  - name: K8S_NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: >-
      service.namespace=payments,
      deployment.environment=prod,
      k8s.pod.name=$(K8S_POD_NAME),
      k8s.namespace.name=$(K8S_NAMESPACE),
      k8s.node.name=$(K8S_NODE_NAME)

Kubernetes expands $(VAR) references in env values only when the referenced variable is declared earlier in the same env list — which is exactly why the fieldRef entries come first. The deployment.environment attribute is what lets one backend query separate prod from staging without separate stores.

7. JVM and runtime metrics from the same agent

The agent is not trace-only. It emits JVM runtime metrics over OTLP on the same pipeline — no Micrometer, no separate scrape. You get heap and non-heap usage, GC duration, thread counts, and class loading as OTLP metrics, exported to the metrics endpoint your OTLP config already points at.

export OTEL_METRICS_EXPORTER=otlp
export OTEL_METRIC_EXPORT_INTERVAL=30000   # milliseconds

This matters operationally: when a trace shows a 4-second pause, the JVM metrics from the same agent on the same pod tell you whether it was a stop-the-world GC. Because both signals carry the identical resource attributes, they join on service.name and k8s.pod.name in your backend with no extra correlation work. To turn metrics off entirely (some teams keep JVM metrics in Prometheus already), set OTEL_METRICS_EXPORTER=none.

Verify

Confirm the agent attached and data is flowing before you trust a dashboard.

# 1. The agent logs its version and config at startup. Confirm it loaded.
java -javaagent:/otel/opentelemetry-javaagent.jar -jar app.jar 2>&1 \
  | grep -i "opentelemetry-javaagent"
# Expect: "opentelemetry-javaagent - version: 2.20.1"
# 2. Smoke-test the OTLP HTTP endpoint the agent targets.
#    A reachable collector returns 200 (often with an empty body) for an empty payload.
curl -i -X POST http://otel-collector.observability.svc:4318/v1/traces \
  -H "Content-Type: application/json" -d '{"resourceSpans":[]}'
# 3. Force a request through a SERVER span and confirm it lands in the backend.
curl -s http://localhost:8080/api/checkout/health
# Then query your trace backend for service.name="checkout-service".
# 4. Validate end-to-end propagation: send a traceparent and confirm the
#    backend shows your span as a CHILD of this exact trace ID.
curl -s http://localhost:8080/api/checkout \
  -H "traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
# In the backend, search trace 4bf92f3577b34da6a3ce929d0e0e4736 -> your span attaches.

If step 4 produces an orphan trace instead of a child, your propagators are misconfigured or an async hop is dropping context — go back to sections 3 and 5.

Enterprise scenario

A platform team running ~140 JVM services on EKS standardized on the Java agent via a shared base image. Within a week their trace backend cost spiked and search got slow. Two causes: a Kafka-heavy pipeline of consumers was emitting a CONSUMER span per message at full volume, and several services had unknown_service:java because teams shipped the image without setting OTEL_SERVICE_NAME.

The constraint was that they could not touch 140 codebases or coordinate 140 redeploys on any reasonable timeline. So they solved it entirely through the deployment layer. They mandated the service name via the Downward API mapped from the existing Helm release name, so it was impossible to deploy nameless, and they pinned a generous head sample ratio on the agent while moving the real decision to a Collector gateway doing tail-based, error-biased sampling. The agent stayed dumb; the gateway got smart.

# Injected by the platform Helm chart into every service. No app change.
env:
  - name: OTEL_SERVICE_NAME
    value: {{ .Release.Name }}              # never unknown_service again
  - name: OTEL_TRACES_SAMPLER
    value: parentbased_traceidratio
  - name: OTEL_TRACES_SAMPLER_ARG
    value: "0.25"                            # head ratio; gateway makes the real call
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: http://otel-gateway.observability.svc:4318
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: http/protobuf

The fix was a chart change and a rolling restart — not a code migration. The gateway then kept 100% of error traces and a sample of the rest, cutting stored volume by roughly 70% while improving the signal because no error trace was ever dropped at the source. The lesson the team wrote down: with the agent, configuration is your product surface. Push policy to the deployment and the Collector, and keep the agent boring.

Checklist

opentelemetryjavadistributed-tracinginstrumentationobservability

Comments

Keep Reading