A payments platform runs forty microservices across three Amazon EKS clusters, and the on-call engineer’s nightmare is the 2 a.m. page that says “checkout latency is up” with no trace to follow — the team has CloudWatch metrics, scattered application logs in three formats, and no single view that connects a slow POST /charge to the downstream ledger pod that is actually GC-thrashing. The mandate from the new VP of Engineering is blunt: one observability backend, full-stack, with distributed traces that cross service boundaries and a live topology map the SRE team can point at during an incident. This guide deploys exactly that — the Dynatrace Operator managing OneAgent for host, process, and deep-code monitoring, alongside an OpenTelemetry Collector that owns vendor-neutral trace/metric/log pipelines — onto EKS, so every span, metric, and log line lands in one Dynatrace tenant and feeds the Smartscape dependency model.
Prerequisites
- An EKS cluster on Kubernetes 1.28+ with at least three worker nodes (the OneAgent DaemonSet runs one pod per node). Confirm with
kubectl versionandkubectl get nodes. kubectl,helmv3.12+,eksctl, and the AWS CLI v2 installed and authenticated against the target account.- Cluster-admin RBAC on the EKS cluster (the Operator installs CRDs and a privileged DaemonSet).
- A Dynatrace SaaS tenant (e.g.
https://abc12345.live.dynatrace.com) with permission to create access tokens. - HashiCorp Vault reachable from the cluster, used here to hold the Dynatrace API and data-ingest tokens so they never live as plaintext in Git or a bare Kubernetes Secret.
- An OIDC identity provider — Okta federated to Microsoft Entra ID — already wired to Dynatrace for SSO, so engineers log into the Dynatrace UI with corporate credentials and SCIM-provisioned groups, not local Dynatrace users.
gitand access to the GitOps repo that Argo CD reconciles, plus a GitHub Actions runner with OIDC trust to AWS (no long-lived keys).
Target topology
The data plane has two complementary ingest paths into the same Dynatrace tenant. OneAgent, deployed by the Operator as a node-level DaemonSet plus an ActiveGate StatefulSet, auto-instruments every process on every node — JVMs, Node.js, Go binaries, the kubelet — and streams host metrics, deep-code traces (PurePath), and process topology that builds Smartscape with zero code changes. In parallel, the OpenTelemetry Collector (deployed as both a per-node DaemonSet for logs/host metrics and a gateway Deployment for trace aggregation) receives OTLP from services that emit their own spans and metrics via OTel SDKs, batches and enriches them, then exports over OTLP to the same tenant through the ActiveGate. Application pods talk OTLP to the Collector’s ClusterIP Service; the Collector and OneAgent both egress to Dynatrace through the in-cluster ActiveGate, so only one component holds an outbound path and the data-ingest token. Vault injects tokens at pod start; Argo CD reconciles the whole stack from Git.
1. Create the Dynatrace access tokens
Dynatrace separates the operator/API token (used by the Operator to query the deployment API and pull OneAgent images) from the data-ingest token (used to push metrics/traces/logs). Create both with least-privilege scopes. You can do this in the UI under Access Tokens, or via the API:
DT_TENANT="https://abc12345.live.dynatrace.com"
DT_PAT="dt0c01.SEED.BOOTSTRAP_PAT_WITH_TOKEN_SCOPES" # a one-time PAT to mint the others
# API/operator token: deployment + cluster ACL scopes
curl -sX POST "$DT_TENANT/api/v2/apiTokens" \
-H "Authorization: Api-Token $DT_PAT" -H "Content-Type: application/json" \
-d '{"name":"eks-operator","scopes":[
"activeGateTokenManagement.create","entities.read","settings.read",
"settings.write","DataExport","InstallerDownload"]}'
# Data-ingest token: metrics, logs, OpenTelemetry traces
curl -sX POST "$DT_TENANT/api/v2/apiTokens" \
-H "Authorization: Api-Token $DT_PAT" -H "Content-Type: application/json" \
-d '{"name":"eks-data-ingest","scopes":[
"metrics.ingest","logs.ingest","openTelemetryTrace.ingest","events.ingest"]}'
2. Store the tokens in HashiCorp Vault
Do not paste tokens into a manifest. Write them into Vault and let the Vault Agent (or the Vault Secrets Operator) materialize them as a Kubernetes Secret at deploy time, so the token rotates centrally and never appears in Git or argocd history.
vault kv put secret/dynatrace/eks-prod \
apiToken="dt0c01.OPERATOR_TOKEN_FROM_STEP_1" \
dataIngestToken="dt0c01.DATA_INGEST_TOKEN_FROM_STEP_1"
Bind a Kubernetes auth role so only the dynatrace namespace service accounts can read it:
vault write auth/kubernetes/role/dynatrace \
bound_service_account_names=dynatrace-operator,dynakube-oneagent \
bound_service_account_namespaces=dynatrace \
policies=dynatrace-read ttl=1h
The Vault Secrets Operator then syncs secret/dynatrace/eks-prod into a Secret named dynakube in the dynatrace namespace — the exact name the DynaKube custom resource expects in step 4.
3. Install the Dynatrace Operator with Helm
Add the chart repo and install into a dedicated dynatrace namespace. The Operator brings the DynaKube and EdgeConnect CRDs and a webhook that injects OneAgent into application pods.
helm repo add dynatrace https://raw.githubusercontent.com/Dynatrace/dynatrace-operator/main/config/helm/repos/stable
helm repo update
kubectl create namespace dynatrace
helm upgrade --install dynatrace-operator dynatrace/dynatrace-operator \
--namespace dynatrace \
--set "installCRD=true" \
--set "csidriver.enabled=true" \
--atomic
csidriver.enabled=true installs the CSI driver that lets OneAgent run in cloudNativeFullStack mode with a shared read-only code module per node, instead of a separate copy per pod — this is the recommended mode on EKS for memory efficiency. Confirm the Operator is healthy:
kubectl -n dynatrace rollout status deploy/dynatrace-operator
kubectl -n dynatrace get pods # expect operator, webhook, and csi-driver pods Running
4. Apply the DynaKube custom resource
The DynaKube CR is the single declarative object that tells the Operator what to deploy: the tenant URL, the Secret holding the tokens, the OneAgent mode, and the ActiveGate role set. Save this as dynakube.yaml in the GitOps repo so Argo CD owns it.
apiVersion: dynatrace.com/v1beta3
kind: DynaKube
metadata:
name: dynakube
namespace: dynatrace
spec:
apiUrl: https://abc12345.live.dynatrace.com/api
# references the Secret synced from Vault in step 2
tokens: dynakube
oneAgent:
cloudNativeFullStack:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
args:
- --set-host-group=eks-payments-prod
activeGate:
capabilities:
- routing # in-cluster egress proxy to the tenant
- kubernetes-monitoring
- dynatrace-api
resources:
requests: { cpu: 500m, memory: 512Mi }
limits: { cpu: "1", memory: 1.5Gi }
Apply it (or let Argo CD sync it — see step 8):
kubectl apply -f dynakube.yaml
kubectl -n dynatrace get dynakube dynakube -o jsonpath='{.status.phase}' # -> Running
kubectl -n dynatrace get daemonset # oneagent DaemonSet, one pod per node
kubectl -n dynatrace get statefulset # activegate
The --set-host-group=eks-payments-prod flag tags every host so Smartscape and management-zone rules can scope this cluster cleanly. The kubernetes-monitoring ActiveGate capability pulls cluster events, node/pod metrics, and workload topology straight from the Kubernetes API.
5. Deploy the OpenTelemetry Collector
OneAgent covers auto-instrumentation; the Collector covers everything you instrument yourself with OTel SDKs and any third-party OTLP source. Install it with the official Helm chart in deployment mode for the trace gateway. Create otel-values.yaml:
mode: deployment
replicaCount: 2
image:
repository: otel/opentelemetry-collector-contrib
presets:
kubernetesAttributes:
enabled: true # stamps k8s.pod.name, k8s.namespace.name, etc.
config:
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch:
send_batch_size: 1000
timeout: 5s
memory_limiter:
check_interval: 2s
limit_percentage: 80
spike_limit_percentage: 20
k8sattributes: {}
exporters:
otlphttp/dynatrace:
# route through the in-cluster ActiveGate, not the public tenant
endpoint: https://dynakube-activegate.dynatrace.svc.cluster.local:443/e/abc12345/api/v2/otlp
headers:
Authorization: "Api-Token ${env:DT_INGEST_TOKEN}"
tls:
insecure_skip_verify: true # ActiveGate uses its self-signed internal cert
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlphttp/dynatrace]
metrics:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlphttp/dynatrace]
logs:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlphttp/dynatrace]
The DT_INGEST_TOKEN env var is injected from the same Vault-synced Secret, so the Collector never carries a hardcoded token:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
--namespace observability --create-namespace \
-f otel-values.yaml \
--set-string "extraEnvs[0].name=DT_INGEST_TOKEN" \
--set-string "extraEnvs[0].valueFrom.secretKeyRef.name=dynakube" \
--set-string "extraEnvs[0].valueFrom.secretKeyRef.key=dataIngestToken" \
--atomic
6. Point application services at the Collector
Instrumented services send OTLP to the Collector’s in-cluster Service. Set the standard OTel environment variables on each workload — here on the checkout deployment:
kubectl -n payments set env deployment/checkout \
OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector-opentelemetry-collector.observability.svc.cluster.local:4317" \
OTEL_EXPORTER_OTLP_PROTOCOL="grpc" \
OTEL_SERVICE_NAME="checkout" \
OTEL_RESOURCE_ATTRIBUTES="service.namespace=payments,deployment.environment=prod"
Services that have no SDK at all are still covered automatically: the OneAgent code module injected by the Operator’s webhook produces PurePath traces for them without any config. The two streams reconcile in Dynatrace because both carry the same k8s.pod.name resource attribute — OneAgent stamps it natively, and the Collector’s k8sattributes processor adds it to SDK spans.
7. Add log collection (optional but recommended)
For application logs, run a second Collector instance as a DaemonSet tailing container log files, so stdout/stderr from every pod reaches Dynatrace with full Kubernetes context. Create otel-logs-values.yaml:
mode: daemonset
presets:
logsCollection:
enabled: true
includeCollectorLogs: false
kubernetesAttributes:
enabled: true
config:
exporters:
otlphttp/dynatrace:
endpoint: https://dynakube-activegate.dynatrace.svc.cluster.local:443/e/abc12345/api/v2/otlp
headers:
Authorization: "Api-Token ${env:DT_INGEST_TOKEN}"
tls:
insecure_skip_verify: true
service:
pipelines:
logs:
receivers: [filelog]
processors: [k8sattributes, batch]
exporters: [otlphttp/dynatrace]
helm upgrade --install otel-logs open-telemetry/opentelemetry-collector \
--namespace observability \
-f otel-logs-values.yaml \
--set-string "extraEnvs[0].name=DT_INGEST_TOKEN" \
--set-string "extraEnvs[0].valueFrom.secretKeyRef.name=dynakube" \
--set-string "extraEnvs[0].valueFrom.secretKeyRef.key=dataIngestToken" \
--atomic
8. Put it under GitOps with Argo CD
Everything above should be declarative and reconciled, not applied by hand in production. Commit dynakube.yaml and the Helm value files, then define an Argo CD Application that points at the repo. The Operator chart and DynaKube live together so Argo CD enforces drift correction.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: dynatrace-observability
namespace: argocd
spec:
project: platform
source:
repoURL: https://github.com/kloudvin/eks-observability.git
targetRevision: main
path: clusters/eks-payments-prod/dynatrace
destination:
server: https://kubernetes.default.svc
namespace: dynatrace
syncPolicy:
automated: { prune: true, selfHeal: true }
syncOptions: [CreateNamespace=true]
The promotion flow is: a pull request changes a value file, GitHub Actions validates it (helm template + kubeconform + a policy check), and on merge Argo CD auto-syncs to the cluster. The Actions runner assumes an AWS role via OIDC, so there are no static AWS credentials in CI. If you prefer Jenkins or Terraform/Ansible for the surrounding cluster lifecycle, the same DynaKube manifest applies unchanged — the Operator is the contract.
Validation
Confirm the full stack is live end to end:
# 1. OneAgent injected and reporting
kubectl -n dynatrace get pods -l app.kubernetes.io/name=oneagent -o wide
kubectl -n dynatrace logs ds/dynakube-oneagent | grep -i "connected to"
# 2. ActiveGate reachable as the egress proxy
kubectl -n dynatrace get svc dynakube-activegate
# 3. Collector pipelines healthy (check the internal metrics endpoint)
kubectl -n observability port-forward deploy/otel-collector-opentelemetry-collector 8888:8888 &
curl -s localhost:8888/metrics | grep otelcol_exporter_sent_spans
# otelcol_exporter_sent_spans{exporter="otlphttp/dynatrace"} > 0 means traces are flowing
Then in the Dynatrace UI (logged in via Okta/Entra SSO): open Kubernetes and confirm the eks-payments-prod cluster with its nodes and namespaces; open Distributed traces and trigger a checkout request — you should see a PurePath that crosses checkout → ledger; open Smartscape and verify the live service-to-service topology. The acceptance test is a single trace showing both an OneAgent-captured span and an SDK span on the same PurePath.
Rollback and teardown
Because the stack is declarative, removal is clean and ordered — tear down the data producers before the Operator that owns the CRDs:
# 1. Stop sending new data
helm uninstall otel-logs -n observability
helm uninstall otel-collector -n observability
# 2. Remove the DynaKube CR (Operator deletes OneAgent DaemonSet + ActiveGate)
kubectl delete -f dynakube.yaml
kubectl -n dynatrace wait --for=delete daemonset/dynakube-oneagent --timeout=120s
# 3. Remove the Operator and its CRDs last
helm uninstall dynatrace-operator -n dynatrace
kubectl delete namespace dynatrace observability
If you manage this via Argo CD, disable auto-sync first (argocd app set dynatrace-observability --sync-policy none) or revert the Git commit so self-heal does not immediately re-create what you just deleted. To roll back a bad config rather than remove everything, helm rollback otel-collector or git revert the offending PR and let Argo CD reconcile.
Common pitfalls
- Skipping the CSI driver. Without
csidriver.enabled=true,cloudNativeFullStackfalls back to copying the code module into every pod, inflating memory and slowing pod start. Install the CSI driver on EKS. - OTLP exporter bypassing the ActiveGate. Exporting directly to
*.live.dynatrace.comworks but puts the data-ingest token on every Collector egress and adds a public hop per pod. Route through the in-cluster ActiveGate endpoint as shown — one egress point, one token surface. - Missing
k8sattributesprocessor. Without it, SDK spans lackk8s.pod.name, so Dynatrace cannot correlate them with OneAgent data and the Smartscape topology looks broken. Always include thekubernetesAttributespreset. - Webhook race on first install. If application pods start before the Operator webhook is ready, they launch un-instrumented. Roll the affected deployments (
kubectl rollout restart) after the Operator isRunning. - TLS verification failures to the ActiveGate. The in-cluster ActiveGate presents a self-signed cert; either trust its CA or set
insecure_skip_verify: trueon the internal hop (acceptable because it stays inside the cluster network). - Tolerations omitted. If you want host metrics from control-plane or tainted nodes, the OneAgent tolerations in the DynaKube must match the taints, or those nodes silently go unmonitored.
Security notes
Tokens are the crown jewels here: keep the operator and data-ingest tokens separate and least-scoped (step 1), source them from HashiCorp Vault with a short TTL and Kubernetes-auth binding rather than committing them, and never embed them in Helm values in Git. Human access to the Dynatrace tenant flows through Okta federated to Entra ID with SCIM-provisioned groups mapped to Dynatrace management zones, so an engineer who leaves loses access on de-provisioning, not on a manual cleanup. For runtime threat detection on the same EKS nodes, CrowdStrike Falcon sensors run alongside OneAgent — Falcon watches for malicious process behavior while OneAgent watches performance; they are complementary, not redundant. Pair this with Wiz (and Wiz Code in the pipeline) for cloud-posture and IaC scanning so a misconfigured ActiveGate Service or an over-scoped token is flagged before it ships. Restrict the Collector’s OTLP receiver to in-cluster traffic with a NetworkPolicy so nothing outside the mesh can inject spans.
Cost notes
Dynatrace bills primarily on Host Units (driven by per-node OneAgent memory) and Davis Data Units / ingest volume for metrics, logs, and traces. Three levers keep the bill predictable: set OneAgent host groups (step 4) so you can scope monitoring modes and even disable deep monitoring on low-value batch nodes; use the OTel Collector’s tail_sampling or probabilistic_sampler processor to drop a percentage of high-volume, low-signal traces before they are billed; and apply log-ingest processing rules in the Collector to filter chatty DEBUG lines rather than paying to store them. Because the Collector sits in the path, sampling and filtering are a config change in Git, reviewed and rolled out through Argo CD — not a vendor support ticket. Right-size the ActiveGate (the requests/limits in step 4) to the cluster’s egress volume; one or two replicas handle a forty-service cluster comfortably.