A mid-size SaaS platform team is bleeding money on its logging bill. Every container ships raw JSON straight to a hosted Elasticsearch cluster, the cluster is the single point of failure for all observability, and when a marketing campaign triples traffic the ingest queue backs up, log shippers OOM-kill, and the on-call engineer is blind during the exact incident they need logs for. The mandate from the platform lead is concrete: cut Elasticsearch index volume by routing low-value logs elsewhere, keep a cheap immutable copy of everything in object storage for compliance and replay, send the metrics-shaped log lines to Loki so Grafana dashboards stop hammering Elasticsearch, and make the whole thing survive a sink outage without dropping data on the floor. That is exactly the job Vector was built for: a single high-throughput observability pipeline that collects logs at the edge, transforms them in flight with VRL (Vector Remap Language), and fans them out to multiple sinks with per-sink buffering and backpressure. This guide builds that pipeline end to end — an agent tier on every node, an aggregator tier doing the heavy parsing, and three sinks (Loki, S3, Elasticsearch) — and shows you how to validate it, roll it back, and not get paged for it.
This is an intermediate, hands-on guide. By the end you will have a working two-tier topology you can adapt to AWS, Azure, or GCP.
Prerequisites
- A Kubernetes cluster (1.27+) or a fleet of VMs where logs originate. Examples below use Kubernetes; the VM path is called out where it differs.
kubectlandhelm3.x configured against the cluster, orvector0.40+ installed locally for VM/dev testing.- A reachable Grafana Loki endpoint (self-hosted or Grafana Cloud) and its push URL.
- An Elasticsearch 8.x cluster (or OpenSearch) reachable over HTTPS, with an ingest user.
- An AWS S3 bucket (or any S3-compatible store — MinIO, GCS via the S3 interop endpoint) and an IAM identity for writes.
- Credentials staged in HashiCorp Vault — Vector reads secrets from environment variables; the deploy job templates them from Vault rather than baking them into the manifest. We never put a sink password in plaintext YAML.
- Identity for operators via Okta / Entra ID SSO into Grafana, Kibana, and the AWS console — Vector itself authenticates to sinks with scoped tokens, not human credentials.
Target topology
The pipeline is deliberately split into two tiers, because collapsing them is the most common mistake teams make and it is the one that wakes you up at 3 a.m.
Agent tier runs as a DaemonSet — one lightweight Vector per node. Its only jobs are to tail container logs, attach Kubernetes metadata (pod, namespace, labels), do the cheapest possible normalization, and forward everything to the aggregator over the Vector-native protocol. Agents are stateless and disposable; if a node dies, you lose nothing that was already acknowledged downstream.
Aggregator tier runs as a horizontally scaled Deployment behind a Service. This is where the expensive work lives: full VRL parsing, enrichment, sampling, routing (deciding which logs go to which sinks), and the actual fan-out to Loki, S3, and Elasticsearch. Each sink has its own disk-backed buffer, so a slow or down Elasticsearch cannot stall delivery to Loki and S3. Keeping the CPU-heavy parsing on a tier you can scale independently — and out of the per-node agent — is what lets the agent stay tiny and the cluster stay cheap.
Around the pipeline: CrowdStrike Falcon runs on the nodes for runtime threat detection and is itself one of the log sources Vector collects; Wiz scans the S3 bucket and the cluster for posture drift (a world-readable bucket, an over-broad IAM policy); Dynatrace (or Datadog) scrapes Vector’s own internal metrics so the pipeline that watches everything is itself watched; and Terraform provisions the bucket, IAM, and Loki/Elasticsearch infrastructure while Argo CD syncs the Vector manifests from Git.
1. Provision the sinks and credentials
Stand up the destinations first — a pipeline with nowhere to write is just a memory leak. Use Terraform so the bucket and IAM are reproducible and reviewable; this is the only place AWS resources get created.
# infra/vector-sinks.tf (applied by Terraform via Argo CD's pre-sync or a CI job)
resource "aws_s3_bucket" "logs_archive" {
bucket = "kv-logs-archive-prod"
}
resource "aws_s3_bucket_lifecycle_configuration" "expire" {
bucket = aws_s3_bucket.logs_archive.id
rule {
id = "to-glacier-then-expire"
status = "Enabled"
transition { days = 30 storage_class = "GLACIER" }
expiration { days = 400 } # compliance retention window
}
}
resource "aws_iam_policy" "vector_s3_write" {
name = "vector-s3-archive-write"
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["s3:PutObject"] # write-only; Vector never reads
Resource = "${aws_s3_bucket.logs_archive.arn}/*"
}]
})
}
For Kubernetes, bind that policy to a service account with IRSA (IAM Roles for Service Accounts) so the aggregator pods get short-lived credentials — no static AWS keys in the cluster. The Loki and Elasticsearch endpoints come from your existing platform; create a scoped Elasticsearch role that can only write to the target indices:
# Elasticsearch: a write-only role + ingest user for Vector
curl -u admin:"$ES_ADMIN_PW" -X POST "https://es.internal:9200/_security/role/vector_writer" \
-H 'Content-Type: application/json' -d '{
"indices": [{ "names": ["logs-app-*","logs-audit-*"], "privileges": ["create_index","create","index"] }]
}'
curl -u admin:"$ES_ADMIN_PW" -X POST "https://es.internal:9200/_security/user/vector" \
-H 'Content-Type: application/json' -d '{ "password":"'"$ES_VECTOR_PW"'", "roles":["vector_writer"] }'
Stage ES_VECTOR_PW, the Loki tenant token, and any S3 fallback keys in HashiCorp Vault under a path like secret/observability/vector. The deploy job (Step 5) injects them as environment variables; Vector references them with the ${VAR} syntax, never literals.
2. Install the agent tier (DaemonSet)
The agents do almost nothing — that is the point. Add the Vector Helm repo and install the vector chart in Agent role:
helm repo add vector https://helm.vector.dev
helm repo update
# agent-values.yaml
role: Agent # DaemonSet: one pod per node
resources:
requests: { cpu: "100m", memory: "128Mi" }
limits: { cpu: "500m", memory: "256Mi" }
customConfig:
data_dir: /vector-data-dir
sources:
k8s_logs:
type: kubernetes_logs # tails /var/log/pods, auto-enriches with pod metadata
transforms:
add_origin:
type: remap
inputs: [k8s_logs]
source: |
.observed_by = "vector-agent"
.cluster = "prod-ap-south-1"
sinks:
to_aggregator:
type: vector # native Vector-to-Vector protocol
inputs: [add_origin]
address: "vector-aggregator.observability.svc:6000"
compression: true
buffer:
type: disk # survive an aggregator blip without dropping
max_size: 268435488 # 256 MiB on-node spool
helm upgrade --install vector-agent vector/vector \
-n observability --create-namespace -f agent-values.yaml
On VMs / virtual appliances instead of Kubernetes, the equivalent is the vector package with a file source globbing /var/log/*.log (or journald), the same vector sink pointing at the aggregator, and Ansible to roll the config out fleet-wide. The agent stays equally thin.
3. Deploy the aggregator tier
The aggregator is a scalable Deployment that receives from all agents, runs the real VRL, routes, and fans out. Install a second release in Aggregator role with replicas and headroom:
# aggregator-values.yaml
role: Aggregator
replicas: 3
resources:
requests: { cpu: "1", memory: "1Gi" }
limits: { cpu: "2", memory: "2Gi" }
service:
enabled: true
ports:
- { name: vector, port: 6000, protocol: TCP }
customConfig:
data_dir: /vector-data-dir
api: { enabled: true, address: "0.0.0.0:8686" } # for `vector top` + health
sources:
from_agents:
type: vector
address: "0.0.0.0:6000"
# transforms and sinks are defined in the next two steps
helm upgrade --install vector-aggregator vector/vector \
-n observability -f aggregator-values.yaml
Front it with the vector-aggregator Service the agents already target. Because each agent disk-buffers to the aggregator and the aggregator disk-buffers to each sink, the pipeline degrades gracefully at every hop instead of dropping events the moment something downstream slows down.
4. Write the VRL: parse, enrich, route
This is the heart of the pipeline. Add a chain of remap (VRL) transforms and a route transform to the aggregator config. VRL parses untyped log lines into structured fields, drops noise, enriches, and tags each event with a destination class.
transforms:
# 4a. Parse: turn a raw line into structured fields, defensively.
parse:
type: remap
inputs: [from_agents]
source: |
# Most app logs are JSON; fall back gracefully if not.
parsed, err = parse_json(.message)
if err == null { . = merge(., parsed) }
# Normalize the severity field that every team spells differently.
.level = downcase(to_string(.level ?? .severity ?? .lvl ?? "info"))
# Coerce the timestamp; if it is missing or junk, stamp ingest time.
.timestamp = to_timestamp(.timestamp ?? .ts ?? now()) ?? now()
# Drop a chatty health-check path entirely — never bill for it.
if .http.path == "/healthz" { abort }
# 4b. Enrich + redact: PII hygiene before anything leaves the pipeline.
enrich:
type: remap
inputs: [parse]
source: |
.env = "prod"
# Redact emails so they never land in a searchable index.
.message = replace(string!(.message), r'[\w.+-]+@[\w-]+\.[\w.-]+', "[redacted-email]")
# Derive a cheap routing signal.
.is_audit = exists(.audit) || starts_with(string!(.logger ?? ""), "audit")
.is_metric = exists(.metric_name)
# 4c. Sample the firehose: keep 1 in 10 successful access logs, all errors.
sample_access:
type: sample
inputs: [enrich]
rate: 10 # keep 1 of every 10...
exclude: '.level == "error" || .level == "warn" || .is_audit' # ...but never drop these
# 4d. Route: classify each event to its sink(s) by condition.
route:
type: route
inputs: [sample_access]
route:
archive: 'true' # EVERYTHING goes to S3 (catch-all)
searchable: '.level == "error" || .level == "warn" || .is_audit'
metrics_shaped: '.is_metric == true'
A few choices worth the why. parse_json with an explicit err check means a single malformed line never crashes the transform — it just stays a raw string. abort on /healthz is the single highest-leverage cost cut in most pipelines. The route transform emits named outputs (route.archive, route.searchable, route.metrics_shaped) you wire to sinks in the next step, and because archive is the literal true, S3 gets a complete copy while Elasticsearch only gets the high-value subset — which is precisely the index-volume cut the platform lead asked for.
5. Wire the three sinks with per-sink buffers
Now attach Loki, S3, and Elasticsearch, each consuming the appropriate route output, each with its own disk buffer so one sink’s outage is isolated.
sinks:
# Loki: the metric-shaped lines, for Grafana dashboards.
loki:
type: loki
inputs: [route.metrics_shaped]
endpoint: "https://loki.internal:3100"
auth: { strategy: basic, user: "vector", password: "${LOKI_TOKEN}" }
labels: # keep label cardinality LOW
app: '{{ kubernetes.pod_labels.app }}'
level: '{{ level }}'
out_of_order_action: accept
buffer: { type: disk, max_size: 536870912 } # 512 MiB
# S3: the immutable catch-all archive, compressed + partitioned by date.
s3_archive:
type: aws_s3
inputs: [route.archive]
bucket: "kv-logs-archive-prod"
region: "ap-south-1"
compression: gzip
encoding: { codec: json }
key_prefix: "year=%Y/month=%m/day=%d/" # Athena-friendly partitions
batch: { max_bytes: 10485760, timeout_secs: 300 }
buffer: { type: disk, max_size: 1073741824 } # 1 GiB — S3 must not block
# Elasticsearch: only the searchable, high-value subset.
elasticsearch:
type: elasticsearch
inputs: [route.searchable]
endpoints: ["https://es.internal:9200"]
auth: { strategy: basic, user: "vector", password: "${ES_VECTOR_PW}" }
bulk: { index: "logs-app-%Y.%m.%d" } # daily indices for easy ILM
buffer:
type: disk
max_size: 1073741824
when_full: block # apply backpressure, don't drop
The when_full: block on Elasticsearch is deliberate: when ES is overwhelmed, Vector applies backpressure up the chain rather than discarding events — the agents’ on-node buffers absorb the slack, and nothing is lost as long as the outage fits the buffer budget. S3 gets the biggest buffer because it is the one sink that must never drop a compliance copy. Keeping Loki’s labels low-cardinality (no request IDs, no user IDs) is what stops Loki from melting — high-cardinality labels are the classic Loki footgun.
Roll the change out through Git, not by hand. Argo CD watches the manifests repo and syncs the updated aggregator ConfigMap; GitHub Actions (or Jenkins) runs vector validate in CI before the merge so a broken VRL never reaches the cluster. The deploy job pulls LOKI_TOKEN and ES_VECTOR_PW from HashiCorp Vault and injects them as env vars into the pod spec.
# CI gate — fails the build on any config or VRL error
vector validate --config-yaml aggregator-values.yaml
6. Add self-observability and change management
The pipeline must report on itself. Vector exposes an internal-metrics source; scrape it with Dynatrace or Datadog so you can alert on vector_component_errors_total, vector_buffer_events, and per-sink vector_component_sent_events_total.
sources:
internal:
type: internal_metrics
sinks:
prom:
type: prometheus_exporter # Dynatrace/Datadog scrape :9598
inputs: [internal]
address: "0.0.0.0:9598"
Wire two integrations that keep humans in the loop. Route Vector’s own error events (a sink consistently failing, a buffer near full) to ServiceNow so a sustained delivery failure opens an incident with a ticket number, not just a log line nobody reads. And if you ship a Moodle LMS or other off-cluster appliances, point their file/syslog logs at the same aggregator Service so they land in the identical parse-route-fan-out path — one pipeline, every source.
Validation
Prove each hop before you trust it.
-
Config is valid and live.
vector validatein CI must pass, and on a running pod:kubectl exec -n observability deploy/vector-aggregator -- vector topvector topshows live throughput per component — confirmfrom_agentsis receiving and all three sinks show non-zero “sent”. -
Routing actually splits traffic. Emit a known error and a known access log, then check counts:
kubectl logs -n observability deploy/vector-aggregator | grep -c "logs-app" # Then in Kibana: an `error`/`warn`/audit doc IS present; a sampled 200 access log is NOT.Errors and audit events must appear in Elasticsearch; routine 200s must be largely absent (1-in-10) yet fully present in S3.
-
S3 archive is complete and partitioned.
aws s3 ls s3://kv-logs-archive-prod/year=2026/month=06/day=10/ | head aws s3 cp s3://kv-logs-archive-prod/year=2026/month=06/day=10/<obj>.log.gz - | gunzip | headEvery event — including the ones sampled out of Elasticsearch — must be present here.
-
Loki has the metric-shaped lines and low label cardinality. In Grafana Explore, run
{app="checkout"}and confirm series count is sane (not thousands). -
Backpressure holds. Scale Elasticsearch to zero replicas for two minutes; confirm Loki and S3 keep flowing (
vector top),vector_buffer_eventsfor the ES sink climbs, and when ES returns the backlog drains with zero gaps in the S3 archive. This is the test that proves the design.
Rollback / teardown
Because everything is GitOps-managed, rollback is a revert, not a scramble.
-
Bad config / VRL: revert the commit and let Argo CD sync, or pin back fast:
helm rollback vector-aggregator -n observabilityThe previous aggregator ConfigMap is restored; agents keep buffering to the Service throughout, so no gap.
-
Drain buffers before teardown so no in-flight events are lost:
kubectl scale deploy/vector-aggregator -n observability --replicas=0 # stop new intake at sinks # confirm vector_buffer_events -> 0 in Dynatrace/Datadog before deleting -
Full removal:
helm uninstall vector-agent vector-aggregator -n observabilityLeave the S3 bucket in place (Terraform-managed, retention-governed) — only destroy it through Terraform with explicit sign-off, since it is your compliance record.
Common pitfalls
- One shared buffer for all sinks. If you let a single buffer or
when_full: dropgovern the whole pipeline, a slow Elasticsearch silently throttles or drops Loki and S3 too. Give every sink its own disk buffer — that isolation is the entire reason for the design. - High-cardinality Loki labels. Putting
request_id,user_id, ortrace_idin Lokilabelsexplodes the index and OOMs the ingesters. Keep labels to a tiny stable set; everything else stays in the log body. - VRL that aborts on bad input. Using
parse_json!(the fallible bang form) without a guard crashes the transform on the first malformed line and stalls the component. Always capture theerrand fall back. - Routing with no catch-all. If your
routeconditions do not cover every event, unmatched events vanish. Keep anarchive: 'true'catch-all to S3 so nothing is ever silently lost. - Under-provisioned agent buffers. If the on-node spool is too small, a long aggregator outage overflows it and drops at the edge — exactly when you need the logs. Size agent buffers to your worst tolerable aggregator downtime.
- No CI validation. Shipping VRL without
vector validatein GitHub Actions means a typo takes the whole pipeline down on sync. Gate every merge.
Security notes
Vector authenticates to each sink with a scoped, least-privilege credential — the Elasticsearch role can only write to logs-* indices, the S3 IAM policy is PutObject-only with no read or delete, and the Loki token is tenant-scoped. All three live in HashiCorp Vault and are injected as env vars at deploy time; none appear in Git or in a ConfigMap. Redact PII in VRL before fan-out (the email-masking transform above) so personal data never reaches a searchable index. CrowdStrike Falcon protects the nodes at runtime and feeds its own events into the pipeline; Wiz continuously scans the S3 bucket and IAM for posture drift — a bucket gone public, a policy gone broad — and the bucket should enforce SSE-KMS encryption and Block Public Access. Operators reach Grafana, Kibana, and AWS through Okta / Entra ID SSO with MFA; Vector never holds a human credential.
Cost notes
The savings come from routing, not from a cheaper vendor. Sampling routine access logs and abort-ing health checks in VRL cuts ingest volume before it costs anything. Sending only error/warn/audit events to Elasticsearch — the most expensive sink per GB — while letting S3 hold the cheap, gzip-compressed, lifecycle-tiered (Glacier after 30 days) full archive, is where the index-volume reduction comes from. Loki, billed on volume and low-cardinality labels, takes only the metric-shaped lines that drive dashboards, so Grafana stops querying Elasticsearch. The replayability bonus: because S3 holds everything, you can re-feed an archive day through a one-off Vector pipeline into Elasticsearch on demand instead of paying to keep cold data hot. Track per-sink throughput in Dynatrace / Datadog to see the volume split in numbers and tune the sample rate against your actual bill.