Observability Platform

Deploy Vector for High-Throughput Log Routing, Transformation, and Multi-Sink Delivery

A mid-size SaaS platform team is bleeding money on its logging bill. Every container ships raw JSON straight to a hosted Elasticsearch cluster, the cluster is the single point of failure for all observability, and when a marketing campaign triples traffic the ingest queue backs up, log shippers OOM-kill, and the on-call engineer is blind during the exact incident they need logs for. The mandate from the platform lead is concrete: cut Elasticsearch index volume by routing low-value logs elsewhere, keep a cheap immutable copy of everything in object storage for compliance and replay, send the metrics-shaped log lines to Loki so Grafana dashboards stop hammering Elasticsearch, and make the whole thing survive a sink outage without dropping data on the floor. That is exactly the job Vector was built for: a single high-throughput observability pipeline that collects logs at the edge, transforms them in flight with VRL (Vector Remap Language), and fans them out to multiple sinks with per-sink buffering and backpressure. This guide builds that pipeline end to end — an agent tier on every node, an aggregator tier doing the heavy parsing, and three sinks (Loki, S3, Elasticsearch) — and shows you how to validate it, roll it back, and not get paged for it.

This is an intermediate, hands-on guide. By the end you will have a working two-tier topology you can adapt to AWS, Azure, or GCP.

Prerequisites

Target topology

Deploy Vector for High-Throughput Log Routing, Transformation, and Multi-Sink Delivery — topology

The pipeline is deliberately split into two tiers, because collapsing them is the most common mistake teams make and it is the one that wakes you up at 3 a.m.

Agent tier runs as a DaemonSet — one lightweight Vector per node. Its only jobs are to tail container logs, attach Kubernetes metadata (pod, namespace, labels), do the cheapest possible normalization, and forward everything to the aggregator over the Vector-native protocol. Agents are stateless and disposable; if a node dies, you lose nothing that was already acknowledged downstream.

Aggregator tier runs as a horizontally scaled Deployment behind a Service. This is where the expensive work lives: full VRL parsing, enrichment, sampling, routing (deciding which logs go to which sinks), and the actual fan-out to Loki, S3, and Elasticsearch. Each sink has its own disk-backed buffer, so a slow or down Elasticsearch cannot stall delivery to Loki and S3. Keeping the CPU-heavy parsing on a tier you can scale independently — and out of the per-node agent — is what lets the agent stay tiny and the cluster stay cheap.

Around the pipeline: CrowdStrike Falcon runs on the nodes for runtime threat detection and is itself one of the log sources Vector collects; Wiz scans the S3 bucket and the cluster for posture drift (a world-readable bucket, an over-broad IAM policy); Dynatrace (or Datadog) scrapes Vector’s own internal metrics so the pipeline that watches everything is itself watched; and Terraform provisions the bucket, IAM, and Loki/Elasticsearch infrastructure while Argo CD syncs the Vector manifests from Git.

1. Provision the sinks and credentials

Stand up the destinations first — a pipeline with nowhere to write is just a memory leak. Use Terraform so the bucket and IAM are reproducible and reviewable; this is the only place AWS resources get created.

# infra/vector-sinks.tf  (applied by Terraform via Argo CD's pre-sync or a CI job)
resource "aws_s3_bucket" "logs_archive" {
  bucket = "kv-logs-archive-prod"
}

resource "aws_s3_bucket_lifecycle_configuration" "expire" {
  bucket = aws_s3_bucket.logs_archive.id
  rule {
    id     = "to-glacier-then-expire"
    status = "Enabled"
    transition { days = 30  storage_class = "GLACIER" }
    expiration { days = 400 }            # compliance retention window
  }
}

resource "aws_iam_policy" "vector_s3_write" {
  name = "vector-s3-archive-write"
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["s3:PutObject"]              # write-only; Vector never reads
      Resource = "${aws_s3_bucket.logs_archive.arn}/*"
    }]
  })
}

For Kubernetes, bind that policy to a service account with IRSA (IAM Roles for Service Accounts) so the aggregator pods get short-lived credentials — no static AWS keys in the cluster. The Loki and Elasticsearch endpoints come from your existing platform; create a scoped Elasticsearch role that can only write to the target indices:

# Elasticsearch: a write-only role + ingest user for Vector
curl -u admin:"$ES_ADMIN_PW" -X POST "https://es.internal:9200/_security/role/vector_writer" \
  -H 'Content-Type: application/json' -d '{
    "indices": [{ "names": ["logs-app-*","logs-audit-*"], "privileges": ["create_index","create","index"] }]
  }'
curl -u admin:"$ES_ADMIN_PW" -X POST "https://es.internal:9200/_security/user/vector" \
  -H 'Content-Type: application/json' -d '{ "password":"'"$ES_VECTOR_PW"'", "roles":["vector_writer"] }'

Stage ES_VECTOR_PW, the Loki tenant token, and any S3 fallback keys in HashiCorp Vault under a path like secret/observability/vector. The deploy job (Step 5) injects them as environment variables; Vector references them with the ${VAR} syntax, never literals.

2. Install the agent tier (DaemonSet)

The agents do almost nothing — that is the point. Add the Vector Helm repo and install the vector chart in Agent role:

helm repo add vector https://helm.vector.dev
helm repo update
# agent-values.yaml
role: Agent          # DaemonSet: one pod per node
resources:
  requests: { cpu: "100m", memory: "128Mi" }
  limits:   { cpu: "500m", memory: "256Mi" }

customConfig:
  data_dir: /vector-data-dir
  sources:
    k8s_logs:
      type: kubernetes_logs        # tails /var/log/pods, auto-enriches with pod metadata
  transforms:
    add_origin:
      type: remap
      inputs: [k8s_logs]
      source: |
        .observed_by = "vector-agent"
        .cluster = "prod-ap-south-1"
  sinks:
    to_aggregator:
      type: vector                 # native Vector-to-Vector protocol
      inputs: [add_origin]
      address: "vector-aggregator.observability.svc:6000"
      compression: true
      buffer:
        type: disk                 # survive an aggregator blip without dropping
        max_size: 268435488        # 256 MiB on-node spool
helm upgrade --install vector-agent vector/vector \
  -n observability --create-namespace -f agent-values.yaml

On VMs / virtual appliances instead of Kubernetes, the equivalent is the vector package with a file source globbing /var/log/*.log (or journald), the same vector sink pointing at the aggregator, and Ansible to roll the config out fleet-wide. The agent stays equally thin.

3. Deploy the aggregator tier

The aggregator is a scalable Deployment that receives from all agents, runs the real VRL, routes, and fans out. Install a second release in Aggregator role with replicas and headroom:

# aggregator-values.yaml
role: Aggregator
replicas: 3
resources:
  requests: { cpu: "1",   memory: "1Gi" }
  limits:   { cpu: "2",   memory: "2Gi" }
service:
  enabled: true
  ports:
    - { name: vector, port: 6000, protocol: TCP }

customConfig:
  data_dir: /vector-data-dir
  api: { enabled: true, address: "0.0.0.0:8686" }   # for `vector top` + health
  sources:
    from_agents:
      type: vector
      address: "0.0.0.0:6000"
  # transforms and sinks are defined in the next two steps
helm upgrade --install vector-aggregator vector/vector \
  -n observability -f aggregator-values.yaml

Front it with the vector-aggregator Service the agents already target. Because each agent disk-buffers to the aggregator and the aggregator disk-buffers to each sink, the pipeline degrades gracefully at every hop instead of dropping events the moment something downstream slows down.

4. Write the VRL: parse, enrich, route

This is the heart of the pipeline. Add a chain of remap (VRL) transforms and a route transform to the aggregator config. VRL parses untyped log lines into structured fields, drops noise, enriches, and tags each event with a destination class.

  transforms:
    # 4a. Parse: turn a raw line into structured fields, defensively.
    parse:
      type: remap
      inputs: [from_agents]
      source: |
        # Most app logs are JSON; fall back gracefully if not.
        parsed, err = parse_json(.message)
        if err == null { . = merge(., parsed) }

        # Normalize the severity field that every team spells differently.
        .level = downcase(to_string(.level ?? .severity ?? .lvl ?? "info"))

        # Coerce the timestamp; if it is missing or junk, stamp ingest time.
        .timestamp = to_timestamp(.timestamp ?? .ts ?? now()) ?? now()

        # Drop a chatty health-check path entirely — never bill for it.
        if .http.path == "/healthz" { abort }

    # 4b. Enrich + redact: PII hygiene before anything leaves the pipeline.
    enrich:
      type: remap
      inputs: [parse]
      source: |
        .env = "prod"
        # Redact emails so they never land in a searchable index.
        .message = replace(string!(.message), r'[\w.+-]+@[\w-]+\.[\w.-]+', "[redacted-email]")
        # Derive a cheap routing signal.
        .is_audit  = exists(.audit) || starts_with(string!(.logger ?? ""), "audit")
        .is_metric = exists(.metric_name)

    # 4c. Sample the firehose: keep 1 in 10 successful access logs, all errors.
    sample_access:
      type: sample
      inputs: [enrich]
      rate: 10                       # keep 1 of every 10...
      exclude: '.level == "error" || .level == "warn" || .is_audit'   # ...but never drop these

    # 4d. Route: classify each event to its sink(s) by condition.
    route:
      type: route
      inputs: [sample_access]
      route:
        archive:  'true'                       # EVERYTHING goes to S3 (catch-all)
        searchable: '.level == "error" || .level == "warn" || .is_audit'
        metrics_shaped: '.is_metric == true'

A few choices worth the why. parse_json with an explicit err check means a single malformed line never crashes the transform — it just stays a raw string. abort on /healthz is the single highest-leverage cost cut in most pipelines. The route transform emits named outputs (route.archive, route.searchable, route.metrics_shaped) you wire to sinks in the next step, and because archive is the literal true, S3 gets a complete copy while Elasticsearch only gets the high-value subset — which is precisely the index-volume cut the platform lead asked for.

5. Wire the three sinks with per-sink buffers

Now attach Loki, S3, and Elasticsearch, each consuming the appropriate route output, each with its own disk buffer so one sink’s outage is isolated.

  sinks:
    # Loki: the metric-shaped lines, for Grafana dashboards.
    loki:
      type: loki
      inputs: [route.metrics_shaped]
      endpoint: "https://loki.internal:3100"
      auth: { strategy: basic, user: "vector", password: "${LOKI_TOKEN}" }
      labels:                          # keep label cardinality LOW
        app: '{{ kubernetes.pod_labels.app }}'
        level: '{{ level }}'
      out_of_order_action: accept
      buffer: { type: disk, max_size: 536870912 }     # 512 MiB

    # S3: the immutable catch-all archive, compressed + partitioned by date.
    s3_archive:
      type: aws_s3
      inputs: [route.archive]
      bucket: "kv-logs-archive-prod"
      region: "ap-south-1"
      compression: gzip
      encoding: { codec: json }
      key_prefix: "year=%Y/month=%m/day=%d/"          # Athena-friendly partitions
      batch:   { max_bytes: 10485760, timeout_secs: 300 }
      buffer:  { type: disk, max_size: 1073741824 }   # 1 GiB — S3 must not block

    # Elasticsearch: only the searchable, high-value subset.
    elasticsearch:
      type: elasticsearch
      inputs: [route.searchable]
      endpoints: ["https://es.internal:9200"]
      auth: { strategy: basic, user: "vector", password: "${ES_VECTOR_PW}" }
      bulk: { index: "logs-app-%Y.%m.%d" }            # daily indices for easy ILM
      buffer:
        type: disk
        max_size: 1073741824
        when_full: block                              # apply backpressure, don't drop

The when_full: block on Elasticsearch is deliberate: when ES is overwhelmed, Vector applies backpressure up the chain rather than discarding events — the agents’ on-node buffers absorb the slack, and nothing is lost as long as the outage fits the buffer budget. S3 gets the biggest buffer because it is the one sink that must never drop a compliance copy. Keeping Loki’s labels low-cardinality (no request IDs, no user IDs) is what stops Loki from melting — high-cardinality labels are the classic Loki footgun.

Roll the change out through Git, not by hand. Argo CD watches the manifests repo and syncs the updated aggregator ConfigMap; GitHub Actions (or Jenkins) runs vector validate in CI before the merge so a broken VRL never reaches the cluster. The deploy job pulls LOKI_TOKEN and ES_VECTOR_PW from HashiCorp Vault and injects them as env vars into the pod spec.

# CI gate — fails the build on any config or VRL error
vector validate --config-yaml aggregator-values.yaml

6. Add self-observability and change management

The pipeline must report on itself. Vector exposes an internal-metrics source; scrape it with Dynatrace or Datadog so you can alert on vector_component_errors_total, vector_buffer_events, and per-sink vector_component_sent_events_total.

  sources:
    internal:
      type: internal_metrics
  sinks:
    prom:
      type: prometheus_exporter      # Dynatrace/Datadog scrape :9598
      inputs: [internal]
      address: "0.0.0.0:9598"

Wire two integrations that keep humans in the loop. Route Vector’s own error events (a sink consistently failing, a buffer near full) to ServiceNow so a sustained delivery failure opens an incident with a ticket number, not just a log line nobody reads. And if you ship a Moodle LMS or other off-cluster appliances, point their file/syslog logs at the same aggregator Service so they land in the identical parse-route-fan-out path — one pipeline, every source.

Validation

Prove each hop before you trust it.

  1. Config is valid and live. vector validate in CI must pass, and on a running pod:

    kubectl exec -n observability deploy/vector-aggregator -- vector top
    

    vector top shows live throughput per component — confirm from_agents is receiving and all three sinks show non-zero “sent”.

  2. Routing actually splits traffic. Emit a known error and a known access log, then check counts:

    kubectl logs -n observability deploy/vector-aggregator | grep -c "logs-app"
    # Then in Kibana: an `error`/`warn`/audit doc IS present; a sampled 200 access log is NOT.
    

    Errors and audit events must appear in Elasticsearch; routine 200s must be largely absent (1-in-10) yet fully present in S3.

  3. S3 archive is complete and partitioned.

    aws s3 ls s3://kv-logs-archive-prod/year=2026/month=06/day=10/ | head
    aws s3 cp s3://kv-logs-archive-prod/year=2026/month=06/day=10/<obj>.log.gz - | gunzip | head
    

    Every event — including the ones sampled out of Elasticsearch — must be present here.

  4. Loki has the metric-shaped lines and low label cardinality. In Grafana Explore, run {app="checkout"} and confirm series count is sane (not thousands).

  5. Backpressure holds. Scale Elasticsearch to zero replicas for two minutes; confirm Loki and S3 keep flowing (vector top), vector_buffer_events for the ES sink climbs, and when ES returns the backlog drains with zero gaps in the S3 archive. This is the test that proves the design.

Rollback / teardown

Because everything is GitOps-managed, rollback is a revert, not a scramble.

Common pitfalls

Security notes

Vector authenticates to each sink with a scoped, least-privilege credential — the Elasticsearch role can only write to logs-* indices, the S3 IAM policy is PutObject-only with no read or delete, and the Loki token is tenant-scoped. All three live in HashiCorp Vault and are injected as env vars at deploy time; none appear in Git or in a ConfigMap. Redact PII in VRL before fan-out (the email-masking transform above) so personal data never reaches a searchable index. CrowdStrike Falcon protects the nodes at runtime and feeds its own events into the pipeline; Wiz continuously scans the S3 bucket and IAM for posture drift — a bucket gone public, a policy gone broad — and the bucket should enforce SSE-KMS encryption and Block Public Access. Operators reach Grafana, Kibana, and AWS through Okta / Entra ID SSO with MFA; Vector never holds a human credential.

Cost notes

The savings come from routing, not from a cheaper vendor. Sampling routine access logs and abort-ing health checks in VRL cuts ingest volume before it costs anything. Sending only error/warn/audit events to Elasticsearch — the most expensive sink per GB — while letting S3 hold the cheap, gzip-compressed, lifecycle-tiered (Glacier after 30 days) full archive, is where the index-volume reduction comes from. Loki, billed on volume and low-cardinality labels, takes only the metric-shaped lines that drive dashboards, so Grafana stops querying Elasticsearch. The replayability bonus: because S3 holds everything, you can re-feed an archive day through a one-off Vector pipeline into Elasticsearch on demand instead of paying to keep cold data hot. Track per-sink throughput in Dynatrace / Datadog to see the volume split in numbers and tune the sample rate against your actual bill.

VectorVRLLokiElasticsearchS3Observability
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading