Free-text logs are where observability goes to die. The moment you need to count errors by tenant, alarm on p99 latency, or correlate a request across three services, a wall of printf strings forces you into brittle regex and full-text scans. This article builds the alternative end to end on AWS: a logging contract that emits JSON, CloudWatch metric filters and Logs Insights for analysis, and a Kinesis Data Firehose path that streams those same logs into OpenSearch in real time — encrypted, least-privilege, and cost-aware.
1. Why structured JSON beats free-text
The core idea is that a log line is a structured event, not a sentence. Once every field is addressable, querying becomes selection instead of parsing, and CloudWatch can index discovered fields automatically.
Pick a logging contract and enforce it across every service. A minimal but production-grade shape:
{
"timestamp": "2026-04-14T09:21:04.512Z",
"level": "ERROR",
"service": "checkout",
"env": "prod",
"trace_id": "1-66134a30-1f2c3d4e5f6a7b8c9d0e1f2a",
"request_id": "8f3c1b2a-0d4e-4a6b-9c8d-7e6f5a4b3c2d",
"tenant_id": "acme",
"route": "POST /orders",
"status_code": 502,
"latency_ms": 1840,
"message": "upstream payment gateway timed out"
}
Three rules make or break this contract:
- A correlation ID on every line. Use the X-Ray
trace_idif you run X-Ray, plus arequest_idyou generate at the edge and propagate. This is what turns a pile of lines into a request narrative. - Stable field names and types.
status_codeis always a number,levelis always one of a fixed set. OpenSearch infers types on first write; a field that is sometimes a string and sometimes a number will cause mapping conflicts and rejected documents downstream. - Bounded cardinality.
tenant_idis fine as a field; a raw user email or a full URL with query string is not — it explodes metric-filter dimensions and OpenSearch field data.
CloudWatch automatically discovers fields in JSON logs and exposes them to Logs Insights as
service,status_code, and so on. With free-text you wouldparsethem out by hand on every query. That single difference is why the contract pays for itself within a week.
2. Getting logs into CloudWatch
The destination is always a CloudWatch log group containing log streams. How you fill it depends on the compute.
Lambda. Anything you write to stdout/stderr lands in /aws/lambda/<function-name>. Just emit JSON — most runtimes’ structured loggers (Powertools for AWS Lambda, for example) already do. Set the format to JSON so platform fields are structured too:
aws lambda update-function-configuration \
--function-name checkout \
--logging-config LogFormat=JSON,ApplicationLogLevel=INFO,SystemLogLevel=WARN
EC2 / on-prem with the CloudWatch agent. The unified agent tails files and ships them. A minimal config:
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/app/checkout.json",
"log_group_name": "/app/checkout",
"log_stream_name": "{instance_id}",
"retention_in_days": 30
}
]
}
}
}
}
Start it with the config you just wrote:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json -s
ECS on Fargate. The simple path is the awslogs driver, which sends each container’s stdout straight to a log group:
{
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/checkout",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "checkout"
}
}
}
When you need routing or parsing at the source — sending some streams to CloudWatch and others to S3, or attaching metadata — switch the log driver to awsfirelens, which runs a Fluent Bit sidecar. FireLens is the right tool when one log driver’s worth of configuration is not enough; for a single JSON stream to CloudWatch, awslogs is less to operate.
Whichever path you choose, the output is identical: JSON events in a log group, ready for the rest of the pipeline.
3. Querying with CloudWatch Logs Insights
Logs Insights is the interactive query layer. Because the logs are JSON, fields are already parsed and you go straight to filter and stats. Error rate by route over the queried window:
fields @timestamp, route, status_code
| filter status_code >= 500
| stats count(*) as errors by route
| sort errors desc
Latency percentiles, which free-text logging cannot give you at all without first extracting the number:
fields latency_ms, route
| filter ispresent(latency_ms)
| stats
pct(latency_ms, 50) as p50,
pct(latency_ms, 90) as p90,
pct(latency_ms, 99) as p99
by route
Trace a single request across whatever streams it touched:
fields @timestamp, service, level, message
| filter request_id = "8f3c1b2a-0d4e-4a6b-9c8d-7e6f5a4b3c2d"
| sort @timestamp asc
Run a query from the CLI and poll for the result:
qid=$(aws logs start-query \
--log-group-name /ecs/checkout \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s) \
--query-string 'fields @timestamp, route, status_code | filter status_code >= 500 | stats count(*) by route' \
--query 'queryId' --output text)
aws logs get-query-results --query-id "$qid"
Save the queries your team reruns (error rate, slow routes, a specific tenant’s traffic) so they are one click away during an incident instead of retyped from memory.
Cross-group searches matter at scale. You can pass multiple
--log-group-namevalues, or query a log group field index. For fleet-wide investigations, organize related groups so a single query spans the whole service.
4. Metric filters: alarms straight from log fields
Logs Insights is for humans asking questions. For machines watching continuously, use metric filters, which turn matching log events into CloudWatch metrics with no extra infrastructure — no Lambda, no scrape job.
Create a metric that counts 5xx responses, reading the value directly from the JSON field:
aws logs put-metric-filter \
--log-group-name /ecs/checkout \
--filter-name checkout-5xx \
--filter-pattern '{ $.status_code >= 500 }' \
--metric-transformations \
metricName=Checkout5xx,metricNamespace=App/Checkout,metricValue=1,defaultValue=0
That { ... } syntax is the JSON metric-filter dialect: $.status_code addresses the field by JSON path. Setting defaultValue=0 is what makes the metric continuous — without it, the metric is sparse and alarms misbehave on the missing-data path.
You can also emit a field’s value as the metric, not just a count. Publish latency so you can alarm on it and graph it cheaply:
aws logs put-metric-filter \
--log-group-name /ecs/checkout \
--filter-name checkout-latency \
--filter-pattern '{ $.latency_ms = * }' \
--metric-transformations \
metricName=CheckoutLatencyMs,metricNamespace=App/Checkout,metricValue='$.latency_ms'
Now wire an alarm. Alarm when 5xx count crosses a threshold over five minutes:
aws cloudwatch put-metric-alarm \
--alarm-name checkout-5xx-high \
--namespace App/Checkout --metric-name Checkout5xx \
--statistic Sum --period 300 --evaluation-periods 1 \
--threshold 10 --comparison-operator GreaterThanThreshold \
--treat-missing-data notBreaching \
--alarm-actions arn:aws:sns:us-east-1:111122223333:oncall
This is the cheapest reliable alerting you can build on AWS: the log you already emit becomes a metric, and the metric becomes a page. Reserve metric filters for the handful of signals you alarm on — every distinct dimension combination is a separate custom metric with its own cost — and leave exploratory slicing to Logs Insights.
5. Subscription filters and Kinesis Data Firehose
Metric filters give you numbers. To get the events themselves out of CloudWatch in real time — into OpenSearch for search and dashboards, or S3 for archive — use a subscription filter, which pushes matching log events to a destination as they arrive. The destination here is a Kinesis Data Firehose delivery stream.
First, an IAM role CloudWatch Logs can assume to write to Firehose. Trust policy:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "Service": "logs.amazonaws.com" },
"Action": "sts:AssumeRole"
}]
}
Create the subscription filter pointing at the delivery stream. An empty filter pattern forwards every event; narrow it to cut volume and cost:
aws logs put-subscription-filter \
--log-group-name /ecs/checkout \
--filter-name to-opensearch \
--filter-pattern '{ $.level = "ERROR" || $.level = "WARN" }' \
--destination-arn arn:aws:firehose:us-east-1:111122223333:deliverystream/checkout-logs \
--role-arn arn:aws:iam::111122223333:role/CWLtoFirehoseRole
Records arriving at Firehose from a subscription filter are gzip-compressed and base64-encoded, and a single record can contain multiple log events. That is exactly why you need a transform before indexing — covered next. Also note a log group allows a limited number of subscription filters, so plan one Firehose fan-out rather than many overlapping subscriptions.
6. Transform and deliver to OpenSearch
Firehose can deliver to an Amazon OpenSearch Service domain directly, but raw CloudWatch records are not index-ready. Attach a Lambda transform that decompresses and flattens each record into one JSON document per log event.
The transform must return each record with a recordId, a result of Ok/Dropped/ProcessingFailed, and base64 data. A Python sketch:
import base64, gzip, json
def handler(event, _ctx):
out = []
for rec in event["records"]:
payload = json.loads(gzip.decompress(base64.b64decode(rec["data"])))
if payload.get("messageType") == "CONTROL_MESSAGE":
out.append({"recordId": rec["recordId"], "result": "Dropped"})
continue
lines = []
for ev in payload.get("logEvents", []):
try:
doc = json.loads(ev["message"])
except json.JSONDecodeError:
doc = {"message": ev["message"]}
doc["@timestamp"] = ev["timestamp"] # epoch ms from CloudWatch
doc["log_group"] = payload["logGroup"]
lines.append(json.dumps(doc))
data = ("\n".join(lines) + "\n").encode()
out.append({
"recordId": rec["recordId"],
"result": "Ok",
"data": base64.b64encode(data).decode(),
})
return {"records": out}
Two correctness details that bite people: drop CONTROL_MESSAGE records (Firehose sends them to validate connectivity), and keep the output of every input record — a missing recordId fails the whole batch.
Now create the delivery stream wired to OpenSearch with the transform and buffering. The key knobs are buffering (deliver when the buffer hits a size in MB or an interval in seconds, whichever first) and an index rotation period so indices stay a manageable size:
aws firehose create-delivery-stream \
--delivery-stream-name checkout-logs \
--delivery-stream-type DirectPut \
--amazon-open-search-service-destination-configuration '{
"RoleARN": "arn:aws:iam::111122223333:role/FirehoseToOpenSearch",
"DomainARN": "arn:aws:es:us-east-1:111122223333:domain/logs",
"IndexName": "checkout-logs",
"IndexRotationPeriod": "OneDay",
"BufferingHints": { "SizeInMBs": 5, "IntervalInSeconds": 60 },
"S3BackupMode": "FailedDocumentsOnly",
"ProcessingConfiguration": {
"Enabled": true,
"Processors": [{
"Type": "Lambda",
"Parameters": [{
"ParameterName": "LambdaArn",
"ParameterValue": "arn:aws:lambda:us-east-1:111122223333:function:cwl-transform"
}]
}]
},
"S3Configuration": {
"RoleARN": "arn:aws:iam::111122223333:role/FirehoseToOpenSearch",
"BucketARN": "arn:aws:s3:::checkout-logs-backup"
}
}'
S3BackupMode set to FailedDocumentsOnly is non-negotiable for a log pipeline: any document OpenSearch rejects (almost always a mapping conflict) lands in S3 instead of vanishing, so you can inspect and replay it. Smaller buffers mean fresher data and more, smaller PUTs; larger buffers mean cheaper delivery and more latency. For a logging pipeline, 60s / 5 MB is a sane default — tune toward larger if cost dominates, smaller if you live in the dashboards during incidents.
7. Retention and cost
Logs are cheap to write and expensive to forget about. Three levers control the bill.
Log class. CloudWatch offers a Standard class and an Infrequent Access class. Infrequent Access costs less to ingest but supports a reduced feature set (notably it does not support metric filters or subscription filters). Keep alarm-driving and OpenSearch-bound groups on Standard; consider Infrequent Access for high-volume, rarely-queried debug logs you keep only for occasional Insights queries.
Retention. The default is never expire, which is a quiet, unbounded cost. Set it explicitly on every group:
aws logs put-retention-policy \
--log-group-name /ecs/checkout \
--retention-in-days 30
Archive tier. CloudWatch is not a cheap long-term store. Keep 14-30 days hot in CloudWatch and OpenSearch for incident response, and let the Firehose-to-S3 path (or a dedicated S3 destination) hold the cold copy. Apply S3 lifecycle rules to transition to Glacier-class storage for multi-year retention at a fraction of the cost, and query it with Athena when audit time comes.
A note on Insights cost: Logs Insights bills by data scanned per query. Always bound the time range, filter early before you stats, and prefer saved narrow queries over fields @message across a week of a chatty service.
| Tier | Lives in | Typical window | Use |
|---|---|---|---|
| Hot | CloudWatch + OpenSearch | 14-30 days | Alarms, dashboards, incident response |
| Warm | OpenSearch UltraWarm (optional) | 30-90 days | Slower interactive search |
| Cold | S3 (+ Glacier classes) | months to years | Compliance, audit, replay |
8. Locking it down
Logs contain your most sensitive runtime data. Three controls are mandatory.
KMS encryption at rest. Encrypt the log group with a customer-managed key. The key policy must allow the CloudWatch Logs service principal to use it:
aws logs associate-kms-key \
--log-group-name /ecs/checkout \
--kms-key-id arn:aws:kms:us-east-1:111122223333:key/abcd-1234
Encrypt the OpenSearch domain, the S3 backup bucket, and the Firehose stream with KMS as well so the data is covered along the entire path.
Least-privilege IAM. The Firehose delivery role should grant only what delivery needs — es:ESHttpPost/es:ESHttpPut scoped to the one domain, s3:PutObject scoped to the backup bucket, lambda:InvokeFunction scoped to the transform, and the kms actions for the specific keys. No wildcards on resources. The CloudWatch-to-Firehose role should allow only firehose:PutRecord/firehose:PutRecordBatch on the single delivery stream.
Redact before it leaves the app. The cheapest place to handle PII is the logger. Never log raw tokens, full PANs, or passwords. For defense in depth, CloudWatch Logs data protection policies can detect and mask sensitive data identifiers (emails, credentials, and similar) at ingest, and the Firehose transform Lambda is a second chokepoint where you can drop or hash a field before it reaches OpenSearch. Layer all three rather than trusting any one.
Enterprise scenario
A fintech platform team ran this exact pipeline across ~140 ECS services. Three weeks after launch, OpenSearch ingestion fell off a cliff for one domain while Firehose still reported ACTIVE. The FailedDocumentsOnly prefix in S3 was filling up: a newly onboarded service emitted status_code as the string "502" instead of a number, and OpenSearch had already inferred long for that field from the first write. Every document from that service after the rotation was a mapping conflict and got dropped. The contract said “numbers stay numbers” — but nothing enforced it, so a single team’s logger shipped strings.
The fix had two parts. Short term, they made the Firehose transform Lambda coerce the known-numeric fields instead of trusting upstream, so one bad emitter could not poison the index:
for k in ("status_code", "latency_ms"):
v = doc.get(k)
if isinstance(v, str) and v.lstrip("-").isdigit():
doc[k] = int(v)
Long term, they stopped relying on dynamic mapping. They created an index template that pins types and rejects surprises, applied before any data lands:
curl -XPUT "https://<domain>/_index_template/checkout-logs" -H 'Content-Type: application/json' -d '{
"index_patterns": ["checkout-logs-*"],
"template": { "mappings": {
"dynamic": "strict",
"properties": {
"status_code": { "type": "integer" },
"latency_ms": { "type": "integer" },
"@timestamp": { "type": "date" }
}
}}
}'
With dynamic: strict, a stray new field now fails loudly into S3 backup at onboarding time — in a test deploy, not silently in prod a month later.
Verify
Confirm each stage end to end:
# Logs are arriving and are valid JSON
aws logs tail /ecs/checkout --since 5m --format short
# Metric filter is producing data points
aws cloudwatch get-metric-statistics \
--namespace App/Checkout --metric-name Checkout5xx \
--start-time $(date -d '15 min ago' -u +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 60 --statistics Sum
# Firehose is delivering, not erroring
aws firehose describe-delivery-stream \
--delivery-stream-name checkout-logs \
--query 'DeliveryStreamDescription.DeliveryStreamStatus'
# Documents are landing in OpenSearch
curl -s "https://<domain-endpoint>/checkout-logs-*/_count" | jq .
If _count is flat but Firehose is ACTIVE, check the S3 FailedDocumentsOnly prefix — mapping conflicts are the usual culprit, and the rejected document there tells you exactly which field changed type.
Checklist
Pitfalls
- Mapping conflicts silently drop documents. A field that is a number in one service and a string in another will be rejected by OpenSearch. Enforce types in the contract, and always enable
FailedDocumentsOnlyso rejects are recoverable instead of lost. - Sparse metrics break alarms. Without
defaultValue=0, a metric filter only emits on matches, andtreat-missing-databecomes load-bearing. Set both deliberately. - Forgetting control messages. A transform that does not drop
CONTROL_MESSAGErecords, or omits arecordId, fails entire Firehose batches — and the failure is not obvious from the OpenSearch side. - Never-expire retention. The default is unbounded storage cost. Set retention on every group at creation, ideally in your IaC so it cannot be missed.
- Wildcard IAM on the delivery role. It is tempting under deadline; scope every action to the specific domain, bucket, key, and function instead.