Change Data Capture with DynamoDB Streams: Lambda Triggers, EventBridge Pipes, and Exactly-Once Processing

Change data capture off DynamoDB looks deceptively simple — turn on a stream, point a Lambda at it, ship the deltas somewhere. The trouble starts the first time a single poison record wedges a shard, an OpenSearch index drifts out of sync, or someone discovers that “exactly-once” was never a property the stream promised. DynamoDB Streams give you an ordered, at-least-once feed of item-level mutations. Everything downstream — ordering across reprocessing, deduplication, partial-batch recovery — is yours to engineer. This is how to build that pipeline so it survives throttles, redrives, and the occasional malformed item, using Lambda event source mappings where you need code and EventBridge Pipes where you do not.

The reason this is hard is that the failure modes are invisible until load. A pipeline that processes ten records a minute in dev never reveals that you have no idempotency, no bisect, no DLQ — and then a flash sale drives one partition key hot, the iterator age on that shard climbs past forty minutes, a malformed item silently blocks every update behind it, and your “real-time” search index is now lying about stock. The platform did exactly what you configured; you simply never configured it for the bad day. So this article treats DynamoDB CDC the way a senior engineer learns it the hard way: not “how do I move a change” but “how do I move a change such that it survives replay, isolates poison, preserves per-key order, and never silently drops data.”

By the end you will know which of the four delivery surfaces (native Stream + Lambda ESM, native Stream + EventBridge Pipes, Kinesis Data Streams for DynamoDB, or a glue Lambda fan-out) fits a given freshness/retention/ordering requirement; exactly which event-source-mapping knobs (BatchSize, ParallelizationFactor, MaximumRecordAgeInSeconds, BisectBatchOnFunctionError, ReportBatchItemFailures) prevent a stalled shard; and how to make every downstream write idempotent so at-least-once delivery is a non-event. Because this is a reference you will return to mid-incident, the knobs, the error/limit numbers, the view types, and the symptom→cause→confirm→fix playbook are all laid out as scannable tables — read the prose once, then keep the tables open when the iterator age starts climbing.

What problem this solves

A relational database hands you a transaction log and a dozen mature CDC tools that tail it. DynamoDB hands you a Stream: an ordered, append-only feed of every INSERT, MODIFY, and REMOVE, retained for 24 hours, partitioned to match your table’s shards. That is a genuine gift — you get item-level deltas with per-partition-key ordering and no triggers to write — but it is a primitive, not a pipeline. The delivery contract is at-least-once, the retention is hard-capped at 24 hours, and the ordering guarantee is per partition key only, never table-wide. Everything that makes a production CDC pipeline trustworthy sits in the gap between that primitive and your sinks.

What breaks without this knowledge is depressingly consistent. A team points a single Lambda at the stream with default settings, fans out to OpenSearch + S3 + a warehouse from that one function, and ships it. Then: (1) a malformed item throws, the whole batch retries forever, and the shard stalls — every change behind the poison record is blocked for hours; (2) a redrive replays a batch and the search index double-applies a delete, or a stale replay overwrites a newer document; (3) one slow sink (OpenSearch under merge pressure) stalls the shared function, so the data lake and warehouse go stale too; (4) a record that needed reprocessing ages past the 24-hour window before anyone noticed, and it is gone for good. None of these are exotic. All of them are the default outcome of treating the stream as if it were exactly-once, infinitely retained, and globally ordered.

Who hits this: anyone projecting DynamoDB state into a search index, a cache, a data lake, an audit trail, or another service. It bites hardest on high-write, hot-key workloads (a single partition going hot starves the one-poller-per-shard model), multi-sink fan-out (coupling sinks behind one function), and first-time CDC builders who assume the stream gives them guarantees it explicitly does not. The fix is almost never “make the stream stronger” — the stream is what it is — it is “engineer idempotency, bisect, DLQ, and bus-level fan-out around it.” Here is the whole field at a glance: every guarantee the stream gives you, every one it withholds, and where you make up the difference.

Property	What the native Stream guarantees	What it does NOT guarantee	Where you close the gap
Delivery	At-least-once to the consumer	Exactly-once	Idempotent downstream write (conditional-on-`SequenceNumber`)
Ordering	Per partition key, within a shard lineage	Global / table-wide ordering	Design so consumers only assume per-key order
Retention	24 hours, rolling	Anything beyond 24h	S3 system-of-record via Firehose; one-time scan for backfill
Failure isolation	Records stay readable for 24h	Auto-skip of a poison record	`BisectBatchOnFunctionError` + on-failure destination
Partial success	— (whole-batch retry by default)	Per-record checkpointing	`ReportBatchItemFailures`
Fan-out	One stream, up to ~2 ESMs efficiently	Independent retry per sink from one Lambda	EventBridge bus/Pipe, one rule+DLQ per sink
Dedup across replay	—	Suppression of duplicates	Sink-native upsert + external version from `SequenceNumber`

Learning objectives

By the end of this article you can:

Choose between the four CDC delivery surfaces — native Stream + Lambda ESM, native Stream + EventBridge Pipes, Kinesis Data Streams for DynamoDB, and a glue-Lambda fan-out — by retention, ordering, freshness, and code-vs-config requirements.
Enable a stream with the correct StreamViewType for the consumer’s job, and explain what each of the four view types lets you compute (diffs, soft-delete capture, invalidation).
Tune a Lambda event source mapping deliberately: BatchSize, MaximumBatchingWindowInSeconds, ParallelizationFactor, MaximumRetryAttempts, MaximumRecordAgeInSeconds, and the starting position — and name the trade-off each one makes.
Stop a stalled shard by combining ReportBatchItemFailures (precise per-record checkpointing) with BisectBatchOnFunctionError (the safety net for unhandled exceptions), and read iterator age as the leading indicator.
Build filter-enrich-route pipelines with EventBridge Pipes — pushing selectivity into the filter stage and reshaping with a target transformer — and know when to keep a Lambda ESM instead.
Engineer exactly-once downstream semantics as an idempotent write: conditional-on-SequenceNumber in DynamoDB, external versioning in OpenSearch, deterministic pk#sk#seq keys elsewhere.
Operate the pipeline: alarm on iterator age / ApproximateAgeOfOldestRecord, evict poison records to a DLQ while they are still readable, and run a deliberate, reviewed redrive from shardId + sequence range.

Prerequisites & where this fits

You should already understand DynamoDB’s data model — partition keys, sort keys, items, and that a table’s throughput is sharded by partition key. Familiarity with the basics of Amazon DynamoDB, In Depth: Tables, Keys, Capacity Modes, Indexes & Streams is assumed — this article picks up where the “Streams” section there ends and goes deep on consuming them. You should be comfortable with AWS Lambda at the level of AWS Lambda, In Depth: Runtimes, Triggers, Layers, Concurrency & Every Setting, reading IAM-scoped CLI/Terraform, and JSON event shapes.

This sits in the event-driven / serverless integration track. It is downstream of table design — if your partition keys create hot shards, no consumer tuning saves you, so DynamoDB Single-Table Design: Modeling Access Patterns, GSIs, and Hot Partition Avoidance is upstream of everything here. It pairs tightly with Designing Event-Driven Architectures with Amazon EventBridge: Buses, Rules, Schemas, and Archive/Replay (the bus you fan out to) and Resilient Messaging with SQS and SNS: Fan-Out, FIFO Ordering, DLQs, and Poison-Message Handling (the DLQ and poison-handling patterns reused wholesale). The “land it in a lake” sink is the front half of A Structured Logging Pipeline on AWS: JSON Logs, CloudWatch Metric Filters, and Firehose to OpenSearch.

A quick map of who owns what during an incident, so you escalate to the right layer fast:

Layer	What lives here	Typical owner	Failure classes it causes
Table + key design	Partition keys, write distribution	Data / app team	Hot shard, throttles → consumer falls behind
Stream	Shards, view type, 24h retention	Platform + app	Records age out; wrong view type starves consumers
Event source mapping / Pipe	The managed poller + its knobs	Platform / app	Stalled shard, whole-batch retry, no DLQ
Processor (Lambda)	Your code, idempotency, error handling	App team	Poison records, non-idempotent writes
Sinks (OpenSearch / S3 / DW)	Durability, freshness, versioning	App + data team	Index drift, duplicate apply, small-object tax
Observability	IteratorAge, DLQ depth, throttles	Platform / SRE	Silent stalls nobody alarmed on

Core concepts

Six mental models make every later decision obvious.

A Stream is an ordered, append-only log, sharded like the table. Records for a given partition key always land in the same shard lineage and are delivered in commit order. That per-partition ordering is the single most important property to internalize — there is no global ordering across the table, only within a partition key. Shard topology mirrors the table’s partition structure and changes as the table splits/merges partitions, so your consumer must tolerate shards appearing and closing.

Delivery is at-least-once; “exactly-once” is something you build downstream. Redrives, bisect retries, and poller restarts all mean a sink can see the same change more than once. There is no setting that makes the stream exactly-once. The good news: every record hands you a stable item key and a monotonic SequenceNumber — exactly the tools an idempotent write needs.

Retention is 24 hours, hard. A record not consumed within 24 hours is gone. This drives two rules: set MaximumRecordAgeInSeconds below 24h so poison records are evicted to a DLQ while still readable, and treat S3 (fed via Firehose) as your replayable system of record once the window rolls off — the stream cannot backfill history.

The poller is managed; you tune it, you don’t write it. A Lambda event source mapping (ESM) is a poller AWS runs for you. You never write the GetShardIterator/GetRecords loop — you set BatchSize, ParallelizationFactor, batching window, retries, and failure handling, and the platform polls, batches, invokes, and checkpoints.

A stalled shard is the master failure mode. By default, if your function throws on a batch, the ESM retries the entire batch, and because the shard checkpoint has not advanced, it keeps retrying until success, age-out, or retry-exhaustion. One un-processable record blocks everything behind it. You see it as a climbing iterator age. Two mechanisms fix it: partial-batch responses and bisect-on-error.

Fan-out belongs at the bus, not in one function. If a single mega-Lambda writes to OpenSearch + S3 + a warehouse, one slow sink stalls the shard for all of them. Land changes on an EventBridge bus (often via a Pipe) and let independent rules deliver to each sink with its own retry budget and DLQ.

The vocabulary in one table

Pin down every moving part before the deep sections; the glossary repeats these for lookup.

Term	One-line definition	Where it lives	Why it matters to CDC
Stream	Ordered, append-only log of item changes	On the table (enable per table)	The source of every change
Shard	An ordered partition of the stream	Within the stream	Unit of ordering and parallelism
`StreamViewType`	What each record carries (keys/new/old/both)	Stream spec	Dictates what consumers can compute
`SequenceNumber`	Monotonic per-shard record id	On every record	The idempotency / versioning anchor
Event source mapping (ESM)	Managed Lambda poller	On the Lambda	Where you tune batch/parallelism/failure
`ParallelizationFactor`	Concurrent invokes per shard (1–10)	ESM config	Drains a hot shard without resharding
`ReportBatchItemFailures`	Per-record checkpoint contract	ESM + function response	Stops one bad record blocking good ones
Bisect-on-error	Split-and-retry a failing batch	ESM config	Isolates a poison record
On-failure destination / DLQ	Where failure context lands	SQS/SNS (ESM), SQS (Pipe)	Points you at records to redrive
EventBridge Pipes	Managed source→filter→enrich→target	Standalone resource	No-code filter-enrich-route
Iterator age	How far behind the consumer is	CloudWatch metric	The leading indicator of a stall
Kinesis Data Streams for DynamoDB	Alternative change feed	Kinesis stream you own	Long retention, many fan-out consumers

Delivery surfaces: native Streams, Pipes, Kinesis, or glue Lambda

Before any tuning, pick the right surface. Four options carry DynamoDB changes, and they differ on retention, ordering, dedup, freshness, and how much code you own. Choosing wrong here is the most expensive mistake in the article — you cannot tune your way out of a surface that doesn’t give you the retention or ordering you need.

Surface	Retention	Ordering	Dedup	Freshness	Code you own	Best for
Native Stream + Lambda ESM	24h	Per partition key (strict)	At-least-once (you dedupe)	Sub-second to seconds	The processor	Real compute, transactional multi-sink writes
Native Stream + EventBridge Pipes	24h	Per partition key (strict)	At-least-once	Seconds	Only enrichment (optional)	Filter-enrich-route with no poller code
Kinesis Data Streams for DynamoDB	Up to 1 year	Best-effort (reorder on seq)	May duplicate (you dedupe)	Seconds	Consumer(s)	Long retention, many independent fan-out consumers
Glue Lambda fan-out (one ESM, code fans out)	24h	Per partition key	At-least-once	Sub-second	The whole fan-out	Small scale, tight code control, few sinks

The native Stream’s headline advantage is strict per-key ordering and no duplication at the source; its costs are the 24-hour cap and a practical limit on how many consumers can read it efficiently. Kinesis Data Streams for DynamoDB inverts that: configurable retention up to a year and clean multi-consumer fan-out, at the price of best-effort ordering and possible duplicates — so you must reorder and dedupe on a sequence number. The trade is real and worth stating plainly:

Dimension	Native DynamoDB Stream	Kinesis Data Streams for DynamoDB
Max retention	24 hours	Up to 365 days (configurable)
Ordering	Strict per partition key	Best-effort; reorder on `ApproximateCreationDateTime`/seq
Duplicates	At-least-once (rare in practice)	Possible; dedupe required
Consumers	~2 efficient ESMs per stream	Many (enhanced fan-out, shared, Firehose, KCL)
Throughput model	Managed, tied to table	Provisioned/on-demand Kinesis shards
When to pick	Strict order, short window	Long replay, many sinks, analytics teams

Enable the Kinesis surface when you need any of: retention beyond 24 hours, several independent teams consuming the same changes, or direct Firehose delivery to a lake without a Lambda in the path. Stay on the native Stream when strict per-key ordering matters most or your consumer count is small. A quick decision table:

If you need…	It’s probably…	Choose
Strict per-key order, one or two consumers	A real-time projector	Native Stream + Lambda ESM
Filter + reshape + route, no poller code	A glue pipeline	Native Stream + EventBridge Pipes
Replay older than 24h / many fan-out consumers	An analytics backbone	Kinesis Data Streams for DynamoDB
A handful of sinks, tight code control, small scale	A bespoke processor	Glue Lambda fan-out
Both real-time projection AND long replay	A hybrid	Native Stream for projection + Kinesis for lake

Stream internals: shards, records, and view types

A DynamoDB Stream is an ordered, append-only log of item-level changes, retained for 24 hours, composed of shards whose topology mirrors the table’s partition structure. Enable it and choose a StreamViewType, which controls what each record carries:

aws dynamodb update-table \
  --table-name orders \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES

resource "aws_dynamodb_table" "orders" {
  name             = "orders"
  hash_key         = "pk"
  billing_mode     = "PAY_PER_REQUEST"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"
  attribute {
    name = "pk"
    type = "S"
  }
}

The four view types are not cosmetic — they dictate what your consumers can compute. For CDC you almost always want NEW_AND_OLD_IMAGES: the diff is what lets you route on transitions (“status went from PENDING to SHIPPED”) and reconstruct deletes — on a REMOVE event there is no new image, only the old one.

`StreamViewType`	Record contains	Record size impact	Use when	Cannot do
`KEYS_ONLY`	Key attributes only	Smallest	You re-read the item yourself, or only need invalidation signals	Compute diffs; reconstruct deletes
`NEW_IMAGE`	Item after the write	Medium	Projecting current state to a search index or cache	See what changed from; capture old value on delete
`OLD_IMAGE`	Item before the write	Medium	Audit “what changed from”, soft-delete capture	Project current state on insert/modify
`NEW_AND_OLD_IMAGES`	Both images	Largest	Computing diffs, conditional routing on field transitions	(nothing — fullest fidelity)

Each stream record carries an eventName of INSERT, MODIFY, or REMOVE, plus the keys, the requested image(s), the SequenceNumber, and (on TTL deletes) a userIdentity block. Know the record anatomy cold — it is what your filters and idempotency keys read:

Record field	Meaning	Present when	CDC use
`eventName`	`INSERT` / `MODIFY` / `REMOVE`	Always	Route by mutation type
`dynamodb.Keys`	The item’s key attributes (typed)	Always	The stable idempotency key
`dynamodb.NewImage`	Item after the write	Not on `REMOVE` (or `KEYS_ONLY`)	Project current state
`dynamodb.OldImage`	Item before the write	Not on `INSERT` (or `KEYS_ONLY`)	Diffs; reconstruct deletes
`dynamodb.SequenceNumber`	Monotonic per-shard id	Always	Idempotency + external version
`dynamodb.ApproximateCreationDateTime`	When the change happened	Always	Latency measurement; Kinesis reorder
`userIdentity`	Set to `dynamodb.amazonaws.com` for TTL	TTL deletes only	Filter housekeeping deletes
`eventSourceARN`	The stream ARN	Always	Redrive target

A subtle trap: a TTL deletion arrives as REMOVE with userIdentity.principalId = dynamodb.amazonaws.com and type = Service. If you mirror deletes downstream, filter TTL expiries explicitly or you replicate housekeeping you did not intend to.

The stream’s own limits are worth memorizing because several of them shape your tuning. These are the numbers that bite:

Limit / quota	Value	Consequence if you hit it
Retention	24 hours	Unconsumed records lost permanently
Max record size	~400 KB (item size)	Large items still fit; very wide items inflate batches
Stream view type changes	Requires disable + re-enable stream	New stream ARN; consumers must re-point
Efficient simultaneous readers	~2 processes per shard	More readers risk `GetRecords` throttling
Shard lineage	Splits/merges with table partitions	Consumer must handle new/closed shards
`GetRecords` throughput	5 calls/sec, up to 1,000 records or 1 MB	Over-reading throttles; ESM manages this for you

Consuming with Lambda: batch size, window, and parallelization

The native stream is consumed through a Lambda event source mapping — a managed poller AWS runs on your behalf. You tune it; you do not write the polling loop. Three knobs dominate throughput and latency:

aws lambda create-event-source-mapping \
  --function-name orders-cdc-processor \
  --event-source-arn "$STREAM_ARN" \
  --starting-position LATEST \
  --batch-size 100 \
  --maximum-batching-window-in-seconds 5 \
  --parallelization-factor 4 \
  --maximum-retry-attempts 3 \
  --bisect-batch-on-function-error \
  --maximum-record-age-in-seconds 21600 \
  --function-response-types ReportBatchItemFailures \
  --destination-config '{"OnFailure":{"Destination":"'"$DLQ_ARN"'"}}'

Every tunable on the mapping, what it does, its default, its bounds, and the trade-off — this is the table you keep open when you create or adjust an ESM:

Setting	What it controls	Default	Range / values	Trade-off / gotcha
`BatchSize`	Max records per invocation	100	1–10,000 (streams)	Bigger amortizes overhead but widens failure blast radius and raises duration
`MaximumBatchingWindowInSeconds`	Wait before invoking	0	0–300	Higher = fewer/fuller invokes but more latency
`ParallelizationFactor`	Concurrent invokes per shard	1	1–10	Drains a hot shard; cross-key order within a shard relaxes (per-key holds)
`StartingPosition`	Where to begin	—	`LATEST` / `TRIM_HORIZON`	`LATEST` ignores history; `TRIM_HORIZON` replays the 24h window
`MaximumRetryAttempts`	Retries before giving up on a record	-1 (infinite)	-1, or 0–10,000	Infinite + no bisect = a poison record stalls forever
`MaximumRecordAgeInSeconds`	Evict records older than this	-1 (= 24h)	-1, or 60–604,800 (capped at 24h effective)	Set below 24h so poison hits DLQ while readable
`BisectBatchOnFunctionError`	Split a failing batch and retry halves	false	true/false	The safety net for unhandled throws
`FunctionResponseTypes`	Enables partial-batch responses	(none)	`ReportBatchItemFailures`	Precise per-record checkpointing
`DestinationConfig.OnFailure`	Where failure context goes	(none)	SQS or SNS ARN	Without it, exhausted records vanish silently
`TumblingWindowInSeconds`	Stateful aggregation window	0	0–900	Enables windowed aggregation across invokes
`Enabled`	Poller on/off	true	true/false	Disable to pause without deleting the mapping
`FilterCriteria`	Pre-invoke event filtering	(none)	Up to 5 patterns	Drops non-matching records before you pay

ParallelizationFactor deserves a warning. Ordering is preserved per partition key even with a factor above 1: the poller hashes each record’s partition key to one of the concurrent invocations, so all records for a key still go to the same invocation in order. What you lose is ordering across different keys within the same shard — almost always fine, because cross-key ordering was never guaranteed table-wide. Use it freely as long as your processing logic only assumes per-key order. How the parallelization factor interacts with order and throughput:

Parallelization factor	Concurrency per shard	Per-key order	Cross-key order in a shard	When to use
1 (default)	1	Preserved	Preserved	Strict in-shard order needed; low throughput
2–4	2–4	Preserved	Relaxed	Moderate hot-shard catch-up
5–10	5–10	Preserved	Relaxed	Drain a badly hot shard during a spike

--starting-position LATEST begins at the tip and ignores history; TRIM_HORIZON replays everything still in the 24-hour window. For a brand-new pipeline that must backfill, TRIM_HORIZON plus a one-time scan-and-load for older data is the usual combination — the stream alone cannot reach beyond 24 hours. The starting-position choices side by side:

Starting position	Begins at	Replays history?	Use for
`LATEST`	The tip (new changes only)	No	Steady-state pipelines; you backfill separately
`TRIM_HORIZON`	Oldest record still in the 24h window	Yes (≤24h)	New pipeline catching up recent changes
(one-time `Scan` + load)	The full table snapshot	N/A (not the stream)	Backfilling data older than 24h

Rather than reason about each knob in isolation, start from the workload and read off a recipe. A decision table mapping the requirement to the ESM configuration:

If your requirement is…	Set	Because
Lowest possible latency	`MaximumBatchingWindowInSeconds=0`, small `BatchSize`	Invoke as soon as records arrive
Highest throughput / fewest invokes	`BatchSize=100–1000`, window `5`	Full batches amortize overhead
Catch up a hot shard fast	`ParallelizationFactor=8–10`	Drain one shard across many invokes
Strict in-shard cross-key order	`ParallelizationFactor=1`	No concurrent invokes per shard
Survive a poison record	`BisectBatchOnFunctionError` + finite `MaxRetryAttempts` + DLQ	Isolate and evict it
Never reprocess good records	`FunctionResponseTypes=ReportBatchItemFailures`	Per-record checkpointing
Evict before the 24h cliff	`MaximumRecordAgeInSeconds=21600`	DLQ while still readable
Backfill a brand-new pipeline	`StartingPosition=TRIM_HORIZON` + one-time `Scan`	Replay 24h + load older data

The concurrency limits that bound how far you can push parallelism — real numbers, because they cap your catch-up:

Limit	Value	What hits it
`ParallelizationFactor` max	10 per shard	Hot-shard catch-up ceiling
`BatchSize` max (streams)	10,000 records	Very large batches widen failure blast radius
`MaximumBatchingWindowInSeconds` max	300	Latency vs invoke-count trade
Account concurrent executions (default)	1,000 (raisable)	Total Lambda concurrency across functions
Effective concurrency per stream	shards × `ParallelizationFactor`	The real parallel ceiling for one stream
Efficient readers per shard	~2	More risks `GetRecords` throttling
`GetRecords`	5/sec, ≤1,000 records or 1 MB	The poller manages this for you

Ordering and partial-batch failures: bisect-on-error and checkpointing

Here is the failure mode that bites everyone. By default, if your function throws on a batch, the ESM retries the entire batch — and because the shard’s checkpoint has not advanced, it keeps retrying the same batch until it succeeds, expires past MaximumRecordAgeInSeconds, or exhausts MaximumRetryAttempts. A single un-processable record blocks every record behind it in that shard. That is a stalled shard, and you see it as a climbing iterator age.

Two mechanisms fix this; use both.

Partial batch responses let the function tell the poller exactly which records failed so the rest are checkpointed and never reprocessed. Set --function-response-types ReportBatchItemFailures and return the failed sequence numbers:

def handler(event, context):
    failures = []
    for record in event["Records"]:
        seq = record["dynamodb"]["SequenceNumber"]
        try:
            process(record)
        except Exception:
            failures.append({"itemIdentifier": seq})
    return {"batchItemFailures": failures}

The contract has a sharp edge worth memorizing: because ordering matters, the poller treats the earliest reported failure as the checkpoint. Everything at or after that sequence number is retried; everything before it is considered done. So return failures honestly — do not report a late record as failed while silently dropping an earlier one, or you will skip data. Returning an empty list ({"batchItemFailures": []}) means the whole batch succeeded. The exact response contract and how the poller interprets each shape:

Function returns	Poller interprets as	Records reprocessed	Trap
`{"batchItemFailures": []}`	Whole batch succeeded	None	—
`{"batchItemFailures":[{"itemIdentifier": seq}]}`	Fail from earliest listed seq	That seq and everything after	List the earliest real failure
Throws an exception	Whole batch failed	Entire batch	One bad record burns the whole budget (without bisect)
Returns `null` / nothing	Whole batch succeeded	None	Swallowing an error here silently drops data
Reports a late seq, drops an earlier one	Checkpoints past the dropped one	Only from the late seq	Data loss — never do this

Bisect-on-error (--bisect-batch-on-function-error) handles the case where the function throws instead of reporting failures. On error the poller splits the batch in two and retries each half, recursively narrowing down to the single offending record. That record then goes to the on-failure destination once retries are exhausted, and the rest of the batch proceeds. Without bisect, one bad record can burn your entire retry budget on a 100-record batch. The two mechanisms compared — ship both, but know which does what:

Mechanism	Precision	Reprocesses good records?	Handles unhandled throw?	Role
`ReportBatchItemFailures`	Per record (exact seq)	No	Only if you catch and report	Primary — precise checkpointing
`BisectBatchOnFunctionError`	Narrows to one record via halving	Yes (re-runs halves)	Yes	Safety net for exceptions you didn’t anticipate
`MaximumRetryAttempts`	Batch/record budget	—	—	Bounds how long a failure retries
`MaximumRecordAgeInSeconds`	Time-based eviction	—	—	Evicts to DLQ before 24h age-out

Rule of thumb: ReportBatchItemFailures is the primary mechanism — precise, avoids reprocessing good records. BisectBatchOnFunctionError is the safety net for unexpected exceptions and serialization bugs you did not anticipate. Ship both.

EventBridge Pipes: filtering, enrichment, and routing without glue Lambdas

A large fraction of CDC “processors” are glue: filter the events you care about, reshape them, send them to a target. EventBridge Pipes does exactly that as a managed integration — source, optional filter, optional enrichment, target — with no poller code to own. The source can be a DynamoDB Stream directly. The four stages and what each is for:

Pipe stage	Purpose	Optional?	Cost timing	Typical implementation
Source	Read the DynamoDB Stream (batch/window/retry like an ESM)	Required	—	Stream ARN + `dynamodb_stream_parameters`
Filter	Drop non-matching records before you pay	Optional	Runs first, cheapest	Up to 5 EventBridge patterns
Enrichment	Hydrate/transform the event	Optional	After filter	Lambda, Step Functions Express, API GW, API destination
Target	Where the final payload goes	Required	After enrichment	EventBridge bus, SQS, SNS, Step Functions, Lambda, …

The filter stage runs before you pay for enrichment or invocation, so push as much selectivity into it as you can. Pipes filter patterns match the stream record shape, including the DynamoDB-typed attribute values:

{
  "eventName": ["MODIFY"],
  "dynamodb": {
    "NewImage": {
      "status": { "S": ["SHIPPED"] }
    }
  }
}

The Terraform wiring is compact, and the dynamodb_stream_parameters block carries the same batch/window/retry semantics as a Lambda ESM:

resource "aws_pipes_pipe" "orders_shipped" {
  name     = "orders-shipped-to-bus"
  role_arn = aws_iam_role.pipe.arn
  source   = aws_dynamodb_table.orders.stream_arn
  target   = aws_cloudwatch_event_bus.commerce.arn

  source_parameters {
    dynamodb_stream_parameters {
      starting_position                  = "LATEST"
      batch_size                         = 100
      maximum_batching_window_in_seconds = 5
      maximum_retry_attempts             = 4
      on_partial_batch_item_failure      = "AUTOMATIC_BISECT"
      dead_letter_config {
        arn = aws_sqs_queue.pipe_dlq.arn
      }
    }
    filter_criteria {
      filter {
        pattern = jsonencode({
          eventName = ["MODIFY"]
          dynamodb  = { NewImage = { status = { S = ["SHIPPED"] } } }
        })
      }
    }
  }
}

The enrichment stage can hydrate the event — look up a customer record, call a pricing service — and return a transformed payload, all without you running a poller. The target transformer then reshapes the final payload with an input template. The net effect: a filter-enrich-route pipeline that used to be 150 lines of Lambda becomes declarative infrastructure, and the only code you keep is genuine business logic in the enrichment. The decision between Pipes and a Lambda ESM, made explicit:

If the work is…	Lean toward	Why
Filter + reshape + route	EventBridge Pipes	No poller code; filter runs before you pay
Non-trivial transformation needing libraries	Lambda ESM	Real code, real error handling
Transactional writes to multiple sinks	Lambda ESM	Coordinated, conditional writes
Enrich from one service, then route	Pipes (Lambda enrichment)	Keep only business logic as code
Fan-out to many independent consumers	Pipes → EventBridge bus	One rule + DLQ per consumer
Windowed aggregation across batches	Lambda ESM (tumbling window)	Pipes has no tumbling window

Choose Pipes when the work is route and reshape. Keep a Lambda ESM when the work is compute — non-trivial transformation, transactional writes to multiple sinks, or anything needing libraries and real error handling.

The Pipes target options, end to end, with what each is good for as a CDC destination:

Pipe target	Delivery shape	Good CDC use	Note
EventBridge bus	Event onto a bus	Fan-out to many independent rules/sinks	The canonical decoupling target
SQS queue	Message to a queue	Buffer + a single consumer with backpressure	Pair with a DLQ
SNS topic	Pub/sub fan-out	Push to multiple subscribers	No replay/buffer
Step Functions (Standard)	Start an execution	Long-running orchestration per change	Async; durable
Step Functions (Express)	Synchronous workflow	Short enrichment/route logic	Also valid as enrichment
Lambda	Invoke a function	Real compute on the routed event	When you do want code
Kinesis / Firehose	Record to a stream	Land in a lake / re-stream	Firehose → S3 for the system of record
API destination	HTTP call	Push to an external SaaS/webhook	Built-in retry + auth

Idempotency and exactly-once semantics in downstream sinks

DynamoDB Streams are at-least-once. Redrives, bisect retries, and poller restarts all mean a downstream sink can see the same change more than once. “Exactly-once” downstream is therefore not a delivery setting you toggle — it is an idempotent write you design. The good news is every stream record hands you the tools: a stable item key and a monotonic SequenceNumber.

The cleanest pattern is a conditional write keyed on sequence. Store the last applied sequence number per item and reject anything not strictly newer. This collapses duplicates and protects against out-of-order application during reprocessing:

def apply_idempotent(key, new_image, sequence):
    try:
        table.put_item(
            Item={**new_image, "_seq": sequence},
            ConditionExpression="attribute_not_exists(#s) OR #s < :seq",
            ExpressionAttributeNames={"#s": "_seq"},
            ExpressionAttributeValues={":seq": sequence},
        )
    except ClientError as e:
        if e.response["Error"]["Code"] != "ConditionalCheckFailedException":
            raise
        # Duplicate or stale replay -> safe no-op

DynamoDB SequenceNumber values are strictly increasing within a shard for a given key, which is exactly the scope this guard needs. For sinks that lack conditional writes, fall back to a deterministic idempotency key — partitionKey#sortKey#sequenceNumber — and let the sink’s natural upsert (OpenSearch _id, an S3 object key, a primary-key MERGE) absorb the duplicate. The principle is invariant: make replay a no-op, never an additive side effect. The idempotency strategy per sink type:

Sink	Idempotency mechanism	Dedup key	Out-of-order protection
DynamoDB (another table)	Conditional `PutItem` on `_seq`	Item key	`_seq < :seq` rejects stale
OpenSearch	`index` with external version	`_id` = item key	`_version_type=external` rejects lower version
S3 (data lake)	Deterministic object key	`pk#sk#seq.json`	Same key overwrites; immutable log
Relational warehouse	`MERGE` / upsert on PK	Primary key	Compare a version/seq column
SQS → consumer	Idempotency table at consumer	`MessageDeduplicationId` / `pk#sk#seq`	Consumer-side last-seq check
Cache (Redis/Memcached)	`SET` keyed by item key	Item key	Conditional `SET` on stored seq

The failure modes idempotency defends against, so you can reason about why each guard exists:

Without idempotency you get…	Caused by	The guard that prevents it
Double-applied insert/modify	Redrive or bisect replay	Conditional-on-seq / upsert by id
Re-issued delete on an already-deleted item	Replay of a `REMOVE`	Idempotent delete (no-op if absent)
Newer doc overwritten by older replay	Out-of-order reprocessing	External version / `_seq < :seq`
Duplicate rows in the warehouse	At-least-once + naive `INSERT`	`MERGE` on PK
Inflated counters/aggregates	Additive side effect on replay	Make the write a set, not an increment

Fan-out to OpenSearch, S3, and analytics targets

Real pipelines feed several sinks with different durability and freshness needs. Resist the urge to fan out from a single mega-Lambda — couple them and one slow target stalls the shard for all of them. Fan out at the bus instead: a Pipe lands changes on an EventBridge bus, and independent rules deliver to each consumer with its own retry budget and DLQ.

OpenSearch — index the NewImage with _id set to the item key so writes are idempotent upserts; on REMOVE, issue a delete by that same _id. Honor per-key order by deriving a monotonic version from the sequence number and using external versioning so a stale replay cannot overwrite a newer doc.
S3 (data lake) — buffer through Kinesis Data Firehose and write newline-delimited JSON partitioned by event date. Firehose batches and compresses, which keeps you off the small-object tax that kills Athena performance. This is your replayable system of record once the 24-hour stream window has rolled off.
Analytics / warehouse — land in S3, then COPY/MERGE into Redshift or query in place with Athena. Keep the raw change log immutable and do transformation downstream; never let a warehouse load be the only place a change exists.

Each sink has a different freshness/durability/cost profile — match the sink to the requirement:

Sink	Freshness	Durability role	Cost driver	Per-sink failure handling
OpenSearch	Sub-second to seconds	Queryable projection (rebuildable)	Cluster + indexing	Bus rule retry + DLQ; `_version` guard
Firehose → S3	Buffered (≈60s / 5MB)	System of record (replayable)	Per-GB + PUTs	Firehose retry + S3 error bucket
Redshift / Athena	Minutes (after S3 land)	Analytical store	Compute + scan	Reload from immutable S3
Another DynamoDB table	Sub-second	Materialized view	RU/s	Conditional-on-seq write
ElastiCache	Sub-second	Cache (rebuildable)	Node hours	Conditional `SET`; TTL

A single Lambda target can also do the OpenSearch fan-out directly when you want code-level control — emit per-document failures via ReportBatchItemFailures so a 409 on one doc does not block the batch:

from opensearchpy import helpers

def to_action(record):
    key = record["dynamodb"]["Keys"]["pk"]["S"]
    if record["eventName"] == "REMOVE":
        return {"_op_type": "delete", "_index": "orders", "_id": key}
    img = deserialize(record["dynamodb"]["NewImage"])
    version = int(record["dynamodb"]["SequenceNumber"])
    return {
        "_op_type": "index", "_index": "orders", "_id": key,
        "_version": version, "_version_type": "external", "_source": img,
    }

Mapping each eventName to the right sink action — the table that keeps your projector correct:

`eventName`	Image present	OpenSearch action	S3 action	Warehouse action
`INSERT`	NewImage	`index` (upsert by `_id`)	Write `pk#sk#seq.json`	`MERGE` insert
`MODIFY`	New + Old	`index` (upsert by `_id`)	Write new version object	`MERGE` update
`REMOVE` (user)	OldImage only	`delete` by `_id`	Write tombstone object	`MERGE` soft-delete flag
`REMOVE` (TTL)	OldImage + `userIdentity`	Filter out (or delete)	Skip or tag as housekeeping	Skip

Handling poison records with on-failure destinations and DLQs

A poison record is one that will never succeed no matter how many times you retry — a schema your code cannot parse, a downstream constraint it will always violate. The entire point of MaximumRetryAttempts, MaximumRecordAgeInSeconds, and an on-failure destination is to evict that record so the shard keeps moving.

For a Lambda ESM, the on-failure destination (SQS or SNS) receives metadata about the failed batch, not the record bodies themselves — shard ID, the failing sequence-number range, and an error summary. That is deliberate: the records still live in the stream for 24 hours, so the destination’s job is to point you at them. Your redrive tooling reads the DynamoDBStreamArn, shard, and sequence-number window from the failure message and re-reads the offending records with GetShardIterator at AT_SEQUENCE_NUMBER. An EventBridge Pipes DLQ behaves the same way — it captures failure context after retries are exhausted.

{
  "requestContext": { "functionArn": "arn:aws:lambda:...:orders-cdc-processor" },
  "responseContext": { "statusCode": 200, "functionError": "Unhandled" },
  "DDBStreamBatchInfo": {
    "shardId": "shardId-00000001700000000000-abcdef01",
    "startSequenceNumber": "700000000000000000001",
    "endSequenceNumber": "700000000000000000099",
    "approximateArrivalOfFirstRecord": "2026-06-08T14:00:00Z",
    "batchSize": 99,
    "streamArn": "arn:aws:dynamodb:...:table/orders/stream/2026-06-08T00:00:00.000"
  }
}

What the failure message contains, and what you do with each field during a redrive:

Field in `DDBStreamBatchInfo`	Meaning	Redrive use
`shardId`	The shard that failed	`GetShardIterator` target
`startSequenceNumber`	First failing seq	Iterator at `AT_SEQUENCE_NUMBER`
`endSequenceNumber`	Last failing seq	Stop boundary for the re-read
`streamArn`	The stream	Which stream to read
`approximateArrivalOfFirstRecord`	When it arrived	How close to the 24h cliff you are
`batchSize`	How many records	Scope of the redrive

Alarm on ApproximateAgeOfOldestRecord and on DLQ depth, and make the redrive a deliberate, reviewed action — not an automatic retry that re-poisons the pipeline. Set MaximumRecordAgeInSeconds comfortably below 24 hours (e.g. 6 hours) so records are evicted to the DLQ while they are still readable in the stream; if a record ages past the 24-hour retention before you redrive, it is gone for good. The on-failure surface differs slightly between ESM and Pipes:

Aspect	Lambda ESM on-failure	EventBridge Pipes DLQ
Destination types	SQS or SNS	SQS (or SNS)
Payload	Batch metadata (shard + seq range)	Failure context after retries
Records included	No (still in stream 24h)	No (still in stream 24h)
Trigger	After `MaximumRetryAttempts` / age	After `maximum_retry_attempts`
Redrive method	Re-read shard at `AT_SEQUENCE_NUMBER`	Re-read shard at `AT_SEQUENCE_NUMBER`
Bisect setting	`--bisect-batch-on-function-error`	`on_partial_batch_item_failure = AUTOMATIC_BISECT`

The metrics you alarm on are the difference between catching a stall in minutes and discovering it from a customer complaint. The CloudWatch reference for a DynamoDB-Streams CDC pipeline — namespace, what it measures, and a starting alarm threshold:

Metric	Namespace	What it tells you	Starting alarm threshold
`IteratorAge` / `GetRecords.IteratorAgeMilliseconds`	`AWS/Lambda` / `AWS/DynamoDB`	How far behind the consumer is (the stall signal)	> 60,000 ms sustained 5 min
`ApproximateAgeOfOldestRecord`	`AWS/Lambda` (ESM)	Oldest unprocessed record’s age	> 1/4 of `MaximumRecordAgeInSeconds`
`Errors`	`AWS/Lambda`	Function invocation failures	> 0 sustained (with retries looping)
`Throttles`	`AWS/Lambda`	Concurrency ceiling hit	> 0 sustained 5 min
`ConcurrentExecutions`	`AWS/Lambda`	Headroom vs account/reserved limit	> 80% of reserved
`DeadLetterErrors` / DLQ `ApproximateNumberOfMessagesVisible`	`AWS/Lambda` / `AWS/SQS`	Poison records captured	≥ 1 (page on any)
`ReadThrottleEvents`	`AWS/DynamoDB` (stream)	Too many readers per shard	> 0 sustained
`ProvisionedConcurrencySpilloverInvocations`	`AWS/Lambda`	PC pool overflow (if using PC)	> 0

The error and exception strings you will actually see, what each means on this path, and the first move:

Error / condition	Where it surfaces	Likely cause	How to confirm	First fix
`ConditionalCheckFailedException`	Idempotent `PutItem`	Duplicate or stale replay (expected)	It’s a no-op by design	Swallow it; do NOT re-raise
`version_conflict_engine_exception` (409)	OpenSearch index	Older version rejected by external versioning	Stale replay arrived after newer	Treat as success; the guard worked
`ProvisionedThroughputExceededException`	`GetRecords`	Too many readers per shard	`ReadThrottleEvents` > 0	≤2 readers; move to Kinesis fan-out
`TrimmedDataAccessException`	`GetShardIterator`/redrive	Record aged past 24h before redrive	`ApproximateAgeOfOldestRecord` near 24h	Lower `MaxRecordAge`; alarm earlier
`ExpiredIteratorException`	Long-running consumer	Shard iterator older than ~5 min unused	Iterator reused too late	Re-acquire the iterator
`ResourceNotFoundException` (stream)	ESM create / Pipe	Stream ARN stale after view-type change	View type changed → new ARN	Re-point the ESM/Pipe to the new ARN
`Unhandled` (in DLQ `functionError`)	On-failure destination	Function threw; batch exhausted retries	DLQ message present	Redrive from shard + seq range
Lambda `Task timed out`	CloudWatch logs	Per-record work exceeds function timeout	Duration ≈ configured timeout	Smaller batch; raise timeout; optimize

Architecture at a glance

The diagram traces a change as it actually flows, then maps each failure class onto the exact hop where it bites. Read it left to right. A write commits to the orders table and lands on its Stream (NEW_AND_OLD_IMAGES, sharded, 24-hour retention) — badge 1 marks where records age out of the window if a stall lets them. The capture/poll zone holds the two managed pollers: the Lambda ESM (batch 100, parallelization 1–10) and EventBridge Pipes (filter, no code) — badge 2 sits on the ESM, where a hot shard stalls the single poller per shard. In process, the CDC processor writes idempotently (conditional on _seq) — badge 3 is the duplicate / out-of-order apply class — and the DLQ captures the shardId + sequence range when a poison record wedges the shard (badge 4). The fan-out sinks — OpenSearch (upsert by _id/_version), Firehose→S3 (NDJSON, gzip), encrypted with a KMS CMK — carry badge 5, the sink-drift / fan-out-coupling class. Finally CloudWatch watches IteratorAge, the leading indicator for every stall.

The whole method is in that left-to-right path: a change is captured once, processed exactly-once via an idempotent write, fanned out at the bus so each sink owns its retries, and observed so a climbing iterator age pages you before the 24-hour cliff. The five numbered badges are the five ways it goes wrong, and the legend narrates each as symptom · confirm · fix — so when the pager fires you localize the failure to a hop, read the cause, run the named confirm, and apply the fix.

Real-world scenario

A retail platform team — call them MeridianStock — ran inventory on a single DynamoDB table fronting 2,000 stores, with a Lambda ESM projecting every change into an OpenSearch index that powered the storefront’s “in stock near you” search. Average load was 600 writes/second; the team was five engineers; the monthly bill for the table + stream + Lambda + OpenSearch ran about ₹95,000. On a normal day it was fine.

During a flash sale, one SKU’s partition went hot. The ESM’s iterator age on that shard climbed past 40 minutes, and search results went stale right when traffic peaked — the exact failure that costs money. Worse, a malformed bundle item (a nested map their deserializer choked on) had been silently stuck in that shard for hours, blocking every inventory update behind it because the whole batch kept failing and retrying. Their original ESM had BatchSize=500, ParallelizationFactor=1, no partial-batch reporting, and no bisect — so a single bad record consumed the entire retry budget on a 500-record batch, over and over.

The constraint was twofold: they needed the hot shard to catch up without resharding a 2,000-store table mid-sale, and they could not let one un-parseable item hold the rest hostage. The breakthrough was reading the right metric. GetRecords.IteratorAgeMilliseconds was pinned near 2,400,000 (40 minutes) on exactly one shard while the others sat near zero — a textbook hot-shard-plus-poison signature, not a capacity problem. Scaling the table would have done nothing.

They made three changes, all configuration. They raised ParallelizationFactor to 8 so the hot shard drained across eight concurrent invocations (ordering held, because all records for a given SKU still hashed to one invocation). They switched the function to ReportBatchItemFailures and returned precise failed sequence numbers, so good records checkpointed immediately instead of being reprocessed behind the poison item. And they enabled BisectBatchOnFunctionError with MaximumRecordAgeInSeconds=21600, so the malformed bundle item was narrowed down and evicted to a DLQ within minutes instead of wedging the shard for hours.

aws lambda update-event-source-mapping \
  --uuid "$ESM_UUID" \
  --parallelization-factor 8 \
  --batch-size 100 \
  --maximum-record-age-in-seconds 21600 \
  --bisect-batch-on-function-error \
  --function-response-types ReportBatchItemFailures \
  --destination-config '{"OnFailure":{"Destination":"'"$DLQ_ARN"'"}}'

The outcome: iterator age on the hot shard fell back under 30 seconds within the next sale, the poison bundle item surfaced in the DLQ with its exact sequence range for an engineer to fix offline, and the storefront search stopped lying about stock during the one window where accuracy mattered most. The following sprint they also moved the data-lake sink off the projector Lambda and onto an EventBridge Pipe → bus, so a slow OpenSearch never again stalled the lake. No table redesign, no resharding — just the ESM knobs used the way they were designed. The incident as a timeline, because the order of moves is the lesson:

Time	Symptom	Action taken	Effect	What it should have been
T+0	Search stale during sale	(alert on stale results, not the pipeline)	—	Alarm on IteratorAge, not user reports
T+10m	One shard’s age at 40 min	Considered scaling the table	Would not help	Read per-shard IteratorAge first
T+20m	Confirmed hot shard + poison	`ParallelizationFactor` 1 → 8	Hot shard drains 8-way	Correct catch-up lever
T+25m	Poison still blocking	`ReportBatchItemFailures` on	Good records checkpoint	Stop reprocessing good behind bad
T+30m	Poison record isolated	Bisect + `MaxRecordAge=21600`	`bundle` item to DLQ in minutes	Evict while still readable
+1 sprint	Lake coupled to projector	Pipe → EventBridge bus	Sinks decoupled	Fan out at the bus from day one

Advantages and disadvantages

DynamoDB Streams give you a genuinely powerful change feed, but the model’s strengths and its sharp edges are two sides of the same coin. Weigh them honestly:

Advantages (why this model helps you)	Disadvantages (why it bites)
Ordered, item-level deltas with per-key ordering and no triggers to write	Ordering is per key only — never table-wide; consumers must not assume global order
Managed pollers (Lambda ESM / Pipes) — you tune, you don’t write the loop	At-least-once delivery means you must build idempotency; nothing is exactly-once
`ReportBatchItemFailures` + bisect give precise poison isolation	Defaults are unsafe: infinite retries, no bisect, no DLQ → a poison record stalls a shard for hours
24-hour retention buffers consumers and enables redrive from the stream	The 24-hour cap is hard — miss the window and the record is gone; you need S3 as the system of record
EventBridge Pipes turns glue into declarative infrastructure	Pipes is route-and-reshape only — real compute still needs a Lambda
`ParallelizationFactor` drains a hot shard without resharding	A hot partition key still bottlenecks one shard; tuning helps, key design fixes
Native Stream never duplicates at the source (vs Kinesis)	Only ~2 efficient readers per shard; many consumers force the Kinesis surface
Tight integration with KMS, IAM, CloudWatch	Several failure modes are invisible until load (hot shard, SNAT-like coupling, ageing out)

The model is right whenever you need item-level CDC with strict per-key ordering and a short replay window — search projection, cache invalidation, audit trails, materialized views. It bites hardest on hot-key workloads, multi-sink fan-out done in one function, and teams that ship the defaults. Every disadvantage is manageable — but only if you know it exists, which is the entire point of building idempotency, bisect, DLQ, and bus-level fan-out in from the start.

Hands-on lab

Build a minimal CDC pipeline end to end — a table with a stream, a Lambda ESM with the safety knobs on, an idempotent projection, and a deliberate poison record — then watch it recover. Free-tier-friendly; teardown at the end. Run in CloudShell.

Step 1 — Variables and a table with a stream.

export AWS_PAGER=""
TABLE=cdc-orders-lab
DLQ=cdc-orders-dlq
FN=cdc-orders-processor

aws dynamodb create-table \
  --table-name "$TABLE" \
  --attribute-definitions AttributeName=pk,AttributeType=S \
  --key-schema AttributeName=pk,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES
aws dynamodb wait table-exists --table-name "$TABLE"
STREAM_ARN=$(aws dynamodb describe-table --table-name "$TABLE" \
  --query 'Table.LatestStreamArn' --output text)
echo "$STREAM_ARN"

Expected: a stream ARN ending in a timestamp. Note it — the ESM points at it.

Step 2 — Create the DLQ and capture its ARN.

DLQ_URL=$(aws sqs create-queue --queue-name "$DLQ" --query QueueUrl --output text)
DLQ_ARN=$(aws sqs get-queue-attributes --queue-url "$DLQ_URL" \
  --attribute-names QueueArn --query 'Attributes.QueueArn' --output text)
echo "$DLQ_ARN"

Step 3 — A processor that is idempotent and reports per-record failures. Save as app.py and zip it. It throws on a deliberately malformed item so you can watch bisect isolate it:

import json
def handler(event, context):
    failures = []
    for r in event["Records"]:
        seq = r["dynamodb"]["SequenceNumber"]
        try:
            img = r["dynamodb"].get("NewImage", {})
            if img.get("poison", {}).get("BOOL"):     # the malformed item
                raise ValueError("poison record")
            print("applied", r["dynamodb"]["Keys"]["pk"]["S"], seq)
        except Exception as e:
            print("FAILED", seq, str(e))
            failures.append({"itemIdentifier": seq})
    return {"batchItemFailures": failures}

Step 4 — Deploy the function (assumes an existing Lambda execution role with stream + SQS + logs permissions; see Security notes for the exact policy).

zip fn.zip app.py
aws lambda create-function --function-name "$FN" \
  --runtime python3.12 --handler app.handler \
  --role "$LAMBDA_ROLE_ARN" --zip-file fileb://fn.zip

Step 5 — Create the event source mapping with every safety knob on.

aws lambda create-event-source-mapping \
  --function-name "$FN" --event-source-arn "$STREAM_ARN" \
  --starting-position TRIM_HORIZON \
  --batch-size 10 --maximum-batching-window-in-seconds 5 \
  --parallelization-factor 2 \
  --maximum-retry-attempts 2 --maximum-record-age-in-seconds 3600 \
  --bisect-batch-on-function-error \
  --function-response-types ReportBatchItemFailures \
  --destination-config '{"OnFailure":{"Destination":"'"$DLQ_ARN"'"}}'

Expected: a mapping with State progressing to Enabled. Confirm: aws lambda list-event-source-mappings --function-name "$FN" --query 'EventSourceMappings[0].State'.

Step 6 — Write a few good items and one poison item, then watch.

aws dynamodb put-item --table-name "$TABLE" --item '{"pk":{"S":"ORDER#1"},"status":{"S":"PENDING"}}'
aws dynamodb put-item --table-name "$TABLE" --item '{"pk":{"S":"ORDER#2"},"poison":{"BOOL":true}}'
aws dynamodb put-item --table-name "$TABLE" --item '{"pk":{"S":"ORDER#3"},"status":{"S":"SHIPPED"}}'
sleep 30
aws logs tail "/aws/lambda/$FN" --since 5m

Expected in the logs: applied ORDER#1, a FAILED line for ORDER#2, and applied ORDER#3 — proof the good records checkpointed while the poison one was isolated and routed to the DLQ after its retries.

Step 7 — Confirm the poison landed in the DLQ.

aws sqs receive-message --queue-url "$DLQ_URL" --wait-time-seconds 5 \
  --query 'Messages[0].Body' --output text

Expected: a JSON body with DDBStreamBatchInfo carrying the shardId and the failing sequence range. The lab steps mapped to what each proves:

Step	What you did	What it proves
1	Table + `NEW_AND_OLD_IMAGES` stream	The source feed and its view type
5	ESM with bisect + partial-batch + DLQ	The safety knobs are real and composable
6	Good + poison items	Good records checkpoint; poison is isolated
7	Read the DLQ message	Failure context (shard + seq) is what you redrive from

Cleanup.

ESM=$(aws lambda list-event-source-mappings --function-name "$FN" --query 'EventSourceMappings[0].UUID' --output text)
aws lambda delete-event-source-mapping --uuid "$ESM"
aws lambda delete-function --function-name "$FN"
aws sqs delete-queue --queue-url "$DLQ_URL"
aws dynamodb delete-table --table-name "$TABLE"

Cost note. On-demand DynamoDB, a tiny Lambda, and an SQS queue for an hour cost a few rupees; deleting the four resources stops everything. There is no charge for enabling a stream — you pay per GetRecords call the ESM makes, which is negligible at this scale.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you can read while the iterator age climbs, then the detail underneath for the entries that bite hardest.

#	Symptom	Root cause	Confirm (exact metric / command)	Fix
1	One shard’s iterator age climbs to minutes; downstream stale	Hot partition key + one-poller-per-shard	`GetRecords.IteratorAgeMilliseconds` high on one shard, flat on others	`ParallelizationFactor` up to 10; `ReportBatchItemFailures` on
2	Whole pipeline wedged; same batch retried forever	Poison record + no bisect + infinite retries	Invocation errors loop; iterator age climbs everywhere	`BisectBatchOnFunctionError` + on-failure DLQ + finite `MaxRetryAttempts`
3	Search index double-applies or shows stale doc	Non-idempotent write; out-of-order replay	OpenSearch `_version` regresses; counts drift	Conditional-on-`_seq`; external versioning from `SequenceNumber`
4	Records vanished; never processed	Aged past 24h before redrive	`ApproximateAgeOfOldestRecord` was near 24h	`MaximumRecordAgeInSeconds` well below 24h; alarm earlier
5	Housekeeping deletes replicated downstream	TTL `REMOVE` not filtered	Record has `userIdentity.principalId=dynamodb.amazonaws.com`	Filter TTL deletes in the function/filter
6	One slow sink stalls all sinks	Mega-Lambda fan-out couples sinks	Iterator age climbs when only OpenSearch is slow	Fan out at an EventBridge bus/Pipe; one rule+DLQ per sink
7	`REMOVE` events have no new data to project	Expecting `NewImage` on delete	`eventName=REMOVE` has only `OldImage`	Use `OldImage`; reconstruct delete from keys
8	Consumers see no field-transition info	Wrong `StreamViewType` (`KEYS_ONLY`/`NEW_IMAGE`)	Record lacks `OldImage`	Re-enable stream as `NEW_AND_OLD_IMAGES`
9	DLQ message has no record body	Expecting records, got metadata	Body is `DDBStreamBatchInfo`, not items	Re-read shard at `AT_SEQUENCE_NUMBER`
10	`GetRecords` throttling; lag despite low function load	Too many readers per shard	`ReadThrottleEvents` / throttled `GetRecords`	≤2 efficient readers; move to Kinesis fan-out
11	Reported failures but data still skipped	Reported a late seq, dropped an earlier one	Gap in processed sequence numbers	Always report the earliest real failure
12	Kinesis CDC shows duplicates / out of order	Kinesis is best-effort, not native Stream	Same record id twice; out-of-order seq	Dedupe + reorder on seq; or use native Stream

The expanded form for the entries that cause the most lost hours:

1. One shard’s iterator age climbs to minutes; downstream goes stale. Root cause: A hot partition key funnels writes into one shard, and the default one-poller-per-shard model falls behind. Confirm: aws cloudwatch get-metric-statistics --namespace AWS/DynamoDB --metric-name ... — or simpler, the Lambda IteratorAge metric high on the one shard. The signature is one shard high while the rest are near zero. Fix: Raise ParallelizationFactor (up to 10) so the shard drains across concurrent invocations; per-key order still holds because each key hashes to one invocation. Enable ReportBatchItemFailures so good records checkpoint immediately. The durable fix is key design — see single-table modeling — but tuning buys you the incident.

2. The whole pipeline is wedged; the same batch retries forever. Root cause: A poison record with default settings (infinite retries, no bisect) — the batch fails, the checkpoint never advances, it retries indefinitely. Confirm: Function invocation errors loop on the same batch; iterator age climbs across shards. Fix: BisectBatchOnFunctionError to isolate the record by halving, a finite MaximumRetryAttempts, and an on-failure destination so the isolated record lands in a DLQ. Combine with ReportBatchItemFailures so you rarely reach bisect at all.

3. The search index double-applies a change or shows a stale document. Root cause: A non-idempotent write, or an out-of-order replay overwriting a newer document during reprocessing. Confirm: OpenSearch _version regresses; row counts in the warehouse drift from the table. Fix: Conditional write keyed on SequenceNumber (attribute_not_exists(_seq) OR _seq < :seq); for OpenSearch use _version_type=external with the sequence-derived version so a lower version is rejected.

4. Records vanished and were never processed. Root cause: They aged past the 24-hour retention before anyone redrove them. Confirm: ApproximateAgeOfOldestRecord was approaching 86,400 seconds. Fix: Set MaximumRecordAgeInSeconds well below 24h (e.g. 21,600) so poison records hit the DLQ while still readable; alarm on iterator age at minutes, not hours.

6. One slow sink stalls all sinks. Root cause: A single mega-Lambda fans out to every sink, so one slow target (OpenSearch under merge pressure) stalls the shard for all of them. Confirm: Iterator age climbs precisely when one sink is slow; the others are healthy but starved. Fix: Land changes on an EventBridge bus (via a Pipe) and give each sink its own rule, retry budget, and DLQ. The projector’s slowness no longer blocks the lake.

Iterator age is the master signal, but its shape tells you which failure you have. Read the pattern, not just the number:

Iterator-age pattern	It’s probably…	Confirm	Do this
High on one shard, flat on the rest	Hot partition key (± poison)	Per-shard `IteratorAge`	`ParallelizationFactor` up; fix key design
Climbing across all shards together	Poison + no bisect, or function too slow	Looping `Errors`; duration ≈ timeout	Bisect + DLQ; smaller batch; optimize
Climbs only when one sink is slow	Mega-Lambda fan-out coupling	Sink dependency latency vs age	Fan out at the bus/Pipe
Sawtooth (climbs, drops, repeats)	Under-provisioned concurrency vs bursts	`Throttles` spiking on bursts	Raise concurrency; `ParallelizationFactor`
Flat-high near `MaxRecordAge` then drops	Records being evicted to DLQ	DLQ depth rising	Expected — go fix the poison offline

Best practices

Pick the delivery surface before tuning. Native Stream for strict per-key order and a short window; Kinesis Data Streams for DynamoDB for long retention or many fan-out consumers. You cannot tune your way out of the wrong surface.
Enable NEW_AND_OLD_IMAGES unless you have a concrete reason for a leaner view. The diff is what lets you route on transitions and reconstruct deletes.
Filter TTL-driven REMOVE events explicitly (or in the Pipe filter) so housekeeping is not replicated downstream.
Ship ReportBatchItemFailures and BisectBatchOnFunctionError together on every ESM — the former for precise checkpointing, the latter as the safety net for unhandled exceptions.
Always report the earliest real failure in a partial-batch response. Reporting a late one while dropping an earlier one silently skips data.
Set MaximumRecordAgeInSeconds below 24 hours (e.g. 6h) so poison records are evicted to a DLQ while still readable in the stream.
Make every downstream write idempotent — conditional-on-SequenceNumber, or a deterministic pk#sk#seq key with the sink’s natural upsert. Treat replay as a non-event.
Derive a monotonic version from SequenceNumber and use external versioning (OpenSearch _version) so a stale replay cannot overwrite a newer doc.
Fan out at an EventBridge bus/Pipe, never one mega-Lambda, so each sink owns its retries and DLQ and one slow target can’t stall the rest.
Use ParallelizationFactor to drain hot shards during a spike, but fix chronic hot keys in the data model.
Keep S3 (via Firehose) as the immutable system of record — the only thing that survives the 24-hour window and lets you rebuild any sink.
Alarm on IteratorAge / ApproximateAgeOfOldestRecord, function errors/throttles, and DLQ depth before go-live — a silent stall you didn’t alarm on is the worst case.
Make redrives deliberate and reviewed. Read shardId + sequence range from the DLQ and re-read at AT_SEQUENCE_NUMBER; never auto-retry a poison record back into the pipeline.

Security notes

DynamoDB Streams inherit the table’s encryption and are governed by their own IAM action set; the consumer needs least-privilege access to read the stream and write the DLQ, nothing more.

Encrypt at rest with a customer-managed KMS key. A stream encrypts with the same key as its table; using a KMS CMK (rather than the AWS-owned key) gives you key policy control, rotation, and an auditable grant to the consumer’s role. See AWS KMS in Depth: Multi-Region Keys, Envelope Encryption, Key Policies, and Grants.
Scope the consumer’s IAM to the four stream actions. The Lambda/Pipe role needs only DescribeStream, GetRecords, GetShardIterator, and ListStreams on the specific stream ARN — never dynamodb:*.
Separate read-from-stream from write-to-sink. Grant sqs:SendMessage to the DLQ, the specific OpenSearch/S3 write actions to the sinks, and nothing broader. Cross-account fan-out should use scoped assume-role — see Secure Cross-Account Access: Assume-Role Patterns, External ID, Confused Deputy, and Session Policies.
Don’t leak PII into the lake unfiltered. The NewImage/OldImage carry the full item; mask or drop sensitive attributes in the enrichment/processor before they land in S3 or a search index.
Lock the DLQ down. Failure metadata names shard IDs and sequence ranges; restrict sqs:ReceiveMessage to the redrive role and encrypt the queue with SSE-KMS.

The minimal consumer policy — what to grant and why each action is needed:

IAM action	On resource	Why the consumer needs it
`dynamodb:DescribeStream`	The stream ARN	Discover shards and topology
`dynamodb:GetShardIterator`	The stream ARN	Position a read (incl. `AT_SEQUENCE_NUMBER` for redrive)
`dynamodb:GetRecords`	The stream ARN	Read the change records
`dynamodb:ListStreams`	The stream ARN	Enumerate streams on the table
`sqs:SendMessage`	The DLQ ARN	Write failure context on exhaustion
`kms:Decrypt`	The table’s CMK	Decrypt encrypted stream records
(sink-specific writes)	Each sink ARN	Project the change (scoped per sink)

data "aws_iam_policy_document" "cdc_consumer" {
  statement {
    actions   = ["dynamodb:DescribeStream", "dynamodb:GetRecords",
                 "dynamodb:GetShardIterator", "dynamodb:ListStreams"]
    resources = [aws_dynamodb_table.orders.stream_arn]
  }
  statement {
    actions   = ["sqs:SendMessage"]
    resources = [aws_sqs_queue.cdc_dlq.arn]
  }
  statement {
    actions   = ["kms:Decrypt"]
    resources = [aws_kms_key.table.arn]
  }
}

Cost & sizing

The bill for a CDC pipeline is dominated by the consumers and sinks, not the stream itself — enabling a stream is free, and you pay only for the GetRecords calls the managed poller makes. What actually drives the cost and how to right-size it:

GetRecords requests are billed per call (DynamoDB Streams read-request units); the ESM batches efficiently, so this is small unless you set a tiny batch size with a zero batching window (many under-full reads). Use a sensible BatchSize (100) and a short MaximumBatchingWindowInSeconds (1–5s) to keep calls few and full.
Lambda is the usual largest line — invocations × duration × memory. Bigger batches amortize invocation overhead; right-size memory to the per-record work.
OpenSearch is often the single most expensive sink (cluster hours regardless of change volume) — size it to query load, and remember it is rebuildable from S3, so you can run it leaner.
Firehose + S3 are cheap per GB; the win is avoiding the small-object tax by letting Firehose batch and compress, which keeps Athena fast and S3 PUTs low.
Kinesis Data Streams for DynamoDB adds shard-hour (provisioned) or per-GB (on-demand) cost plus the longer retention — only pay this when you genuinely need >24h replay or many consumers.

A rough monthly picture for a mid-size pipeline (600 writes/s projected to OpenSearch + a lake):

Cost driver	What you pay for	Rough INR / month	How to right-size
Stream `GetRecords`	Read-request units the poller makes	~₹500–2,000	Sensible batch size + window (few, full reads)
Lambda (processor)	Invocations × duration × memory	~₹4,000–12,000	Larger batches; tune memory to per-record work
OpenSearch domain	Cluster hours (data + master nodes)	~₹40,000–90,000	Size to query load; rebuild from S3, run lean
Firehose → S3	Per-GB ingest + storage + PUTs	~₹2,000–6,000	Let Firehose batch/compress; partition by date
SQS DLQ	Requests (tiny)	< ₹200	Negligible
Kinesis (if used)	Shard-hours or per-GB + retention	~₹6,000–20,000	Only when >24h replay / many consumers

The lesson MeridianStock learned: most “we need a bigger pipeline” instincts are actually “we mis-tuned the poller” or “we coupled our sinks.” Fix the configuration and the fan-out topology first; the bill usually goes down, not up. There is no free tier for streams specifically, but the per-GetRecords cost is small enough that the consumers dominate every realistic bill.

Interview & exam questions

1. What ordering guarantee does a DynamoDB Stream actually provide? Strict ordering per partition key, within a shard lineage — records for a given key are delivered in commit order. There is no global/table-wide ordering. Consumers must only assume per-key order; this is what makes ParallelizationFactor > 1 safe.

2. Is delivery exactly-once? How do you achieve exactly-once downstream? No — delivery is at-least-once; redrives and retries can replay a record. Exactly-once downstream is an idempotent write you design: a conditional write keyed on SequenceNumber in DynamoDB, external versioning in OpenSearch, or a deterministic pk#sk#seq key with the sink’s natural upsert.

3. A single poison record has stalled a shard for hours. Why, and what two settings fix it? With defaults (infinite retries, no bisect) the failing batch never checkpoints, so it retries forever and blocks everything behind it. Fix with BisectBatchOnFunctionError (isolate the record by halving) plus an on-failure destination, and ReportBatchItemFailures so good records checkpoint immediately. Bound it with a finite MaximumRetryAttempts and a MaximumRecordAgeInSeconds below 24h.

4. What does ParallelizationFactor do, and what does it cost you? It runs up to 10 concurrent invocations per shard, draining a hot shard without resharding. Per-key order is preserved (each key hashes to one invocation); you relax ordering across keys within a shard — almost always acceptable since cross-key order was never guaranteed.

5. How does the ReportBatchItemFailures checkpoint contract work, and what’s the trap? You return the failed sequence numbers; the poller treats the earliest reported failure as the checkpoint — everything at or after it is retried, everything before is done. The trap: reporting a late failure while dropping an earlier record silently skips data. Always report the earliest real failure.

6. When do you choose EventBridge Pipes over a Lambda ESM? When the work is route and reshape — filter, optionally enrich, deliver — with no poller code to own. Keep a Lambda ESM when the work is genuine compute: non-trivial transformation needing libraries, transactional writes to multiple sinks, or windowed aggregation (tumbling windows are ESM-only).

7. Why is a 24-hour stream not enough, and what do you pair with it? Records age out after 24 hours and are gone; you cannot backfill history from the stream. Pair it with S3 (via Firehose) as the immutable, replayable system of record, and a one-time Scan-and-load for data older than the window. Set MaximumRecordAgeInSeconds below 24h so poison records hit the DLQ while still readable.

8. How do you keep an OpenSearch projection from being corrupted by an out-of-order replay? Derive a monotonic version from the record’s SequenceNumber and index with _version_type=external, so OpenSearch rejects any write with a lower version. Set _id to the item key so writes are idempotent upserts and a REMOVE is a delete by the same _id.

9. What’s the difference between native DynamoDB Streams and Kinesis Data Streams for DynamoDB? Native Streams: 24-hour retention, strict per-key order, at-least-once, ~2 efficient readers per shard. Kinesis for DynamoDB: retention up to a year, best-effort order (reorder on sequence), possible duplicates, and many fan-out consumers. Pick Kinesis for long replay or many consumers; native for strict order and a short window.

10. The on-failure destination has a message but no record body. Why, and how do you recover the records? The destination receives batch metadata (shardId, failing sequence range, stream ARN), not the records — they still live in the stream for 24h. Recover by GetShardIterator at AT_SEQUENCE_NUMBER with the range from the message and re-reading the offending records for a deliberate redrive.

11. How do you handle TTL deletes you don’t want to replicate? A TTL deletion arrives as REMOVE with userIdentity.principalId=dynamodb.amazonaws.com. Filter those explicitly (in the function or the Pipe filter pattern) so housekeeping expiries are not mirrored to downstream sinks.

12. Iterator age is climbing on exactly one shard while others are flat. What does that tell you? A hot partition key is funneling writes into that shard and the poller is falling behind (often compounded by a poison record). Confirm with per-shard IteratorAge; mitigate with ParallelizationFactor and ReportBatchItemFailures, and fix the root cause in the data model (key design).

These map to the AWS Certified Developer – Associate (DVA-C02) — develop event-driven solutions, Lambda event source mappings, DynamoDB Streams — and the Solutions Architect – Associate / Professional for the integration and fan-out patterns. A compact cert mapping:

Question theme	Primary cert	Objective area
Stream ordering / view types	DVA-C02	Develop with DynamoDB; event-driven
ESM tuning, bisect, partial-batch	DVA-C02	Lambda event source mappings
EventBridge Pipes vs Lambda	DVA-C02 / SAA-C03	Event-driven integration
Idempotency / exactly-once	DVA-C02 / SAP-C02	Resilient, idempotent design
Kinesis vs native Streams	SAA-C03 / DAS	Choosing a streaming surface
KMS / IAM scoping for streams	SCS-C02 / DVA-C02	Secure data + least privilege

Quick check

A redrive replays a batch and your OpenSearch index now shows an older document than before. What single mechanism prevents this, and what value do you derive it from?
One shard’s IteratorAge is at 30 minutes while every other shard is near zero. What is the most likely root cause, and which ESM knob buys you time?
True or false: setting MaximumRetryAttempts infinite (the default) with no bisect is a safe configuration for a CDC processor.
You need to replay changes from eight days ago and have three independent teams consuming the same feed. Native Stream or Kinesis Data Streams for DynamoDB — and why?
Your on-failure SQS message contains a DDBStreamBatchInfo block but none of the actual records. How do you get the records to fix them?

Answers

External versioning derived from the record’s SequenceNumber: index into OpenSearch with _version set to the sequence (cast to a number) and _version_type=external, so a write with a lower version is rejected and a stale replay can’t overwrite a newer doc. In DynamoDB the equivalent is a conditional write attribute_not_exists(_seq) OR _seq < :seq.
A hot partition key is funneling writes into that one shard, so the single poller per shard falls behind (often with a poison record compounding it). Raise ParallelizationFactor (up to 10) to drain the shard across concurrent invocations — per-key order still holds — and enable ReportBatchItemFailures. The durable fix is the data model.
False. With infinite retries and no bisect, a single poison record fails the batch, the checkpoint never advances, and it retries forever — stalling every record behind it. Use BisectBatchOnFunctionError, a finite MaximumRetryAttempts, MaximumRecordAgeInSeconds below 24h, and an on-failure DLQ.
Kinesis Data Streams for DynamoDB. The native Stream caps retention at 24 hours (so eight days is impossible) and supports only ~2 efficient readers per shard. Kinesis offers retention up to a year and clean multi-consumer fan-out — at the cost of best-effort ordering and possible duplicates, so dedupe and reorder on a sequence number.
The destination carries batch metadata, not records — the records still live in the stream for 24 hours. Read the shardId and startSequenceNumber/endSequenceNumber from the message, call GetShardIterator at AT_SEQUENCE_NUMBER, and re-read the offending records for a deliberate, reviewed redrive (provided they haven’t aged past 24h).

Glossary

DynamoDB Stream — an ordered, append-only log of item-level changes on a table, retained for 24 hours and sharded to match the table’s partitions.
Shard — an ordered partition of the stream; the unit of both ordering and consumer parallelism.
StreamViewType — what each record carries: KEYS_ONLY, NEW_IMAGE, OLD_IMAGE, or NEW_AND_OLD_IMAGES.
SequenceNumber — a monotonic, per-shard record identifier; the anchor for idempotency and external versioning.
Event source mapping (ESM) — the managed Lambda poller AWS runs for you; you tune BatchSize, ParallelizationFactor, retries, and failure handling.
ParallelizationFactor — 1–10 concurrent invocations per shard; drains a hot shard while preserving per-key order.
ReportBatchItemFailures — the partial-batch response contract; you return failed sequence numbers so good records checkpoint and aren’t reprocessed.
Bisect-on-error — BisectBatchOnFunctionError; on a thrown error the poller splits the batch and retries halves to isolate the offending record.
Iterator age — GetRecords.IteratorAgeMilliseconds / ApproximateAgeOfOldestRecord; how far behind the consumer is — the leading indicator of a stall.
On-failure destination / DLQ — an SQS/SNS target that receives failure metadata (shard + sequence range) after retries are exhausted; the records stay in the stream for 24h.
EventBridge Pipes — a managed source→filter→enrich→target integration; filter-enrich-route with no poller code.
Poison record — an item that will never process successfully (unparseable schema, constraint violation); must be evicted so the shard keeps moving.
Idempotent write — a downstream write designed so replay is a no-op (conditional-on-seq, upsert by id, deterministic pk#sk#seq key).
Kinesis Data Streams for DynamoDB — an alternative change feed with retention up to a year and many fan-out consumers, at the cost of best-effort ordering and possible duplicates.
TTL delete — an item expiry that arrives as REMOVE with userIdentity.principalId=dynamodb.amazonaws.com; filter it to avoid replicating housekeeping.
System of record (S3) — the immutable, replayable change log (via Firehose) that survives the 24-hour stream window and lets you rebuild any sink.

Next steps

You can now build a CDC pipeline off DynamoDB Streams that survives replay, isolates poison, preserves per-key order, and fans out cleanly. Build outward:

Next: Designing Event-Driven Architectures with Amazon EventBridge: Buses, Rules, Schemas, and Archive/Replay — the bus you fan out to, with archive/replay that complements the 24-hour stream window.
Related: Resilient Messaging with SQS and SNS: Fan-Out, FIFO Ordering, DLQs, and Poison-Message Handling — the DLQ and poison-handling patterns this article leans on, in depth.
Related: DynamoDB Single-Table Design: Modeling Access Patterns, GSIs, and Hot Partition Avoidance — fix hot keys at the source so no consumer tuning is needed.
Related: A Structured Logging Pipeline on AWS: JSON Logs, CloudWatch Metric Filters, and Firehose to OpenSearch — the Firehose→S3→OpenSearch sink half, built out.
Related: Event-Driven Order Processing with the Saga Pattern on AWS — where CDC events become saga steps in a real workflow.