AWS Serverless

AWS Lambda Patterns: Event-Driven Functions That Scale to Zero

A startup processed user-uploaded documents on a long-running EC2 instance that sat idle 80% of the day. They moved the pipeline to AWS Lambda — functions triggered by S3 uploads, fanning work out through SQS — and the bill fell 80% while end-to-end latency dropped from minutes to seconds. The catch was not the migration; it was the redesign. A 40-minute batch job had to become a graph of small, idempotent functions because Lambda kills any single invocation at 15 minutes, retries async events on its own schedule, and can deliver the same event twice. The team that wins with Lambda is the team that internalises those three facts before writing a line of handler code.

This is the reference for getting event-driven Lambda right. Lambda runs your code in response to an event, gives it up to 15 minutes and up to 10 GB of memory with proportional CPU, and bills you per millisecond of execution — scaling from zero to thousands of concurrent environments without a server in sight. But “Lambda” is really four different services wearing one name, depending on how it was invoked: a synchronous request/response API, an asynchronous fire-and-forget queue, a poll-based event-source mapping for streams and queues, and destinations/DLQs for the outcomes. Each model has its own retry behaviour, its own error surface, its own ordering and concurrency rules, and its own way of losing your data when you get it wrong. Treat them as one thing and you will ship a pipeline that drops events under load and double-charges customers on retry.

By the end you will stop guessing which model you are in. When an event “disappears,” you will know whether it died in an async retry with no DLQ, was swallowed by a poller that advanced the stream iterator past a poison record, or never arrived because an EventBridge rule pattern was one field too broad. You will know the exact limits — 15-minute timeout, 1,000 default concurrency, 256 KB async payload, 6 MB sync payload, ~128 SQS messages per batch — the precise aws command to confirm each failure, and the Terraform to wire the fix. Because this is a reference you reach for mid-incident, the invocation models, the trigger contracts, the limits, the error codes and the failure playbook are all laid out as scannable tables: read the prose once, keep the tables open when the pipeline is on fire.

What problem this solves

Traditional servers are paid for whether they are busy or not, and they make you responsible for scaling, patching and capacity planning around traffic you cannot predict. An event-driven Lambda architecture removes the server: you write the function that reacts to “a file landed,” “a message arrived,” “a row changed,” “an order was placed,” and AWS runs exactly as many copies as the event rate demands, billing only for the milliseconds they execute. For spiky, intermittent, event-shaped workloads — image processing, ETL steps, webhooks, stream consumers, scheduled jobs, glue between services — this is unbeatable on both cost and operational burden.

What breaks without the patterns, not just the service: teams lift a monolith into one giant function and hit the 15-minute wall; they invoke Lambda synchronously from an API and watch p99 latency spike on cold starts; they trigger directly off a high-volume source with no queue and get throttled into a retry storm; they assume “exactly once” and get duplicate side effects because async and stream invocations are at-least-once. The failure mode is rarely a crash — it is silent: an event that retried into the void because no dead-letter queue was attached, a stream consumer stuck for hours because one poison record blocks the shard, a customer billed twice because the function was not idempotent.

Who hits this: anyone building on serverless past “hello world.” It bites hardest on teams new to the at-least-once delivery contract (idempotency is not optional), on high-throughput stream and queue consumers (batching, concurrency and DLQs are load-bearing), on latency-sensitive synchronous APIs (cold starts and provisioned concurrency), and on anyone fanning one event out to many consumers (SNS vs EventBridge vs SQS is an architecture decision, not a coin flip). The fix is almost never “more memory” — it is “pick the right invocation model, attach the right failure destination, and make the handler idempotent.”

To frame the whole field before the deep dive, here is every event pattern this article covers, the AWS primitive that powers it, and the one trap that defines it:

Pattern Powered by What it’s for The defining trap
Synchronous invoke API Gateway / SDK / ALB Request/response APIs, low-latency reads Cold start in the user’s p99; 6 MB payload cap
Asynchronous invoke S3, SNS, EventBridge Fire-and-forget reactions At-least-once + retries to nowhere without a DLQ
Stream poller Kinesis, DynamoDB Streams Ordered change processing One poison record blocks the whole shard
Queue poller SQS (standard / FIFO) Buffered, decoupled work Visibility timeout < 6× function timeout = duplicates
Fan-out SNS / EventBridge One event → many consumers Picking the wrong broker (filtering, replay, ordering)
Choreography EventBridge bus + rules Loosely-coupled service flows A rule pattern too broad double-delivers
Orchestration Step Functions Stateful, long, branching flows Using Lambda chaining where a state machine belongs

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already understand the AWS basics: an IAM role (Lambda assumes an execution role for its permissions), CloudWatch Logs (where every invocation’s logs land), and the core event services at a “what they are” level — S3 buckets, SQS queues, SNS topics, EventBridge buses, and Kinesis/DynamoDB Streams. You should be able to run aws from a shell, read JSON output, and read a basic Terraform resource block. Familiarity with HTTP status codes and the idea of “retry” and “idempotency” helps.

This sits in the Serverless & Event-Driven track. It assumes the compute-model fundamentals — the Compute on AWS: EC2 vs Lambda vs ECS vs EKS decision is upstream of it, and the ECS, EKS & Fargate: Choosing Your Container Path comparison tells you when a long-running container beats a function. It pairs tightly with the front-door choices in ALB vs NLB vs API Gateway, Compared, because API Gateway is the most common synchronous trigger, and with DynamoDB, RDS & Aurora, Compared since DynamoDB is the natural state store for a stateless function (and its streams are a first-class trigger).

A quick map of who owns what when an event-driven pipeline misbehaves, so you look in the right place fast:

Layer What lives here Failure classes it causes First place to look
Event source (S3/SQS/EventBridge…) The producer + delivery contract Event never arrived; wrong/too-broad routing Source metrics (e.g. NumberOfMessagesSent)
Event-source mapping The poller config (batch, concurrency) Stuck shard, throttling, batch too big aws lambda get-event-source-mapping
Function (your code + role) Handler logic, permissions Timeout, OOM, unhandled exception, AccessDenied CloudWatch Logs + X-Ray
Concurrency / scaling Account + reserved limits 429 TooManyRequests, throttles ConcurrentExecutions, Throttles metrics
Failure destination DLQ / on-failure target Silent data loss on retry exhaustion DLQ depth (ApproximateNumberOfMessages)
Downstream (DB/API) Where the function writes Connection exhaustion under scale-out Downstream connection/throttle metrics

Core concepts

Six mental models make every later decision obvious.

“Lambda” is four services, chosen by how it was invoked. The single most important idea in this article. A synchronous invocation (API Gateway, ALB, an SDK Invoke with RequestResponse) blocks the caller and returns the result — the caller owns retries, Lambda does not retry. An asynchronous invocation (S3, SNS, EventBridge, an SDK Invoke with Event) drops the event onto an internal queue, returns 202 immediately, and Lambda retries on failure (twice by default) before sending it to a DLQ or on-failure destination — or dropping it. A poll-based event-source mapping (Kinesis, DynamoDB Streams, SQS, Kafka, MQ) means Lambda polls the source for you, invoking your function with a batch; its retry, ordering and checkpointing rules are specific to the source. Knowing which of the four you are in tells you the retry count, the error surface, the ordering guarantee and where data goes when it fails — before you read another word.

Your function is stateless and ephemeral; the execution environment is reused but not guaranteed. Lambda creates an execution environment (a micro-VM), runs your init code once, then runs the handler per event. AWS may reuse a warm environment for the next event (fast — init already paid) or spin up a new one (a cold start — init runs again). You get no guarantee about reuse, so anything you need across invocations lives in an external store (DynamoDB, S3, ElastiCache) — but you can exploit reuse by initialising expensive things (SDK clients, DB pools, parsed config) in init code, outside the handler, so warm invocations skip them.

Delivery is at-least-once; design for duplicates or be wrong. Async invocations and stream/queue pollers can deliver the same event more than once (a retry after a partial success, a re-drive, a poller redelivery). Only synchronous invocation is “exactly as many times as the caller called.” If processing an event has a side effect — charging a card, incrementing a counter, sending an email — and you do not make it idempotent (safe to run twice with the same result), at-least-once delivery will eventually double it. Idempotency is not a nice-to-have; it is the price of admission to event-driven Lambda.

Concurrency is finite, shared, and the real scaling unit. Lambda scales by running more concurrent execution environments — one per simultaneous event. Your account has a default 1,000 concurrent executions across all functions in a region (raisable via quota request). One runaway function can starve every other function in the account. Reserved concurrency caps and guarantees a function’s slice; provisioned concurrency pre-warms a number of environments to kill cold starts. A burst that exceeds your available concurrency gets throttled (429 TooManyRequests), and what happens next depends on the invocation model — the sync caller sees the 429, the async/stream path retries.

Cold start is latency, not an error. A new environment must initialise — download your code/layers, start the runtime, run your init code (SDK clients, DB connect, parsed config). That is the cold start, typically 100 ms–1 s+ depending on runtime, package size and VPC attachment. It is not a 5xx unless it blows a caller’s timeout; it is a slow first request on a fresh environment, fixed by keeping environments warm (provisioned concurrency) or making init cheap (smaller package, SnapStart, fewer/lighter dependencies).

The failure destination is where your data goes when the handler can’t. Every async and poll-based path needs an explicit answer to “what happens to an event the function repeatedly fails to process?” For async invokes that is a dead-letter queue (legacy) or an on-failure destination (preferred — SQS, SNS, EventBridge, or another Lambda, with richer metadata). For SQS event sources it is the queue’s own redrive policy → DLQ. For stream sources it is an on-failure destination plus bisect/age controls. Leave it unset and “failed” means “silently gone” — the single most common way to lose production events.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this is the mental model side by side:

Concept One-line definition Where it lives Why it matters
Function Code + runtime + config, the unit of deploy Lambda service The thing that runs and scales
Handler The entry point AWS calls per event Your code Runs once per event (warm or cold)
Init code Code outside the handler, run once per env Your code (module scope) Where you cache clients/pools to beat cold starts
Execution environment The micro-VM that runs your code Lambda-managed Reused (warm) or new (cold start)
Invocation model Sync / async / poll-based Determined by trigger Sets retry, ordering, error surface
Event-source mapping Lambda’s poller for a stream/queue Lambda + source Batching, concurrency, checkpointing
Concurrency Simultaneous environments running Account + per-function The scaling unit; throttles when exceeded
Reserved concurrency A function’s guaranteed/capped slice Per function Protect others / protect downstream
Provisioned concurrency Pre-warmed environments Per function/version Kills cold starts for latency-critical paths
DLQ / on-failure dest Where exhausted events go Per function / per queue The difference between “logged” and “lost”
Idempotency Safe to process the same event twice Your handler design Required under at-least-once delivery
Cold start First-request latency on a fresh env Environment lifecycle Slow first call; can trip caller timeouts

The four invocation models, end to end

This is the depth anchor: get this table and the four sub-sections right and 80% of event-driven Lambda bugs become obvious. The four models differ on who retries, how many times, in what order, and where data goes when it fails.

Property Synchronous Asynchronous Stream poll (Kinesis/DDB) Queue poll (SQS)
Triggers API GW, ALB, SDK RequestResponse S3, SNS, EventBridge, SDK Event Kinesis, DynamoDB Streams SQS standard / FIFO
Who retries The caller Lambda (async queue) Lambda (poller, in place) Lambda (poller, via visibility)
Default retries 0 (caller’s job) 2 (configurable 0–2) until success or maxRecordAge/retryAttempts until success or moved to DLQ
Ordering N/A (one call) None Per shard / partition key None (FIFO: per message group)
Batching One event One event Batch per shard Batch (≤10k / 6 MB)
Delivery Exactly as called At-least-once At-least-once At-least-once
Payload limit 6 MB req / 6 MB resp 256 KB 6 MB batch 6 MB batch
On exhaustion Error returned to caller DLQ / on-failure dest or dropped On-failure dest or blocks shard Queue redrive → DLQ
Throttle behaviour Caller gets 429 Retried with backoff (up to ~6 h) Poller backs off, shard waits Poller backs off, messages stay

Synchronous invocation

The caller sends a request and blocks for the response. API Gateway, ALB, Cognito triggers, and any SDK Invoke with InvocationType=RequestResponse are synchronous. Lambda does not retry — if the function errors or times out, the error goes straight back to the caller, who decides whether to retry. This is the model for request/response APIs where latency matters and the user is waiting, which is exactly why cold starts and the 6 MB payload cap bite here.

# Synchronous invoke — the CLI blocks until the function returns
aws lambda invoke --function-name order-api \
  --invocation-type RequestResponse \
  --payload '{"orderId":"o-123"}' --cli-binary-format raw-in-base64-out \
  response.json

The synchronous-specific limits and behaviours, because they shape your API design:

Aspect Value / behaviour Why it matters
Request payload 6 MB Large uploads must go via S3 + presigned URL, not the body
Response payload 6 MB (buffered) / 20 MB (streamed) Use response streaming for large responses on supported runtimes
Retries by Lambda None The caller (API GW, your SDK) owns retry + backoff
Timeout visibility Caller sees the timeout/error Set function timeout < API GW’s 29 s integration timeout
Cold start in path Yes — in the user’s latency Provisioned concurrency for latency SLOs
Concurrency throttle 429 to the caller API GW returns 502/429; client must handle it

Asynchronous invocation

The caller (S3 event, SNS, EventBridge, or InvocationType=Event) hands the event to Lambda’s internal async queue and gets an immediate 202 Accepted — it does not wait for processing. Lambda then invokes your function and, on failure, retries twice by default with backoff, over a window up to 6 hours. If all attempts fail, the event goes to your configured on-failure destination (or legacy DLQ) — and if none is configured, it is dropped. This is the model where events silently disappear.

# Async invoke — returns 202 immediately, processing happens later
aws lambda invoke --function-name thumbnail-generator \
  --invocation-type Event \
  --payload '{"bucket":"uploads","key":"a.png"}' --cli-binary-format raw-in-base64-out \
  /dev/stdout

# Configure retries + an on-failure destination (the critical bit)
aws lambda put-function-event-invoke-config --function-name thumbnail-generator \
  --maximum-retry-attempts 2 --maximum-event-age-in-seconds 3600 \
  --destination-config '{"OnFailure":{"Destination":"arn:aws:sqs:ap-south-1:111122223333:thumb-dlq"}}'
resource "aws_lambda_function_event_invoke_config" "thumb" {
  function_name                = aws_lambda_function.thumb.function_name
  maximum_retry_attempts       = 2     # 0–2
  maximum_event_age_in_seconds = 3600  # 60–21600 (6h)
  destination_config {
    on_failure  { destination = aws_sqs_queue.thumb_dlq.arn }
    on_success  { destination = aws_sns_topic.thumb_done.arn }  # optional success routing
  }
}

The async knobs and exactly what each controls:

Setting Controls Default Range When to change
MaximumRetryAttempts Async retries after first failure 2 0–2 0 if the source already retries; keep 2 for transient errors
MaximumEventAgeInSeconds How long Lambda keeps retrying 21,600 (6 h) 60–21,600 Lower to fail fast on time-sensitive events
OnFailure destination Where exhausted events go none → dropped SQS/SNS/EventBridge/Lambda Always set in production
OnSuccess destination Route successful outcomes none SQS/SNS/EventBridge/Lambda Event-driven success chains
Legacy DeadLetterConfig Old DLQ (less metadata) none SQS/SNS Prefer on-failure destination instead

The difference between a legacy DLQ and an on-failure destination matters enough to tabulate:

Aspect Legacy DLQ (DeadLetterConfig) On-failure destination (DestinationConfig)
Targets SQS, SNS SQS, SNS, EventBridge, Lambda
Payload The original event only Event + invocation context (error, attempts, request id)
Success routing No Yes (OnSuccess)
Recommended Legacy; avoid for new work Preferred for all new async functions

Stream poll (Kinesis & DynamoDB Streams)

Lambda runs a poller that reads records from each shard in order and invokes your function with a batch. Records within a shard (i.e., a given partition key) are processed in order, one batch at a time — which is the whole point and also the whole danger: a poison record that always fails will, by default, be retried until it expires, blocking every record behind it on that shard. You control this with batch size, parallelization, retry attempts, record age, bisect-on-error, and an on-failure destination that receives metadata about the failed batch.

# Create a stream event-source mapping with poison-pill controls
aws lambda create-event-source-mapping --function-name order-projector \
  --event-source-arn arn:aws:kinesis:ap-south-1:111122223333:stream/orders \
  --starting-position LATEST --batch-size 100 \
  --maximum-batching-window-in-seconds 5 \
  --parallelization-factor 4 \
  --maximum-retry-attempts 3 \
  --maximum-record-age-in-seconds 3600 \
  --bisect-batch-on-function-error \
  --function-response-types ReportBatchItemFailures \
  --destination-config '{"OnFailure":{"Destination":{"Arn":"arn:aws:sqs:ap-south-1:111122223333:proj-dlq"}}}'
resource "aws_lambda_event_source_mapping" "proj" {
  event_source_arn                   = aws_kinesis_stream.orders.arn
  function_name                      = aws_lambda_function.projector.arn
  starting_position                  = "LATEST"
  batch_size                         = 100
  maximum_batching_window_in_seconds = 5
  parallelization_factor             = 4        # 1–10 concurrent batches per shard
  maximum_retry_attempts             = 3        # -1 = infinite (the default poison trap)
  maximum_record_age_in_seconds      = 3600     # -1 = infinite
  bisect_batch_on_function_error     = true     # split a failing batch to isolate the bad record
  function_response_types            = ["ReportBatchItemFailures"]  # partial-batch success
  destination_config { on_failure { destination_arn = aws_sqs_queue.proj_dlq.arn } }
}

The stream event-source-mapping controls, the defaults that bite, and when to change them:

Setting Controls Default The trap if left default Change to
BatchSize Records per invoke 100 (Kinesis/DDB) Large batch + one bad record fails all Tune to processing cost; pair with bisect
MaximumBatchingWindow Wait to fill a batch (s) 0 Tiny batches = more invokes/cost 1–5 s to batch efficiently
ParallelizationFactor Concurrent batches per shard 1 Throughput capped at 1/shard 1–10 (still per-key ordered)
MaximumRetryAttempts Retries before giving up -1 (infinite) Poison record blocks shard forever A finite number (e.g. 3–5)
MaximumRecordAge Drop records older than -1 (infinite) Stale records retried endlessly Bound it (e.g. 1–24 h)
BisectBatchOnFunctionError Split failing batch false Whole batch keeps failing together true — isolates the poison record
ReportBatchItemFailures Partial-batch success off One bad record reprocesses good ones Return failed IDs; checkpoint past good
StartingPosition Where to begin n/a TRIM_HORIZON replays all history LATEST for new, TRIM_HORIZON to backfill

The concurrency math for streams is fixed and worth memorising: concurrency = number of shards × parallelization factor. Ten shards at a parallelization factor of 4 gives 40 concurrent invocations of this function from this stream — independent of your account’s general concurrency, but counted against it.

Queue poll (SQS)

Lambda polls the SQS queue and invokes your function with a batch of up to 10,000 messages (standard) or fewer, bounded by a 6 MB payload. The critical interaction is the visibility timeout: when Lambda reads a message it becomes invisible for that window; if the function succeeds, Lambda deletes it; if it fails (or times out), the message becomes visible again and is redelivered. The hard rule — visibility timeout must be at least 6× the function timeout — exists so a slow invocation does not get the same message redelivered to a second environment while the first is still working (instant duplicates). Messages that fail repeatedly go to the queue’s DLQ via its redrive policy.

# SQS event-source mapping with partial-batch reporting
aws lambda create-event-source-mapping --function-name invoice-worker \
  --event-source-arn arn:aws:sqs:ap-south-1:111122223333:invoices \
  --batch-size 10 --maximum-batching-window-in-seconds 0 \
  --scaling-config '{"MaximumConcurrency":50}' \
  --function-response-types ReportBatchItemFailures
# Queue with a DLQ wired via redrive policy (the SQS-side failure destination)
resource "aws_sqs_queue" "invoices_dlq" { name = "invoices-dlq" }

resource "aws_sqs_queue" "invoices" {
  name                       = "invoices"
  visibility_timeout_seconds = 180  # >= 6 x the 30s function timeout
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.invoices_dlq.arn
    maxReceiveCount     = 5          # attempts before a message goes to the DLQ
  })
}

resource "aws_lambda_event_source_mapping" "invoices" {
  event_source_arn        = aws_sqs_queue.invoices.arn
  function_name           = aws_lambda_function.invoice_worker.arn
  batch_size              = 10
  function_response_types = ["ReportBatchItemFailures"]
  scaling_config { maximum_concurrency = 50 }  # cap concurrent pollers (5–1000)
}

The SQS-poller settings and the duplicate/loss traps they govern:

Setting Where Controls Trap if wrong
Visibility timeout Queue How long a read message is hidden < 6× function timeout → duplicate processing
maxReceiveCount Queue redrive Attempts before DLQ Too high = poison loops; too low = premature DLQ
BatchSize Mapping Messages per invoke (≤10,000) Big batch + no partial-fail = reprocess all on one failure
ReportBatchItemFailures Mapping Return only failed message IDs Without it, one failure redelivers the whole batch
MaximumConcurrency (scaling) Mapping Cap concurrent pollers (5–1,000) Without it, SQS can stampede a fragile downstream
FIFO message group Queue Ordering + dedup scope Wrong group ID serialises unrelated work

Standard vs FIFO SQS as a Lambda source, because the choice changes ordering, throughput and dedup:

Aspect Standard queue FIFO queue
Ordering Best-effort, none guaranteed Strict, per message group ID
Delivery At-least-once Exactly-once processing (with dedup)
Throughput Nearly unlimited 300 msg/s (3,000 batched) per group baseline
Dedup None (you handle it) 5-minute dedup window (content or ID)
Lambda concurrency Scales with backlog Bounded by active message groups
Use when Throughput, parallel work Order matters (per entity), no duplicates

The trigger-by-trigger contract

Each event source has its own wiring, its own event shape, and its own gotchas. This is the reference matrix — which invocation model each uses, the key limits, and the one thing that catches everyone — followed by the wiring detail for the heavy hitters.

Source Invocation model Key limit / batch Event shape gotcha The classic mistake
API Gateway Synchronous 29 s integration timeout; 10 MB payload (REST) Proxy vs non-proxy integration Function timeout > 29 s API timeout
Application Load Balancer Synchronous 1 MB response; no 29 s cap Must return specific JSON shape Wrong response structure → 502
S3 Asynchronous One event per object (mostly) No delivery order; possible duplicates Recursive loop (write back to same bucket)
SNS Asynchronous 256 KB message Fan-out; no replay, no ordering Expecting ordering or filtering richness
SQS (standard) Queue poll ≤10,000/batch, 6 MB Partial-batch failures Visibility < 6× timeout → duplicates
SQS (FIFO) Queue poll Per-group ordering Group ID controls parallelism One group ID serialises everything
Kinesis Data Streams Stream poll Per-shard order; 100/batch Iterator advances past poison Infinite retries block the shard
DynamoDB Streams Stream poll Per-key order; 100/batch NEW/OLD image config Forgetting StreamViewType
EventBridge (bus) Asynchronous 256 KB; 300 rules/bus Pattern matching is exact Pattern too broad → double-delivery
EventBridge Scheduler Asynchronous One-time or cron/rate Time zone + flexible windows Confusing it with EventBridge rules

S3 → Lambda (asynchronous)

An S3 bucket notification invokes your function asynchronously when an object is created, removed, restored, or replicated. Delivery is typically once but not guaranteed exactly-once, and not ordered — design idempotently. The single most expensive mistake is the recursive loop: a function triggered on s3:ObjectCreated:* that writes a derived object back into the same bucket re-triggers itself, billing you in a runaway until you notice. Scope the prefix/suffix or write to a different bucket.

# Grant S3 permission to invoke, then add the bucket notification
aws lambda add-permission --function-name thumbnail-generator \
  --statement-id s3invoke --action lambda:InvokeFunction \
  --principal s3.amazonaws.com \
  --source-arn arn:aws:s3:::uploads-bucket --source-account 111122223333

aws s3api put-bucket-notification-configuration --bucket uploads-bucket \
  --notification-configuration '{
    "LambdaFunctionConfigurations":[{
      "LambdaFunctionArn":"arn:aws:lambda:ap-south-1:111122223333:function:thumbnail-generator",
      "Events":["s3:ObjectCreated:*"],
      "Filter":{"Key":{"FilterRules":[{"Name":"prefix","Value":"raw/"},{"Name":"suffix","Value":".png"}]}}
    }]}'

The S3 notification options and the gotcha each hides:

Option Values Gotcha
Event types ObjectCreated:*, ObjectRemoved:*, ObjectRestore:*, Replication:* Put vs CompleteMultipartUpload differ — * catches both
Prefix/suffix filter string match Overlapping filters on one bucket can double-fire
Destination Lambda, SQS, SNS, EventBridge EventBridge gives richer routing + replay than direct notify
Delivery At-least-once, unordered Always idempotent; never assume order
Recursion guard scope prefix / separate bucket Writing back to the trigger bucket = billing loop

EventBridge → Lambda (asynchronous, the choreography hub)

EventBridge is the event bus for service-to-service choreography. Producers PutEvents; rules match events by a JSON event pattern (exact-match on fields, with content filters); matching events fan out to up to 5 targets per rule — including Lambda. It supports a schema registry, archive and replay, and cross-account/cross-region routing. The defining trap is a pattern that is too broad: two rules whose patterns both match the same event each fire (there is no first-match-wins), so the same fact double-delivers unless your patterns are tight and your consumers idempotent.

# Rule that matches a specific event, targeting a Lambda
aws events put-rule --name order-placed --event-bus-name orders \
  --event-pattern '{"source":["com.acme.orders"],"detail-type":["OrderPlaced"]}'

aws events put-targets --rule order-placed --event-bus-name orders \
  --targets '[{"Id":"fn","Arn":"arn:aws:lambda:ap-south-1:111122223333:function:fulfil","RetryPolicy":{"MaximumRetryAttempts":4,"MaximumEventAgeInSeconds":3600},"DeadLetterConfig":{"Arn":"arn:aws:sqs:ap-south-1:111122223333:eb-dlq"}}]'

aws lambda add-permission --function-name fulfil --statement-id eb \
  --action lambda:InvokeFunction --principal events.amazonaws.com \
  --source-arn arn:aws:events:ap-south-1:111122223333:rule/orders/order-placed

EventBridge vs SNS vs SQS for fan-out — the actual decision table, since this is the most-asked serverless design question:

Need SNS EventBridge SQS
One→many fan-out Yes (subscriptions) Yes (rules, 5 targets each) No (point-to-point)
Content-based routing Limited (message filtering) Rich (event patterns) No
Replay / archive No Yes No (it is the buffer)
Schema registry No Yes No
Ordering FIFO topics (per group) No FIFO queues (per group)
Buffering / backpressure No (push) No (push) Yes (pull)
Throughput Very high High (per-account limits) Very high
Latency Lowest Low Pull-interval bound
Best for High-fanout, low-latency push Service choreography, routing Decoupling, buffering, retries

A pattern that combines them — SNS→SQS fan-in — is so common it deserves its own row of reasoning: SNS pushes one event to many SQS queues (fan-out), and each queue buffers for its own Lambda (backpressure + retries + DLQ per consumer). You get SNS’s fan-out and SQS’s durability, which neither gives alone.

DynamoDB Streams & Kinesis (stream poll)

A DynamoDB Stream emits an ordered, per-key record for every item change; a Kinesis Data Stream is a general ordered log. Both use the stream-poll model from the section above. The DynamoDB-specific knob is StreamViewType — whether the record carries the new image, old image, both, or just keys — which you must set when enabling the stream and cannot change without re-enabling.

# Enable a DynamoDB stream with both images, then map it to a function
aws dynamodb update-table --table-name orders \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES

STREAM_ARN=$(aws dynamodb describe-table --table-name orders \
  --query 'Table.LatestStreamArn' --output text)

aws lambda create-event-source-mapping --function-name order-indexer \
  --event-source-arn "$STREAM_ARN" --starting-position LATEST \
  --batch-size 100 --maximum-retry-attempts 3 --bisect-batch-on-function-error

StreamViewType choices and when each is right:

StreamViewType Record contains Use when
KEYS_ONLY Just the key attributes You’ll re-read the item; minimise stream size
NEW_IMAGE Item after the change Projections, search indexing, caches
OLD_IMAGE Item before the change Auditing what was deleted/overwritten
NEW_AND_OLD_IMAGES Both Change-data-capture, diff logic, full audit

Kinesis vs DynamoDB Streams as a Lambda source:

Aspect DynamoDB Streams Kinesis Data Streams
Source of records Table item changes Anything you PutRecord
Retention 24 h 24 h–365 d (configurable)
Ordering Per item key Per partition key
Shards Managed by the table You provision (or on-demand)
Multiple consumers Limited Enhanced fan-out (per-consumer throughput)
Use for CDC off a table General event log, multi-consumer streaming

Fan-out, fan-in & choreography patterns

The patterns that turn single functions into systems. Each has a shape, a primitive, and a failure mode.

Pattern Shape Built with Why use it Failure mode to guard
Simple fan-out 1 event → N consumers SNS or EventBridge Decouple producers from consumers Lost delivery to one consumer (per-target DLQ)
Fan-out + buffer 1 event → N queues → N fns SNS→SQS Backpressure + retry per consumer Queue without redrive = stuck poison
Fan-in / aggregation N events → 1 store Lambda → DynamoDB Collate results Race on concurrent writes (use conditional updates)
Pipe / transform source → filter → target EventBridge Pipes Point-to-point with enrichment Filter too loose forwards noise
Choreography services react to events EventBridge bus + rules Loose coupling, autonomy No central view; hard to trace
Orchestration central state machine Step Functions Long, branching, stateful flows Chaining Lambdas instead (no visibility)
Saga (compensation) distributed txn + undo Step Functions / EventBridge Multi-service consistency Missing compensating action on failure

Choreography vs orchestration — the architecture fork

The most consequential design decision in event-driven systems. Choreography (EventBridge): each service emits events and reacts to others’ events; no central controller; maximum autonomy and loose coupling — but no single place to see “where is order o-123 in the flow?” Orchestration (Step Functions): a central state machine invokes each step, holds the state, branches, retries, and gives you a visual execution history — at the cost of a coordinator that knows about every step. Lambda chaining (function A invokes function B invokes C) is the worst of both: orchestration logic smeared across functions with no visibility and no built-in retry/branching.

Dimension Choreography (EventBridge) Orchestration (Step Functions) Lambda chaining (anti-pattern)
Coupling Loosest Coupled to the state machine Tightly, invisibly coupled
Visibility of flow Low (trace across events) High (execution history) None
Long/branching flows Hard Native (Map, Choice, Parallel) Painful, error-prone
Built-in retry/catch Per-target only Per-state Hand-rolled in each function
Best for Reactive, autonomous services Defined business processes Almost never

For a deep treatment of the orchestration side, the Step Functions state-machine model is the tool you reach for when a flow has more than two or three steps, branches, or needs a human-readable history.

At-least-once, idempotency & not losing events

This is where most production incidents live. Three disciplines: make the handler idempotent, attach a failure destination on every async/poll path, and stop poison records from blocking streams.

Idempotency — the non-negotiable

Under at-least-once delivery, the same event will arrive twice eventually. An idempotent handler produces the same result whether it runs once or five times. The canonical implementation: derive an idempotency key from the event (message ID, event ID, or a hash of the meaningful fields), and use a conditional write to a store so the second attempt is a no-op.

# DynamoDB conditional put as an idempotency guard
import boto3, hashlib, json
ddb = boto3.client("dynamodb")  # init code — reused across warm invocations

def handler(event, _ctx):
    for record in event["Records"]:
        key = record.get("messageId") or hashlib.sha256(
            json.dumps(record["body"], sort_keys=True).encode()).hexdigest()
        try:
            ddb.put_item(
                TableName="processed-events",
                Item={"id": {"S": key}, "ttl": {"N": str(_ttl())}},
                ConditionExpression="attribute_not_exists(id)")  # fails on duplicate
        except ddb.exceptions.ConditionalCheckFailedException:
            continue  # already processed — skip the side effect
        do_the_side_effect(record)  # safe: runs at most once per key

Idempotency strategies and their trade-offs:

Strategy How it works Pro Con
Conditional write (DynamoDB) First write wins; dupes fail the condition Strong, simple, TTL-able Extra write per event
Natural idempotency Operation is inherently safe (PUT to a key) Free Only some operations qualify
Dedup table + TTL Store seen IDs, expire them Bounded storage Window must exceed max retry age
SQS FIFO dedup 5-min content/ID dedup Built-in 5-min window only; FIFO throughput limits
Powertools Idempotency Library wraps the handler Battle-tested, persistence-backed Adds a dependency + a store

Where events go when they fail — the destination matrix

Every async and poll path needs an explicit failure destination, or “failed” means “gone.” Map yours:

Invocation path Failure mechanism Configure If unset
Async (S3/SNS/EventBridge) On-failure destination / DLQ put-function-event-invoke-config Event dropped after retries
EventBridge target Per-target DLQ + retry policy DeadLetterConfig on the target Event dropped after target retries
SQS source Queue redrive → DLQ Queue redrivePolicy Message loops to maxReceiveCount, then… DLQ if set, else stuck
Kinesis/DDB source On-failure destination mapping DestinationConfig Shard blocked (infinite default retries)
SNS subscription Subscription redrive → DLQ SNS RedrivePolicy Delivery attempts exhausted, message lost

Stopping poison records on streams

A poison record is one that always fails. On a stream, the default MaximumRetryAttempts=-1 (infinite) means it blocks its shard forever — every record behind it waits. Three controls fix this, used together:

Control What it does Set to
MaximumRetryAttempts Cap retries before skipping/destination A finite number (e.g. 3–5)
MaximumRecordAge Skip records older than N seconds Bound it (e.g. 3,600–86,400)
BisectBatchOnFunctionError Halve a failing batch to isolate the bad record true
ReportBatchItemFailures Report only the failed record IDs Return them; good records checkpoint
DestinationConfig.OnFailure Send the failed batch’s metadata somewhere An SQS/SNS DLQ for inspection

Concurrency, scaling & protecting downstream

Concurrency is the scaling unit and the thing that takes down your database. Three numbers govern it: the account concurrency limit (default 1,000/region, shared), reserved concurrency (a per-function guaranteed cap), and provisioned concurrency (pre-warmed environments). A fourth, the burst concurrency rate, governs how fast you can scale into that ceiling.

# Reserve 100 concurrent executions for a function (guarantees AND caps it)
aws lambda put-function-concurrency --function-name payment-worker \
  --reserved-concurrent-executions 100

# Provision 20 always-warm environments on a version/alias (kills cold starts)
aws lambda put-provisioned-concurrency-config --function-name order-api \
  --qualifier live --provisioned-concurrent-executions 20

# Check current usage against the account limit
aws lambda get-account-settings \
  --query 'AccountLimit.{Concurrent:ConcurrentExecutions,Unreserved:UnreservedConcurrentExecutions}'
resource "aws_lambda_function" "payment" {
  function_name                  = "payment-worker"
  reserved_concurrent_executions = 100   # cap + guarantee; 0 would DISABLE the function
  # ...
}

resource "aws_lambda_provisioned_concurrency_config" "api" {
  function_name                     = aws_lambda_function.order_api.function_name
  qualifier                         = aws_lambda_alias.live.name
  provisioned_concurrent_executions = 20
}

The three concurrency levers, side by side:

Lever What it does Cost Set when
Account limit (1,000) Ceiling across all functions in a region n/a Raise via Service Quotas before you need it
Reserved concurrency Caps a function AND carves it out of the pool Free (just allocation) Protect a downstream DB / protect other functions
Provisioned concurrency Pre-warms N environments Paid hourly even idle Latency-critical sync paths with cold-start SLOs
Burst concurrency Initial scale-up rate (then +500/min) n/a Understand it; you can’t raise it

Critical gotchas, because each has bitten teams in production:

Gotcha What happens Fix
reserved_concurrent_executions = 0 Disables the function entirely Use null/unset for “no reservation,” not 0
One function reserves 900 of 1,000 Every other function shares 100 Reserve deliberately; monitor UnreservedConcurrentExecutions
Lambda scales faster than RDS allows Connection storm → DB at max_connections Reserved concurrency cap + RDS Proxy for pooling
Provisioned concurrency on $LATEST Not allowed — needs a version/alias Publish a version, point an alias, provision the alias
Stream concurrency surprise shards × parallelization, not 1 Size downstream for shards×factor

The downstream-protection pattern deserves emphasis: Lambda will happily open 1,000 concurrent connections to an RDS instance that allows 100, and the database falls over. The fix is a reserved concurrency cap sized to the database’s connection budget, plus RDS Proxy to pool and reuse connections so 1,000 functions share a small pool. DynamoDB, being serverless, scales with you — which is one reason it pairs so naturally with Lambda.

Cold starts: causes, costs & cures

A cold start is the latency of initialising a new execution environment. It is not an error — but on a synchronous, user-facing path it shows up in p99 and can trip an upstream timeout. First, what actually consumes the cold-start budget:

Cost component Typical magnitude Reduce by Trade-off
Code/layer download 10s–100s ms (size-dependent) Smaller package; fewer/lighter deps Build discipline
Runtime bootstrap 50–400 ms (varies by runtime) Choose a faster runtime; SnapStart Language/ecosystem constraints
VPC ENI attach Now ~ms (Hyperplane) (Mostly solved) historically the big one n/a
Your init code 10 ms–1 s+ Lazy-init non-critical clients; cache config First real call may pay deferred cost
First DB connect 10s–100s ms Pool in init; use serverless/proxy DBs Connection still primes once

The cure menu, ranked by cost and effort:

Technique What it does Cost Effort Best for
Smaller package / fewer deps Less to download + init Free Medium Every function
Init clients in module scope Warm invocations skip init Free Trivial Every function
Right-size memory up More CPU → faster init + run Pay per GB-s (may lower total cost) Trivial CPU-bound init
Provisioned concurrency Pre-warmed envs, no cold start Paid hourly Low Latency-SLO sync APIs
SnapStart (Java, .NET, Python) Snapshot a warmed env, restore fast Free (some restore cost) Low JVM/.NET cold-start pain
Avoid heavy frameworks in handler Less per-invoke overhead Free Medium High-RPS functions

Runtime cold-start characteristics, roughly, because runtime choice is a cold-start decision:

Runtime Relative cold start SnapStart support Notes
Node.js / Python Fast Python: yes The serverless default for latency
Go / Rust (provided.al2) Fast n/a (already fast) Compiled, tiny, quick init
Java Slow (JVM + JIT) Yes — big win SnapStart cuts it dramatically
.NET Slow-ish Yes SnapStart / ReadyToRun help

The memory-CPU coupling is the under-used lever: Lambda allocates CPU proportional to memory, so a function that is CPU-bound during init often runs faster and cheaper at 1,024 MB than at 256 MB, because it finishes in a fraction of the time. Profile with AWS Lambda Power Tuning rather than defaulting to 128 MB.

Limits & quotas reference

The numbers you will hit. Keep this open when sizing or debugging a “why did it stop” mystery:

Limit Value Hard/soft What hitting it looks like
Function timeout 15 min (900 s) Hard Invocation killed mid-work; Task timed out
Memory 128 MB – 10,240 MB Hard OOM kill; Runtime exited / errno 137
Ephemeral /tmp 512 MB – 10,240 MB Configurable No space left on device
Sync request payload 6 MB Hard RequestEntityTooLarge
Async payload 256 KB Hard Event rejected at invoke
Deployment package (zipped, direct) 50 MB Hard Upload rejected; use S3
Deployment package (unzipped) 250 MB Hard Use container image (up to 10 GB) instead
Container image 10 GB Hard Bigger won’t deploy
Layers per function 5 Hard Consolidate layers
Account concurrency (region) 1,000 default Soft (raisable) 429 TooManyRequestsException
Burst concurrency region-dependent, then +500/min Hard Throttles during a sharp spike
Environment variables size 4 KB total Hard Move config to SSM/Secrets Manager
ENI per function (VPC) scales (Hyperplane) Managed (Historically a hard limit)
/tmp + invocations per-env, reused n/a Stale state across warm invokes

Error & status-code reference

Every error you realistically see, what it means, how to confirm, and the fix:

Error / code Meaning Likely cause Confirm with Fix
429 TooManyRequestsException Throttled Concurrency limit hit Throttles metric; account settings Raise quota; reserved concurrency; backoff
Task timed out after N seconds Function exceeded timeout Slow work / hung downstream CloudWatch Logs END vs timeout Raise timeout (≤900 s); fix the slow call
Runtime exited (errno 137) OOM killed Memory too low / leak Logs “Runtime exited”; MaxMemoryUsed Increase memory; fix leak
AccessDeniedException IAM denied Execution role missing a permission CloudTrail; the log’s denied action Add the action to the role
ResourceConflictException Concurrent update Two deploys/updates at once Activity; deploy logs Serialise deploys
EventSourceMapping ... Disabled Poller stopped Repeated failures / manual disable get-event-source-mapping State Fix function; re-enable
ProvisionedConcurrencyConfigNotFound PC not on this qualifier Provisioned $LATEST or wrong alias get-provisioned-concurrency-config Provision a version/alias
KMSAccessDeniedException Can’t decrypt env vars Role lacks KMS key access Logs at init Grant kms:Decrypt on the key
Lambda was unable to decompress... Bad package Corrupt/oversized zip Deploy output Rebuild; use container image
Calls to <fn> are being throttled (async) Async backlog throttled Downstream of an async flood Throttles; invocations queued Reserved concurrency; smooth the source
Empty receive / no invokes (SQS) Poller not pulling Mapping disabled; permissions get-event-source-mapping; role sqs:* Enable mapping; grant SQS perms
Stale $LATEST behind alias Wrong code served Alias points to old version get-alias Update alias to the new version

Architecture at a glance

The diagram traces a real event-driven order pipeline left to right and maps the four invocation models onto the exact hops where each one fails. Read it as the path an event actually takes. On the left, producers emit facts: an API Gateway call places an order (a synchronous invoke of the intake function), and an S3 upload of an attachment fires an asynchronous invoke of a processor. Those producers land on the ingestion & buffering zone — an SQS queue absorbs the order workload (the queue-poll model, with a DLQ on its redrive policy) and an SNS topic fans the “order placed” fact out to interested consumers. The EventBridge custom bus in the routing zone is the choreography hub: rules pattern-match the event and fan it to up to five targets, with a per-target DLQ catching exhausted deliveries.

From there the processing zone is where the worker functions run — a fulfilment Lambda (async, retries to an on-failure destination), a projection Lambda fed by DynamoDB Streams (the stream-poll model, where a poison record can block the shard), and a Kinesis consumer for the analytics tap (also stream-poll). Everything converges on the state & failure zone: a DynamoDB table as the idempotency store and projection target, and the DLQs that are the difference between a logged failure and a lost event. The numbered badges sit on the five places an event silently dies or duplicates — a throttle at the concurrency ceiling, an async retry with no destination, a poison record on the stream, a visibility-timeout duplicate on the queue, and a too-broad EventBridge rule that double-delivers. The legend narrates each as symptom, the metric that confirms it, and the fix.

AWS event-driven Lambda reference architecture: producers (API Gateway synchronous invoke, S3 asynchronous invoke) feed an ingestion and buffering zone (SQS queue with DLQ redrive, SNS fan-out topic), which routes through an EventBridge custom bus with pattern-matching rules and per-target dead-letter queues to a processing zone of worker Lambdas (fulfilment async with on-failure destination, projection fed by DynamoDB Streams stream-poll, Kinesis analytics consumer), all converging on a state and failure zone of a DynamoDB idempotency and projection table plus dead-letter queues — with numbered badges marking the five silent-failure points: concurrency throttle, async retry with no destination, stream poison record blocking a shard, SQS visibility-timeout duplicate, and a too-broad EventBridge rule double-delivering

Real-world scenario

Parcelo, a fictional last-mile delivery startup in Bengaluru, ran its parcel-event pipeline on a single 8-vCPU EC2 instance: a Python worker that polled a queue, processed scan events from courier apps, updated a Postgres database, and pushed notifications. Traffic averaged 200 events/second with a 7pm surge to ~2,500/second as the evening delivery wave finished. The instance sat near-idle overnight, cost about ₹14,000/month running 24×7, and — worse — during the evening surge it fell behind, processing events minutes late, so customers saw “out for delivery” long after the parcel arrived. The four-engineer platform team decided to go event-driven on Lambda.

The first cut was naive and instructive. They pointed a Lambda directly at the courier SNS topic (asynchronous) and had it write straight to Postgres. It worked in testing. In the first evening surge it fell apart: the function scaled to ~900 concurrent environments, each opened a Postgres connection, and the database hit max_connections and started refusing — so functions errored, Lambda retried the async events, and the retry storm made it worse. Meanwhile a malformed scan event from one buggy courier app build threw on every attempt; with no on-failure destination configured, those events simply vanished after two retries. The team had reproduced two textbook traps at once: a concurrency stampede on a non-serverless downstream, and silent async data loss.

The breakthrough was redesigning around the invocation models rather than fighting them. They inserted an SQS queue between SNS and the worker (SNS→SQS fan-in), switching the worker to the queue-poll model. That gave them three things at once: a buffer that absorbed the 2,500/s surge instead of stampeding, reserved concurrency capped at 80 (sized to the database’s connection budget) so Lambda could never open more connections than Postgres allowed, and a DLQ via redrive policy (maxReceiveCount=5) so the poison events landed somewhere inspectable instead of disappearing. They made the handler idempotent with a DynamoDB conditional-write on the scan event’s ID, because at-least-once delivery meant duplicates were now expected, not exceptional. Finally they put RDS Proxy in front of Postgres so the 80 concurrent workers shared a small pooled connection set.

The numbers told the story. The evening surge now drained through the queue with sub-second processing latency end to end; the database never exceeded 80 connections; the DLQ caught exactly the malformed events (which turned out to be one courier app version, fixed at the source) with full payloads for replay. Cost fell to about ₹3,800/month — Lambda billed only for the milliseconds of actual processing, near-zero overnight, scaling to the surge automatically. The lesson the team wrote on the wall: “Don’t point a function at a fragile thing. Buffer it, cap it, make it idempotent, and give failures somewhere to land.”

The incident and redesign as a timeline, because the order of the fixes is the lesson:

Stage What they did Result What it should have been
v0 (EC2) One 24×7 worker ₹14,000/mo, falls behind at surge
v1 (naive Lambda) SNS → Lambda → Postgres direct Stampede; DB refuses connections Buffer with SQS first
v1 failure No on-failure destination Malformed events vanish DLQ on every async/queue path
Fix 1 Insert SQS (SNS→SQS) Surge buffered, no stampede The core architectural fix
Fix 2 Reserved concurrency = 80 DB connections bounded Size to the downstream budget
Fix 3 DLQ via redrive (maxReceiveCount 5) Poison events captured Never lose an event silently
Fix 4 Idempotent handler (DDB conditional) Duplicates harmless At-least-once demands it
Fix 5 RDS Proxy 80 workers share a pool Pool connections to non-serverless DBs
Outcome ₹3,800/mo, sub-second latency The fix was design, not bigger compute

Advantages and disadvantages

The event-driven serverless model both unlocks huge wins and introduces failure modes you must design against. Weigh it honestly:

Advantages (why this model wins) Disadvantages (why it bites)
Pay per millisecond of execution; near-zero cost when idle At sustained high RPS, a container can be cheaper than per-invoke billing
Scales from zero to thousands of environments with no capacity planning A scale-out stampede can overwhelm any non-serverless downstream (RDS)
Each event source is a first-class, declarative trigger Each source has its own retry/ordering/error contract you must learn
Built-in retries and DLQs for async/poll paths “Failed” means “silently gone” unless you configure a destination
Stateless functions are trivially horizontally scalable At-least-once delivery means you must build idempotency
Cold starts now small for most runtimes; provisioned concurrency for the rest Cold starts still hurt latency-critical synchronous paths
Tight integration with the whole AWS event ecosystem Observability is fragmented across many small functions
15-min timeout fits most event reactions Long/heavy jobs hit the wall — wrong tool

The model is right for event-shaped, intermittent, spiky workloads where you want to ship reactions, not operate servers — and where the work decomposes into small, idempotent steps. It bites hardest on sustained high-throughput workloads (where per-invoke billing loses to a reserved container), on functions fronting fragile non-serverless downstreams (without concurrency caps and pooling), and on teams that haven’t internalised at-least-once delivery (duplicates and silent loss). The disadvantages are all manageable — but only if you design for them, which is the entire point of the patterns above. When the workload is long-running, stateful, or steady-state high-CPU, the container path is the honest answer.

Hands-on lab

Build a real, free-tier-friendly fan-out pipeline: an S3 upload fans out through SNS→SQS to a Lambda that writes an idempotent record to DynamoDB, with a DLQ catching failures. Run in a shell with the AWS CLI configured; everything here is within Free Tier for a short test, and we tear it down at the end.

Step 1 — Variables.

export R=ap-south-1 ACC=$(aws sts get-caller-identity --query Account --output text)
export PFX=lab-evt

Step 2 — Create the DynamoDB idempotency/projection table.

aws dynamodb create-table --table-name ${PFX}-events \
  --attribute-definitions AttributeName=id,AttributeType=S \
  --key-schema AttributeName=id,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST --region $R

Expected: a TableDescription with TableStatus: CREATING, soon ACTIVE.

Step 3 — Create the work queue and its DLQ with a redrive policy.

DLQ_URL=$(aws sqs create-queue --queue-name ${PFX}-dlq --query QueueUrl --output text)
DLQ_ARN=$(aws sqs get-queue-attributes --queue-url $DLQ_URL --attribute-names QueueArn --query Attributes.QueueArn --output text)
Q_URL=$(aws sqs create-queue --queue-name ${PFX}-work \
  --attributes "{\"VisibilityTimeout\":\"180\",\"RedrivePolicy\":\"{\\\"deadLetterTargetArn\\\":\\\"$DLQ_ARN\\\",\\\"maxReceiveCount\\\":\\\"5\\\"}\"}" \
  --query QueueUrl --output text)
Q_ARN=$(aws sqs get-queue-attributes --queue-url $Q_URL --attribute-names QueueArn --query Attributes.QueueArn --output text)

Note the visibility timeout 180s ≥ 6× the 30s function timeout — the duplicate-prevention rule from the SQS section, made concrete.

Step 4 — Create the execution role.

aws iam create-role --role-name ${PFX}-role \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"lambda.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name ${PFX}-role \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaSQSQueueExecutionRole
aws iam attach-role-policy --role-name ${PFX}-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess

Step 5 — Package and deploy the idempotent function.

cat > handler.py <<'PY'
import boto3, json, os
ddb = boto3.client("dynamodb")  # init code: reused on warm invokes
T = os.environ["TABLE"]
def handler(event, _):
    failures = []
    for r in event["Records"]:
        mid = r["messageId"]
        try:
            ddb.put_item(TableName=T, Item={"id": {"S": mid}},
                ConditionExpression="attribute_not_exists(id)")
            print("processed", mid)
        except ddb.exceptions.ConditionalCheckFailedException:
            print("duplicate, skipped", mid)  # idempotent: no double side effect
        except Exception as e:
            print("error", mid, str(e))
            failures.append({"itemIdentifier": mid})  # partial-batch failure
    return {"batchItemFailures": failures}
PY
zip fn.zip handler.py
sleep 10  # let the role propagate
aws lambda create-function --function-name ${PFX}-worker \
  --runtime python3.12 --handler handler.handler --timeout 30 --memory-size 256 \
  --role arn:aws:iam::${ACC}:role/${PFX}-role \
  --environment "Variables={TABLE=${PFX}-events}" \
  --zip-file fileb://fn.zip --region $R

Step 6 — Map the queue to the function with partial-batch reporting.

aws lambda create-event-source-mapping --function-name ${PFX}-worker \
  --event-source-arn $Q_ARN --batch-size 10 \
  --function-response-types ReportBatchItemFailures --region $R

Step 7 — Send a message twice; prove idempotency.

aws sqs send-message --queue-url $Q_URL --message-body '{"scan":"SC-1"}'
aws sqs send-message --queue-url $Q_URL --message-body '{"scan":"SC-1"}'
sleep 8
aws logs tail /aws/lambda/${PFX}-worker --since 2m --region $R
# Expect: two messages, but a "processed" then a "duplicate, skipped" if they share a messageId,
# or two "processed" with distinct IDs — and the DynamoDB table holds one item per unique message.
aws dynamodb scan --table-name ${PFX}-events --select COUNT --region $R

Each SQS message gets its own messageId, so to truly see the duplicate path, re-drive the same message — but the lab’s point is proven: the conditional write makes reprocessing the same ID a safe no-op, which is exactly what protects you under at-least-once delivery.

Validation checklist. You created a buffered, idempotent, DLQ-backed consumer: an SQS source (queue-poll model), a visibility timeout sized to the function timeout, partial-batch failure reporting so one bad record doesn’t reprocess the batch, a DLQ via redrive policy so poison messages land somewhere, and a DynamoDB conditional write for idempotency. The lab steps mapped to what each proves:

Step What you did What it proves
3 Queue with redrive + 180s visibility Visibility ≥ 6× timeout; failures have a DLQ
5 boto3.client in init scope Warm invokes skip client init (cold-start lever)
5 Conditional put Idempotency under at-least-once delivery
5/6 ReportBatchItemFailures One bad record doesn’t reprocess the whole batch
7 Send + scan COUNT Reprocessing the same ID is a safe no-op

Cleanup.

MID=$(aws lambda list-event-source-mappings --function-name ${PFX}-worker --query 'EventSourceMappings[0].UUID' --output text --region $R)
aws lambda delete-event-source-mapping --uuid $MID --region $R
aws lambda delete-function --function-name ${PFX}-worker --region $R
aws sqs delete-queue --queue-url $Q_URL ; aws sqs delete-queue --queue-url $DLQ_URL
aws dynamodb delete-table --table-name ${PFX}-events --region $R
aws iam detach-role-policy --role-name ${PFX}-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaSQSQueueExecutionRole
aws iam detach-role-policy --role-name ${PFX}-role --policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess
aws iam delete-role --role-name ${PFX}-role

Cost note. Lambda’s free tier (1M requests + 400,000 GB-seconds/month), DynamoDB on-demand at trivial volume, and SQS’s first million requests are all free; this lab costs effectively ₹0 and deleting the resources stops everything.

Common mistakes & troubleshooting

The playbook — the part you bookmark. First as a scannable table you read mid-incident, then the detail for the entries that bite hardest.

# Symptom Root cause Confirm (exact command / metric) Fix
1 Async events silently disappear No on-failure destination; retries exhausted aws lambda get-function-event-invoke-config; check for DestinationConfig Set OnFailure destination; alarm on DLQ depth
2 Stream consumer stuck for hours, no progress Poison record + infinite default retries blocks shard IteratorAge climbing; aws lambda get-event-source-mapping retries=-1 Finite MaximumRetryAttempts, BisectBatchOnFunctionError, on-failure dest
3 Same event processed twice (double charge) At-least-once delivery + non-idempotent handler Duplicate side effects in logs/DB; no dedup store DynamoDB conditional write keyed on event ID
4 SQS messages reprocessed repeatedly Visibility timeout < 6× function timeout Queue VisibilityTimeout vs function Timeout Set visibility ≥ 6× timeout
5 429 TooManyRequestsException under load Concurrency limit reached Throttles > 0; ConcurrentExecutions at ceiling Raise account quota; reserved concurrency; backoff
6 Database refuses connections during spikes Lambda scale-out > DB connection budget RDS DatabaseConnections at max; function fan-out Reserved concurrency cap + RDS Proxy
7 Whole SQS batch reprocessed on one failure No partial-batch reporting Mapping FunctionResponseTypes empty ReportBatchItemFailures + return failed IDs
8 EventBridge event handled twice Two rules’ patterns both match (no first-match-wins) MatchedEvents on two rules for one event Tighten patterns; keep consumers idempotent
9 S3-triggered function loops, billing spikes Function writes back to the trigger bucket Runaway Invocations; CloudWatch billing alarm Scope prefix/suffix or write to a different bucket
10 Task timed out after 900.00 seconds Job exceeds 15-min hard limit Logs show timeout at 900 s Re-architect into steps; use Step Functions/Fargate
11 Cold starts spike p99 on the API Sync path, no warm envs (esp. JVM/.NET/VPC) InitDuration in logs; p99 latency Provisioned concurrency; SnapStart; smaller package
12 AccessDeniedException calling an AWS service Execution role missing a permission CloudTrail shows the denied action Add the action to the execution role
13 Function returns old code after deploy Alias/trigger points to a stale version aws lambda get-alias; trigger qualifier Update alias to new version; trigger the alias
14 Reserved concurrency “broke” the function Set to 0 (which disables it) ReservedConcurrentExecutions: 0 Use unset for “no reservation,” not 0
15 Runtime exited / errno 137 OOM — memory too low MaxMemoryUsed near limit Increase MemorySize; fix the leak

The expanded form for the ones that cause the most damage:

1. Async events silently disappear. Root cause: An async invocation (S3/SNS/EventBridge/Event) failed, retried twice, and had no on-failure destination or DLQ — so the event was dropped. Confirm: aws lambda get-function-event-invoke-config --function-name <fn> returns no DestinationConfig; the Errors metric is non-zero while nothing lands in any queue. Fix: Configure an OnFailure destination (SQS preferred) on every async function; alarm on the DLQ’s ApproximateNumberOfMessagesVisible > 0.

2. Stream consumer stuck for hours. Root cause: A poison record that always fails, combined with the default MaximumRetryAttempts=-1 (infinite), blocks its shard — every record behind it waits. Confirm: The IteratorAge metric climbs steadily (records aging without being processed); aws lambda get-event-source-mapping --uuid <id> shows MaximumRetryAttempts: -1. Fix: Set a finite MaximumRetryAttempts and a MaximumRecordAge, enable BisectBatchOnFunctionError to isolate the bad record, and add a DestinationConfig.OnFailure so the failed batch’s metadata is captured.

3. Same event processed twice. Root cause: At-least-once delivery (async, stream, or queue) delivered a duplicate, and the handler is not idempotent, so a side effect (charge, email, counter) ran twice. Confirm: Duplicate side effects with the same source event ID; no dedup/idempotency store in the code path. Fix: Derive an idempotency key from the event and conditional-write it to DynamoDB (attribute_not_exists) before the side effect; or use Powertools Idempotency.

4. SQS messages reprocessed repeatedly. Root cause: The queue’s visibility timeout is shorter than ~6× the function timeout, so a still-running invocation’s message becomes visible and is redelivered to a second environment. Confirm: Compare the queue’s VisibilityTimeout to the function’s Timeout; duplicates correlate with slow invocations. Fix: Set the visibility timeout to at least 6× the function timeout (e.g. 180s for a 30s function).

5. 429 TooManyRequestsException under load. Root cause: Demand exceeded available concurrency — the account’s 1,000 default, or a too-small reserved allocation, or another function hogging the pool. Confirm: Throttles metric > 0; ConcurrentExecutions pinned at the limit; UnreservedConcurrentExecutions near zero in account settings. Fix: Raise the account concurrency quota via Service Quotas; set reserved concurrency to guarantee this function a slice; ensure synchronous callers back off and retry.

6. Database refuses connections during spikes. Root cause: Lambda scaled to hundreds of environments, each opening a connection, exceeding the database’s max_connections — a stampede no RDS instance survives. Confirm: RDS DatabaseConnections at the ceiling exactly when the function fans out; function errors are connection failures. Fix: Cap the function with reserved concurrency sized to the DB’s connection budget, and front the database with RDS Proxy so functions share a pool.

9. S3-triggered function loops and billing spikes. Root cause: A function triggered on s3:ObjectCreated:* writes a derived object back into the same bucket, re-triggering itself in a runaway loop. Confirm: Invocations climbing without external cause; a billing alarm fires; the same bucket appears as both source and write target. Fix: Scope the notification to a narrow prefix/suffix that excludes derived objects, or write outputs to a different bucket entirely.

Best practices

The alarms worth wiring before the next incident — leading indicators, not lagging:

Alarm on Metric Threshold (starting point) Why it’s leading
Throttling Throttles > 0 sustained 5 min Concurrency ceiling before users feel 429s
Stream lag IteratorAge > 60,000 ms Shard falling behind / poison record blocking
Dead-letter fill DLQ ApproximateNumberOfMessagesVisible > 0 Events are leaving the happy path
Error rate Errors > 1% of invocations Handler failing — confirm with logs
Async age AsyncEventAge climbing toward 6 h Async backlog not draining
Downstream saturation RDS DatabaseConnections > 80% of max Stampede before the DB refuses
Cold-start latency InitDuration p99 > your SLO Sync path latency creeping up

Security notes

The security controls that also improve resilience — they pull the same direction here:

Control Mechanism Secures against Also prevents
Per-function least-privilege role Scoped IAM policy Blast radius of a compromised function AccessDenied surprises from over-broad churn
Secrets in Secrets Manager SecureString + role grant Secrets leaking via env vars Rotation breaking a hard-coded value
--source-arn on invoke permission Resource policy condition Any topic/bucket invoking your fn Accidental cross-source triggering
Input validation + size caps Handler-level checks Injection / oversized payloads Poison records crashing the consumer
Customer-managed KMS key kms:Decrypt scoped Unauthorised decrypt of config Silent init failure (grant it correctly)
ECR scanning + digest pinning Image supply chain Tampered/unknown images Surprise breakage from a moved tag

Cost & sizing

The bill drivers and how they interact with the design:

A rough monthly picture for a moderate event pipeline (say 50M events/month, 256 MB, ~150 ms each):

Cost driver What you pay for Rough INR / month Watch-out
Lambda requests ~50M invocations ~₹700–900 Batch where possible to cut request count
Lambda GB-seconds 256 MB × 150 ms × 50M ~₹2,500–4,000 Right-size memory; shorten duration
Provisioned concurrency N warm envs × hours ~₹1,000+ per 10 envs Idle cost — only for latency SLOs
SQS requests Polls + sends ~₹300–600 Batching window reduces poll count
DynamoDB (on-demand) Idempotency + projection writes ~₹500–1,500 TTL the idempotency table
CloudWatch Logs Ingestion + storage ~₹500–2,000 Set retention; sample noisy logs
RDS Proxy (if used) Per-vCPU-hour of the DB ~₹1,500–3,000 Only if fronting a relational DB

Sizing rules of thumb: start at 256 MB and profile up; set timeout to ~2× the observed p99 duration (not the 15-min max); set batch size to balance throughput against the cost of reprocessing a failed batch; and cap reserved concurrency to whatever your most fragile downstream can survive. The cheapest correct pipeline is almost always “small functions, buffered sources, idempotent handlers, right-sized memory” — not a bigger anything.

Interview & exam questions

1. Explain Lambda’s four invocation models and why the distinction matters. Synchronous (caller waits, caller retries), asynchronous (Lambda queues it, retries twice, sends to a DLQ/destination or drops it), stream poll (Lambda polls Kinesis/DynamoDB Streams, per-shard ordering, checkpointing), and queue poll (Lambda polls SQS, visibility-timeout-driven redelivery). The model determines who retries, how many times, the ordering guarantee, and where data goes on failure — so it dictates how you design for correctness.

2. An async-triggered function’s events are disappearing. What’s happening and how do you fix it? Async invocations retry twice and then send the event to a configured on-failure destination or DLQ — and if none is configured, the event is dropped. Confirm there’s no DestinationConfig via get-function-event-invoke-config. Fix by attaching an OnFailure destination (SQS) and alarming on its depth.

3. Why must an SQS visibility timeout be at least 6× the function timeout? When Lambda reads a message it becomes invisible for the visibility-timeout window. If that window is shorter than the time the function needs, the message becomes visible again and is redelivered to a second environment while the first is still processing — instant duplicates. Six times gives headroom for retries within Lambda’s polling.

4. What is a poison record on a stream, and how do you prevent it blocking the shard? A record that always fails. Because stream records are processed in order and the default MaximumRetryAttempts is -1 (infinite), the bad record is retried forever, blocking every record behind it on that shard. Prevent it with a finite retry count, a MaximumRecordAge, BisectBatchOnFunctionError to isolate it, and an on-failure destination.

5. Why is idempotency mandatory in event-driven Lambda, and how do you implement it? Async, stream and queue deliveries are at-least-once — the same event will eventually arrive twice. Without idempotency, side effects (charges, emails, counters) double. Implement it by deriving a key from the event and doing a conditional write (attribute_not_exists) to DynamoDB before the side effect, so the second attempt is a no-op.

6. SNS vs EventBridge vs SQS for fan-out — how do you choose? SNS: low-latency one-to-many push fan-out, limited filtering. EventBridge: rich content-based routing, schema registry, archive/replay — the choreography hub. SQS: not fan-out at all but a buffer giving backpressure, retries and a DLQ. Combine SNS→SQS to get fan-out with per-consumer durability.

7. A Lambda is exhausting a relational database’s connections during traffic spikes. Fix? Lambda scales to hundreds of concurrent environments, each opening a connection, blowing past max_connections. Cap the function with reserved concurrency sized to the DB’s connection budget, and put RDS Proxy in front so the functions share a pooled set. DynamoDB wouldn’t have this problem because it’s serverless.

8. What causes cold starts and what reduces them? Initialising a new environment: code/layer download, runtime bootstrap, and your init code (clients, connections). Reduce with smaller packages, initialising clients in module scope, right-sizing memory (more CPU → faster init), provisioned concurrency (pre-warmed envs), and SnapStart for Java/.NET/Python. It only matters on latency-critical synchronous paths.

9. Difference between reserved and provisioned concurrency? Reserved concurrency caps and guarantees a function’s slice of the account pool (free — it’s just allocation), used to protect downstreams or other functions. Provisioned concurrency pre-warms a number of environments to eliminate cold starts (paid hourly even when idle), used for latency-sensitive paths. Setting reserved to 0 disables the function.

10. When is Lambda the wrong choice? For long-running (>15 min), stateful, or sustained high-CPU/steady-state workloads, where per-invoke billing loses to a reserved container and the timeout/statelessness constraints fight you. Use ECS/EKS/Fargate or EC2 there; use Lambda for event-shaped, intermittent, spiky work.

11. How do you process a batch from SQS so one bad message doesn’t reprocess the whole batch? Enable ReportBatchItemFailures on the event-source mapping and return a batchItemFailures list of only the failed message IDs. Lambda then deletes the successful messages and redelivers only the failures, instead of redelivering the entire batch on any single failure.

12. An EventBridge event is being handled twice. Why? Two rules whose event patterns both match the same event each fire — there is no first-match-wins in EventBridge. Confirm with the MatchedEvents metric on both rules. Fix by tightening the patterns (more specific source + detail-type + content fields) and keeping consumers idempotent.

These map to AWS Certified Developer – Associate (DVA-C02)develop event-driven and serverless solutions, Lambda configuration, SQS/SNS/EventBridge integration, error handling — and AWS Certified Solutions Architect – Associate (SAA-C03)design decoupled and event-driven architectures, choosing between fan-out brokers, and resilience patterns. A compact cert mapping:

Question theme Primary cert Objective area
Invocation models, retries, DLQs DVA-C02 Develop event-driven solutions
Idempotency, at-least-once DVA-C02 Resilient application design
Fan-out broker choice SAA-C03 Design decoupled architectures
Concurrency, throttling, scaling DVA-C02 / SAA-C03 Performance & resilience
Stream poison records, bisect DVA-C02 Troubleshoot serverless
Cold starts, provisioned concurrency DVA-C02 Optimize serverless performance

Quick check

  1. An S3-triggered function’s events sometimes vanish with nothing in any queue. Which invocation model is this, and what one thing is almost certainly missing?
  2. Your Kinesis consumer’s IteratorAge has been climbing for two hours with no progress. What’s the single most likely cause, and the setting that’s enabling it?
  3. True or false: setting a function’s reserved concurrency to 0 is a good way to give it a tiny guaranteed slice.
  4. Why might the same SQS message be processed by two execution environments at once, and what’s the rule that prevents it?
  5. You need one “order placed” event to reach four independent consumers, each with its own retry and DLQ. What’s the cleanest pattern?

Answers

  1. It’s the asynchronous model (S3 invokes Lambda async). Almost certainly missing: an on-failure destination / DLQ — async retries twice and then drops the event if none is configured. Confirm with aws lambda get-function-event-invoke-config and attach an OnFailure destination.
  2. A poison record that always fails, blocking the shard because the default MaximumRetryAttempts is -1 (infinite) — every record behind it waits. Fix with a finite retry count, MaximumRecordAge, BisectBatchOnFunctionError, and an on-failure destination.
  3. False. Reserved concurrency of 0 disables the function entirely. For “no reservation,” leave it unset/null; to give a small slice, set a small positive number.
  4. Because the visibility timeout is shorter than the time the function takes, so the message reappears and is redelivered to a second environment while the first is still working. The rule: set the visibility timeout to at least 6× the function timeout.
  5. SNS→SQS fan-in: an SNS topic fans the event out to four SQS queues, one per consumer; each queue buffers for its own Lambda with its own redrive policy → DLQ. You get SNS’s fan-out and SQS’s per-consumer durability and backpressure. (EventBridge with four rule targets is the alternative when you want content-based routing and replay.)

Glossary

Next steps

You can now choose the right invocation model, wire each event source correctly, and design for at-least-once delivery without losing events. Build outward:

AWSLambdaServerlessEventBridgeSQSKinesisEvent-DrivenDynamoDB Streams
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading