AWS Enterprise Architecture: Event-Driven Serverless

The phrase “event-driven serverless” gets sold as a billing model — “you only pay when code runs” — and that framing causes most of the bad architectures I review. Teams reach for Lambda to save money, wire every function to call the next one synchronously, and end up with a distributed monolith billed by the millisecond: the same tight coupling as before, now spread across forty functions, each holding a connection open and waiting on the one in front of it. The win of this architecture is not the pricing. The win is that the network of facts becomes the integration layer. Services stop calling each other; they emit events about what happened, and other services react. The broker — not a shared database, not a synchronous API mesh — becomes the contract. Done well, you can delete a consumer, add three new ones, or replay yesterday’s traffic into a brand-new service, and nobody upstream knows or cares.

This article is a concrete AWS reference architecture for getting that right. The protagonists are AWS Lambda for stateless compute that scales to zero; Amazon EventBridge as the central event bus, schema registry, and cross-account router; Amazon SQS and SNS for the durable buffering and fan-out that make Lambda reliable under load; Amazon DynamoDB as the single-digit-millisecond operational store with Streams as a first-class event source; and AWS Step Functions for the long-running, multi-step business transactions (sagas) that you must never try to cram into a single function. The running domain is an order-and-fulfilment platform for an omnichannel retailer, because order processing is the canonical workload where commands, events, sagas, idempotency, and read models all show up at once and you cannot hand-wave any of them away.

The business scenario

Lakeside Outfitters is a fictional but representative company: an outdoor-gear retailer doing roughly USD 60M in annual GMV across a website, an iOS/Android app, 32 physical stores, and two marketplace integrations (Amazon and a regional outdoors marketplace). They run on a three-year-old Django monolith backed by a single PostgreSQL RDS instance and a fleet of EC2 instances behind an Application Load Balancer. On a quiet Tuesday the monolith is fine. The problem is the days that actually matter to the business.

The pain that triggered the rebuild:

Promotions take the whole site down. During a seasonal sale, checkout, payment capture, inventory decrement, loyalty accrual, fraud scoring, marketplace inventory sync, and the confirmation email all run synchronously inside one HTTP request. When the payment processor slows from 250 ms to 5 seconds, request workers pile up, the RDS connection pool (capped at 100) exhausts, and the entire application — including catalogue browsing, which needs none of that — falls over. One slow downstream takes everything down with it.
Inventory is wrong everywhere at once. Store POS, the website, and both marketplaces each write stock counts on their own schedule into the same inventory table with their own locking assumptions. Oversells on fast-moving SKUs are routine, and customer service absorbs the cancellations and goodwill credits.
Every new channel is a quarter-long project. Onboarding a marketplace means threading its calls through the monolith’s request path. The blast radius of any change is the whole application, so releases are monthly and stressful.
Spiky, unpredictable load. Traffic is near zero at 3 a.m. and 40x baseline when an email campaign lands at 9 a.m. They are paying for a fleet sized for the peak and idle the rest of the day, yet the peak still knocks them over.

The business goals are not “go serverless.” They are: survive a 40x traffic spike on promo days without checkout failing; never oversell; onboard a new sales channel in under two weeks; let four product squads release independently; and stop paying for idle capacity at night. Those goals are precisely what make this event-driven rather than a lift-and-shift to Lambda. Spike survival and the never-oversell guarantee both demand that work be accepted in milliseconds and processed reliably out of band, against a single authoritative stock record that every channel reacts to instead of races against.

Critically, this architecture scales down as cleanly as it scales up. A 15-person startup shipping 800 orders a day deploys the exact same shape — pay-per-request Lambda, on-demand DynamoDB, a default EventBridge bus, standard SQS — for a few hundred dollars a month, and grows into the provisioned-concurrency, multi-Region version without re-drawing the diagram. That is what makes it a reference architecture and not a hyperscaler special.

Architecture overview

The system separates three planes that monoliths smear together: the synchronous request plane (a human is waiting), the asynchronous command plane (work that must happen reliably but not while the user waits), and the event notification plane (facts that many independent consumers react to). Each plane gets the AWS primitive whose semantics fit, instead of forcing every interaction through one broker.

The end-to-end flow for placing an order:

Ingress and edge. A client (web, mobile, store POS, marketplace webhook) hits Amazon CloudFront (global CDN, TLS, AWS WAF for the OWASP rules and rate-based blocking) in front of Amazon API Gateway. API Gateway terminates the API, validates the JWT against an Amazon Cognito user pool (or a Lambda authorizer for partner API keys) using a cached authorizer result, enforces per-key usage plans and throttling, and validates the request body against a JSON Schema model so malformed payloads never reach compute. This is the only public door.
Accept fast, decouple immediately. For the write path, API Gateway does not invoke a “do everything” Lambda. It uses a direct service integration to put the validated request onto the command plane — either PutEvents to EventBridge or SendMessage to an SQS queue — via an IAM role, with no Lambda in the hot path at all. The client gets a 202 Accepted with an order ID in well under 100 ms. The order is now durably captured; everything else happens out of band. (Read paths — catalogue, order status — do invoke Lambda, but against DynamoDB read models, not the write path.)
The command lands and the order is created. An OrderIngest Lambda consumes the command from SQS (which gives it batching, retries, and a dead-letter queue for free), validates business rules, and performs a conditional write to the Orders DynamoDB table — attribute_not_exists(PK) keyed on a client-supplied idempotency key — so a retried or duplicated submission can never create two orders. This is where idempotency is enforced, not hoped for.
DynamoDB Streams turns the write into an event. The committed order write appears on the DynamoDB Stream. A thin OrderEventPublisher Lambda (or an EventBridge Pipe — more on that below) reads the stream and emits a well-typed OrderCreated event to the EventBridge custom bus. This is the transactional-outbox pattern done natively: the event is published because and only because the database commit succeeded, so the store and the bus can never disagree.
Fan-out to independent reactors. On the EventBridge bus, content-based rules route OrderCreated to every interested consumer, each on its own SQS queue (the “fan-out with buffering” pattern, SNS-style but on EventBridge):
- Inventory service reserves stock with a conditional UpdateItem and emits StockReserved or StockReservationFailed.
- Loyalty service accrues points.
- Notifications service sends the confirmation (via SNS → email/SMS).
- Analytics archives the raw event to S3 through Kinesis Data Firehose.
- Marketplace sync pushes the new stock level outward. Each consumer is a separate squad’s code, deployed independently, ignorant of the others. Adding a fifth is a new rule and a new queue — zero changes upstream.
The multi-step transaction runs as a saga. Order fulfilment is not one event; it is a sequence with money and physical goods at stake — reserve stock, capture payment, allocate from a warehouse, generate a shipping label, and compensate (refund, release stock) if any step fails. That orchestration lives in an AWS Step Functions state machine (Standard workflow), kicked off by the OrderCreated event. Step Functions owns the retries, timeouts, parallel branches, human-approval waits, and — crucially — the compensating transactions. None of that belongs in a tangle of Lambdas calling Lambdas.
Read models for queries. Consumers also project events into purpose-built DynamoDB read models — an OrderStatusView, an InventoryByStore view — so the synchronous read API serves single-digit-millisecond queries without ever touching the write path or doing cross-service joins. This is CQRS: the write model and the read models are different shapes, kept eventually consistent by the event stream.

Drawn out, the diagram is three horizontal bands. Top band (request plane): Client → CloudFront/WAF → API Gateway → (reads) Lambda → DynamoDB read models; (writes) direct integration → SQS/EventBridge, returning 202. Middle band (command + event plane): SQS → OrderIngest Lambda → Orders table → DynamoDB Stream → EventBridge custom bus, which fans out through rules to per-consumer SQS queues, each draining into its own Lambda. Bottom band (orchestration): the OrderCreated event also starts a Step Functions saga that calls the inventory, payment, and shipping services with built-in retry/compensation, writing terminal results back as events. Cross-cutting all three: CloudWatch + X-Ray for traces and metrics, DLQs on every async hop, and EventBridge Archive capturing every event for replay.

Component breakdown

Component	AWS service	What it does here	Key configuration choices
Edge & WAF	CloudFront + AWS WAF	TLS, caching of read responses, OWASP and rate-based protection	Rate-based rule per IP; managed rule groups; cache only safe GETs
API / authN	API Gateway (HTTP API) + Cognito	The single public door; token validation, throttling, schema validation	JWT authorizer with result caching; usage plans; request validators
Synchronous compute	Lambda (read paths, event handlers)	Stateless functions; scale to zero, scale to thousands	ARM/Graviton2; right-sized memory; provisioned concurrency only on latency-critical reads
Central event bus	EventBridge (custom bus)	Routing, schema registry, content-based rules, cross-account delivery, archive/replay	Custom bus per domain; schema registry on; rules → SQS targets; DLQ + retry policy on every target
Command buffering	SQS (Standard + FIFO where ordering matters)	Durable buffer that absorbs spikes and decouples producer from consumer rate	Long polling; `maxReceiveCount` → DLQ; partial-batch-response reporting; FIFO + dedup for ordered flows
Fan-out & notifications	SNS	Push fan-out and end-user notifications (email/SMS); pairs with SQS for fan-in buffering	SNS → SQS subscriptions; message filtering; FIFO topics for ordered fan-out
Operational store	DynamoDB (on-demand) + Streams	Source of truth for orders/inventory; Streams as the event source	Single-table design; conditional writes for idempotency; Streams → Pipe/Lambda outbox; PITR on
Orchestration	Step Functions (Standard)	Long-running saga: reserve → pay → allocate → ship, with compensation	Standard (not Express) for durability/audit; `Retry`/`Catch`; `.waitForTaskToken` for async/human steps
Stream glue	EventBridge Pipes	Point-to-point source→filter→enrich→target without boilerplate Lambdas	DynamoDB Stream → Pipe → EventBridge; filter at the Pipe to cut invocations
Observability	CloudWatch + X-Ray + Lambda Powertools	Structured logs, metrics, distributed traces across the whole async graph	EMF metrics; active tracing; correlation IDs propagated through events

A few of these choices carry the design and deserve the why, not just the what.

EventBridge is the bus, not SNS — but SNS still has a job. People ask why both. EventBridge gives you content-based routing on the event payload (a JSON rule like {"detail":{"orderValue":[{"numeric":[">",500]}]}}), a schema registry, 24-hour-plus retry with DLQ per target, and archive-and-replay. That makes it the right backbone for integration events between services. SNS is simpler, higher-throughput, and lower-latency for raw fan-out, and it is the natural fit for end-user notifications (it speaks email/SMS/push directly) and for the classic SNS→SQS fan-in when many queues need the same message at very high rates. Rule of thumb in this architecture: EventBridge for service-to-service business events; SNS for notifications and ultra-high-fanout pub/sub. Putting everything on one and ignoring the other is the common mistake.

SQS sits in front of nearly every Lambda for a reason. EventBridge can invoke Lambda directly, but a raw async invoke gives you only two internal retries and then the event is gone (unless a DLQ catches it). Routing EventBridge rule → SQS → Lambda instead buys you four things that matter under real load: a durable buffer that absorbs a 40x spike while Lambda concurrency catches up; batching (up to 10,000 records / 6 MB per invoke) that slashes invocation count and cost; controlled concurrency via the event-source-mapping maximumConcurrency, so a downstream database is not stampeded; and a first-class DLQ with maxReceiveCount for poison messages. This single pattern is the difference between “survives the campaign” and “melts at 9:01 a.m.”

DynamoDB single-table design with Streams as the outbox. The Orders and Inventory data lives in a DynamoDB single-table design (composite PK/SK, plus GSIs for access patterns like “orders by customer” and “stock by store”). Two properties make it the right core: conditional writes give you optimistic concurrency and idempotency without a lock, and Streams give you an ordered, exactly-once-per-shard change log that you turn into events. That last point is the transactional outbox solved with zero extra infrastructure — you do not need a separate outbox table and poller, because the table is the log.

Step Functions, not a Lambda chain, for the saga. The temptation is to have the payment Lambda invoke the shipping Lambda invoke the label Lambda. That recreates the synchronous monolith with worse failure modes (a 15-minute Lambda timeout ceiling, no built-in compensation, opaque debugging). A Standard state machine instead gives you durable execution that survives for up to a year, declarative Retry/Catch, parallel branches, .waitForTaskToken for steps that wait on a human or an external callback, and a visual execution history that is the difference between a 5-minute and a 5-hour incident postmortem. Use Express workflows only for the high-volume, short, idempotent orchestrations where you do not need the per-execution audit trail.

Implementation guidance

Compute. Functions are Python 3.13 (or Node 22) on ARM/Graviton2 — roughly 20% cheaper per GB-second and usually faster for this workload. Right-size memory with AWS Lambda Power Tuning (a Step Functions state machine that sweeps memory settings against real payloads); for these handlers the cost-optimal point is typically 512–1024 MB, where more memory buys proportionally more CPU and the function finishes faster and cheaper. Adopt Lambda Powertools (Python/TypeScript) on day one for structured logging, EMF custom metrics, tracing, and — importantly — its idempotency and batch-processing utilities, which save you from re-implementing both badly.

Idempotency is non-negotiable and lives in three places. At-least-once delivery is the law of this land: SQS, EventBridge, and DynamoDB Streams can all hand you the same message twice. (1) The write path enforces it structurally with the DynamoDB conditional attribute_not_exists on the idempotency key. (2) Side-effecting consumers (charge a card, send an email) wrap their handler with the Powertools idempotency utility backed by a DynamoDB idempotency table with TTL, so a replay is a no-op. (3) The saga makes each task idempotent and uses idempotency keys on external calls (e.g. the payment processor’s Idempotency-Key header) so a Step Functions retry never double-charges.

Always-on partial batch responses. When Lambda reads a batch of 10 SQS messages and message 7 fails, the naive behaviour re-delivers all 10 — re-processing the 9 that succeeded. Set ReportBatchItemFailures on the event-source mapping and return the failed message IDs in batchItemFailures; only the genuine failures are retried. Forgetting this is the single most common correctness bug I see in SQS→Lambda pipelines.

A few IaC snippets (Terraform) that capture the load-bearing wiring. First, the EventBridge rule that routes OrderCreated to a buffered consumer queue, with a DLQ and retry policy on the target (the part people omit):

resource "aws_cloudwatch_event_rule" "order_created" {
  name           = "order-created-to-inventory"
  event_bus_name = aws_cloudwatch_event_bus.orders.name
  event_pattern  = jsonencode({
    "source"      = ["lakeside.orders"],
    "detail-type" = ["OrderCreated"]
  })
}

resource "aws_cloudwatch_event_target" "to_inventory_queue" {
  rule           = aws_cloudwatch_event_rule.order_created.name
  event_bus_name = aws_cloudwatch_event_bus.orders.name
  arn            = aws_sqs_queue.inventory.arn

  dead_letter_config { arn = aws_sqs_queue.inventory_dlq.arn }
  retry_policy {
    maximum_retry_attempts       = 10
    maximum_event_age_in_seconds = 3600
  }
}

Second, the SQS → Lambda event-source mapping that pins concurrency and turns on partial-batch reporting (the two settings that keep a downstream store safe and retries correct):

resource "aws_lambda_event_source_mapping" "inventory_consumer" {
  event_source_arn                   = aws_sqs_queue.inventory.arn
  function_name                      = aws_lambda_function.inventory.arn
  batch_size                         = 10
  maximum_batching_window_in_seconds = 5
  function_response_types            = ["ReportBatchItemFailures"]

  scaling_config { maximum_concurrency = 20 } # protect the DynamoDB write path
}

Third, the direct API-Gateway-to-SQS integration that keeps Lambda out of the write hot path entirely (sub-100 ms 202, nothing to cold-start):

resource "aws_apigatewayv2_integration" "place_order" {
  api_id              = aws_apigatewayv2_api.public.id
  integration_type    = "AWS_PROXY"
  integration_subtype = "SQS-SendMessage"
  credentials_arn     = aws_iam_role.apigw_to_sqs.arn
  request_parameters  = {
    "QueueUrl"    = aws_sqs_queue.order_ingest.url
    "MessageBody" = "$request.body"
  }
}

Package functions with AWS SAM or the Serverless Framework for fast local iteration (sam local invoke, sam local start-api), but keep the shared, account-level platform — the buses, VPC, IAM boundaries, Organizations guardrails — in Terraform so the platform team owns it independently of the squads’ function repos. Wire CI/CD as CodePipeline → CodeBuild (or GitHub Actions → OIDC into AWS, no long-lived keys), deploying each function behind a Lambda alias with weighted/canary shifting and a CloudWatch alarm that auto-rolls-back on an error-rate breach.

Networking and identity. Most of this is internet-facing-AWS-managed and needs no VPC — Lambda, EventBridge, SQS, SNS, DynamoDB, and Step Functions are all reachable over IAM-authenticated AWS APIs, and keeping functions out of a VPC removes the ENI cold-start tax. The moment a Lambda must reach a private resource (an RDS replica, an internal microservice, the on-prem ERP over Direct Connect), attach it to private subnets and reach AWS services through VPC Gateway/Interface Endpoints (PrivateLink) so traffic never leaves the AWS network — there is a Gateway Endpoint for DynamoDB and S3 (free) and Interface Endpoints for SQS, SNS, EventBridge, and Step Functions. Identity is least-privilege IAM per function: each Lambda gets its own execution role scoped to exactly the one queue it drains and the one table partition it writes, never a shared “Lambda can do anything” role. Cross-account event flow (e.g. a central security account subscribing to all order events) is a resource policy on the EventBridge bus granting PutEvents/rule creation to specific account IDs — no credential sharing.

Enterprise considerations

Security and Zero Trust. Identity is the perimeter, because there is barely a network one. Every hop authenticates with IAM and authorizes with a scoped policy; there are no implicit trust zones. Concretely: the public edge has WAF + Cognito/JWT validation + request-schema validation, so unauthenticated or malformed traffic dies at the door. Every internal interaction is an IAM-signed AWS API call — a compromised inventory function holds a role that can drain one queue and conditional-write one table prefix, nothing more, so its blast radius is bounded by policy, not by hope. Data is encrypted with customer-managed KMS keys at rest (DynamoDB, SQS, SNS, S3) and in transit via TLS everywhere. Secrets (payment-processor keys, marketplace tokens) live in Secrets Manager with automatic rotation and are fetched at runtime, never baked into env vars in plaintext. Events on the bus carry no card numbers or raw PII — they carry references (an order ID, a tokenised payment handle), so the event log is not a liability. Guardrails are enforced org-wide with Service Control Policies (deny public S3, deny disabling CloudTrail, require KMS), and GuardDuty + Security Hub watch the whole account continuously.

Cost optimization. This is where the architecture’s economics shine, and where the naive version quietly bleeds money. The headline win is scale-to-zero: at 3 a.m. you pay essentially nothing — no idle EC2 fleet. You pay per Lambda invocation-ms, per million SQS/EventBridge messages, and per DynamoDB request unit. The non-obvious levers: (1) Batch aggressively — a Lambda that processes 10 SQS messages per invoke costs a tenth of one that processes them one at a time; the maximumBatchingWindow lets you trade a little latency for far fewer invocations. (2) Graviton + Power Tuning typically cuts compute 20–40% with no code change. (3) Move from DynamoDB on-demand to provisioned with auto-scaling once traffic is predictable enough — on-demand is perfect for spiky/unknown load but costs ~5–7x per request at sustained high volume; this is the single biggest line-item swing at scale. (4) EventBridge Pipes replace “glue” Lambdas (stream→transform→target) with a managed integration you do not pay per-invoke for. (5) Use Express Step Functions for the high-volume short orchestrations (priced by duration/memory, far cheaper than Standard’s per-state-transition charge) and reserve Standard for the durable, auditable, long-running ones. A realistic bill for this platform at ~3,000 orders/day with promo spikes lands around USD 1,800–2,500/month, versus the over-provisioned EC2 fleet it replaced.

Scalability. Each plane scales on its own axis. Lambda scales out to thousands of concurrent executions (default 1,000/account/Region, raised on request) and, with SQS as the buffer, a spike does not drop work — it queues it, and the consumers drain it as concurrency ramps. The two things you must actively manage: reserved/provisioned concurrency to protect latency-critical reads (and to cap functions that hit a fragile downstream), and the downstream you are protecting — set the event-source-mapping maximumConcurrency so 5,000 queued messages do not translate into 5,000 simultaneous writes against a database that tops out at 500. DynamoDB on-demand absorbs the spike natively (it adapts to traffic), which is exactly why it is the default here. The design’s superpower is that the write acceptance path (API GW → SQS) has effectively unbounded throughput and near-zero latency regardless of how backed-up the processing is — the customer’s 202 never slows down because fulfilment is busy.

Reliability and DR (RTO/RPO). Within a Region, every managed service here is already multi-AZ — SQS, EventBridge, DynamoDB, Step Functions, and Lambda all replicate across Availability Zones with no work from you, so a single-AZ failure is a non-event. The deliberate reliability work is at the message level: a DLQ on every async hop (EventBridge target, SQS consumer, Lambda async, Step Functions task) so nothing is ever silently lost; alarms on DLQ depth; and EventBridge Archive + Replay so you can re-drive events into a fixed or new consumer. Idempotent consumers make redelivery and replay safe. For multi-Region DR, the cost-effective default is warm standby: DynamoDB Global Tables replicate the source-of-truth data to a second Region (RPO of ~1 second), the same IaC deploys the stack there, and a Route 53 health-check failover repoints the API. Realistic targets: RPO ≈ seconds (Global Tables) and RTO ≈ 10–20 minutes (DNS failover + provisioned-concurrency warm-up). The subtle bit is the event bus: use a second custom bus in the DR Region and cross-Region EventBridge replication (bus-to-bus) so in-flight events are mirrored — otherwise you fail over the data but lose the events in flight.

Observability. Asynchronous, distributed systems are invisible without deliberate instrumentation, and “tail the logs” does not work when one order touches eleven functions. Three pillars: (1) A correlation/causation ID stamped on the first request and propagated through every event’s detail — Powertools does this — so you can reconstruct one order’s entire journey across all functions and queues. (2) AWS X-Ray active tracing on Lambda, API Gateway, and Step Functions for the service map and latency breakdown; X-Ray now traces across EventBridge and SQS hops, so the async graph is one trace, not eleven disconnected ones. (3) CloudWatch EMF custom business metrics (orders accepted, stock reservations failed, saga compensations triggered) plus the operational ones that actually page you: SQS ApproximateAgeOfOldestMessage (the truest “are we falling behind” signal), DLQ depth > 0, Lambda error rate and throttles, and Step Functions ExecutionsFailed. Step Functions’ visual execution history is itself a debugging tool — you see which state failed and why.

Governance. The platform team owns the buses, the schema registry, the IAM boundaries, and the Organizations guardrails as code; squads own their functions, queues, and rules within those guardrails. The EventBridge Schema Registry is the linchpin of governance here: every event type has a registered, versioned schema, consumers generate typed bindings from it, and a producer cannot silently change an event’s shape and break six consumers — schema evolution follows the same additive-only discipline as any public contract (add optional fields freely; never remove or retype one without a new version). CloudTrail captures every control-plane action org-wide; AWS Config rules enforce that every queue has a DLQ, every table has PITR enabled, and nothing is unencrypted; cost allocation tags per squad/service make the bill attributable so teams see their own spend.

Reference enterprise example

Lakeside Outfitters committed to the rebuild after a flagship Memorial Day sale: an email blast drove 38x normal traffic at 9 a.m., the RDS connection pool saturated within ninety seconds, checkout returned 500s for forty minutes, and they oversold a popular tent by 340 units because the website and two marketplaces all decremented the same row under contention. The post-incident number that got the CFO’s attention was USD 210,000 in lost orders plus goodwill credits from a single morning.

Their constraints were concrete: a four-squad engineering org (orders, inventory, fulfilment, growth), a hard mandate to keep the existing monolith serving catalogue browsing during the migration (no big-bang cutover), and a CFO ceiling of “the new platform must cost less at steady state than the EC2 fleet it replaces.” They migrated the write path first — the part that actually fell over.

What they built, mapped to this architecture:

Ingress: CloudFront + WAF + API Gateway (HTTP API) with a Cognito JWT authorizer. The POST /orders route is a direct API-Gateway-to-SQS integration — no Lambda in the hot path — returning 202 with an order ID in a measured p50 of 41 ms / p99 of 88 ms, independent of backend load.
Ingest + source of truth: OrderIngest Lambda (Python 3.13, ARM, 768 MB after Power Tuning) drains the SQS command queue in batches of 10 with ReportBatchItemFailures, conditional-writes to a single-table Orders DynamoDB (on-demand, PITR on), keyed on a client idempotency key. Duplicate submissions are silently no-ops.
Outbox: an EventBridge Pipe reads the Orders DynamoDB Stream, filters to committed INSERT/MODIFY events, and publishes typed OrderCreated / OrderUpdated events to the lakeside.orders custom bus. No glue-Lambda to maintain.
Fan-out: EventBridge rules route to four per-squad SQS queues (inventory, loyalty, notifications, analytics), each with a DLQ and a concurrency-pinned consumer. Inventory’s consumer does a conditional UpdateItem to reserve stock and emits StockReserved / StockReservationFailed — one authoritative stock record, reacted to, never raced against. Oversells went to zero.
Saga: a Standard Step Functions workflow, triggered by OrderCreated, runs reserve-stock → capture-payment → allocate-warehouse → create-shipping-label, with Retry on transient errors, Catch → compensating refund + stock-release on hard failure, and .waitForTaskToken on a manual-review branch for orders flagged by fraud scoring. Every execution is visible end-to-end in the console.
Notifications & marketplace sync: SNS handles the customer confirmation (email/SMS); a marketplace-sync consumer pushes new stock levels outward, reacting to StockReserved.
Reads: a GET /orders/{id} Lambda serves an OrderStatusView DynamoDB read model (provisioned concurrency = 5 to kill cold-start tail latency on the customer-facing path).

The growth squad onboarded a third marketplace six weeks later by adding one EventBridge rule and one consumer Lambda — zero changes to orders, inventory, or fulfilment code, which is the entire point of the event backbone. The next seasonal sale drove 44x baseline traffic; checkout p99 stayed under 95 ms, the SQS queues peaked at ~9,000 messages and drained in under three minutes as Lambda concurrency ramped, and not a single order was lost or oversold. Steady-state cost settled at ~USD 2,100/month against the old fleet’s ~USD 5,400, clearing the CFO’s bar with room to spare. The one scar they earned: an early version emitted full customer addresses on the bus, which a security review flagged; they refactored to emit references and fetch PII inside the consumer that needed it — the “events carry references, not payloads” lesson, learned the way most teams learn it.

When to use it

Use this architecture when your workload is genuinely event-shaped: many independent reactions to business facts, spiky or unpredictable load, a need to onboard consumers without touching producers, and a tolerance for eventual consistency between services. Order processing, IoT ingestion, media pipelines, real-time fraud and notifications, SaaS activity feeds, and any “fan-out to N teams” integration are the sweet spot. It is also the right starting point for a small team precisely because it scales to zero — you pay for traffic, not for a fleet sitting idle, and you grow into the enterprise version without re-architecting.

The trade-offs are real and you should price them in. Eventual consistency is a feature, not a bug, but it is a cognitive tax: a customer may see “order placed” before the loyalty points appear, and your product and support teams must be fine with that. Debugging is harder than a monolith’s stack trace — which is exactly why the correlation-ID + X-Ray + Step Functions-history discipline above is not optional. And at-least-once delivery means you must build idempotency; if you skip it, you will double-charge a customer in production. There is no version of this architecture that is correct without idempotent consumers.

Anti-patterns to avoid:

The distributed monolith. Lambda-A synchronously invokes Lambda-B invokes Lambda-C, each waiting on the next. You have rebuilt the monolith with worse latency, a 15-minute timeout ceiling, and no compensation. If a flow is a sequence of steps with rollback, it is a Step Functions saga, not a Lambda chain.
One broker for everything. Forcing high-fanout notifications through EventBridge, or routing nuanced business events through raw SNS, or — worst — using a DynamoDB table as a message queue. Match the primitive to the semantic.
EventBridge → Lambda with no SQS buffer for anything that can spike or whose downstream is fragile. You lose batching, controlled concurrency, and a real DLQ, and you stampede the database on the first campaign.
“Eventual consistency” used to dodge a requirement that is actually strongly consistent. Moving money between two accounts in one atomic step is a transaction, not a saga of independent events — use a DynamoDB TransactWriteItems (or a Step Functions saga with explicit compensation), and be honest about which it is.
Forgetting partial batch responses, which silently re-processes successful messages on every batch with one failure.

Alternatives, and when they win. If your workload is a steady, predictable, high-throughput stream (always-on at scale, not spiky), a containerised event-driven stack on ECS/EKS with Kafka (MSK) can be cheaper per unit and gives you Kafka’s log-replay and consumer-group semantics — at the cost of running and patching the platform. If you need strict global ordering and stream replay as a first-class primitive, Kinesis Data Streams (or MSK) beats SQS/EventBridge. If the system is genuinely a handful of synchronous request/response APIs with no fan-out and strong-consistency needs throughout, a modular service on Fargate or a well-factored monolith is simpler and you should not reach for an event bus at all — the operational and cognitive overhead of async only pays off when there are real, independent consumers reacting to real events. Choose the broker, the consistency model, and the compute on the shape of the work, not on a slide that says “serverless is cheaper.”

AWS Enterprise Architecture: Event-Driven Serverless

The business scenario

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Reference enterprise example

When to use it

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)