A national online grocery and logistics company — same-day delivery, 1,200 dark stores, a few million orders a week — has a problem that every growing platform eventually hits. Their monolithic order service does everything the instant a customer taps “Place order”: it charges the card, decrements inventory, books a picker, notifies the driver app, emails a receipt, and updates the loyalty ledger. On a quiet Tuesday this is fine. On the first cold morning of winter, when demand triples in an hour, the payment gateway gets slow, every one of those downstream steps blocks behind it, the checkout request times out, and a customer who was charged sees a spinning wheel and taps “Place order” again. Now there are duplicate orders, oversold inventory, and a furious operations team.
The fix is not a faster server. It is to stop doing all that work synchronously inside the customer’s request. The moment the order is accepted, the only thing that must happen immediately is “write down that this order exists.” Everything else — payment, picking, notifications, loyalty — can happen a few hundred milliseconds later, independently, each at its own pace, each able to fail and retry without dragging checkout down with it. That decoupling is what asynchronous messaging buys you. This article is a junior-friendly but production-honest walk through the two patterns that dominate async messaging — queues and pub/sub — when each one wins, and the operational details (dead-letter handling, ordering, idempotency) that separate a demo from the system that survives that cold winter morning.
The two patterns, in one paragraph each
A message queue is point-to-point. A producer drops a message in; exactly one consumer (from a pool of workers competing for work) picks it up, processes it, and deletes it. Think of a single line at a bank with several tellers: each customer is served by one teller. Queues are about distributing work so it gets done reliably, even if the worker pool is busy or temporarily down. AWS SQS, Azure Service Bus queues, and Google Cloud Pub/Sub (used in pull-subscription mode) are the managed services here.
Pub/Sub (publish/subscribe) is fan-out. A producer publishes an event to a topic, and every subscriber interested in that topic gets its own copy. Think of a radio broadcast: one transmission, many independent listeners, none of them aware of each other. Pub/sub is about broadcasting facts — “order 4471 was placed” — to many consumers who each react differently. AWS SNS, Azure Event Grid, and Google Cloud Pub/Sub (topics with multiple subscriptions) are the managed services here.
The single most useful mental test: “Does exactly one thing need to handle this, or do many things need to know about it?” Work that must happen once → queue. A fact that many parts of the system care about → pub/sub. Most real architectures, including the grocery platform below, use both — fan out the event, then queue the work for each consumer.
Architecture overview
Here is the order pipeline rebuilt around async messaging. Trace one order through it.
-
Checkout (the synchronous part, kept tiny). The customer’s request hits the API behind Akamai, which terminates TLS at the edge, absorbs traffic spikes, and runs WAF/bot protection so a flash sale or a credential-stuffing bot never reaches the origin. The order service does the absolute minimum: validate the cart, write an
order: ACCEPTEDrow to the database, and publish a singleOrderPlacedevent to a pub/sub topic. Then it returns200 OKto the customer in tens of milliseconds. The card has not been charged yet — that is deliberate, and we will handle the consequences. -
Fan-out (pub/sub). The
OrderPlacedevent lands on an SNS topic (the company is AWS-primary). SNS does not process anything; it fans the event out to several subscribers at once. This is the broadcast: payments, inventory, fulfilment, notifications, and analytics all need to know an order was placed, but each will do something completely different about it. -
Topic-to-queue, the pattern that matters most. Each subscriber is not a raw function — it is its own SQS queue subscribed to the topic. SNS delivers a copy of the event into the Payments queue, the Inventory queue, the Fulfilment queue, and so on. Why a queue behind every subscriber instead of invoking a function directly? Because the queue is a buffer. If the payment processor is slow that cold morning, its queue simply gets deeper while every other consumer keeps working at full speed, completely unaffected. Fan-out gives you parallelism; the per-consumer queue gives you isolation and back-pressure. This “SNS → SQS fan-out” shape is the bread and butter of AWS event-driven design.
-
Workers (queue consumers doing the actual work). Behind each queue is a pool of workers — Lambda functions or containers on ECS/EKS — competing to pull messages. The Payments worker calls the card processor and writes
order: PAID(orPAYMENT_FAILED). The Inventory worker decrements stock. The Fulfilment worker books a picker at the right dark store. The Notifications worker pushes to the driver app and emails the receipt. Each scales independently: if payments are slow, you add payment workers without touching anything else. -
The control plane around it. Human and machine identity is brokered through Okta as the workforce IdP, federated to Entra ID so cloud RBAC sees first-class tokens — engineers and on-call operators authenticate once to reach the consoles and dashboards. Application secrets the workers need — the payment processor’s API key, the SMTP credential, third-party tokens — are pulled at runtime from HashiCorp Vault with short-lived dynamic leases, never baked into a container image or an environment variable. Dynatrace (with Datadog on a second team) traces a message end to end and watches queue depth and consumer lag; Wiz and Wiz Code scan the infrastructure-as-code and running cloud for misconfigured topics, over-permissive policies, and a queue accidentally left world-readable; CrowdStrike Falcon sensors run on the worker hosts for runtime threat detection feeding the SOC. ServiceNow is where a stuck dead-letter queue or a paging incident becomes a tracked ticket with an owner.
The shape to remember: publish once, fan out to many topics, buffer each consumer behind its own queue, scale workers per queue. That one sentence is the whole design.
Queue vs Pub/Sub: a side-by-side
| Dimension | Message queue (point-to-point) | Pub/Sub (fan-out) |
|---|---|---|
| Delivery | One message → exactly one consumer | One event → every subscriber gets a copy |
| Purpose | Distribute work reliably | Broadcast facts / notifications |
| Coupling | Producer knows a queue exists, not who drains it | Producer knows nothing about subscribers |
| Add a consumer | Adds workers to the same pool (more throughput) | Adds a new independent reaction to the event |
| Back-pressure | Natural — messages wait in the queue | None on its own; pair each subscriber with a queue |
| AWS | SQS | SNS, EventBridge |
| Azure | Service Bus queues | Event Grid, Service Bus topics |
| Google Cloud | Pub/Sub (single pull subscription) | Pub/Sub (multiple subscriptions) |
| Classic mistake | Using a queue when 5 systems each need the event | Invoking functions directly with no buffer |
A note junior engineers trip on: the cloud product names do not map cleanly to the patterns. Google Cloud Pub/Sub is named after the pattern but is happily used as a work queue too. Azure Service Bus does both queues and topics in one service. AWS splits them: SQS is queues, SNS is fan-out notifications, and EventBridge is a richer event bus with content-based routing and schemas. Learn the pattern first; the product is just which managed box implements it.
When to reach for Kafka / Confluent
SQS, SNS, Service Bus, and Event Grid are message brokers: a message is delivered, consumed, and deleted. That is exactly right for “do this work once.” But the grocery company’s analytics and data teams want something those services do not give: a durable, replayable log of every order event, kept for days, that multiple independent systems can read at their own offset and re-read from the past.
That is the log/streaming model, and it is what Apache Kafka (managed as Confluent Cloud here) provides. A Kafka topic is an append-only log retained for a configured window (say 7 days). Consumers track their own position (offset); the data analytics platform, the real-time fraud model, and a brand-new reporting service can all read the same orders stream independently, and a consumer that was down for an hour simply resumes from where it left off — or rewinds to reprocess a whole day after a bug fix. You cannot rewind an SQS queue; once a message is gone, it is gone.
| Use a broker (SQS/SNS/Service Bus/Event Grid) when | Use a log (Kafka/Confluent) when |
|---|---|
| Discrete tasks consumed once and deleted | A durable, replayable history is needed |
| Low-to-moderate, bursty volume | Very high sustained throughput (100k+ events/s) |
| Simple fan-out and work distribution | Stream processing, joins, windowed aggregations |
| You want zero ops, pay-per-use | You accept partition/consumer-group complexity for power |
The honest junior-level guidance: do not start with Kafka. It is powerful and operationally heavy — partitions, consumer groups, rebalancing, retention tuning. For the order pipeline’s actual work (charge the card, book the picker), SNS + SQS is simpler, cheaper, and fully sufficient. The grocery company introduces Confluent only for the analytics stream where replay and high-throughput stream processing genuinely earn their keep — and they let the rest of the system stay on the managed broker. Reach for the log when you need the log, not because it is fashionable.
Failure modes and dead-letter queues
The whole reason to go async is graceful failure — but “graceful” is something you have to engineer. A few failure modes and how this design handles each.
A poison message. The payment processor returns a response the worker cannot parse, so the worker throws. SQS does not delete an un-acknowledged message — after the visibility timeout it reappears for another worker, which also fails. Without a guard, that one bad message loops forever and blocks the queue. The guard is a dead-letter queue (DLQ): configure a maxReceiveCount, and after (say) 5 failed attempts SQS moves the message to a separate payments-dlq instead of retrying forever. The queue keeps flowing; the bad message is quarantined for a human.
{
"RedrivePolicy": {
"deadLetterTargetArn": "arn:aws:sqs:ap-south-1:123456789012:payments-dlq",
"maxReceiveCount": 5
}
}
A DLQ is not “set and forget.” A message landing there is an operational event: an alarm on ApproximateNumberOfMessagesVisible > 0 on any DLQ fires a Dynatrace alert and auto-raises a ServiceNow incident, so a human investigates, fixes the bug, and uses SQS redrive to replay the quarantined messages back to the main queue. A DLQ silently filling up is one of the most common “we lost orders and didn’t know” incidents in production.
A slow or dead downstream. The card processor is down for ten minutes. Without async, checkout would be down too. With this design, the Payments queue simply gets deeper while customers keep checking out and every other consumer keeps working. When the processor recovers, the workers drain the backlog. The customer experience degrades to “your receipt is a few minutes late,” not “the site is down.”
Duplicate delivery — and why you must design for it. Standard SQS, SNS, and Pub/Sub are at-least-once: under retries and network hiccups a message can be delivered more than once. If your payment worker is not careful, that means charging a customer twice — the worst possible bug for this business. The fix is idempotency: make processing the same message twice have the same effect as processing it once. The order service generates an idempotency key per order; the payment worker records “I have processed key X” before charging, and on a duplicate it sees the key and skips. Two cloud features help — SQS FIFO queues offer exactly-once processing within a deduplication window, and SNS FIFO topics preserve order — but neither removes your need to write idempotent consumers. Assume duplicates; design so they are harmless.
Ordering. Standard queues and topics do not guarantee order — message B can be processed before message A. For most of the pipeline that is fine (it does not matter whether the receipt email or the loyalty update happens first). Where order does matter — all events for one order must apply in sequence — use a FIFO queue with a MessageGroupId set to the order ID. AWS guarantees ordering within a group while still processing different groups (different orders) in parallel. The lesson: demand ordering only where you truly need it, because strict ordering caps throughput and adds cost.
Scaling, security, and cost
Scaling. Each consumer scales on queue depth, the most natural signal in async systems. Lambda consumers scale automatically with the backlog; container workers on ECS/EKS scale on the ApproximateNumberOfMessagesVisible metric. Because fan-out copies the event into a separate queue per consumer, a spike in one consumer (payments at peak) never starves another (notifications). This per-queue isolation is the property that fixed the original cold-morning outage: slow payments now means a deeper payments queue, not a frozen checkout.
Security. Treat messaging infrastructure like any other sensitive surface:
- Identity, not keys. Workers assume IAM roles (federated from Okta → Entra ID for humans, native cloud roles for services) scoped to exactly the queues/topics they touch — the payment worker can read the Payments queue and nothing else. Secrets the workers genuinely need (processor API key, SMTP credential) come from HashiCorp Vault as short-lived leases, never from a container image.
- Encryption. Enable server-side encryption (SQS SSE, SNS encryption) so messages are encrypted at rest, and keep all data-plane traffic over TLS / VPC endpoints so events never traverse the public internet.
- Least-privilege policies. A queue policy that allows
*is a classic leak. Wiz and Wiz Code scan both the Terraform and the live cloud for an over-permissive topic policy, an unencrypted queue, or a subscription open to the world, and flag it before it ships. CrowdStrike Falcon covers runtime threats on the worker hosts. - PII discipline. Do not put a full card number or raw PII in a message body that fans out to five consumers and a 7-day Kafka log. Put an order ID; let each consumer fetch only the fields it is authorized to see.
Cost. Managed messaging is cheap relative to the outages it prevents, but the model differs by service:
| Service | Pricing model | Cost-control lever |
|---|---|---|
| SQS / SNS | Per million requests | Batch sends/receives (up to 10 messages) — fewer API calls |
| Service Bus | Per-operation (or fixed Premium messaging units) | Right-size the tier; batch operations |
| Event Grid | Per million operations | Filter at the topic so subscribers only get relevant events |
| Confluent / Kafka | Throughput + retention + partitions | Shorten retention; size partitions to real throughput, not “just in case” |
The biggest avoidable cost on the broker side is chatty, one-at-a-time API calls — batching 10 messages per request cuts request charges ~10x. On the Kafka side it is over-retention and over-partitioning: keeping 30 days when 7 suffices, or 100 partitions for a topic that needs 6.
Provisioning and operations
This is infrastructure, so it is code. The queues, topics, subscriptions, DLQs, and policies are defined in Terraform (with Ansible for any VM-based worker configuration), reviewed in a pull request, and applied by a pipeline. CI/CD runs through GitHub Actions (one team uses Jenkins) authenticating to the cloud via OIDC with no stored credentials, and progressive rollout of the worker services is driven by Argo CD doing GitOps onto the Kubernetes cluster. Wiz Code runs as a pull-request check so a misconfigured queue policy fails the build instead of reaching production.
A minimal Terraform shape communicates the whole fan-out intent — a topic, a per-consumer queue, its DLQ, and the subscription wiring them together:
resource "aws_sns_topic" "order_placed" {
name = "order-placed"
}
resource "aws_sqs_queue" "payments_dlq" {
name = "payments-dlq"
}
resource "aws_sqs_queue" "payments" {
name = "payments"
visibility_timeout_seconds = 60
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.payments_dlq.arn
maxReceiveCount = 5
})
}
resource "aws_sns_topic_subscription" "payments_sub" {
topic_arn = aws_sns_topic.order_placed.arn
protocol = "sqs"
endpoint = aws_sqs_queue.payments.arn
}
Repeat the queue/DLQ/subscription block for inventory, fulfilment, and notifications, and you have the fan-out pipeline. Operationally, the dashboards that matter are simple and few: queue depth per consumer, consumer age / lag (how stale the oldest unprocessed message is), DLQ count (which should normally be zero), and end-to-end latency from OrderPlaced to fully fulfilled — all in Dynatrace / Datadog, with DLQ and lag alarms wired to ServiceNow.
One non-obvious operational point worth internalizing early: this same async backbone is what lets the company bolt on new reactions cheaply. When marketing wanted abandoned-cart nudges, and when the L&D team wanted to feed delivery-driver performance events into their Moodle training platform, neither required touching checkout or the existing workers — they each just added a new subscription to the existing OrderPlaced topic. That is the quiet superpower of pub/sub: new consumers are additive and isolated. The same logic extends to legacy systems that cannot speak the cloud’s native protocols — a virtual appliance (a vendor’s software router/gateway running as a VM in the VPC) bridges an older warehouse-management system onto the topic without that legacy box ever knowing SNS exists.
Explicit tradeoffs
Async is not free, and a junior engineer should name the costs out loud. You trade simplicity for resilience. A synchronous call returns a result you can immediately act on; an async message means the result arrives later, so your UX must handle “accepted, processing” states (the customer sees “order confirmed, payment processing” rather than an instant final answer). You inherit eventual consistency — for a brief window the order exists but is not yet paid or picked, and the system has to be correct despite that gap. You must design idempotent consumers because at-least-once delivery will hand you duplicates. And you add operational surface: queues, DLQs, and lag are new things to monitor, and a silently-filling DLQ is a real failure mode you now own.
The decision, distilled. Use a queue when exactly one worker must do a unit of work reliably. Use pub/sub when many independent systems must learn about a fact. In practice, use both — fan out the event, then buffer each consumer behind its own queue, which is the SNS-to-SQS pattern at the heart of this design. Reach for Kafka/Confluent only when you need a durable, replayable, high-throughput log, not for ordinary work distribution. Always add dead-letter queues and treat a message landing there as an incident, not a curiosity. Get those four decisions right and the rest is detail.
The shape of the win
For the grocery company, the payoff is the inverse of the outage that started it. The next cold winter morning, demand triples and the payment processor slows — and checkout stays fast, because accepting an order is now a tiny synchronous write plus one published event. The payments queue gets deeper, drains when the processor recovers, and not a single other consumer notices. No duplicate charges, because every worker is idempotent. No lost orders, because anything unprocessable lands in a DLQ that pages a human instead of vanishing. And when the business wants a new reaction to an order next quarter, an engineer adds one subscription and ships it without going anywhere near checkout. The async patterns here are foundational on purpose: master queue versus pub/sub, dead-letter handling, and idempotency, and you have the toolkit that nearly every resilient, scalable backend is built on.