Architecture AWS

Black Friday-Ready E-Commerce Platform on AWS with Surge Autoscaling

A mid-market fashion retailer — call it the kind of brand that does ₹1,800 crore a year online and another ₹1,400 crore in stores — gets one sentence from its CEO in the September board meeting: “Last Black Friday the site fell over for ninety minutes at 8 PM and we will not let that happen again.” The post-mortem from the prior year is brutal and specific. At the stroke of the doorbuster launch, traffic went from a steady 3,000 requests per second to roughly 120,000 in under two minutes. The autoscaler — reacting to CPU metrics on a five-minute lag — was still spinning up the previous surge’s capacity when the database connection pool saturated, checkout started throwing 500s, and the load balancer dutifully kept routing customers into a dying fleet. The brand lost an estimated ₹22 crore in that ninety-minute window and, worse, a chunk of trust. This article is the reference architecture for the rebuild: a Black Friday-ready storefront on AWS that treats a 40x surge not as an emergency but as a planned-for Tuesday.

The pressures in retail at peak are unforgiving and they all arrive at once. Spikiness is the defining trait — not a gentle ramp but a near-vertical wall the instant a 50%-off banner goes live, and reactive autoscaling is structurally too slow for a wall. Revenue per second is so high that even a 99.9% availability target leaks real money; the brand thinks in dollars-per-minute-of-downtime, not nines. The database is the choke point — stateless web tiers scale linearly and cheaply, but the writes (inventory decrements, order inserts) hit a relational store that does not, and that asymmetry is where peak architectures live or die. And the customer’s patience is zero — a shopper who sees a spinner at checkout abandons the cart and tells everyone. The architecture below answers each of these with a specific, named mechanism rather than “add more servers and hope.”

Why the naive scaling story fails

Three obvious fixes get proposed every year, and each fails predictably at 40x.

“Just turn the autoscaler up.” Target-tracking on CPU is reactive — it observes a breach, waits out a cooldown, launches tasks, and waits for them to pass health checks. On a vertical traffic wall, the fleet is perpetually chasing a number it passed minutes ago. You cannot out-react a doorbuster.

“Just over-provision for peak and leave it running.” Sizing the steady fleet for the 8 PM spike means paying for 40x capacity 360 days a year to use it for two hours, which finance vetoes the moment they see the bill, and which still does not protect the database tier that does not scale horizontally on writes.

“Just make checkout synchronous and fast.” Coupling the customer’s checkout click directly to an inventory write and a payment call means every downstream slowness — a payment gateway hiccup, a hot inventory row — becomes the customer’s spinner and the customer’s abandoned cart. The write path will be slower than the read path at peak; the only question is whether the customer waits for it.

The real architecture does three things instead: it scales ahead of demand with predictive scaling and pre-warming, it decouples the write path so the customer’s order is captured in milliseconds and settled asynchronously, and it degrades deliberately — shedding the least valuable load first — rather than collapsing uniformly. Those three ideas drive every component choice that follows.

Architecture overview

Black Friday-Ready E-Commerce Platform on AWS with Surge Autoscaling — architecture

The platform separates cleanly into two paths that must be reasoned about independently: a read-heavy browse path that serves the catalogue, product pages, and search to the flood of shoppers, and a write-critical commerce path that captures carts and orders and must never lose a customer’s intent even while it sheds load. Most peak failures come from treating these as one system; keeping them distinct is the first design move.

The defining property of the whole topology is this: the customer’s request is answered as close to the edge as possible and as far from the database as possible. Most Black Friday traffic is browsing the same few hundred hot products, and the architecture is built so the database barely notices that crowd.

Browse path, following the request:

  1. A shopper hits Akamai at the very edge for TLS termination, global anycast, bot mitigation, and — critically at peak — WAF and rate-based bot defenses that strip out the credential-stuffing and scalper-bot traffic which can be 30-40% of “demand” during a hyped drop. Akamai also absorbs a large share of static and cacheable content so it never reaches AWS.
  2. Behind Akamai sits Amazon CloudFront as the AWS-native CDN and the front door to origin. CloudFront caches product images, CSS/JS bundles, and — using cache policies tuned for peak — even semi-dynamic catalogue fragments, so a hot product page is served from a POP without touching compute.
  3. Cache misses reach an Application Load Balancer fronting the storefront tier running on Amazon ECS on Fargate. Fargate is chosen deliberately over EC2-backed ECS: there are no nodes to pre-warm or patch, task launch is fast, and at peak the only scaling dimension is task count, which is exactly the simplicity you want at 2 AM on Black Friday.
  4. The storefront’s read queries for catalogue, pricing, and inventory-display hit Amazon ElastiCache for Redis first. The cache is the load-bearing wall of the browse path: hot products, category listings, and price lookups are served from memory, and only a genuine miss falls through to the database.
  5. Misses land on Amazon Aurora (PostgreSQL-compatible) — but on the reader endpoint, against a fleet of read replicas, never the writer. Browse traffic is read-only by definition and is isolated onto replicas so it can never contend with the order-write path.

Commerce path, the part that must not break:

  1. The shopping cart lives in Amazon DynamoDB, not in a relational table. A cart is a per-session, key-value, write-heavy object with no need for joins, and DynamoDB with on-demand capacity (or pre-provisioned with auto-scaling for a predictable peak) absorbs millions of cart mutations at single-digit-millisecond latency without a connection pool to exhaust. This single choice removes the cart from the database’s blast radius entirely.
  2. When the shopper clicks Place Order, the request does not synchronously write the order and decrement inventory. Instead the order-capture service validates the cart, reserves payment authorization, writes an “order accepted” record, and publishes the order onto an Amazon SQS queue — then returns success to the customer in milliseconds. The customer’s intent is now durably captured; the slow work happens behind the queue.
  3. A pool of order-processing workers (ECS Fargate services, scaled on queue depth) consume from SQS, perform the transactional work against Aurora’s writer — insert the order, decrement inventory in a controlled way, trigger fulfillment — and handle retries. SQS is the shock absorber: it lets the customer-facing tier accept orders far faster than the database can durably commit them, smoothing a vertical spike into a steady drain the writer can sustain.
  4. Poison messages and repeated failures route to an SQS dead-letter queue for inspection rather than blocking the line, and an idempotency key on each order makes the at-least-once delivery safe to retry without double-charging or double-decrementing.

Identity, secrets, and security wrap both paths. Customer auth is handled by the storefront’s own consumer identity provider; the internal operators, fulfillment staff, and the on-call engineers authenticate through Okta as the workforce IdP (federated to Microsoft Entra ID where the retailer’s Microsoft 365 estate requires native Azure RBAC for back-office tools), so every human touching the admin console or the deploy pipeline carries an SSO identity with MFA and conditional access. Application secrets — the payment-gateway API keys, the third-party tax-service token, database credentials for the workers — are issued by HashiCorp Vault with short-lived dynamic database credentials, so a leaked credential expires on its own and no long-lived password sits in a task definition or environment variable. (That discipline is non-negotiable here: this team has been burned by credentials in source control before and treats Vault-issued, auto-rotating secrets as the only acceptable pattern.)

Component breakdown

Component Service / tool Role at peak Key configuration choices
Edge & bot defense Akamai TLS, anycast, WAF, scalper-bot mitigation, static offload Rate-based rules for drop events; origin shield to CloudFront
CDN Amazon CloudFront Cache product pages, assets, semi-dynamic fragments Tuned cache policies; high TTLs on hot catalogue; origin failover
Storefront compute ECS on Fargate Stateless read/render tier Predictive + target-tracking scaling; pre-warmed pool before drop
Read cache ElastiCache for Redis Hot products, pricing, inventory display Cluster mode; read-through; short TTL on stock counts
Read database Aurora (reader endpoint) All browse queries, isolated from writes Auto-scaling read replicas; reader endpoint only
Cart store DynamoDB Per-session cart, high write volume On-demand (or provisioned + auto-scale); single-digit-ms latency
Order buffer Amazon SQS Decouple order capture from durable commit Standard queue + DLQ; long polling; idempotency keys
Order workers ECS on Fargate Drain queue, write to Aurora writer, fulfill Scale on ApproximateNumberOfMessagesVisible; controlled concurrency
Write database Aurora (writer) Orders, inventory decrement, transactions Single writer; connection pooling via RDS Proxy; reserved headroom
Workforce SSO Okta + Microsoft Entra ID Operator/on-call/back-office login OIDC; MFA + conditional access; Okta→Entra federation for M365 tools
Secrets HashiCorp Vault Dynamic DB creds, gateway keys, tokens Short-lived leases; DB secrets engine; no static secrets in tasks
CSPM / IaC scanning Wiz + Wiz Code Cloud posture, attack paths, IaC misconfig in PRs Agentless account scan; Wiz Code gates Terraform PRs
Runtime security CrowdStrike Falcon Workload runtime protection, container threat detection Sensor on Fargate runtime + admin EC2; detections to the SOC
Observability Datadog (with Dynatrace option) Metrics, traces, peak war-room dashboards, anomaly alerts APM tracing on order path; synthetic checks; SLO monitors
ITSM / change ServiceNow Change freeze, peak runbook, incident records Change gate; auto-incident on SLO breach; on-call routing
CI/CD & GitOps Jenkins / GitHub Actions + Argo CD Build/test; declarative deploy to ECS/EKS OIDC to AWS (no static keys); Argo CD reconciles desired state
IaC & config Terraform + Ansible Provision AWS; configure admin/appliance hosts Remote state; Wiz Code pre-merge; Ansible for non-container hosts
Network appliances Virtual appliances (NGF/WAF) North-south inspection for the admin/VPN plane HA pair across AZs; inspect back-office and partner traffic

A few of these choices carry the weight of the design and deserve the why.

Why DynamoDB for the cart and SQS in front of orders. These two choices, together, are what take the database off the critical path of a spike. A relational cart table means every “add to cart” is a write to the same store that commits orders, and at 40x that contention alone can topple the writer. Moving the cart to DynamoDB makes it a horizontally-scaling, schemaless, latency-flat store that simply does not have a connection pool to exhaust. And putting SQS between order capture and order commit decouples the rate at which customers place orders from the rate at which the database can durably commit them — the customer gets an instant “order received,” and the writer drains the queue at its own sustainable pace. The retailer would rather tell a customer “your order is confirmed and processing” in 80 milliseconds than make them watch a spinner for eight seconds while a payment gateway is slow.

Why predictive scaling, not just target tracking. Target-tracking reacts; predictive scaling anticipates. AWS Auto Scaling’s predictive policy learns the daily and weekly demand shape and provisions ahead of a forecasted ramp. But Black Friday is precisely the day history does not predict, so predictive scaling is paired with a scheduled scaling action that pre-warms the Fargate fleet (and pre-provisions DynamoDB and Aurora replicas) to a known floor before the doorbuster timestamp, with target-tracking layered on top to handle the residual. The lesson from the failed year is blunt: on a vertical wall you must already have the capacity when the wall arrives, because there is no time to react after it does.

Why Aurora split reader/writer with RDS Proxy. Reads and writes have opposite scaling stories, so they get opposite treatment. Browse reads fan out across auto-scaling Aurora read replicas behind the reader endpoint and never touch the writer. The single writer is the scarcest resource in the whole system, so it is protected on every side: SQS rate-limits what reaches it, RDS Proxy pools and multiplexes connections so a stampede of workers cannot exhaust max_connections, and it is sized with deliberate headroom because you cannot horizontally add writers mid-event.

Implementation guidance

Provision with Terraform; gate the IaC with Wiz Code. The whole estate — VPC, subnets across three Availability Zones, ECS services, Aurora cluster, ElastiCache, DynamoDB tables, SQS queues, scaling policies — is declared in Terraform with remote state. Every Terraform change goes through a pull request that Wiz Code scans pre-merge for misconfigurations (a public S3 bucket, an over-permissive security group, an unencrypted queue) so an insecure change is caught in review, not in the running account. Hosts that are not containers — the bastion/admin tier, the virtual network appliances — are configured with Ansible for repeatable, auditable state.

A minimal Terraform shape for the order-buffer queue and its dead-letter companion communicates the intent — durable capture, safe retries:

resource "aws_sqs_queue" "orders_dlq" {
  name                      = "bf-orders-dlq"
  message_retention_seconds = 1209600 # 14 days to inspect poison messages
}

resource "aws_sqs_queue" "orders" {
  name                       = "bf-orders"
  visibility_timeout_seconds = 90      # > worker p99 processing time
  receive_wait_time_seconds  = 20      # long polling, fewer empty receives
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.orders_dlq.arn
    maxReceiveCount     = 5            # retry, then quarantine to DLQ
  })
}

And the scaling shape that matters most — workers tracking queue depth, not CPU, so the order drain rate follows the backlog:

resource "aws_appautoscaling_policy" "workers_on_queue" {
  name               = "scale-workers-by-backlog"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.order_workers.resource_id
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"

  target_tracking_scaling_policy_configuration {
    target_value       = 1000 # target backlog-per-task; tune to drain SLA
    customized_metric_specification {
      metric_name = "ApproximateNumberOfMessagesVisible"
      namespace   = "AWS/SQS"
      statistic   = "Average"
    }
    scale_in_cooldown  = 120
    scale_out_cooldown = 30   # scale out fast, scale in slow
  }
}

Deploy via GitOps, freeze before the event. Application builds and tests run in Jenkins or GitHub Actions (authenticating to AWS through OIDC so there is no long-lived access key to leak), producing immutable container images. Argo CD then reconciles the desired state declared in Git onto the ECS/EKS estate, so what is running is always exactly what is in the repository and a rollback is a Git revert. Crucially, a ServiceNow change freeze locks the production estate in the days before Black Friday — the most reliable peak system is the one nobody has touched in a week — with only pre-approved, rehearsed changes allowed through the gate.

Kill the static secrets; federate the humans. Order workers and storefront tasks obtain database credentials as short-lived dynamic secrets from HashiCorp Vault’s database engine, leased for minutes and auto-rotated, so a compromised task cannot yield a durable credential. Payment-gateway and tax-service API keys live in Vault too, never in a task definition. Human access to the admin console, the deploy pipeline, and the AWS account is federated through Okta with MFA and conditional access, brokered to Microsoft Entra ID for the back-office tools that live in the Microsoft estate — so every privileged action carries a real, audited workforce identity.

Enterprise considerations

The load-shedding strategy — degrade on purpose. This is the heart of a peak architecture and the part most teams skip. When demand genuinely exceeds capacity, the system must shed the least valuable load first instead of failing uniformly. The retailer encodes a priority order: protect checkout and payment above all, then add-to-cart, then browse, and shed in reverse. Concretely: a virtual waiting room at the Akamai/CloudFront edge admits shoppers into the buy flow at a controlled rate and politely queues the overflow (“you’re in line, ~2 minutes”) rather than letting them all stampede checkout and crash it for everyone. Non-essential features — personalized recommendations, the live “X people viewing” widget, wishlist sync — are toggled off by feature flags the moment latency budgets tighten, freeing capacity for the buy path. CloudFront serves a cached, slightly stale catalogue if Aurora readers are saturated. The principle: a customer who waits two minutes in a tidy queue and then checks out successfully is a sale; a customer who hits a 500 at checkout is a loss and a tweet. Graceful degradation converts the former into the latter’s place.

Security & Zero Trust. Customer traffic is scrubbed at the edge by Akamai’s WAF and bot defenses — at peak, bot and scalper traffic is not noise, it is a material fraction of “demand” and shedding it early protects real capacity. Virtual network appliances (an HA pair of next-gen firewall/WAF instances across AZs) inspect north-south traffic on the admin and partner-integration plane that CloudFront does not front. Wiz runs continuous CSPM and attack-path analysis across the AWS account, alerting on any drift to public exposure or an over-broad IAM role, while Wiz Code shifts that left into the Terraform PR. CrowdStrike Falcon sensors provide runtime threat detection on the Fargate workloads and the admin EC2 hosts, feeding the retailer’s SOC. Any guardrail breach or SLO violation auto-raises a ServiceNow incident, so security and reliability events become tracked tickets rather than log lines lost in the war-room scroll. IAM follows least privilege per service, and OIDC federation means the CI/CD pipeline holds no static AWS keys at all.

Cost optimization. Peak architectures are an exercise in paying for elasticity, not for a permanent peak.

Lever Mechanism Typical effect
Baseline on Savings Plans Cover the steady fleet with Compute Savings Plans; burst on on-demand Fargate ~30-50% off the always-on base
Predictive + scheduled scale Pre-warm to a floor before the drop, scale to zero-overhead after Pay for 40x only for the hours it exists
DynamoDB on-demand for peak Let cart capacity track real traffic instead of provisioning peak year-round No idle write-capacity spend off-season
Cache and edge offload Akamai + CloudFront + Redis answer the browse flood off the database Smaller, cheaper Aurora reader fleet
Fargate Spot for workers Run interruptible order-drain workers on Spot where the DLQ makes retries safe Material savings on the async tier

Datadog meters cost-relevant signals — task count over time, DynamoDB consumed capacity, NAT and data-transfer egress — so the post-peak review can show finance exactly what the two-hour surge cost and prove the elastic model beats the over-provisioned one.

Scalability — where each tier tops out. The storefront Fargate tier scales near-linearly on task count and is rarely the ceiling. ElastiCache scales with cluster-mode shards and replicas. Aurora readers scale out to fifteen replicas, so the browse path has enormous headroom; the writer is the real ceiling, which is exactly why SQS, RDS Proxy, and a deliberately oversized writer instance exist — and why the very write-heaviest workloads (the cart) were moved off Aurora to DynamoDB entirely. DynamoDB and SQS are effectively unbounded at this scale. The honest summary: you scale the read path with money and the write path with architecture, and a peak design lives or dies on how much write pressure it can keep away from the single writer.

Reliability & DR (RTO/RPO). Everything spans three Availability Zones: Fargate tasks, Aurora (writer plus replicas with automatic failover, typically under 30 seconds), ElastiCache with Multi-AZ, and the regionally-redundant DynamoDB and SQS. For the order path the durability guarantee is concrete: once SQS has accepted an order message, the order is not lost even if every worker dies — they restart and resume draining. A pragmatic target for the commerce path: RTO under 5 minutes, RPO near zero, because the queue and Aurora’s continuous backup mean accepted orders survive a tier failure. For a full regional event, the catalogue and assets are already global at the CDN, and a warm cross-region Aurora replica plus DynamoDB global tables make a region failover a rehearsed runbook rather than an improvisation. The single most important reliability practice is the game-day: the team load-tests the full stack to above projected peak weeks in advance, deliberately triggers the load-shedding and waiting-room paths, and fixes what breaks while it is cheap to break.

Observability — the war room. The order path is traced end to end in Datadog (Dynatrace is the alternative the team evaluated) with APM: one trace covering edge → storefront → order-capture → SQS → worker → Aurora-commit, so a latency or error regression is attributable to a specific hop. The dashboards that matter during the event are business signals, not just infrastructure: orders per second accepted vs. committed, SQS queue depth and age of oldest message (the canary for whether the writer is keeping up), checkout success rate, p95 add-to-cart and place-order latency, cache hit ratio, and Aurora writer CPU and connection count. Synthetic checks hammer the checkout flow continuously, and SLO monitors page the on-call and open a ServiceNow incident the moment checkout success dips. The growing gap between “orders accepted” and “orders committed” is the single number the war room watches — it is the early warning that the queue is filling faster than it drains, with minutes of runway to act before customers feel it.

The training and runbook angle. A peak event is as much an operations rehearsal as an engineering one. The retailer runs its seasonal fulfillment staff, store associates, and support agents through structured readiness courses on its Moodle LMS — the peak runbook, the escalation tree, how the waiting room looks to a customer so support can answer “am I stuck?” calls calmly — so the humans are as pre-warmed as the Fargate fleet. The on-call engineering rotation drills the same runbook against the game-day environment until the load-shedding toggles and failover steps are muscle memory.

Explicit tradeoffs

Accept these or do not build it. Decoupling orders behind SQS buys survivability at the cost of eventual consistency: the customer is told “order received” before it is durably committed and inventory is decremented, which means you must handle the rare case where an accepted order cannot be fulfilled (oversold stock) with a graceful, automated apology-and-refund path — and you must make every step idempotent so at-least-once delivery never double-charges. Predictive and scheduled scaling demand that you know your event calendar and rehearse; they reward planning and punish the surprise spike you forgot to pre-warm for. The load-shedding and waiting-room machinery is real engineering you build and test — a waiting room nobody has rehearsed is a liability, not a safety net. And the full multi-AZ, predictive-scaled, queue-buffered estate is genuine operational complexity that a small flash-sale site does not need and should not carry.

The alternatives, and when they win. If your peaks are modest and gentle, plain target-tracking autoscaling on Fargate with a healthy Aurora is enough — skip the queue and the waiting room. If you are a smaller team that wants to offload undifferentiated heavy lifting, a managed commerce platform (a hosted storefront, or AWS’s serverless-first patterns end to end) trades control for less to operate at peak. If your write volume is genuinely extreme and relational guarantees are negotiable, going fully event-sourced with DynamoDB and Lambda for the entire order lifecycle removes the Aurora writer ceiling altogether — at the cost of a harder consistency and reporting story. This architecture is the right destination when the stakes are a 40x spike, millions in revenue per hour, and a CEO who has already lived through the outage once.

The shape of the win

For the retailer, the payoff is not “the site stayed up.” It is that at 8:00:00 PM on Black Friday the doorbuster goes live, 120,000 requests per second hit an edge and a fleet that were already provisioned for it, the browse flood is answered from Akamai, CloudFront, and Redis without the database breaking a sweat, every Place Order click is captured in DynamoDB and SQS in under a hundred milliseconds and confirmed instantly, the writer drains the queue at its own steady pace, and when a brief overflow does occur a few thousand shoppers wait politely in a queue for ninety seconds and then check out successfully — instead of ninety minutes of 500s and a ₹22 crore hole. Everything upstream — the predictive pre-warming, the read/write split, the cart on DynamoDB, the orders behind SQS, the Vault-issued credentials, the Wiz and CrowdStrike coverage, the Datadog war-room dashboard, the rehearsed load-shedding — exists so that the single most important number of the year, orders successfully placed in the first ten minutes, goes up instead of to zero. Start narrower if your peak is gentler, but for a brand whose whole year rides on one night, this is where it has to land.

AWSECS FargateAutoscalingAuroraE-CommerceResilience
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading