Architecture AWS

AWS Enterprise Architecture: SaaS Multi-Tenant Platform

Building a SaaS product is not the same as building a single-customer application and selling it many times. The moment your second customer signs up, you inherit a class of problems that no amount of feature work will solve: how do you keep Tenant A’s data invisible to Tenant B, how do you stop one noisy tenant from starving everyone else, how do you bill each tenant for what they actually consumed, and how do you offer a premium customer a dedicated, compliant environment without forking your codebase? This article is a reference architecture for a multi-tenant SaaS control plane and application plane on AWS that answers those questions deliberately rather than by accident.

The business scenario

You are building a B2B SaaS product — say a document-collaboration platform, a vertical CRM, or an analytics suite. You sell to other businesses (tenants), and each tenant has many users. Your commercial reality looks like this:

The central architectural tension is pool vs silo. In a pool model, all tenants share the same compute and the same databases, and isolation is enforced in software (every query is scoped by tenant_id, every IAM policy is dynamically scoped to the caller’s tenant). Pooling is cheap and operationally simple — one deployment, one database fleet — but a single bug can leak data across tenants, and a noisy tenant degrades everyone. In a silo model, each tenant gets dedicated resources (its own database, sometimes its own compute, its own KMS key). Silos give you hard, infrastructure-level isolation and clean per-tenant cost attribution and blast-radius containment — but they are expensive and the operational surface grows linearly with tenant count.

The mature answer, codified in the AWS SaaS Lens of the Well-Architected Framework and the SaaS Builder Toolkit, is not to pick one — it is a tiered, bridge model: pool the long tail to protect margins, silo the demanding head to win enterprise deals, and run both from a single control plane and a single codebase. This article shows how to build exactly that on AWS, with Amazon Cognito as the tenant-aware identity provider, JWT-embedded tenant context flowing through every request, dynamically-scoped IAM and PartiQL/leading-key partitioning for data isolation, and an EventBridge + Kinesis metering pipeline that turns raw usage into invoices.

Architecture overview

The platform splits into two cooperating planes, a separation that is the single most important idea in SaaS architecture.

AWS SaaS multi-tenant reference architecture: CloudFront/WAF edge and Cognito tenant-aware JWTs feed a shared control plane (API Gateway + Lambda authorizer minting STS tenant-scoped credentials) that tier-routes to an application plane split into a pooled tier (shared Lambda/Fargate over DynamoDB LeadingKeys + Aurora RLS) and a siloed enterprise tier (dedicated Fargate/account + Aurora + KMS CMK), with an EventBridge/Kinesis → S3 → aggregator → counters metering pipeline feeding billing and a Step Functions onboarding flow.

The control plane is shared by every tenant and is where SaaS-specific concerns live: tenant onboarding/provisioning, the tenant and tier catalog, identity, billing, metering aggregation, and operations. It is multi-tenant by definition — a Starter tenant and an Enterprise tenant are both rows in the same control-plane tables. The application plane is where your actual product runs and where tenant workloads are isolated. The application plane is deployed in different isolation shapes depending on the tenant’s tier, but it is always built from the same code.

Follow a single authenticated request end to end:

  1. A user hits the app. The SPA is served from Amazon S3 behind Amazon CloudFront. Static assets are global and tenant-agnostic.
  2. The user authenticates against Amazon Cognito. Critically, Cognito is configured so that the issued JWT (ID and access token) carries a custom custom:tenantId claim and a custom:tier claim (and often custom:tenant_tier_config such as which silo/pool the tenant maps to). A Pre-Token-Generation Lambda trigger injects/refreshes these claims from the tenant catalog at sign-in. This is the linchpin: tenant identity is bound to the user’s token, not passed as an untrusted request parameter.
  3. The token-bearing request reaches Amazon API Gateway. A Lambda authorizer (or a JWT authorizer for HTTP APIs) validates the signature, extracts tenantId and tier, and — this is the key isolation step — constructs a scoped session: it assumes an IAM role and injects a session policy whose conditions are templated with the caller’s tenantId. The result is short-lived AWS credentials that are mathematically incapable of touching another tenant’s data, even if downstream code has a bug.
  4. The request is routed by tier. Pool-tier traffic lands on shared compute — AWS Lambda functions or shared Amazon ECS/Fargate services running behind an internal Application Load Balancer. Silo-tier traffic is routed (by an API Gateway stage variable, a header, or a dedicated CloudFront behavior keyed on the tenant) to that tenant’s dedicated compute and data stores.
  5. Application code reads/writes data. For pooled data in Amazon DynamoDB, the tenant’s credentials carry a dynamodb:LeadingKeys condition so the caller can only access items whose partition key begins with their tenantId. For pooled relational data in Amazon Aurora (PostgreSQL), isolation is enforced by Row-Level Security (RLS) policies keyed on a current_tenant session variable plus a per-tenant database role. For siloed data, the tenant simply has its own DynamoDB table or its own Aurora cluster — isolation is the infrastructure boundary itself.
  6. Every meaningful action emits a usage event. Application code (or API Gateway access logs, or a Lambda extension) publishes a structured event — {tenantId, metric: "api.call" | "storage.gb" | "doc.processed", quantity, timestamp} — to Amazon EventBridge or directly to Amazon Kinesis Data Streams. A Kinesis Data Firehose lands the raw events in S3 (the durable system of record for billing disputes), while a Lambda or Managed Service for Apache Flink aggregates them per tenant per metric.
  7. Aggregated usage flows to the billing system. AWS Marketplace Metering Service (if you sell through Marketplace) or a third-party billing engine (Stripe Billing, Metronome, m3ter) receives the per-tenant metered quantities and produces invoices. A DynamoDB table holds the running per-tenant usage counters for in-app dashboards and for enforcing plan limits/throttles in real time.
  8. Onboarding is its own asynchronous flow. When a tenant signs up, an AWS Step Functions state machine orchestrates provisioning: create the Cognito group/app-client mapping, write the tenant record, and — for silo tenants — invoke a provisioning pipeline (CodePipeline + Terraform/CloudFormation) that stands up the dedicated stack. AWS Control Tower / Organizations is used when the silo boundary is a whole AWS account per tenant.

The mental model: CloudFront/S3 (global edge) → Cognito (tenant identity in the JWT) → API Gateway + Lambda authorizer (scoped credential minting) → tier-routed compute (pool Lambda/Fargate or silo dedicated) → tenant-partitioned data (DynamoDB LeadingKeys / Aurora RLS, or siloed stores) → EventBridge/Kinesis metering → billing. Control plane (onboarding, catalog, identity, metering, billing) orchestrates; application plane (the product) executes per-tenant.

Component breakdown

Component AWS service Role in the architecture Key configuration choices
Edge & static hosting CloudFront + S3 + WAF Serve the SPA globally; first line of defence Origin Access Control to S3; WAF rate-based rules per IP; optional tenant-keyed cache behaviors for vanity domains (acme.app.com)
Identity provider Amazon Cognito User Pools Authenticates users; mints tenant-scoped JWTs Custom attributes custom:tenantId, custom:tier; Pre-Token-Generation Lambda to inject claims; one user pool with per-tenant groups, or pool-per-tenant for strict silo identity
API front door API Gateway (HTTP or REST API) Single entry; validates tokens; routes by tier Lambda authorizer returns an IAM policy + context (tenantId); usage plans/API keys as a coarse throttle; per-tenant rate limits
Tenant-context / token vending Lambda (authorizer) + STS Converts tenantId claim into least-privilege, tenant-scoped AWS credentials sts:AssumeRole with an inline session policy containing ${aws:PrincipalTag/tenantId} or templated dynamodb:LeadingKeys conditions
Pooled compute Lambda / ECS Fargate Runs shared business logic for the long tail Reserved/provisioned concurrency to bound noisy neighbours; tenant-aware structured logging on every invocation
Siloed compute Dedicated Fargate service / Lambda alias / dedicated account Runs the same code, isolated, for premium tenants Per-tenant ECS service or account; per-tenant compute budget; identical container image, different deployment target
Pooled NoSQL data Amazon DynamoDB High-scale pooled storage with cheap isolation Partition key = TENANT#<id>#...; dynamodb:LeadingKeys IAM condition; on-demand or per-tenant capacity; tenant tag for cost-by-tag
Pooled relational data Aurora PostgreSQL (Serverless v2) Pooled relational store needing joins/transactions Row-Level Security policies; SET app.current_tenant; non-superuser per-tenant DB role; FORCE ROW LEVEL SECURITY
Siloed data Dedicated DynamoDB table / dedicated Aurora cluster Hard, infra-level isolation for compliance tenants Per-tenant KMS CMK (BYOK); per-tenant backup/PITR policy; data-residency region pinning
Provisioning / onboarding Step Functions + CodePipeline + Terraform/CFN Orchestrates new-tenant setup, pool or silo Idempotent state machine; silo branch triggers IaC pipeline; writes tenant record + Cognito mapping
Metering ingestion EventBridge + Kinesis Data Streams Captures per-tenant usage events at scale Event schema {tenantId, metric, quantity, ts}; partition by tenantId; Firehose → S3 raw log
Metering aggregation Lambda / Managed Service for Apache Flink Rolls raw events into per-tenant/per-metric totals Tumbling windows (hourly/daily); idempotent, exactly-once-ish reconciliation against the S3 raw store
Usage + billing DynamoDB (counters) + Marketplace Metering / Stripe / Metronome Real-time limits + invoicing Atomic counter updates; daily BatchMeterUsage to Marketplace; reconcile against raw S3 monthly
Tenant & tier catalog DynamoDB System of record for who is who and which tier/shape tenantId{tier, isolationModel, siloStackArn, kmsKeyArn, status, region}
Observability CloudWatch + X-Ray + OpenSearch Per-tenant metrics, traces, and logs tenantId as a structured log field and metric dimension; CloudWatch Embedded Metric Format; per-tenant dashboards & anomaly alarms

A few of these deserve emphasis because they are where SaaS architectures most often go wrong.

The Pre-Token-Generation Lambda is non-negotiable. A common rookie mistake is to let the client send tenantId in the request body or a header. That is a horizontal-privilege-escalation vulnerability waiting to happen — any user can edit the request and read another tenant’s data. Binding tenantId into the signed JWT, server-side, at token-mint time, means the value is cryptographically attested by Cognito and cannot be forged.

Scoped credentials (the token vending machine) are the difference between “we hope the query is filtered” and “the query physically cannot return other tenants’ rows.” The authorizer (or a dedicated token-vending service) calls sts:AssumeRole and attaches a session policy that templates the tenant id into resource/condition fields. For DynamoDB this means dynamodb:LeadingKeys = ["TENANT#${tenantId}"]; for S3 it means a prefix condition s3:prefix = ["${tenantId}/*"]. Downstream code uses those credentials. This is defence in depth: even a SQL-injection-style bug or a forgotten WHERE clause cannot cross the tenant boundary because the credentials themselves are constrained.

Implementation guidance

Identity wiring (Cognito → JWT → API Gateway)

Choose your Cognito topology by tier:

The Pre-Token-Generation trigger (Lambda) fires on every token issuance. It looks up the user’s tenant in the catalog and returns claimsToAddOrOverride with tenantId, tier, and any feature flags. Keep this Lambda fast and cache the catalog (DynamoDB DAX or in-memory) — it is on the hot path of every login.

In Terraform, the spine looks like this (illustrative, trimmed):

resource "aws_cognito_user_pool" "main" {
  name = "saas-pool"

  schema {
    name                = "tenantId"
    attribute_data_type = "String"
    mutable             = true
  }
  schema {
    name                = "tier"
    attribute_data_type = "String"
    mutable             = true
  }

  lambda_config {
    pre_token_generation = aws_lambda_function.pre_token.arn
  }
}

# The per-request tenant-scoped role the authorizer assumes.
resource "aws_iam_role" "tenant_scoped" {
  name               = "tenant-runtime"
  assume_role_policy = data.aws_iam_policy_document.trust.json
}

# A *base* policy; the authorizer adds a tighter SESSION policy at assume time.
data "aws_iam_policy_document" "tenant_base" {
  statement {
    actions   = ["dynamodb:GetItem", "dynamodb:Query", "dynamodb:PutItem", "dynamodb:UpdateItem"]
    resources = [aws_dynamodb_table.app.arn]
    condition {
      test     = "ForAllValues:StringLike"
      variable = "dynamodb:LeadingKeys"
      values   = ["TENANT#$${aws:PrincipalTag/tenantId}"]
    }
  }
}

The authorizer Lambda then does, conceptually:

# pseudo-code inside the Lambda authorizer / token-vending function
claims     = verify_jwt(token, cognito_jwks)          # validates signature
tenant_id  = claims["custom:tenantId"]
session    = sts.assume_role(
    RoleArn=TENANT_RUNTIME_ROLE_ARN,
    RoleSessionName=f"t-{tenant_id}",
    Tags=[{"Key": "tenantId", "Value": tenant_id}],   # becomes aws:PrincipalTag
    Policy=scoped_session_policy(tenant_id),           # LeadingKeys / s3 prefix
    DurationSeconds=900,
)
# return APIGW policy + context so the handler receives credentials & tenantId

Use IaC for all of this. Terraform (or CloudFormation/CDK) is mandatory because silo provisioning is an IaC pipeline: the onboarding Step Functions state machine for a silo tenant kicks off a terraform apply (via CodeBuild) parameterized with the new tenantId, standing up that tenant’s table/cluster/KMS key/Fargate service from the same module the pooled stack uses.

Networking

Pooled compute sits in private subnets behind an internal ALB or is invoked directly (Lambda). Aurora and any siloed databases live in isolated subnets reachable only from the application tier’s security group — no public IPs. Use VPC endpoints (Gateway endpoint for DynamoDB/S3, Interface endpoints for STS, Secrets Manager, KMS) so tenant data traffic never traverses the public internet. For account-per-tenant silos, wire connectivity with AWS RAM-shared subnets or Transit Gateway, and centralize egress through a shared-services account.

Data partitioning patterns

DynamoDB (pooled): make tenantId the leading component of the partition key (TENANT#<id>#ENTITY#<id>). This gives you free isolation via LeadingKeys and natural per-tenant cost visibility. For a noisy tenant, you can later promote them to their own table without changing the key schema.

Aurora PostgreSQL (pooled): enable RLS. Each connection sets SET app.current_tenant = '<tenantId>' (from the JWT claim) at the start of the request; an RLS policy USING (tenant_id = current_setting('app.current_tenant')::uuid) filters every row. Crucially, the application connects as a non-superuser role and you ALTER TABLE ... FORCE ROW LEVEL SECURITY, because table owners and superusers bypass RLS by default — forgetting this is a classic, dangerous mistake.

ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE documents FORCE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON documents
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

Metering wiring

Emit usage events from the closest reliable point to the action. Three complementary sources:

  1. API Gateway access logs → Firehose → S3 for raw call counts (cheap, lossy-tolerant).
  2. Application-emitted EventBridge events for business metrics (doc.processed, report.generated) where the meaning matters for billing.
  3. Periodic sweeps (a scheduled Lambda) for stock metrics like storage GB, which you sample rather than stream.

A Kinesis stream partitioned by tenantId feeds an aggregator (Lambda for simple sums, Managed Service for Apache Flink for windowed/complex aggregation). The aggregator writes to a DynamoDB counters table for real-time enforcement and dashboards, and a daily job calls the billing API (BatchMeterUsage for AWS Marketplace, or Stripe/Metronome ingest). The S3 raw event lake is the source of truth — always reconcile aggregated numbers against it monthly, because billing disputes are won and lost on auditability.

Enterprise considerations

Security & Zero Trust. The whole design is Zero Trust at the tenant boundary: never trust a client-supplied tenant id; bind it to the JWT; mint per-request, least-privilege, tenant-scoped credentials; assume the application code will have a bug and make the data layer enforce isolation regardless. Add per-tenant KMS keys for silo tenants (satisfies BYOK and gives you a cryptographic kill-switch — disable the key to instantly revoke access). Run AWS WAF at the edge, keep all data stores private, use Secrets Manager with rotation for DB creds, and turn on GuardDuty, CloudTrail (org-wide), and Security Hub. For the highest tier, an account-per-tenant model in AWS Organizations gives you the strongest blast-radius and compliance boundary that exists on AWS.

Cost optimization. This is the entire reason for the tiered model. Pooling the long tail amortizes fixed costs across thousands of small tenants — a single Aurora Serverless v2 cluster and a shared Lambda fleet serve them all, scaling to near-zero at night. Silos cost more but only for tenants who pay for them. Attribute cost per tenant using cost-allocation tags (tag every silo resource with tenantId; for pooled DynamoDB/Lambda, derive cost from the metering data) so you know your per-tenant margin and can spot a Pro tenant who is secretly costing you Enterprise money — a signal to either throttle, reprice, or graduate them to a silo. Use Graviton (arm64) for Lambda/Fargate and Savings Plans for the steady pooled baseline.

Scalability. Pooled tiers scale horizontally and automatically (DynamoDB on-demand, Lambda concurrency, Aurora Serverless v2 ACUs, Fargate auto-scaling). The defence against the noisy-neighbour problem is layered: API Gateway usage plans throttle per tenant; reserved/provisioned concurrency caps any one tenant’s compute share; and DynamoDB per-tenant request accounting (via the metering pipeline) lets you detect and rate-limit abusers before they impact others. When a pooled tenant outgrows the pool, the bridge model lets you migrate just them to a silo without touching anyone else.

Reliability & DR (RTO/RPO). Set tier-differentiated targets. Pooled tier: multi-AZ everywhere (DynamoDB and Aurora are multi-AZ natively), automated backups, RPO ≈ 5 min (DynamoDB PITR / Aurora continuous backup), RTO ≈ 1 hr for a regional failover with pre-warmed infra. Enterprise silos can buy a stricter SLA — cross-region replication (DynamoDB Global Tables, Aurora Global Database) for RPO < 1 min, RTO < 15 min, and an active-passive standby in a second region. Because silos are pure IaC, your DR runbook for a silo is “re-run the Terraform module in the DR region and restore the snapshot,” which is testable and fast.

Observability. The golden rule: tenantId is a first-class dimension on every log line, metric, and trace. Use CloudWatch Embedded Metric Format to emit per-tenant latency/error/throttle metrics from application code; propagate tenantId through X-Ray segments; ship structured logs to OpenSearch so support can pull “everything that happened for tenant ACME between 14:00 and 15:00” instantly. Build per-tenant dashboards and anomaly-detection alarms so you can answer “is the platform slow, or just slow for this one tenant?” — the question SaaS on-call gets asked at 2 a.m.

Governance. The control-plane tenant catalog is your governance system of record. Drive provisioning, deprovisioning (GDPR right-to-erasure for a silo = delete the stack + key), and tier changes through Step Functions so every lifecycle action is audited and idempotent. Use AWS Organizations SCPs to enforce guardrails (no public S3, mandatory encryption, region restrictions) across all tenant accounts. Tag everything; reconcile billing monthly against the S3 raw event lake.

Reference enterprise example

Quillstream Inc. is a fictional 60-person startup selling a collaborative document-analysis SaaS to legal and financial firms. They have 1,400 tenants: 1,250 on Starter/Pro (pooled), 138 on Business (pooled, with higher limits and a dedicated DynamoDB table for their hottest collection), and 12 Enterprise tenants (full silo — dedicated Aurora cluster, dedicated Fargate service, per-tenant KMS key, one in eu-central-1 for a German bank with data-residency requirements). Two Enterprise tenants are large enough to warrant a dedicated AWS account each, governed via Control Tower.

Their numbers:

A decision they made and why. A Pro tenant, Harbor & Vance LLP, was running 11x the average API volume and 2.1 TB of storage on a flat $290/mo plan — their fully-loaded cost was ~$1,900/mo, a heavily negative margin. The metering data surfaced it on a per-tenant-margin dashboard. Quillstream offered them a Business-tier upsell with a dedicated table and metered overage pricing; the firm, now aware of their own heavy usage, upgraded to $2,400/mo. The metering pipeline didn’t just enable billing — it found a loss-making customer and turned them into a profitable one.

The outcome. When a Fortune-500 prospect’s security team sent the 200-line questionnaire demanding dedicated database, BYOK, EU residency, and proof of tenant isolation, Quillstream answered “yes” to all of it by pointing at the silo model and the scoped-credential design — and closed a $540k/year contract that pooled-only competitors couldn’t service. Meanwhile their blended infra cost stayed at ~11% of revenue because 99% of tenants are pooled. One codebase, one control plane, three isolation shapes.

When to use it

Use this architecture when you are building genuine B2B multi-tenant SaaS with a heterogeneous customer base — a price-sensitive long tail and a compliance-demanding head — and you intend to run a single codebase. The tiered pool/silo bridge is the right default for almost any SaaS that expects to sell up-market over time.

Trade-offs. The flexibility costs you complexity: you now operate two planes, multiple isolation models, a token-vending layer, and a metering pipeline. That is a lot of moving parts for a pre-product-market-fit startup with five tenants. The honest counsel from the AWS SaaS Lens is to start pooled-only (one shared stack, JWT tenant context, DynamoDB LeadingKeys / Aurora RLS, and the metering pipeline from day one) and add the silo machinery only when a real enterprise deal demands it — but design the tenant-context and metering plumbing up front, because retrofitting tenantId into a single-tenant codebase later is agony.

Anti-patterns to avoid:

Alternatives. If you only ever serve one customer profile, you may not need the bridge: a pure-pool SaaS (everyone shares, isolation in software) is dramatically simpler and right for high-volume, low-touch products. A pure-silo model (a dedicated stack per customer, fully automated by IaC) suits very-high-value, low-count, regulated businesses (think 30 hospital systems) where pooling buys you nothing and isolation is the whole product. And if “multi-tenancy” for you just means logical separation inside one big database with no enterprise/compliance pressure, a single shared cluster with RLS — without Cognito custom claims, scoped credentials, or silos — may be all you need. This reference architecture exists precisely for the common, harder case in between: when you must serve both ends of the market, profitably, from one team and one codebase.

AWSArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading