Multi-Tenant SaaS Control Plane and Tenant Isolation on AWS

A health-tech company that sells a clinical scheduling and patient-intake product to hospital groups gets a board-level ultimatum after its third enterprise deal stalls in security review: one prospect — a 14-hospital system — will not sign until the vendor can prove, on paper, that its patient data can never be read by another customer’s application, even by accident, even by a bug. The product today is a single Rails monolith with a tenant_id column on every table and a WHERE tenant_id = ? clause the developers promise they always remember. The prospect’s CISO asked one question — “show me the IAM boundary that makes a forgotten WHERE clause harmless” — and there wasn’t one. The deal is worth more than the entire current book of business, and it is HIPAA-regulated PHI where a cross-tenant leak is a reportable breach. The ask is no longer “ship features.” It is “re-architect into a real multi-tenant SaaS platform with provable tenant isolation, and do it without a year-long rewrite.” This article is the reference architecture for that platform on AWS.

The pressures here are the ones every B2B SaaS company eventually hits, just compressed by regulation. Isolation has to be enforced by the cloud’s own access-control plane, not by application discipline, because “we always remember the WHERE clause” is not an audit answer. Scale means hundreds of tenants ranging from a five-clinic practice to a multi-hospital system, with wildly different load profiles, sharing infrastructure economically. Onboarding has to be self-service and fast — sales closes a deal Friday, the tenant expects a working environment Monday — which means a control plane that provisions tenants programmatically, not a runbook an engineer executes. And billing has to charge each tenant for what they actually consumed, which means metering usage per tenant from day one. The pattern that satisfies all four is the control-plane / application-plane split with tiered tenant isolation, and getting the isolation model right is the whole game.

Why not the obvious shortcuts

Three shortcuts get proposed in every one of these projects, and each fails for a reason worth naming out loud.

“Just keep the shared database with a tenant_id column.” This is pooled tenancy taken to its logical end with zero isolation, and it is exactly what the CISO rejected. A single missing predicate, an ORM default scope that gets bypassed, a reporting query that forgets the filter — any one of them is a cross-tenant PHI leak. The data is co-mingled and the only thing keeping tenants apart is application code that humans wrote. For regulated data, that is not isolation; it is a promise.

“Give every tenant their own AWS account and their own full stack.” This is silo tenancy, and it is genuinely the strongest isolation you can buy — a hard AWS account boundary per tenant. But running a complete stack per tenant for a five-clinic practice paying a modest monthly fee is economically absurd; you would spend more on idle infrastructure than the contract is worth, and operating hundreds of identical stacks is its own nightmare. Silo for everyone doesn’t scale down.

“Spin up a separate Kubernetes namespace per tenant and call it isolated.” Namespaces are a soft boundary — a logical partition inside a shared cluster, not a security boundary against a determined attacker or a serious misconfiguration. Network policies and RBAC help, but a namespace does not stop an over-broad IAM role on the node from reaching another tenant’s data in S3. Namespaces are an operational convenience, not the isolation a regulator wants to see.

The architecture that actually works refuses to pick one globally. It offers isolation tiers — pooled for the cost-sensitive small tenants, silo for the enterprise tenant who demands a hard boundary and will pay for it — and crucially, it makes even the pooled tier enforce isolation through IAM, not through a WHERE clause. That is the move the CISO was asking for.

Architecture overview

Multi-Tenant SaaS Control Plane and Tenant Isolation on AWS — architecture

The defining structural decision — the one to fix in your head before anything else — is the split between two planes that share an AWS organization but live on completely different schedules and have completely different blast radii:

The control plane is the SaaS provider’s own system. It owns tenant onboarding, the tenant registry, identity configuration, billing/metering aggregation, and tenant lifecycle (provision, suspend, offboard). It is shared infrastructure that no tenant ever touches directly. If it goes down, no new tenants onboard, but existing tenants keep running.
The application plane is where tenant workloads actually run and serve end users. Depending on the tenant’s tier, this is either pooled (shared compute and data, partitioned by tenant) or silo (dedicated resources, in the strongest case a dedicated AWS account).

Keeping these separate is what lets you reason about security and failure independently. A bug in the application plane cannot reach the tenant registry; a control-plane deploy cannot take down a tenant’s runtime.

Onboarding flow, following the control plane:

Sales closes a deal and a rep submits an onboarding request through ServiceNow — tenant name, tier (pooled or silo), data-residency region, contracted seat count. ServiceNow runs the approval workflow (a HIPAA SaaS provisioning a new covered-entity tenant has a compliance sign-off gate here) and on approval fires a webhook.
The webhook lands on the control-plane onboarding service (an API behind API Gateway, backed by a Step Functions state machine — onboarding is a multi-step, long-running, must-be-idempotent workflow, which is exactly what Step Functions is for).
The state machine writes the new tenant into the Tenant Registry (a DynamoDB table — the system of record for every tenant: tenantId, tier, status, region, IAM role ARNs, isolation metadata).
It configures tenant identity: a per-tenant user pool or group in Amazon Cognito for tenants who use the product’s native login, federated to Okta for enterprise tenants who bring their own IdP via SAML/OIDC. Either way the outcome is the same — tokens that carry a tenant_id claim.
It provisions the tenant’s resources by invoking infrastructure-as-code. For a pooled tenant this is lightweight: create the tenant’s IAM role and its data partition. For a silo tenant it runs a full Terraform apply (and Ansible for any in-instance configuration) to stand up dedicated infrastructure — in the strongest tier, a brand-new AWS account vended through AWS Control Tower Account Factory so the tenant gets a hard account boundary.
It registers the tenant in billing/metering and flips status to active. The whole thing is event-driven and idempotent, so a partial failure replays cleanly rather than leaving a half-built tenant.

Request flow, following the application plane (a pooled tenant):

An end user (a hospital scheduler) authenticates. They hit Akamai at the edge first for TLS termination, global anycast, WAF, and bot/DDoS mitigation, then reach Cognito (or are federated through to Okta for an enterprise tenant). They receive a JWT whose claims include tenant_id and the user’s role.
The request reaches API Gateway, which is the single front door for the application plane. A Lambda authorizer validates the JWT, extracts the tenant_id, and — this is the pivotal step — vends scoped, short-lived credentials for that tenant before the request ever touches tenant data.
The application code (on ECS Fargate or Lambda) handles the request using those scoped credentials only. Every read and write to the tenant’s data goes through the AWS access-control plane, which enforces the partition regardless of what the SQL or DynamoDB query says.
The application emits a usage metering event per billable operation (an appointment created, an intake processed, MB stored) onto Kinesis, tagged with tenant_id.
The cited, tenant-scoped result returns to the user, and the turn is observable end to end in Datadog with tenant_id on every span.

The thing that makes the pooled tier actually isolated — and the thing the CISO wanted to see — is step 2 and step 3 together: the application never holds broad credentials. It only ever holds credentials scoped to the one tenant in the current request.

The isolation mechanism: dynamic per-tenant IAM

This is the technical heart of the platform, so it gets its own section. In the pooled tier, data for many tenants lives in shared stores — a shared DynamoDB table, a shared S3 bucket, a shared Aurora cluster. Isolation comes from IAM policies that scope access by tenant at request time, a pattern AWS calls dynamic policy generation (and what the SaaS Factory community popularized as token vending).

For DynamoDB, the tenant’s data is partitioned by a key prefix (tenantId as the partition key), and the scoped credential carries an IAM policy with a condition that pins access to that prefix:

{
  "Effect": "Allow",
  "Action": ["dynamodb:Query", "dynamodb:PutItem", "dynamodb:GetItem"],
  "Resource": "arn:aws:dynamodb:us-east-1:1234:table/Appointments",
  "Condition": {
    "ForAllValues:StringEquals": {
      "dynamodb:LeadingKeys": ["${aws:PrincipalTag/tenant_id}"]
    }
  }
}

For S3, the same idea uses a per-tenant key prefix and a s3:prefix condition. The Lambda authorizer (or a token-vending service) calls STS AssumeRole with a session policy or session tags carrying the tenant_id, and the application receives credentials that cannot read another tenant’s partition even if the application code is wrong. Now the CISO’s question has an answer: a forgotten WHERE clause is harmless, because the IAM boundary denies the cross-tenant read before the query ever returns a row.

The silo tier doesn’t need this gymnastics — the AWS account boundary (or at minimum a dedicated table/bucket with a resource policy) does the isolating. That is precisely the tradeoff: silo buys isolation with money and operational overhead; pooled buys it with a more sophisticated IAM design and pays for it in engineering complexity.

Isolation tier	What’s shared	Isolation boundary	Cost per tenant	Operational load	Right for
Pooled	Compute + data stores, partitioned	Dynamic per-tenant IAM (STS scoped creds)	Lowest (shared)	Low (one stack)	Small/mid tenants, cost-sensitive
Bridge	Compute shared, data siloed	Dedicated DB/bucket per tenant + IAM	Medium	Medium	Tenants wanting data separation, not full silo
Silo (account)	Nothing	AWS account boundary	Highest	High (N stacks)	Enterprise, regulated, contractually required

Most real platforms run all three at once, and the tenant registry records which tier each tenant is on so the routing and provisioning logic can branch accordingly.

Component breakdown

Component	Service / tool	Role in the platform	Key configuration choices
Edge	Akamai	TLS, anycast, WAF, bot/DDoS mitigation at the perimeter	Per-tenant custom domains map to one origin; WAF rules at the edge
Identity (native)	Amazon Cognito	Per-tenant user pools/groups; issues JWT with `tenant_id` claim	One user pool per tenant (silo) or shared pool + group (pooled)
Identity (enterprise)	Okta	SAML/OIDC federation for tenants bringing their own IdP	Per-tenant SAML connection federated into Cognito
Front door	API Gateway + Lambda authorizer	Single entry; validates JWT, vends scoped per-tenant credentials	Authorizer caches by token; STS AssumeRole with session tags
Application compute	ECS Fargate / Lambda	Tenant request handling using scoped credentials only	No broad IAM role; only the vended tenant credential
Control plane	Step Functions + API Gateway + DynamoDB	Onboarding, tenant registry, lifecycle orchestration	Idempotent state machine; registry is system of record
Account vending	Control Tower Account Factory	Hard account boundary for silo-tier tenants	Account Factory for Terraform; guardrails inherited
Secrets	HashiCorp Vault	Per-tenant DB credentials, third-party API keys, signing keys	Namespaces per tenant; dynamic DB creds with short leases
Metering	Kinesis + Lambda + DynamoDB	Per-tenant usage events aggregated for billing	`tenant_id` on every event; aggregation to billing system
CSPM / posture	Wiz + Wiz Code	Cloud posture, cross-tenant exposure, attack-path; IaC scanning	Alerts on any IAM drift that widens a tenant boundary
Runtime security	CrowdStrike Falcon	Runtime threat detection on Fargate tasks and silo EC2	Sensor on compute; detections piped to the SOC
Observability	Datadog	Per-tenant dashboards, traces, usage telemetry	`tenant_id` tag on every metric/span; per-tenant SLOs
ITSM / approvals	ServiceNow	Onboarding approvals, change gates, incident records	Provisioning approval workflow; auto-ticket on isolation alert
CI / IaC	GitHub Actions + Jenkins + Argo CD + Terraform/Ansible	Build/test/deploy; tenant provisioning; GitOps for app plane	OIDC to AWS (no stored keys); Argo CD reconciles app-plane state

A few of these choices deserve the why, because they are the ones teams get wrong.

Why the control plane is a separate system, not “an admin page in the app.” Folding tenant management into the application means a bug in the customer-facing app can corrupt the tenant registry, and a tenant request path shares blast radius with provisioning. Separating them means the registry — the thing that defines who every tenant is — sits behind its own boundary, deployed on its own cadence, with its own tightly scoped access. The application plane reads tenant config; it never writes it.

Why Vault sits alongside per-tenant IAM. IAM scopes access to AWS resources, but silo tenants often have their own database with its own credentials, and tenants integrate third-party systems (a hospital’s EHR) with their own API keys. HashiCorp Vault holds those, isolated by per-tenant namespaces, and issues dynamic, short-lived database credentials so a tenant’s DB password is leased for minutes, not stored forever. IAM handles the AWS-native partition; Vault handles everything that isn’t AWS-native.

Why GitOps with Argo CD for the application plane but Terraform for tenant provisioning. The application plane is one shared deployment whose desired state belongs in Git and should be continuously reconciled — that is Argo CD’s job, driven by GitHub Actions for build/test. But vending a new tenant (especially a silo account) is an imperative, parameterized infrastructure action best expressed as Terraform (with Ansible for in-instance config), invoked by the control plane’s Step Functions workflow, with Jenkins orchestrating the heavier silo-account pipelines that the org already runs. Two different problems, two right tools.

Implementation guidance

Provision the organization and guardrails first. The deployment order matters because the isolation story rests on the AWS Organizations structure underneath it.

An AWS Organizations structure with a dedicated control-plane account, a shared application-plane account for pooled tenants, and an OU under which silo tenant accounts are vended.
Control Tower with guardrails (SCPs) that every account inherits — deny public S3, require encryption, deny disabling CloudTrail — so a tenant boundary cannot be weakened even by a mistake inside a tenant account.
The tenant registry DynamoDB table and the control-plane onboarding Step Functions workflow.
The token-vending path: the Lambda authorizer, the per-tenant IAM role template, and the STS scoped-credential logic.
The application-plane compute (Fargate services / Lambda) wired to receive only vended credentials.

A minimal shape of the token-vending step communicates the intent — assume a role, tag the session with the tenant, get back credentials that cannot escape the partition:

def vend_tenant_credentials(tenant_id: str):
    resp = sts.assume_role(
        RoleArn=TENANT_ACCESS_ROLE_ARN,
        RoleSessionName=f"tenant-{tenant_id}",
        Tags=[{"Key": "tenant_id", "Value": tenant_id}],   # consumed by IAM Condition
        DurationSeconds=900,                                # 15 min, short-lived
    )
    return resp["Credentials"]   # scoped to this tenant ONLY

The pipelines that apply infrastructure run in GitHub Actions authenticating to AWS via OIDC federation so there is no stored access key to leak — a lesson this team intends never to repeat — with Jenkins handling the silo-account provisioning runs that predate the move to Actions.

Identity: federate the enterprise, pool the rest. Native tenants log in through Cognito; the JWT carries a tenant_id claim that drives everything downstream. Enterprise tenants who insist on their own IdP federate through Okta (per-tenant SAML connection) into Cognito so the application sees one token shape regardless of origin. The tenant_id claim is the thread that runs through the entire request: the authorizer reads it, STS tags the session with it, IAM conditions enforce it, Kinesis tags the meter with it, and Datadog dashboards slice by it.

Metering wiring. Emit a usage event per billable action onto Kinesis at the moment it happens, tagged with tenant_id and event type. A Lambda consumer aggregates into a per-tenant usage table, which the billing system reads to produce invoices. Meter from day one — bolting metering on later means you cannot bill accurately for the months before you added it, and “we’ll figure out billing later” is how SaaS companies leave revenue on the floor.

Enterprise considerations

Security & tenant isolation as the product. For this company, isolation is a feature you sell, not just a control you operate. The layered story: (a) dynamic per-tenant IAM makes pooled-tier isolation enforced by AWS, not by application code — the core of the architecture; (b) AWS account boundaries via Control Tower give silo tenants the hardest isolation available; © HashiCorp Vault namespaces isolate per-tenant secrets and lease DB credentials dynamically; (d) Wiz runs continuous CSPM and attack-path analysis across every account, alerting the instant any IAM change widens a tenant boundary or a resource drifts to public exposure — and Wiz Code scans the Terraform before it applies, so a boundary-weakening change is caught in the pull request, not in production; (e) CrowdStrike Falcon sensors on Fargate tasks and silo EC2 give runtime threat detection feeding the SOC; (f) an isolation alert auto-raises a ServiceNow incident so security has a ticket, not just a log line. The single most important test in the entire CI suite is the cross-tenant isolation test: spin up two tenants, vend credentials for tenant A, assert that an attempt to read tenant B’s data is denied by IAM. That test failing blocks every deploy.

Cost optimization. The whole reason pooled tenancy exists is cost, so engineer the economics deliberately.

Lever	Mechanism	Typical effect
Tier by tenant value	Pooled for small tenants, silo only where required/paid-for	Avoids per-tenant idle stacks for the long tail
Shared serverless compute	Fargate/Lambda scales to actual pooled load, not peak-per-tenant	One elastic pool instead of N reserved stacks
Per-tenant cost attribution	Tag every resource with `tenant_id`; metering ties spend to revenue	Surfaces unprofitable tenants; informs pricing
Right-size silo tenants	Terraform-vended silo stacks scoped to the tenant’s real load	Stops over-provisioning the expensive tier
Aggregate metering	Kinesis → per-tenant usage → chargeback	Each tenant’s bill reflects actual consumption

Tag everything with tenant_id and pipe per-tenant cost and usage into Datadog, which the team uses for the per-tenant profitability dashboard the CFO sees — the same view that flags when a pooled tenant has grown enough to justify (and pay for) a move to silo.

Scalability. Each tier scales differently, which is the point. The pooled application plane scales as one elastic system — Fargate services scale on concurrency, Lambda on invocations — so adding a small tenant adds near-zero fixed cost. Silo tenants scale independently inside their own account, isolated from noisy neighbors entirely. The control plane scales with onboarding rate, not request rate, so it stays small and cheap. The natural ceiling on the pooled tier is the “noisy neighbor” problem — one tenant’s spike degrading others — which is why per-tenant usage throttling at API Gateway (rate limits keyed on tenant_id) and per-tenant Kinesis/DynamoDB capacity awareness matter, and why a tenant who consistently saturates the pool is a sales signal to upgrade them to silo.

Failure modes, and what each one looks like. Name them before they page you.

A widened IAM boundary — a policy change accidentally drops the tenant_id condition, and the pooled tier silently loses isolation. Mitigation: the cross-tenant isolation test as a hard CI gate, plus Wiz alerting on IAM drift in production.
Noisy neighbor in the pool — one tenant’s traffic spike starves others sharing the Fargate pool and DynamoDB capacity. Mitigation: per-tenant API Gateway throttling, usage-based isolation, and a silo-upgrade path for chronic offenders.
Onboarding partial failure — a silo provision fails halfway, leaving a half-built tenant. Mitigation: the idempotent Step Functions workflow replays from the failed step; status stays provisioning until complete, never active.
Control-plane outage — the registry is unreachable. Mitigation: the application plane caches tenant config and runs on it; onboarding pauses but live tenants are unaffected because the planes are decoupled.
Silo sprawl — too many under-utilized silo accounts because everyone got upgraded. Mitigation: tier decisions driven by the metering/cost data, not by sales reflex.

Reliability & DR (RTO/RPO). Decide the numbers per plane and per tier. The control plane is recoverable from the DynamoDB tenant registry (point-in-time recovery on) and the IaC in Git — if it is down, no tenant is serving-impaired, so a longer RTO is acceptable: target RTO 1 hour, RPO 5 minutes. The pooled application plane is the customer-facing tier and deserves tighter numbers — multi-AZ Fargate, DynamoDB global tables or PITR, target RTO 15 minutes, RPO 1 minute. Silo tenants get DR scoped (and priced) per contract, since an enterprise tenant may demand cross-region standby that a small pooled tenant would never pay for. Akamai health checks drive edge failover for ingress.

Observability. Instrument with Datadog and put tenant_id on every metric, log, and span — this single discipline is what makes per-tenant SLOs, per-tenant cost, and per-tenant incident triage possible. Emit the metrics the business actually cares about: per-tenant request rate and error rate, per-tenant p95 latency, per-tenant usage/consumption (the metering feed), and pool saturation (the early warning for noisy-neighbor). Alert on per-tenant SLO breaches so a single unhappy enterprise tenant surfaces before their CISO emails. New tenant onboarding and any tier change pass through a ServiceNow change record, giving compliance a documented gate for every covered-entity tenant added.

Governance. Pin the SCP guardrails at the Organizations level so no individual account — tenant or shared — can weaken them. Keep all infrastructure and IAM policy templates in version control, reviewable and revertable, with Wiz Code scanning IaC in the pull request. Log every credential-vending event and every cross-tenant access denial (the denials are the proof your isolation is working) for audit. Maintain a tenant offboarding path that is as rigorous as onboarding — data export, then verifiable deletion — because HIPAA and most enterprise contracts require provable data destruction when a tenant leaves, and “we think we deleted it” is not a compliant answer.

Explicit tradeoffs

Accept these or do not build it. A tiered multi-tenant platform is more complex than either extreme — you are running pooled and silo simultaneously, with a control plane on top, and the dynamic-IAM token-vending design is genuinely harder to build and reason about than a tenant_id column. The control plane is real software you have to build, operate, and secure, separate from your product. Per-tenant IAM adds an STS hop and an authorizer on the hot path — small, but non-zero latency and a piece that, if it breaks, breaks every tenant. And running silo accounts means operating N stacks with all the patching, monitoring, and drift management that implies. None of this is overhead you’d take on for a five-customer startup; all of it is mandatory the moment a regulated enterprise tenant makes isolation a contractual condition.

The alternatives, and when they win. If every customer is small and none is regulated, pure pooled with a tenant_id column and good test discipline is simpler and may be entirely adequate — graduate to IAM-enforced isolation when a customer’s security team demands proof. If you have only a handful of large, well-funded, isolation-obsessed customers and no long tail, silo-everything (account per tenant) skips the control-plane sophistication and the dynamic-IAM design — the operational cost is real but you avoid the pooled tier’s complexity entirely. And if you are early and optimizing for speed over isolation guarantees, a managed multi-tenant platform or a framework’s built-in tenancy gets you running fast; graduate to this architecture when isolation, scale, regulation, or per-tenant economics force the issue — which, for any B2B SaaS that sells upmarket, they eventually will.

The shape of the win

For the health-tech company, the payoff is not “we re-platformed.” It is that the prospect’s CISO asks the same question — “show me the IAM boundary that makes a forgotten WHERE clause harmless” — and this time the answer is a live demo: vend credentials for tenant A, attempt to read tenant B’s PHI, watch AWS IAM deny it before a single row returns, and point at the CI test that proves it stays that way on every deploy. The 14-hospital deal closes, and it closes into a pooled tier the company can actually afford to run at scale, with a clean upgrade path to a silo account the day that tenant’s volume — or their next audit — justifies it. Everything upstream — the control-plane split, the Cognito/Okta identity with its tenant_id claim, the STS token vending, the Vault-held per-tenant secrets, the Wiz boundary monitoring, the Datadog per-tenant telemetry, the ServiceNow approval gates — exists to turn “trust our developers” into “here is the boundary, enforced by AWS, and here is the test that proves it.” That is the sentence that unlocks the enterprise market. Start narrower if you must, but this is where regulated, at-scale multi-tenant SaaS has to land.

Multi-Tenant SaaS Control Plane and Tenant Isolation on AWS

Why not the obvious shortcuts

Architecture overview

The isolation mechanism: dynamic per-tenant IAM

Component breakdown

Implementation guidance

Enterprise considerations

Explicit tradeoffs

The shape of the win

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)