AI/ML Multi-cloud

Multi-Tenant ML SaaS: Isolation and Per-Tenant Models

A B2B logistics-analytics company sells one product: a forecasting and routing-optimization service that ingests a carrier’s shipment events and returns ETA predictions, demand forecasts, and exception alerts. The model is the product. Their first ten customers ran fine on a single shared model and a shared database. Then they signed a national parcel carrier with a contractual requirement that its data never co-mingle with a competitor’s, a clause requiring the model serving its predictions be trained only on its own history, and a security questionnaire 340 lines long. The next prospect — a pharmaceutical cold-chain operator — wanted a region locked to the EU and a signed DPA. The shared-everything design that got them to ten customers is exactly the design that loses them the eleventh.

This is the inflection every ML SaaS hits: the moment a customer’s data becomes valuable and regulated enough that isolation stops being a feature and becomes the contract. This article is a reference architecture for that platform — multi-tenant where it is safe and cheap to share, single-tenant where the contract or the model quality demands it, with the metering, identity, and noisy-neighbor controls that let one engineering team operate hundreds of tenants without forking the codebase per account.

The business scenario

The driver is a sales motion that has moved upmarket faster than the platform. A startup ML product is born shared-everything: one model, one database with a tenant_id column, one Kubernetes namespace. That is correct for product-market fit — it is cheap, simple, and every customer benefits from a model trained on pooled data. The problems arrive with the logos:

You cannot solve these by giving every customer a private copy of the entire stack — that is not a SaaS, it is a hundred bespoke deployments and a bankruptcy. Nor can you stay shared-everything and lose every regulated deal. The architecture that wins is tiered multi-tenancy: a shared pool tier for price-sensitive customers who are fine with a pooled model and logical isolation, a bridge tier with a per-tenant model on shared infrastructure, and a silo tier with dedicated, optionally in-region, infrastructure for the customers whose contracts demand it — all behind one control plane, one identity layer, and one billing system, so the company operates a product and not a portfolio of snowflakes.

Architecture overview

The platform splits cleanly into a control plane (tenant lifecycle, identity, routing, metering, model registry — one global brain) and data planes (the actual per-tenant or pooled inference and training, which can live in different accounts, regions, or clouds). The control plane never holds tenant payload data; it holds tenant metadata and decides where requests go.

Multi-Tenant ML SaaS: Isolation and Per-Tenant Models — architecture

Request path. (1) A tenant’s call enters through Akamai at the edge for global anycast, TLS termination, WAF, and per-tenant rate limiting at L7 — the first noisy-neighbor dam, dropping a tenant’s flood before it consumes a single core of yours. (2) It reaches the API gateway / tenant router, which authenticates the caller. Identity is federated: each customer is an Okta org (or their own IdP brokered through Okta), so the JWT carries a verified org_id that is the tenant identity — the platform never invents tenant IDs, it trusts the one the identity layer asserts. (3) The router reads the tenant registry (which tier, which region, which model version, which data-plane endpoint this org_id maps to) and forwards the request to the correct data plane. (4) In the data plane, an inference service loads the right model — the pooled model, or this tenant’s dedicated model pulled from the model registry — and serves the prediction, reading any required features from that tenant’s isolated feature store / database. (5) Every call emits a metered usage event (tenant, model, tokens or rows or prediction-units, latency) onto a stream, which feeds both billing and the noisy-neighbor governor. The prediction returns to the caller.

Training / model path runs on its own schedule. Each tenant’s raw data lands in its isolated storage; a training pipeline (orchestrated per tenant) produces a model, which is versioned in the model registry and promoted through evaluation gates before any inference service is allowed to load it. Pooled-tier tenants share one training run over pooled (and consented) data; bridge and silo tenants get their own runs over only their own data — the registry is what makes “this tenant, this model version” an addressable, auditable fact.

The defining property: the tenant identity is established once, at the edge, by the identity layer, and then carried as a signed claim through every layer — routing, data access, metering, and the model load itself. Isolation is not a filter applied late; it is a routing decision made early from a verified claim, and re-enforced at every boundary.

The isolation tiers

The heart of the design is choosing, per tenant, how much to isolate — because isolation trades directly against cost and operational simplicity. The three tiers map to three points on that curve.

Tier Compute Data Model Who it is for Marginal cost / tenant
Pool Shared inference pool, shared namespace Shared store, tenant_id + RLS One pooled model for all Price-sensitive, no isolation clause, benefits from pooled model Near zero
Bridge Shared compute, per-tenant model artifact Per-tenant schema or database Dedicated model, trained on tenant data Wants model quality / data isolation, fine with shared infra Low — a model artifact + a schema
Silo Dedicated cluster / account / region Dedicated account, dedicated keys Dedicated model, dedicated everything Regulated, residency, contractual hard isolation High — a full stack

Pool tier is classic shared-everything done carefully. Isolation is logical: a tenant_id on every row, PostgreSQL Row-Level Security so a query physically cannot return another tenant’s rows even with a bug in the app, and a tenant_id partition key on every stream and cache key. The model is the global pooled model — its whole value proposition is that it learned from everyone. This tier is your margin engine; most small customers live here and cost almost nothing to add.

Bridge tier is the workhorse for ML SaaS specifically, because the per-tenant model is often the actual reason a customer pays more. Infrastructure stays shared — the same inference cluster, the same Kubernetes — but the model artifact is per-tenant, pulled from the registry by org_id at load time, and the data is isolated to a per-tenant schema or database. A tenant gets a model trained only on its own shipment history (better predictions for its lanes) and a data boundary that survives an application bug, without you standing up a cluster per customer. The trick that makes this affordable is multi-model serving: a serving runtime (Triton, KServe, BentoML, or a sidecar pattern) that holds many tenants’ models in one fleet and loads/evicts them like a cache, so 200 tenant models do not need 200 servers.

Silo tier is for the contracts. A silo tenant gets a dedicated cluster — often a dedicated cloud account or subscription, sometimes a dedicated region — provisioned by the same Terraform that builds everything else, just with the tenant as a parameter. Its data, its encryption keys (its own KMS/Key Vault key, so you can prove cryptographic separation in an audit), its model, its blast radius. This is what you sell the regulated parcel carrier and the EU pharma. It is expensive per tenant, which is fine — you price it accordingly, and the contract justifies it.

The architectural win is that a tenant moves between tiers without re-platforming. A pool customer that signs an isolation addendum gets promoted to bridge by a control-plane flag plus a training run; a bridge customer that hits a residency requirement gets promoted to silo by Terraform spinning up an in-region stack and the registry re-pointing its route. One codebase, three isolation postures, movement between them as a config change — that is the difference between a SaaS and a consulting shop.

Per-tenant models and the registry

The model registry is the linchpin of an ML multi-tenant platform — it is what a generic SaaS multi-tenancy guide leaves out. It answers, authoritatively: for this org_id, what is the current production model version, where is its artifact, what data was it trained on, and what did it score on evaluation?

A registry entry is essentially:

tenant: org_8f2a               # the Okta org id, = the tenant identity
model_name: eta-forecaster
version: 2026-06-03.4
artifact_uri: s3://tenant-org_8f2a-models/eta/2026-06-03.4/
trained_on: org_8f2a-only      # provenance: NEVER pooled for this tier
eval:
  mape: 0.082                  # beat the 0.11 gate
  baseline_global_mape: 0.119
status: production             # promoted past the gate
kms_key: arn:aws:kms:eu-west-1:...:key/tenant-org_8f2a

Two properties matter enormously. Provenance (trained_on) is a contractual artifact: when the parcel carrier’s auditor asks “prove this model never saw a competitor’s data,” the registry is the answer, and the training pipeline enforces it by reading only that tenant’s storage. Promotion gates keep per-tenant models honest — a per-tenant model that does worse than the global baseline should not ship, so promotion is automated only when eval.mape beats both an absolute threshold and the global model. This is where per-tenant models go wrong in practice: a tenant with thin data gets a worse model than the pooled one. The gate catches it, and the fallback is to serve the pooled model and flag the tenant as “insufficient data for a dedicated model” — an honest, cheaper answer than a bad bespoke model.

Serving these efficiently is the multi-model problem. You do not run one container per tenant model; you run a fleet where each node can serve many models, loaded on demand:

# Sidecar/router resolves tenant -> model, serving runtime caches artifacts
def handle(request, claims):
    tenant = claims["org_id"]                  # from the verified Okta JWT
    spec = registry.production_model(tenant)   # cached; which model + uri
    model = model_cache.get_or_load(spec.uri)  # LRU across tenants on this node
    feats = feature_store.read(tenant, request.entity_ids)  # tenant-scoped
    pred  = model.predict(feats)
    meter.emit(tenant, spec.model_name, units=len(request.entity_ids))
    return pred

The cache is the cost lever: hot tenants stay resident, cold tenants pay a load latency on first hit. Pin latency-sensitive tenants (silo, premium SLAs) to dedicated replicas; let the long tail share an over-committed pool. This is how one fleet serves hundreds of distinct models at a sane cost — the inverse of the naive “a deployment per customer” that does not scale past a few dozen.

Noisy-neighbor control

Shared infrastructure means one tenant can starve the rest, and in ML the failure modes are nastier than in CRUD apps because inference is CPU/GPU-heavy and training is enormous. Control happens at four layers, defense in depth:

Layer Mechanism Stops
Edge Akamai per-tenant rate limit + burst Volumetric floods, scrapers, runaway client loops
Gateway Per-org_id quota + concurrency cap A tenant exceeding its plan’s QPS/concurrency
Compute K8s ResourceQuota, namespace limits, priority classes One tenant’s pods consuming the cluster
Data Per-tenant connection pool + statement timeout A backfill exhausting DB connections or locking tables

The edge and gateway layers cap a tenant before it reaches expensive compute — a 50-million-row replay gets throttled to its provisioned rate at Akamai, so it drains over hours instead of crushing the inference pool in minutes. The compute layer uses Kubernetes priority classes so a silo or premium tenant’s pods preempt best-effort batch work, and ResourceQuota per namespace so a tenant’s training job cannot eat the whole node pool. The data layer is the one teams forget: a shared Postgres has a finite connection count, and one tenant opening 200 connections for a bulk import locks everyone out — so each tenant gets a bounded pool (PgBouncer per-tenant pools) and a statement timeout that kills a runaway query before it holds locks for minutes.

Separate the batch plane from the interactive plane entirely. Training jobs, backfills, and large exports run on a different node pool (often spot/preemptible) than online inference, so the heaviest tenant workload — a full retrain — physically cannot touch the latency of live predictions. The two planes share the model registry and storage but never the serving CPUs. This single decision eliminates the most common noisy-neighbor incident in ML SaaS.

Observability makes the governor possible. Every metered event carries tenant_id, so Datadog (or Dynatrace) dashboards and monitors are sliced per tenant: p95 inference latency by tenant, model-load misses by tenant, DB connection saturation by tenant, GPU utilization by tenant. The alert that matters is per-tenant SLO breach and tenant-attributable resource spikes — “tenant org_8f2a’s p99 crossed 400 ms” and “tenant org_3c10 is consuming 60% of the inference pool” — so on-call knows who to throttle, not just that something is slow. Without per-tenant telemetry, noisy-neighbor control is blind.

Metering and billing

In an ML SaaS the unit of value is not “a user” — it is predictions, tokens, rows scored, or model-training-hours. Metering is therefore a first-class data path, not an afterthought, and it has to be accurate enough to bill on and reconcile.

Every request emits an event the instant the work completes — (tenant_id, metric, quantity, model, region, timestamp) — onto a durable stream (Kafka / Kinesis / Event Hubs). A metering aggregator rolls these into per-tenant usage records:

{ "tenant": "org_8f2a", "period": "2026-06",
  "predictions": 4_120_338, "training_hours": 6.5,
  "tier": "bridge", "region": "eu-west-1",
  "overage_predictions": 120_338 }

Three principles keep this trustworthy. Meter at the source, bill from the aggregate — emit at the serving layer where the work is unambiguous, aggregate idempotently (dedupe on an event id so a retried request is not double-counted), and reconcile the aggregate against raw events daily; a billing system that drifts from reality loses customer trust instantly. Tie metering to the same org_id the identity layer asserts, so the bill and the access control agree on who the tenant is — no mapping table to drift. Make the meter the noisy-neighbor signal too — the same stream that bills a tenant tells the governor it is spiking, so usage and enforcement come from one source of truth. Feed the aggregates to billing (Stripe, or an enterprise system like Zuora) for invoicing, and surface them in ServiceNow when a tenant’s usage crosses a plan threshold and triggers an upsell or an approval workflow for a tier change.

Silo tenants get a twist: because their stack is dedicated, you can bill them on infrastructure cost + margin rather than per-prediction, and cloud cost-allocation tags (tenant=org_8f2a on every resource) let you attribute the actual spend. Tagging every resource with the tenant id is non-negotiable — it is how you know a tenant is unprofitable before the renewal, not after.

Enterprise considerations

Security & isolation enforcement. Isolation is only as strong as its weakest re-check, so enforce it at every layer, not just routing. (a) The org_id claim is verified at the gateway (validate-jwt against the Okta org’s JWKS) and then propagated and re-checked at the data layer — RLS in Postgres uses the same tenant id, so even a compromised app server cannot read across tenants. (b) HashiCorp Vault issues per-tenant dynamic credentials — a serving pod for tenant org_8f2a gets a short-lived DB credential scoped to that tenant’s schema and a path to that tenant’s KMS key, so a leaked credential is both ephemeral and tenant-bounded. © Silo tenants get their own KMS/Key Vault key (envelope encryption), giving cryptographic separation you can demonstrate in an audit and a kill switch — destroy the key, the data is gone. (d) Wiz runs continuously as CSPM and data-security posture management, catching the multi-tenant-specific disasters: a storage bucket for one tenant made public, an IAM role that can assume across tenant accounts, a security group bridging two silos, a secret in a container image. (e) CrowdStrike Falcon provides runtime protection on the nodes — detecting a compromised inference pod attempting lateral movement toward another tenant’s data plane, the attack that turns a single-tenant breach into a multi-tenant headline. The cardinal rule: a vulnerability in shared infrastructure is a vulnerability for every tenant on it, which is precisely why regulated customers pay for the silo tier and why CSPM/runtime coverage is not optional.

Identity at the org boundary. Federating through Okta orgs (with Entra ID as the brokered IdP for Microsoft-shop customers) means each customer administers its own users, SSO, and SCIM provisioning, and the platform consumes a verified org_id rather than running its own user directory per tenant. SCIM deprovisioning matters in multi-tenancy: when a customer offboards an employee, the claim stops validating immediately, across every tier. The platform’s own operators authenticate separately and their access to a tenant’s data plane is itself brokered through Vault with full audit — break-glass access to a silo tenant should page and log, not be a standing privilege.

Cost optimization. Tiering is the cost strategy: keep the long tail of small customers in the pool tier where marginal cost is near zero, reserve dedicated infrastructure for tenants whose contracts pay for it. (1) Multi-model serving with an LRU cache so hundreds of bridge-tier models share a fleet rather than one-server-per-model. (2) Spot/preemptible nodes for the batch plane (training, backfills) — interruptible work belongs on interruptible compute, often 60–70% cheaper. (3) Right-size silo tenants — a dedicated stack idling at 5% utilization is pure loss; scale-to-zero or scheduled scale-down for silo tenants with predictable hours. (4) Per-tenant cost attribution via tags so you can see which tenants are unprofitable and reprice at renewal. (5) Bill overage at the source meter so heavy tenants pay for the load they impose — metering is a cost control, not just revenue.

Scalability. Each plane scales independently. The interactive inference fleet scales on concurrency/latency (HPA on a custom queue-depth metric, not just CPU); the batch plane scales on job queue depth; the control plane is small and scales trivially because it carries metadata, not payload. Per-tenant scaling is the subtle part: a single huge tenant can outgrow a shared pool and should be promoted to silo (its own cluster) rather than allowed to dominate a shared one — promotion is a scaling tool, not only an isolation tool. The natural ceiling on the bridge tier is how many models fit in the serving cache before churn hurts latency; past that, shard tenants across serving fleets by org_id.

Reliability & DR (RTO/RPO). Blast radius is the multi-tenant reliability story. A pool-tier incident hits everyone, so the pool gets the most engineering rigor and the tightest SLO. Bridge incidents are contained to shared infra but per-tenant data survives in isolated schemas. Silo incidents are, by design, one tenant. Set RTO/RPO per tier in the contract: silo tenants often buy a tighter RPO (their own cross-region replica) than pool tenants get. The model registry and tenant registry are the crown jewels — back them up and replicate them cross-region, because losing the mapping of org_id → model → endpoint is losing the whole platform’s ability to route correctly. A pragmatic target: control plane RTO 15 min / RPO near zero (it is small and replicated), pool data plane RTO 30 min, silo data planes per their individual SLAs.

Operations & governance. Terraform provisions every tenant — pool tenants are a database schema and a registry row, silo tenants are a full stack module parameterized by tenant id — so onboarding is terraform apply, not a runbook, and every tenant is reproducible and auditable. GitHub Actions (or Jenkins) runs the pipeline that promotes a tenant model: train, evaluate against the gate, and only on a pass does it update the registry to production — model promotion is CI/CD with a quality gate, identical in spirit to deploying code. Tier changes and silo provisioning route through ServiceNow for approval, because standing up a dedicated stack has a cost the account owner should sign off on. Pin model versions explicitly per tenant in the registry so a tenant’s predictions never drift under a silent retrain; promote new versions per tenant through the same gate, and keep the old version warm for instant rollback if a tenant’s metrics regress.

Reference example

Polaris Routing, a fictional logistics-analytics SaaS (~70 employees, ~140 customers), runs exactly this platform on AWS with an EU region for residency customers. The bulk of its customers — ~115 small and mid-size carriers — live in the pool tier: one shared eta-forecaster model, a shared Aurora PostgreSQL with RLS on tenant_id, served from one KServe fleet. They cost almost nothing each and subsidize the platform.

The decisions. ~20 mid-market customers sit in the bridge tier with per-tenant models trained on their own shipment history, served from a multi-model KServe fleet that holds ~30 models hot in an LRU cache across 6 GPU nodes; their data lives in per-tenant Aurora schemas. Five enterprise customers are silo tenants — the national parcel carrier and the EU pharma among them — each in a dedicated AWS account, the EU two in eu-west-1, each with its own KMS key and its own EKS cluster, all built by one parameterized Terraform module. Identity is federated through Okta orgs (the pharma brokers its own Entra ID); every request carries a verified org_id that drives routing, RLS, Vault credential scoping, and metering alike. Vault issues short-lived per-tenant DB credentials to serving pods. Wiz and CrowdStrike Falcon cover posture and runtime across all accounts; Datadog dashboards are sliced per tenant for SLO and noisy-neighbor alerts. Training and backfills run on a separate spot-node batch plane that cannot touch interactive latency. Model promotion runs in GitHub Actions with a MAPE gate against the global baseline; tier upgrades route through ServiceNow.

The numbers. ~210 million predictions/month across all tenants; pool-tier marginal cost per customer is effectively rounding error. Monthly platform run cost landed near ₹46 lakh (~$55,000): the five silo stacks ~$22,000 (dedicated EKS + Aurora + replication), the bridge GPU fleet ~$11,000, the shared pool fleet + Aurora ~$8,000, batch/spot training ~$3,500, edge (Akamai) + observability (Datadog) + security (Wiz/Falcon) the remainder. Two per-tenant models failed their promotion gate (thin data, worse than the global model) and correctly fell back to the pooled model. A noisy-neighbor incident — a bridge tenant’s 40-million-row replay — was absorbed at the Akamai rate limit and the spot batch plane, with zero impact on interactive p95, which the per-tenant Datadog monitor confirmed in real time.

The outcome. The parcel-carrier deal closed on the strength of the silo tier and the registry’s provenance proof that its model trained only on its own data; the EU pharma closed on in-region silo plus per-tenant KMS keys. Per-tenant models lifted ETA accuracy ~14% for the bridge tenants that had distinctive lanes — the upsell that justified the tier. And because one Terraform module and one control plane run all three tiers, the 70-person company operates 140 tenants without a dedicated SRE per customer.

When to use it

Use this architecture when you sell an ML product to businesses, your customers’ data is sensitive or regulated enough that isolation becomes contractual, per-tenant model quality is a real differentiator, and you need to serve many tenants from one engineering team without forking the platform. That is most B2B ML SaaS the moment it moves upmarket — forecasting, scoring, document intelligence, recommendation, or any model that improves when trained on a specific customer’s data.

Trade-offs to accept. Tiered multi-tenancy is more complex than shared-everything — a control plane, a model registry, multi-model serving, per-tenant metering, and the discipline to keep them coherent. Per-tenant models multiply your training and evaluation surface; you now run (and must monitor) many models, not one. Silo tenants are genuinely expensive and must be priced to cover dedicated infrastructure. And the isolation guarantees are only as strong as your weakest re-check — a single place that trusts an unverified tenant id, or one IAM role that spans accounts, undoes the whole story.

Anti-patterns. (1) Inventing tenant ids in the app instead of trusting the verified org_id claim — the boundary must come from identity, re-checked at every layer. (2) A deployment per customer from day one — does not scale past a few dozen and turns a SaaS into bespoke ops; share in the pool/bridge tiers and silo only what the contract demands. (3) Filtering tenants only in app code — without RLS, one query bug leaks across tenants; enforce isolation in the data layer too. (4) Sharing the serving plane between training and inference — a retrain crushes live latency; split batch from interactive. (5) Per-tenant models with no promotion gate — thin-data tenants get worse models than the pool; gate on beating the global baseline and fall back. (6) Metering as an afterthought — if the bill drifts from reality you lose trust; meter at the source and reconcile.

Alternatives, and when they win. If your customers are not regulated and your model genuinely benefits from pooled data, stay shared-everything (pool tier only) — it is cheaper and simpler, and the per-tenant machinery is premature. If you have only a handful of large, deep-pocketed enterprise customers and no long tail, single-tenant-per-deployment (every customer a silo) can be acceptable — fewer isolation subtleties, at the cost of per-deployment ops. If your differentiation is a foundation model you fine-tune lightly per tenant rather than full per-tenant training, a shared base model with per-tenant adapters / LoRA collapses the bridge tier’s cost dramatically — one base model, tiny per-tenant deltas, loaded like the multi-model cache here. And if you are pre-product-market-fit, do none of this yet: ship shared-everything, win customers, and build the control plane the week a regulated logo asks for an isolation clause — which, if your product works, it will.

Multi-tenancyMLOpsSaaSArchitectureEnterpriseSecurity
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading