Architecture Multi-cloud

Enterprise GenAI Gateway: Governing LLM Access Across Providers

A global pharmaceutical company’s platform team gets handed a problem that looks like a win and behaves like a fire. In eighteen months, “let’s try the LLM thing” turned into forty-one separate teams calling models directly — medical-affairs writing literature summaries against Azure OpenAI, the clinical data-science group on Amazon Bedrock because their pipeline already lived in AWS, a newly acquired biotech subsidiary defaulting to Google Vertex AI, and a long tail of teams pasting protocol text and adverse-event narratives into whatever endpoint had a key. The CISO finds out the way CISOs always do: a security researcher demonstrates that a prompt sent to a third-party model included a patient identifier and an unblinded trial arm — the kind of disclosure that, under GxP and HIPAA, is not a bug ticket but a reportable event. The CFO finds out separately, when the consolidated cloud bill shows a quarter-million dollars of “AI services” no one can attribute to a team, a project, or a justification.

The mandate that comes down is precise and uncomfortable: every LLM call in the company — across all three clouds, every team, every model — must go through one governed path that redacts sensitive data before it leaves the building, enforces per-team budgets, applies the same safety policy everywhere, and produces a single audit trail. And it cannot lock the company into one model provider, because in a field where the best model for protein-structure reasoning and the best model for regulatory-document drafting may come from different vendors and change every quarter, betting the platform on a single API is its own risk. This article is the reference architecture for that control plane: an enterprise GenAI gateway that governs LLM access across providers.

Why a gateway, and why not the obvious alternatives

The instinct in most shops is to skip the gateway, and three shortcuts get proposed in the first meeting. Naming why each fails saves a quarter of rework.

“Just standardize on one provider.” Tempting, and wrong for an enterprise of any size. Model leadership rotates between Azure OpenAI, Bedrock, and Vertex on a timescale of months; a frontier reasoning model lands on one cloud first, a cheaper-per-token workhorse on another, a long-context model on a third. Standardizing on one vendor means either using a worse model for half your workloads or migrating the whole company every time the frontier moves. Worse, it does nothing for governance — a single-provider sprawl of forty teams with forty API keys is exactly as ungoverned as a multi-provider one.

“Put governance in a shared SDK.” A library every team imports, which redacts and meters before calling the model. This fails the moment one team pins an old version, one team writes in a language the SDK does not support, or one team simply calls the provider’s REST endpoint directly to ship faster. Governance you can bypass by not importing a package is not governance; it is a suggestion.

“Let each cloud’s native tools handle it.” Azure has APIM policies, AWS has Bedrock guardrails, Google has Vertex safety filters. Real, but per-cloud — three different policy languages, three different audit formats, three places to update when the redaction rules change, and no single budget that spans them. The whole point of the mandate is one path, one policy, one ledger.

The gateway threads the needle. It is a single network-enforced choke point that every team must traverse to reach any model on any cloud, where redaction, safety, identity, budgeting, routing, and audit are applied once and consistently — while the model provider behind it stays swappable. The enforcement is the architecture: you make the gateway the only egress path to model providers, so bypassing it is not a policy violation an engineer can rationalize but a network route that does not exist.

Architecture overview

Enterprise GenAI Gateway: Governing LLM Access Across Providers — architecture

The gateway is best understood as a request pipeline with a control plane bolted alongside it. The request pipeline is the hot path every prompt travels; the control plane is the slower-moving machinery — config, budgets, secrets, policy — that the hot path reads from. Keeping the two separate in your head is the first step to operating this without melting it down.

The defining property of the whole topology is the one the CISO signs for: model providers are reachable from exactly one place. Egress firewall rules and private networking are configured so that Azure OpenAI, Bedrock, and Vertex endpoints accept traffic only from the gateway’s egress addresses; every other subnet in every cloud is denied. A team that tries to call a provider directly does not get a 403 they can argue about — the packet has nowhere to go.

Request path, following the control flow:

  1. A caller — an application, a notebook, an internal “chat with our SOPs” portal — sends an OpenAI-compatible request to the gateway’s single endpoint. It does not name a cloud; it names a logical model like reasoning-large or summarize-cheap. This indirection is what keeps providers swappable: the caller never hardcodes bedrock or vertex.
  2. Traffic enters through Akamai at the edge — TLS termination, global anycast so a Basel team and a Boston team hit a near point of presence, and a WAF tuned with custom rules for prompt-flood and token-exhaustion patterns before anything reaches the gateway origin.
  3. The request hits the gateway core (the LLM proxy itself, running on a private Kubernetes cluster). The first thing it does is authenticate: it validates the caller’s token against Okta as the workforce and service identity provider, where every team and service has been provisioned with scopes that say which logical models and which data-sensitivity tier they may use. A medical-affairs app scoped to reasoning-large on confidential data is allowed; the same app reaching for an restricted/PHI tier is rejected at the door. For Azure-resident workloads, Okta federates to Microsoft Entra ID so the call carries a first-class Entra token where Azure RBAC expects one.
  4. The gateway checks the per-team token budget before doing any work. A budgeting service (backed by Redis for the hot counters, a database for the ledger) holds each team’s monthly prompt- and completion-token allowance; if the team is over budget or over its per-minute rate, the request is throttled or rejected now, before a single expensive token is spent.
  5. The prompt passes through the input guardrail stage: PII/PHI redaction and prompt-safety screening. A redaction engine (Microsoft Presidio for named-entity detection, plus domain dictionaries for trial IDs, patient identifiers, and compound codes) strips or tokenizes sensitive spans before the text leaves the company’s network. In parallel, a prompt-safety check screens for jailbreaks and indirect injection. This is the stage that exists because of the patient-identifier incident.
  6. The router maps the logical model to a concrete provider deployment — reasoning-large to Azure OpenAI gpt-4o today, with Bedrock Claude or Vertex Gemini configured as fallbacks — and pulls the provider credential it needs from HashiCorp Vault at call time. The gateway holds no long-lived provider keys in config or environment variables; Vault issues short-lived, leased credentials, and rotates them out from under the running pods.
  7. The gateway calls the chosen provider over a private egress path — Azure Private Endpoint / Private Link for Azure OpenAI, AWS PrivateLink for Bedrock, Private Service Connect for Vertex — so the prompt and completion never traverse the public internet even on the leg between the gateway and the model.
  8. The response passes back through the output guardrail stage: content-safety filtering on the model’s output, optional de-tokenization to restore redacted spans for trusted callers, and a groundedness/policy check. The governed completion streams back to the caller.
  9. Every hop — identity, tokens consumed, logical and concrete model, redaction hits, guardrail verdicts, latency, cost — is emitted as structured telemetry to Datadog, which is the single ledger the CFO, the CISO, and the platform team all read from.

Control plane, alongside: policy and routing config live in Git and are reconciled onto the cluster by Argo CD; the gateway and its infrastructure are defined in Terraform (the cloud resources, networking, private endpoints) with Ansible handling the configuration of the self-managed pieces; Jenkins or GitHub Actions runs the build, the policy tests, and the evaluation gate before any change reaches Argo CD. None of this is on the request hot path — it is how the hot path gets safely changed.

Component breakdown

Component Service / tool Role in the gateway Key configuration choices
Edge Akamai TLS, anycast, WAF, bot and prompt-flood mitigation Custom WAF rules for token-exhaustion; origin shield to the private gateway
Identity / scopes Okta (+ Entra ID) Per-team/service auth; scopes bind identity to logical models and data tiers OAuth2 client-credentials for services, OIDC for humans; Okta→Entra federation for Azure workloads
Gateway core LLM proxy on private K8s The choke point: auth, budget, redact, route, guard, log OpenAI-compatible API; horizontal autoscale on concurrency
Budgets / metering Budget service + Redis Per-team token quotas and rate limits, enforced pre-call Hot counters in Redis; durable ledger in Postgres; reset on billing cycle
Redaction Presidio + domain dictionaries Strip/tokenize PII/PHI before egress; restore on trusted return NER recognizers + custom trial-ID/compound-code patterns; reversible vault-backed tokenization
Content safety Provider guardrails + own filter Block jailbreaks, injection, harmful in/out content Bedrock Guardrails / Azure Content Safety / Vertex safety, normalized to one verdict schema
Routing Router in gateway core Logical model → concrete provider; fallback and load-balance Weighted routing; health-based failover Azure→Bedrock→Vertex
Secrets HashiCorp Vault Short-lived provider credentials, signing keys Dynamic secrets per cloud; lease + auto-rotation; no static keys in pods
Observability / FinOps Datadog Unified cost, token, latency, guardrail telemetry across all clouds OTel traces per request; per-team cost dashboards; anomaly monitors
CSPM / IaC scanning Wiz + Wiz Code Posture across three clouds; scan Terraform before apply Agentless multi-cloud scan; Wiz Code blocks a public-endpoint IaC change in CI
Runtime security CrowdStrike Falcon Runtime threat detection on gateway nodes Sensor on the node pool; detections to the SOC
ITSM / approvals ServiceNow Onboarding approvals, budget-change gates, incident records Change gate to raise a team’s quota; auto-ticket on guardrail breach
Delivery Jenkins / GitHub Actions + Argo CD Build, test, eval gate, GitOps reconcile of policy/routing OIDC to each cloud; eval gate before promote; Argo CD as the deployer
IaC / config Terraform + Ansible Provision cloud + network; configure self-managed components Per-cloud modules; private endpoints as first deliverable

A few choices carry the design and deserve the why.

Why a logical-model abstraction, not direct provider names. If callers request gpt-4o or anthropic.claude by name, you have hardcoded the provider into forty codebases and lost the swappability that justified the multi-cloud bet in the first place. By exposing logical names — reasoning-large, summarize-cheap, long-context — the platform team can re-point reasoning-large from Azure OpenAI to a Bedrock or Vertex model in a single config change reconciled by Argo CD, with zero caller redeploys, the day a better or cheaper model ships. The abstraction is the product.

Why redaction lives in the gateway, not in each app. It is tempting to ask every team to scrub PHI before they call. Do not — that distributes the single most compliance-critical control across forty teams of varying diligence, and one missed scrub is a reportable disclosure. Centralizing redaction in the gateway means the rule is written once, tested once, and applied identically to a notebook, a portal, and a batch job. Reversible tokenization (the span is replaced with a token, the mapping held in Vault) lets trusted callers get the real value back on the return trip while the model provider only ever saw a placeholder.

Why one normalized safety verdict over three native guardrails. Bedrock Guardrails, Azure AI Content Safety, and Vertex safety filters each have their own categories and response shapes. The gateway calls whichever is native to the chosen provider and applies its own filter, then normalizes everything into one verdict schema (blocked, category, severity) so the audit log, the dashboards, and the incident automation see one consistent signal regardless of which cloud served the request.

Implementation guidance

Provision with Terraform, and treat the cross-cloud network as the first deliverable. The deployment order matters because private connectivity to three providers is where this architecture silently breaks.

  1. A landing-zone VNet/VPC per cloud with a subnet for the gateway and a dedicated egress subnet, peered or transit-connected so the gateway core can reach all three providers from one logical place.
  2. Private connectivity to each provider — Azure Private Endpoint + the privatelink.openai.azure.com private DNS zone for Azure OpenAI; an AWS PrivateLink interface endpoint for bedrock-runtime; Private Service Connect for the Vertex endpoint. A forgotten private DNS zone here produces the same silent hang it does anywhere: the endpoint resolves to a firewalled public IP and the call times out with no clear error.
  3. Egress firewall rules that allow the provider endpoints only from the gateway’s egress subnet and deny them everywhere else — this is the rule that makes the gateway unbypassable.
  4. The private Kubernetes cluster for the gateway core, with workload identity per cloud and CrowdStrike Falcon sensors on the node pool.
  5. Wiz Code wired into the Terraform pipeline so a plan that would expose a provider or gateway endpoint publicly is blocked before apply, not caught after.

A trimmed gateway routing config communicates the intent — logical names, ordered fallbacks across clouds, budgets attached:

models:
  reasoning-large:
    route:
      - provider: azure-openai      # primary today
        deployment: gpt-4o
        weight: 100
      - provider: bedrock           # health-based fallback
        model: anthropic.claude-3-5-sonnet
      - provider: vertex            # last resort
        model: gemini-1.5-pro
    guardrails: [pii_redact, prompt_shield, output_safety]
teams:
  medical-affairs:
    allowed_models: [reasoning-large, summarize-cheap]
    max_data_tier: confidential     # not restricted/PHI
    monthly_token_budget: 80_000_000
    rpm_limit: 600

The pipeline that applies this runs in GitHub Actions (or Jenkins for the teams standardized on it), authenticating to each cloud via OIDC federation so there is no stored service-principal or IAM secret to leak — the exact failure mode the company has already lived through. The same pipeline runs the policy tests and the offline evaluation gate before handing the change to Argo CD to reconcile.

Identity: bind every caller to a scope, not a network location. Provision each application and service in Okta as a first-class client with OAuth2 client-credentials, and encode the entitlement in scopes: which logical models, which data-sensitivity tier, which team budget it draws from. The gateway treats the scope as the source of truth — being inside the corporate network grants nothing on its own. Azure-resident workloads federate Okta → Entra ID so the token is native where Azure RBAC reads it. The provider credentials the gateway itself uses to reach Azure OpenAI, Bedrock, and Vertex are never static: HashiCorp Vault issues short-lived dynamic secrets per cloud, leased and auto-rotated, so a leaked pod memory dump yields a credential that is already dead.

Budgets and metering, enforced before the spend. The budget service holds two layers: fast Redis counters for the per-minute rate and the running monthly token tally, and a durable Postgres ledger for chargeback and audit. The check happens before the model call, so an over-budget team is stopped before incurring cost, not billed after. Raising a team’s quota is not a config edit an engineer makes quietly — it routes through ServiceNow as a change request with an owner and a justification, which is how the CFO gets a defensible answer to “why did medical-affairs spend triple this month.”

Enterprise considerations

Security & Zero Trust. The architecture is Zero Trust by construction: identity-and-scope-based access, no implicit trust from network position, model providers reachable from one egress path only, and no public data-plane surface on the gateway. Layered on top: (a) input redaction as the hard control against PHI ever reaching a provider, with reversible tokenization for trusted return; (b) prompt-injection screening, including the under-appreciated indirect injection hidden inside a document a RAG-style caller retrieved and forwarded; © Wiz running continuous CSPM across all three clouds, alerting the moment any provider or gateway endpoint drifts to public exposure or an IAM policy widens, with Wiz Code shifting that check left into the Terraform pipeline; (d) CrowdStrike Falcon on the gateway node pool feeding the SOC; (e) a guardrail breach — a blocked jailbreak, a redaction engine flagging a sustained PHI leak attempt from one team — auto-raises a ServiceNow incident so security gets a ticket, not just a log line. Policy-as-code in each cloud denies a provider resource created with public access, and Wiz independently verifies the policy is actually holding across all three.

Cost optimization (FinOps is the headline, not a footnote). This gateway exists as much for the CFO as the CISO, so cost control is a first-class feature.

Lever Mechanism Typical effect
Model tiering Route simple summarization to summarize-cheap, reserve reasoning-large for hard tasks Large share served at a fraction of the per-token cost
Cross-provider arbitrage Re-point a logical model to whichever cloud is cheapest for equivalent quality Captures provider price moves without caller changes
Semantic caching Serve near-identical prior prompts from cache above a cosine threshold Deflects a meaningful share of calls on repetitive corpora
Pre-call budget enforcement Stop over-budget teams before the spend, not after Hard ceiling per team, not a surprise on the invoice
Prompt-size hygiene Cap context and trim system prompts centrally Cuts input tokens on every request across all teams

The single most valuable artifact is the per-team, per-model cost dashboard in Datadog, fed by the token and cost telemetry the gateway emits on every request. For the first time the company can answer “what does AI cost us, by team, by model, by cloud” from one place — and that visibility alone changes behavior, because teams that can see their spend manage it.

Scalability. The gateway core is stateless on the hot path (state lives in Redis and the providers), so it scales horizontally on request concurrency behind the load balancer; the budget counters in Redis are the one shared bottleneck, sized and sharded accordingly. Throughput ceilings are the providers’ own quotas — Azure OpenAI PTUs/regional limits, Bedrock and Vertex per-region quotas — which the router smooths by load-balancing across deployments and failing over Azure→Bedrock→Vertex when one provider throttles, so a 429 on one cloud degrades gracefully instead of failing the request. Plan multi-region for the gateway itself early; a single-region gateway is a single point of failure for all AI in the company.

Failure modes, and what each looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Decide the numbers per tier. The gateway core, being stateless, recovers by standing pods up in a paired region behind the load balancer — RTO in minutes. The budget ledger (Postgres) replicates cross-region for near-zero RPO on the financial record; the Redis hot counters are rebuildable from the ledger, so their loss is a performance event, not a data-loss one. The model providers each handle their own regional availability; the gateway’s job is to route around a degraded region, which the failover config already does. A pragmatic target: RTO 10 minutes, RPO 1 minute for the gateway service and its budget ledger, with Akamai health checks driving edge failover for ingress.

Observability. Instrument the full request span in Datadog with OpenTelemetry: one trace covering authenticate → budget-check → redact → route → provider-call → output-guard, with token counts, cost, chosen provider, and guardrail verdicts on each hop. Emit the metrics each stakeholder actually reads — tokens and cost per team/model/cloud (the CFO’s view), redaction-hit and guardrail-block rates (the CISO’s view), per-provider latency and error rate and p95 time-to-first-token (the platform team’s view). Datadog anomaly monitors then surface a cost spike or a redaction-rate change on their own. An offline evaluation harness runs in the delivery pipeline so a routing or policy change is scored — does re-pointing reasoning-large to a new provider hold answer quality? — before Argo CD ships it.

Governance and adoption. Pin concrete provider model versions explicitly behind each logical name (never a floating alias) so a silent provider-side model update cannot drift behavior; promote new versions through the eval gate. Keep routing, budgets, and policy in Git, reviewable and instantly revertable, deployed only via Argo CD so there is no out-of-band change. New teams onboard through ServiceNow with their scope, data tier, and budget approved by an owner — the documented gate compliance needs. And because adoption is the real success metric, the platform team publishes a self-serve enablement track on Moodle: how to call the gateway, how to choose a logical model, what the data tiers mean, and the redaction rules — so teams onboard correctly instead of inventing their own ungoverned path again. Where a team needs an isolated network posture for a regulated workload, the gateway can also be deployed as a hardened virtual appliance image into that team’s VPC, governed by the same central policy but running on their side of a trust boundary.

Explicit tradeoffs

Accept these or do not build it. A central gateway is, by definition, a chokepoint in front of every AI call in the company — that is its value and its liability. It adds a network hop and a few milliseconds of policy processing to every request, it becomes a tier you must make as available as the providers behind it, and it concentrates operational responsibility on the platform team. The multi-cloud abstraction that buys provider independence costs you the lowest-common-denominator problem: a feature unique to one provider (a particular structured-output mode, a provider-specific tool format) is awkward to expose through a normalized API, so you either special-case it and erode the abstraction or forgo it. Centralized redaction is the right call for compliance but means the gateway must understand enough of every team’s data shapes to redact correctly — a real, ongoing curation cost. And running this across three clouds means three sets of private-link plumbing, three quota regimes, and three IAM models to keep in Terraform, which a single-cloud shop simply does not pay.

The alternatives, and when they win. If you are genuinely committed to one cloud and one provider for the foreseeable future, that cloud’s native gateway (APIM with LLM policies on Azure, Bedrock-fronted patterns on AWS) is simpler and you should use it — the multi-provider machinery is overhead you do not need. If you have a handful of teams and trust them, a shared SDK with redaction and metering is faster to stand up than a network-enforced gateway, accepting that it is bypassable. If your concern is purely cost and not safety or data leakage, a FinOps tagging-and-dashboard effort against the native billing exports gets you the CFO’s view without a request-path component. The full gateway earns its complexity precisely when all three pressures land at once — multi-provider reality, a hard data-protection mandate, and uncontrolled spend — which is exactly the situation the pharmaceutical company found itself in, and exactly when “every LLM call goes through one governed path” stops being architecture astronomy and becomes the only defensible answer.

The shape of the win

For the company, the payoff is not “an AI proxy.” It is that the CISO can state, with an audit trail to prove it, that no patient identifier has reached a model provider since the gateway went live — because redaction is one control, applied to every call, not a hope spread across forty teams. It is that the CFO opens one Datadog dashboard and sees AI spend by team, model, and cloud, with each team stopped at its budget before the overage rather than after. It is that when a better reasoning model ships on Bedrock next quarter, the platform team re-points one logical name and every team gets the upgrade with no redeploy and no new key. The Private Links, the Okta scopes, the Vault-leased provider credentials, the Wiz posture scanning, the normalized safety verdicts, the per-team token ledger — all of it exists so that a regulator, a CISO, and a CFO each say yes to the same sentence: every LLM call in this company is governed, attributed, and safe, no matter which cloud served it. That sentence is the architecture. Start with one cloud and one logical model if you must, but in a multi-provider, regulated enterprise, this is where governing GenAI access has to land.

GenAILLM GatewayMulti-cloudGovernanceFinOpsEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading