A retail bank’s wealth-management arm gets an ultimatum from its head of advisory: the 4,000 relationship managers spend a quarter of every client meeting hunting through product disclosures, suitability rules, and a decade of internal research notes — and twice last year an RM quoted a withdrawn fund fact sheet to a client, which is the kind of thing that ends with a regulator’s letter. The ask is “an assistant that answers from our own documents, with a citation, instantly.” The constraint is brutal and non-negotiable: this is MiFID II / SEBI-regulated material under data-residency rules, the security team has a standing veto on anything where a client’s portfolio context could leave the bank’s network, and the model must never invent a number. A weekend with an API key is not the answer. This article is the reference architecture for building that assistant properly on Azure — a private-networked, identity-gated, guardrailed RAG platform that a bank’s CISO and compliance officer will actually sign.
The pressures stack the way they always do in finance. Regulation means every answer needs a verifiable source and an audit trail, and the data cannot traverse the public internet. Scale means 4,000 concurrent advisors during market hours, not a demo. Latency means a relationship manager mid-conversation will not wait eight seconds for a streamed token. And cost means token spend that grows with adoption and has to be charged back to each business line. RAG — retrieval-augmented generation — is the pattern that satisfies all four at once: it grounds the model’s answer in retrieved, permissioned, citable passages from a search index you control, instead of the model’s frozen and uncitable training memory. The knowledge stays in your index; the model supplies language and reasoning, not facts.
Why not the obvious shortcuts
The naive fixes each fail predictably, and naming why matters because someone on the project will propose all three.
Keyword search over the document store returns documents, not answers, and misses anything phrased differently from the query — “early redemption penalty” will not find a fact sheet that says “exit load.” Fine-tuning a model on the corpus bakes facts into weights you cannot cite, update the morning a fund is withdrawn, or unlearn — and the model still hallucinates confidently. Pasting documents into a public chatbot leaks regulated, client-linked data across a tenant boundary the security team will never approve, and in a residency regime is simply illegal.
RAG threads the needle. At query time the system retrieves the handful of passages actually relevant to the question, hands them to the model as grounding context, and the model composes an answer from that context with citations back to the source. Retrieval is also the natural choke point to enforce document-level permissions — an advisor only ever grounds on documents they are cleared to read — and the citations turn an unauditable black box into something a compliance officer can verify line by line.
Architecture overview
The platform runs two distinct paths that share infrastructure but live on different schedules: a synchronous query pipeline that serves advisors, and an event-driven ingestion pipeline that keeps the knowledge base current. Keeping them separate in your head is the first step to operating this well.
The defining property of the entire topology is the one the security team cares about most: every Azure PaaS data-plane call rides a Private Endpoint, and every public endpoint is disabled. Azure OpenAI, AI Search, Cosmos DB, Blob Storage, and Content Safety expose private IPs inside the VNet only. No prompt, no retrieved passage, and no completion ever touches the public internet — which is what makes a residency story defensible.
Query path, following the control flow:
- An advisor opens the assistant inside the bank’s portal. The portal federates identity through Okta as the workforce IdP, which is brokered to Microsoft Entra ID so Azure resources see a first-class Entra token. The user hits Akamai at the edge for TLS termination, global anycast, and WAF/bot protection before traffic ever reaches Azure.
- The request lands on Azure API Management (APIM) acting as the AI gateway in internal VNet mode. APIM validates the Entra JWT (
validate-jwt), attaches the caller’s identity and group claims, enforces a per-business-line token budget with thellm-token-limitpolicy, and rate-limits by tenant. This is the single front door for every model call in the bank — one place to meter, throttle, and audit. - The request reaches the orchestrator — your RAG application code on AKS (a private cluster, for the steady high throughput of market hours). The orchestrator pulls the few secrets it cannot get from managed identity — third-party API tokens, the Okta introspection secret — from HashiCorp Vault via the Vault Agent sidecar with Entra-backed auth, so nothing sensitive sits in a Kubernetes Secret or config map.
- The orchestrator embeds the question with
text-embedding-3-largeand queries Azure AI Search with a hybrid query (vector + BM25 keyword) plus semantic reranking, filtered by the advisor’s permission groups so a passage they may not see is never even returned. - It assembles the top passages into a grounded prompt and calls Azure OpenAI (
gpt-4ofor substantive reasoning,gpt-4o-minifor simple lookups). - Both the inbound prompt and the model’s output pass through Azure AI Content Safety — Prompt Shields for jailbreak and indirect-injection detection, harm-category filters, and a groundedness check on the answer to flag anything the retrieved context does not support. The cited answer streams back to the advisor; the turn is written to Cosmos DB.
Ingestion path, independent and event-driven: source documents (product disclosures, research, suitability rules) land in Blob Storage, synced from SharePoint and the bank’s content systems. A Blob event triggers Azure AI Document Intelligence to extract clean text and layout — tables, headings, OCR for scanned legacy fact sheets — then Azure Functions chunk the text, call Azure OpenAI to embed each chunk, and push the vectors stamped with the source document’s ACL groups into the AI Search index. That ACL stamp at ingestion time is what makes per-document security real at query time.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Edge | Akamai | TLS, anycast, WAF, bot mitigation at the perimeter | Custom WAF rules for prompt-flood patterns; origin shield to APIM’s private origin |
| Identity / SSO | Okta + Microsoft Entra ID | Workforce SSO (Okta) federated to Entra for native Azure RBAC | OIDC federation; group claims flow to APIM/Search; conditional access on Entra |
| AI gateway | Azure API Management | JWT validation, per-line token metering, rate limiting, routing | validate-jwt; llm-token-limit; backend pool for OpenAI spillover |
| Orchestrator | AKS (private cluster) | RAG logic: embed → retrieve → rerank → prompt → call → guard → cite | Workload Identity; KEDA scaling; streaming responses |
| Secrets | HashiCorp Vault | Third-party tokens, introspection secrets, signing keys | Entra auth method; dynamic leases; Vault Agent sidecar injection |
| Chat & embeddings | Azure OpenAI | Generation + vectorization | gpt-4o (PTU) + gpt-4o-mini (cost tier); text-embedding-3-large; local_auth off |
| Retrieval | Azure AI Search | Hybrid vector+keyword index, semantic rerank, security filtering | HNSW profile; semanticConfiguration; OData filter on ACL field |
| Extraction | AI Document Intelligence | PDF/Office/image → structured text + layout | Layout model; per-page provenance for citations |
| Guardrails | AI Content Safety | Block jailbreaks, harmful content, ungrounded answers | Prompt Shields on input; groundedness + harm filters on output |
| State | Cosmos DB | Conversation history, feedback, per-user memory | Autoscale RU/s; partition by conversation id; TTL on transient turns |
| CSPM / data posture | Wiz | Cloud posture, sensitive-data exposure, attack-path analysis | Agentless scan of Storage/Search; alerts on any public-exposure drift |
| Runtime security | CrowdStrike Falcon | Workload runtime protection on AKS nodes and ingestion VMs | Sensor on node pool; detections piped to the SOC |
| Observability | Dynatrace | Distributed tracing, token/cost telemetry, AI-specific spans | OneAgent on AKS; OpenTelemetry RAG span; Davis anomaly detection |
| ITSM / approvals | ServiceNow | Corpus onboarding approvals, change requests, incident records | Change gate before a new corpus goes live; auto-ticket on guardrail breach |
| CI / IaC | GitHub Actions + Terraform | Pipeline build/test/eval; infrastructure as code | OIDC to Azure (no stored creds); eval gate before deploy |
A few of these choices deserve the why, because they are the ones teams get wrong.
Why hybrid search, not pure vector. Vector similarity is excellent at “passages about early-redemption penalties” even when the document says “exit load.” But it is mediocre at exact-match needs — ISIN codes, fund identifiers, regulation clause numbers — where a single token must match. Hybrid retrieval runs both a vector query and a BM25 keyword query, fuses the results, and then semantic reranking (a cross-encoder Microsoft hosts inside AI Search) re-scores the top ~50 candidates for true relevance. In practice, hybrid plus semantic ranking lifts answer quality more than any prompt tweak you will make.
Why security trimming belongs in the index, not the app. It is tempting to retrieve broadly and filter in application code. Do not — that means restricted content leaves the index into your process memory and your traces before being dropped, and one bug leaks it across a regulatory line. Instead, stamp every chunk with the Entra group GUIDs allowed to read its source, and pass the caller’s groups as an OData filter so AI Search never returns a forbidden passage:
$filter=groups/any(g: search.in(g, '8f3a..., 21b7..., d40c...'))
Permission stays a property of the data, not a hope in the application layer.
Why an AI gateway in front of Azure OpenAI. Every enterprise eventually needs one place that enforces token budgets per business line, load-balances across OpenAI deployments, honors Retry-After, and produces a single audit log of who asked the model what. APIM is that layer here. Its llm-token-limit policy meters prompt and completion tokens against a per-line quota so the equities desk cannot exhaust the budget the advisory desk paid for, and its backend pool routes overflow from the provisioned deployment to a pay-as-you-go one.
Implementation guidance
Provision with Terraform, and treat the network as the first deliverable. The deployment order matters because of private DNS — get it wrong and endpoints resolve to firewalled public IPs and calls hang silently.
- A hub/spoke or single VNet with subnets for the orchestrator, the private endpoints, and APIM (its own delegated subnet).
- Private DNS zones for each PaaS service —
privatelink.openai.azure.com,privatelink.search.windows.net,privatelink.documents.azure.com(Cosmos),privatelink.blob.core.windows.net,privatelink.cognitiveservices.azure.com(Content Safety + Document Intelligence) — linked to the VNet. Forgetting one zone is the single most common failure on this architecture. - The PaaS resources, each with
publicNetworkAccess = 'Disabled'and a Private Endpoint in the PE subnet. - The private AKS cluster (Azure CNI) with Workload Identity enabled.
- APIM in internal VNet mode, with Akamai pointed at its private origin.
A minimal Terraform shape for the Azure OpenAI account communicates the intent — keys off, Entra only:
resource "azurerm_cognitive_account" "openai" {
name = "oai-ragbank-prod-cin"
kind = "OpenAI"
sku_name = "S0"
custom_subdomain_name = "oai-ragbank-prod-cin" # required for Private Link + AAD
public_network_access_enabled = false
local_auth_enabled = false # no API keys; Entra only
identity { type = "SystemAssigned" }
}
resource "azurerm_cognitive_deployment" "chat" {
name = "gpt-4o"
cognitive_account_id = azurerm_cognitive_account.openai.id
model { format = "OpenAI" name = "gpt-4o" version = "2024-11-20" }
sku { name = "ProvisionedManaged" capacity = 120 } # PTUs sized to p95
}
The pipeline that applies this runs in GitHub Actions, authenticating to Azure via OIDC federation so there is no stored service-principal secret to leak — a hard lesson the platform team intends never to repeat. The same pipeline runs the offline evaluation harness (below) as a required gate.
Identity: kill the keys, federate the humans. Set local_auth_enabled = false on Azure OpenAI and disableLocalAuth on AI Search and Cosmos, so the only way in is Entra. The orchestrator authenticates with Workload Identity on AKS, granted exactly three role assignments — Cognitive Services OpenAI User on the OpenAI account, Search Index Data Reader on the index, and Cosmos DB Built-in Data Reader/Contributor — with the ingestion identity additionally holding Search Index Data Contributor and Storage Blob Data Reader. Human SSO flows Okta → Entra: advisors log in once with the bank’s Okta credentials and conditional-access policies, Okta federates to Entra over OIDC, and the resulting Entra token carries the group claims that both APIM and AI Search consume. The few residual secrets that are not managed identities — the Okta introspection secret, third-party research-feed tokens — live in HashiCorp Vault, leased dynamically and injected by the Vault Agent sidecar, so they are short-lived and never written to a Kubernetes Secret.
Indexing wiring. Prefer the push pattern through the ingestion function for this corpus: full control over chunking, ACL stamping, and metadata, which is essential when permissions are non-trivial. Chunk at ~300–500 tokens with ~10–15% overlap; carry title, sourceUri, page, effectiveDate, and groups[] on every chunk; define an HNSW vector field and a semantic configuration naming your title and content fields. Carry effectiveDate specifically so a withdrawn fact sheet can be filtered out at query time — the exact failure that started this project.
Enterprise considerations
Security & Zero Trust. The architecture is Zero Trust by construction: identity-based access only, least-privilege RBAC scoped per resource, no public data-plane surface. Layer on top: (a) Prompt Shields to catch jailbreaks and the under-appreciated indirect injection where a malicious instruction is hidden inside a retrieved document; (b) groundedness detection as the last line against hallucinating a number; © Wiz running continuous CSPM and sensitive-data-exposure scanning across Storage and Search, alerting the moment any resource drifts to public exposure or a misconfigured ACL widens access — the posture backstop behind the policy controls; (d) CrowdStrike Falcon sensors on the AKS node pool and ingestion compute for runtime threat detection, feeding the bank’s SOC; (e) a guardrail breach (a blocked jailbreak, a sustained groundedness failure) auto-raises a ServiceNow incident so security has a ticket, not just a log line. Azure Policy denies any Cognitive Services or Search resource created with public network access, and Wiz independently verifies that the policy is actually holding.
Cost optimization. Token spend dominates and grows with success, so engineer for it from day one.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Model tiering | Route simple lookups to gpt-4o-mini, reserve gpt-4o for reasoning |
~10× cheaper on the routed share |
| Semantic caching | Serve near-identical prior questions from cache (cosine threshold) | Deflects 30–50% of model calls on repetitive corpora |
| PTU to p95 + spillover | Size provisioned throughput to steady demand; burst to pay-as-you-go | Avoids buying for peak |
| Prompt hygiene | Tune top-k to 3–5 after semantic reranking, not 20 passages | Cuts input tokens every turn |
| Per-line metering | APIM llm-token-limit feeds chargeback |
Makes each desk own its spend |
Meter tokens per business line in APIM and pipe the metric to Dynatrace, which the platform team uses for the chargeback dashboard the CFO sees.
Scalability. Each tier scales independently. AKS scales pods on concurrency and nodes via the cluster autoscaler; ingestion Functions scale on Blob-event depth via KEDA. AI Search scales out with replicas (query QPS and the read SLA) and partitions (index size and indexing throughput) — size them separately. Azure OpenAI scales by adding PTUs or fanning out across deployments and regions behind APIM’s backend pool, which round-robins and honors Retry-After. The natural ceiling is the OpenAI regional quota, which is why a 4,000-seat rollout plans multi-region early.
Failure modes, and what each one looks like. Name them before they page you.
- A missing private DNS zone link — the endpoint deploys clean but resolves to a firewalled public IP and every call hangs until timeout. Mitigation: assert all zone links in Terraform and in a post-deploy smoke test.
- Azure OpenAI 429s under load — pay-as-you-go shares a regional pool; at market open you get throttling and latency jitter, fatal for a streaming UX. Mitigation: a PTU deployment for interactive traffic with pay-as-you-go as APIM-routed spillover.
- Retrieval miss — the relevant passage is not in the top-k, so the model cannot ground on it and a confidently wrong answer slips out. Mitigation: hybrid + semantic rerank, groundedness detection set to flag-and-cite, and an eval harness that catches regressions.
- Stale corpus — a withdrawn fact sheet still in the index gets cited. Mitigation: the
effectiveDatefilter and an ingestion job that tombstones superseded documents within the hour. - Regional outage — see DR below.
Reliability & DR (RTO/RPO). Decide the numbers per tier. Cosmos DB with multi-region writes gives near-zero RPO and seconds RTO for chat state. AI Search has no native cross-region replication, so DR means maintaining a warm index in a paired region — re-run ingestion against both — with Blob (geo-redundant, the durable source of truth) as the real recovery guarantee. Azure OpenAI deployments are regional; for DR, deploy the same models in a paired region and fail over at the APIM layer. A pragmatic target for this platform: RTO 15 minutes, RPO 5 minutes for the conversational service, with the knowledge base rebuildable from geo-redundant Blob within hours if the index itself is lost. Akamai health checks drive edge failover for ingress.
Observability. Instrument the RAG span end to end in Dynatrace with OpenTelemetry: one trace covering embed → retrieve → rerank → generate → guard, with timing and token counts on each hop, and Davis anomaly detection on top so a latency or cost regression surfaces on its own. Emit the metrics the business actually cares about — retrieval hit-rate, groundedness score, cache-deflection rate, tokens and cost per business line, and p95 time-to-first-token (the latency advisors feel in a stream). Run an offline evaluation harness (Azure AI Foundry prompt-flow evaluators or a golden-question set) inside the GitHub Actions pipeline so a prompt or model change is scored on groundedness and relevance before it ships. New corpora pass through a ServiceNow change approval before going live, giving compliance a documented gate.
Governance. Pin model versions explicitly (gpt-4o 2024-11-20, never a floating alias) so behavior does not drift; promote new versions through the eval gate. Keep prompt templates in version control, reviewable and instantly revertable. Apply Azure Policy to deny public network access and require diagnostic settings on every relevant resource, with Wiz as the independent check that the controls are real. Log every prompt/response pair for audit, incident review, and future eval data — with a right-to-be-forgotten path, since advisor conversations and client context are personal data under the same regime that started this whole project.
Explicit tradeoffs
Accept these or do not build it. RAG adds real moving parts — an ingestion pipeline, an index to keep fresh, embedding costs, and retrieval quality you must measure and tune. Latency is the sum of retrieval and generation, never just one. And RAG answers are only as good as retrieval: if the relevant passage is not in the top-k, the model cannot ground on it, and groundedness checks mitigate but do not eliminate the confidently-wrong failure. The private-networking posture that makes the security team sign costs you setup complexity — five private DNS zones, an internal-mode APIM, no public debugging shortcuts — and the price of forgetting one piece is a silent hang, not a clear error. The Okta-to-Entra federation adds a hop and a token-translation step that the simpler single-IdP shops will not need. And the AI gateway, the guardrails, and the per-line metering are all overhead you can skip for a ten-user pilot and absolutely cannot skip for 4,000 regulated seats.
The alternatives, and when they win. If your corpus is small, static, and fits the model’s context window, long-context prompting skips the index entirely and is simpler. If you need the model to adopt a style or domain vocabulary rather than recall facts, fine-tuning is the right tool — and it composes with RAG (fine-tune for behavior, retrieve for facts). If you need the model to take actions — place a trade, update a CRM — you want an agent/tool-calling architecture, with RAG as one of its tools. And if you are a small team optimizing for speed over control, Azure AI Foundry’s “chat on your data” wizard stands up a working RAG app in an afternoon; graduate to this full private-networked platform when security, scale, residency, or governance demand it.
The shape of the win
For the bank’s advisory desk, the payoff is not “a chatbot.” It is that an advisor mid-meeting asks “what is the exit load on this fund after two years,” gets a one-line answer with a citation to the current fact sheet’s exact page in about a second, and — because the answer is grounded, cited, permission-trimmed, and never left the bank’s network — compliance approved the tool for advisory use, which a fine-tuned, uncitable model would never have cleared. That last sentence is the one that funds the platform. Everything upstream — the Private Endpoints, the Okta-to-Entra federation, the Vault-held secrets, the Wiz posture scanning, the Content Safety groundedness gate, the Dynatrace RAG span — exists to make a regulator, a CISO, and a CFO each say yes. The architecture here is the destination; start narrower if you must, but this is where a regulated, at-scale “chat with our documents” has to land.