The demo RAG app that worked on twelve PDFs falls apart at fifty thousand documents. Pure vector search returns plausible-but-wrong chunks for keyword-heavy queries (“error code 0x80070057”), the LLM hallucinates because retrieval missed the one relevant paragraph, and nobody can answer “is this answer based on the current SOP or last year’s?” The retrieval layer is where most RAG systems silently fail, and Azure AI Search is the component that earns its keep there — if you design the index, chunking, and ranking deliberately.
This guide builds a production retrieval layer end to end: vector index schema with HNSW tuning, integrated vectorization so you never run an embedding pipeline yourself, hybrid search fused with RRF, semantic reranking, indexers with change tracking, and the security and scaling decisions that bite at the SLA. Examples use the 2024-07-01 stable REST API (generally available; the surface used here is stable across the 2024 GA line).
Mental model: the LLM is the reasoning layer; AI Search is the retrieval layer. Grounding quality is bounded by what retrieval surfaces. Spend your effort here.
1. Index schema: vector fields, HNSW, and analyzers
An index for RAG holds chunks, not whole documents. Each document in the index is one chunk plus the metadata you need for filtering, citation, and freshness. The schema below pairs a searchable text field (for BM25) with a vector field (for similarity), and carries a parent_id, title, url, and last_modified for grounding.
{
"name": "kb-chunks",
"fields": [
{ "name": "chunk_id", "type": "Edm.String", "key": true, "filterable": true,
"sortable": true, "analyzer": "keyword" },
{ "name": "parent_id", "type": "Edm.String", "filterable": true },
{ "name": "title", "type": "Edm.String", "searchable": true, "filterable": true },
{ "name": "url", "type": "Edm.String", "filterable": true },
{ "name": "security_group", "type": "Collection(Edm.String)", "filterable": true },
{ "name": "last_modified", "type": "Edm.DateTimeOffset", "filterable": true, "sortable": true },
{ "name": "content", "type": "Edm.String", "searchable": true,
"analyzer": "en.microsoft" },
{ "name": "content_vector", "type": "Collection(Edm.Single)",
"searchable": true, "dimensions": 1536,
"vectorSearchProfile": "hnsw-profile" }
],
"vectorSearch": {
"algorithms": [
{ "name": "hnsw-algo", "kind": "hnsw",
"hnswParameters": { "m": 4, "efConstruction": 400, "efSearch": 500, "metric": "cosine" } }
],
"profiles": [
{ "name": "hnsw-profile", "algorithm": "hnsw-algo", "vectorizer": "aoai-vectorizer" }
],
"vectorizers": [
{ "name": "aoai-vectorizer", "kind": "azureOpenAI",
"azureOpenAIParameters": {
"resourceUri": "https://my-aoai.openai.azure.com",
"deploymentId": "text-embedding-3-large",
"modelName": "text-embedding-3-large"
}
}
]
}
}
Decisions that matter:
dimensionsmust match the embedding model.text-embedding-3-smallis 1536;text-embedding-3-largeis 3072 by default but supports shortened dimensions via thedimensionsrequest parameter. Pick one and pin it — changing it later means a reindex. The example above sets 1536 with the large model only if you also passdimensions: 1536at embed time; otherwise use 3072.- HNSW
mis the bidirectional link count per node. Highermimproves recall at the cost of index size and memory. The service default ism: 4;m: 4to8is plenty for most KBs.efConstruction(default 400) trades build time for graph quality;efSearch(default 500) trades query latency for recall. metric: cosinematches how OpenAI embeddings are trained. Do not switch todotProductunless your model is normalized and documented for it.- Analyzer choice drives BM25 quality.
en.microsoftdoes lemmatization (so “running” matches “run”);en.luceneis lighter. Use the Microsoft analyzer for natural-language content andkeywordfor IDs you want matched verbatim.
If you store vectors you never return to the client (you usually do not — they are large), set "stored": false on the vector field to cut index size substantially. The field stays searchable; it just is not retrievable.
2. Chunking and integrated vectorization with skillsets
The single biggest quality lever is chunk size. Too large and embeddings blur multiple topics, hurting precision; too small and you fragment context the LLM needs. Start at 300-500 tokens per chunk with ~10-15% overlap, then tune against your eval set.
You can chunk and embed yourself, but integrated vectorization lets the indexer do both via a skillset. The SplitSkill chunks; the AzureOpenAIEmbeddingSkill embeds each chunk; an index projection writes one search document per chunk. This means at query time you can send raw text and the index vectorizer embeds it for you — no embedding code in your application path.
{
"name": "kb-skillset",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"textSplitMode": "pages",
"maximumPageLength": 2000,
"pageOverlapLength": 250,
"unit": "characters",
"context": "/document",
"inputs": [{ "name": "text", "source": "/document/content" }],
"outputs": [{ "name": "textItems", "targetName": "pages" }]
},
{
"@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
"resourceUri": "https://my-aoai.openai.azure.com",
"deploymentId": "text-embedding-3-large",
"modelName": "text-embedding-3-large",
"dimensions": 1536,
"context": "/document/pages/*",
"inputs": [{ "name": "text", "source": "/document/pages/*" }],
"outputs": [{ "name": "embedding", "targetName": "vector" }]
}
],
"indexProjections": {
"selectors": [
{
"targetIndexName": "kb-chunks",
"parentKeyFieldName": "parent_id",
"sourceContext": "/document/pages/*",
"mappings": [
{ "name": "content", "source": "/document/pages/*" },
{ "name": "content_vector", "source": "/document/pages/*/vector" },
{ "name": "title", "source": "/document/title" },
{ "name": "url", "source": "/document/metadata_storage_path" },
{ "name": "last_modified", "source": "/document/metadata_storage_last_modified" }
]
}
],
"parameters": { "projectionMode": "skipIndexingParentDocuments" }
}
}
projectionMode: skipIndexingParentDocuments is the key flag: it tells the indexer to index only the chunk projections, not the original parent document, so your index contains exactly the chunk granularity you want. The AzureOpenAIEmbeddingSkill should authenticate with a managed identity (assign the search service’s identity the Cognitive Services OpenAI User role on the AOAI resource) rather than an API key — set authIdentity or rely on the system-assigned identity.
textSplitMode: pages is a misnomer — it means “fixed-size chunks,” not literal document pages. There is also a markdown mode (textSplitMode: markdown) that splits on header structure, which is excellent for documentation sites.
3. Hybrid search: BM25 + vectors with RRF
Vector search wins on semantics (“how do I roll back a release”); BM25 wins on exact tokens (product names, error codes, acronyms). Hybrid search runs both and fuses them with Reciprocal Rank Fusion (RRF) — a parameter-free method that ranks by sum(1 / (k + rank)) across result sets. You almost always want hybrid for RAG.
A hybrid query sends both a search (text) and a vectorQueries block. With an index-level vectorizer you can pass text and let the service embed it (kind: text):
curl -X POST \
"https://my-search.search.windows.net/indexes/kb-chunks/docs/search?api-version=2024-07-01" \
-H "Content-Type: application/json" \
-H "api-key: $SEARCH_QUERY_KEY" \
-d '{
"search": "how do I roll back a failed deployment",
"vectorQueries": [
{
"kind": "text",
"text": "how do I roll back a failed deployment",
"fields": "content_vector",
"k": 50
}
],
"select": "chunk_id,title,url,content,last_modified",
"top": 10
}'
Notes that change behavior:
kon the vector query is the number of nearest neighbors retrieved before fusion, distinct fromtop(final result count). Over-retrieve (k: 50) then let RRF and semantic ranking trim totop: 10.exhaustive: trueforces a brute-force KNN scan instead of the HNSW approximation — useful for evaluating recall offline, far too slow for production.- Add a
filterto scope by metadata (security, freshness, source). With hybrid + filters you get the precision of keyword search and the recall of vectors, scoped to what the user is allowed to see.
4. Semantic ranker: reranking and tradeoffs
RRF gives a good first-pass ordering, but the top result is not always the most relevant — it is the most similar. Semantic ranking sends the top ~50 fused results to a cross-encoder model that re-scores them on actual query-passage relevance, and it can return captions (highlighted snippets) and extractive answers. This is the single highest-ROI feature for grounding quality.
Configure a semantic configuration on the index, naming which fields the reranker reads:
{
"semantic": {
"configurations": [
{
"name": "kb-semantic",
"prioritizedFields": {
"titleField": { "fieldName": "title" },
"prioritizedContentFields": [{ "fieldName": "content" }],
"prioritizedKeywordsFields": []
}
}
]
}
}
Then enable it on the query:
{
"search": "how do I roll back a failed deployment",
"vectorQueries": [
{ "kind": "text", "text": "how do I roll back a failed deployment",
"fields": "content_vector", "k": 50 }
],
"queryType": "semantic",
"semanticConfiguration": "kb-semantic",
"captions": "extractive",
"answers": "extractive|count-3",
"top": 10
}
Tradeoffs to internalize:
- Latency: semantic ranking adds tens to low-hundreds of milliseconds. It reranks at most the top 50 documents (the fused candidate set), so it does not scale with index size — but it is not free.
- Cost and quota: semantic ranking is billed and rate-limited (queries per second). It is a separate plan tier you enable per service. Budget it; do not enable it on every autocomplete keystroke.
@search.rerankerScoreranges 0-4. It is a far better confidence signal than the raw similarity score. A practical pattern: if the toprerankerScoreis below ~1.5, treat retrieval as “no good answer” and have the LLM say so instead of grounding on weak context.
5. Indexers, data sources, and change tracking
Indexers pull from a data source, run the skillset, and write to the index — on a schedule, incrementally. For Blob Storage, the indexer tracks LastModified automatically. For SQL or Cosmos DB, configure a high-water-mark change-detection policy so only changed rows are re-pulled.
{
"name": "kb-datasource",
"type": "azureblob",
"credentials": { "connectionString": "ResourceId=/subscriptions/.../storageAccounts/kbsa;" },
"container": { "name": "documents" },
"dataDeletionDetectionPolicy": {
"@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
}
}
Using a ResourceId= connection string (no account key) lets the search service authenticate to Blob with its managed identity — assign it Storage Blob Data Reader. The NativeBlobSoftDeleteDeletionDetectionPolicy makes soft-deleted blobs propagate as deletes into the index, so retired documents stop grounding answers.
{
"name": "kb-indexer",
"dataSourceName": "kb-datasource",
"targetIndexName": "kb-chunks",
"skillsetName": "kb-skillset",
"schedule": { "interval": "PT2H" },
"parameters": {
"configuration": {
"dataToExtract": "contentAndMetadata",
"parsingMode": "default",
"indexedFileNameExtensions": ".pdf,.docx,.md,.html"
}
}
}
schedule.interval uses ISO 8601 duration (PT2H = every 2 hours; minimum is PT5M). For incremental enrichment economics, enable an enrichment cache (set a cache with a storage connection on the indexer) so unchanged documents are not re-embedded on every run — embeddings are the expensive part, and re-vectorizing unchanged content is pure waste.
6. Security: private endpoints, RBAC, and document-level filtering
Three layers, all of which a regulated tenant will require:
Network. Disable public access and reach the service over a private endpoint; the indexer reaches Blob/AOAI over a shared private link.
az search service update \
--name my-search --resource-group rg-search \
--public-network-access disabled
az search shared-private-link-resource create \
--name spl-aoai --service-name my-search --resource-group rg-search \
--group-id openai_account \
--resource-id "/subscriptions/<sub>/resourceGroups/rg-aoai/providers/Microsoft.CognitiveServices/accounts/my-aoai" \
--request-message "indexer access to AOAI"
Identity, not keys. Enable RBAC for the data plane and turn off API keys entirely once callers use Entra tokens:
az search service update \
--name my-search --resource-group rg-search \
--auth-options aadOrApiKey --aad-auth-failure-mode http403
# After migrating callers to Entra tokens, tighten to RBAC-only:
az search service update \
--name my-search --resource-group rg-search \
--disable-local-auth true
Use built-in roles: Search Index Data Reader for query-only app identities, Search Index Data Contributor for ingestion, Search Service Contributor for control-plane changes. Application backends should hold Data Reader and nothing more.
Document-level security. AI Search has no row-level security, so you enforce it in the query with a filter on a field carrying the user’s allowed groups. Stamp each chunk with security_group (the AD groups that may see it), then filter at query time using search.in for an efficient set match:
{
"search": "quarterly revenue",
"filter": "security_group/any(g: search.in(g, 'finance-readers,exec-team', ','))",
"queryType": "semantic",
"top": 10
}
The app must derive that group list from the caller’s validated token, never from client input. This is the standard “security trimming” pattern — there is no server-side identity join, so a missing filter is a data-leak bug, not a feature gap.
7. Scaling: replicas, partitions, and latency
Capacity is replicas x partitions search units (SUs), billed as their product.
- Partitions shard the index — they add storage and write/ingestion throughput, and improve query latency on large indexes by parallelizing the scan. Add partitions when the index outgrows one partition or queries are scan-bound.
- Replicas are full copies — they add query QPS and are required for the query SLA (the 99.9% read SLA needs >= 2 replicas; writes need >= 3). Add replicas when you are QPS-bound or need the SLA.
az search service update \
--name my-search --resource-group rg-search \
--partition-count 2 --replica-count 3
Latency levers, in order of impact:
- Lower
efSearchon the HNSW profile if recall headroom allows — it is the direct query-time vs. recall knob. selectonly the fields you need. Never returncontent_vectorto the client; it inflates payload and serialization time. Setstored: falseon it.- Keep
kandtopmodest. Over-retrieval feeds semantic ranking but costs bandwidth and rerank time. - Right-size chunks. Fewer, well-sized chunks mean a smaller graph and faster traversal than millions of tiny fragments.
Scaling is not instant — adding partitions reprovisions and can take time, and you cannot reduce partitions and replicas in the same operation as some other changes. Provision ahead of launch, not during the incident.
8. Grounding into Azure OpenAI with citations and freshness
Now wire retrieval into the LLM. Two paths:
Path A — “On Your Data” (managed). Azure OpenAI’s chat completions accept a data_sources block pointing at your index; the service runs hybrid+semantic retrieval and injects grounding automatically, returning citations in the response. Lowest code, less control over the prompt.
Path B — explicit RAG (recommended for control). You retrieve, you build the prompt, you own the citation format. This is where freshness and provenance live:
# 1) Retrieve (hybrid + semantic) via the Search SDK, then:
sources = "\n\n".join(
f"[{i+1}] (updated {d['last_modified']}) {d['content']}\nURL: {d['url']}"
for i, d in enumerate(results)
)
system = (
"Answer ONLY from the sources below. Cite every claim as [n]. "
"If the sources do not contain the answer, say you don't know. "
"Prefer sources with the most recent 'updated' date when they conflict."
)
messages = [
{"role": "system", "content": system},
{"role": "user", "content": f"{question}\n\nSources:\n{sources}"},
]
# 2) Call chat.completions; render [n] -> d['url'] in the UI.
The freshness rule earns its place when an SOP changes: stamp last_modified into every chunk (Section 2 mapped it), surface it in the prompt, and instruct the model to prefer recent sources on conflict. Combined with the soft-delete deletion policy from Section 5, retired content stops grounding answers and current content wins ties — which is exactly the “is this the current SOP?” question that sinks naive RAG.
Enterprise scenario
A financial-services platform team ran a single shared “all-knowledge-base” index for an internal copilot across legal, HR, and trading-desk content. Two failures surfaced in the same week. First, a compliance audit found that an HR analyst’s query had surfaced a trading-desk memo in the citations — the app filtered on department only when the request included it, and one code path omitted the filter, leaking restricted content. Second, the copilot kept citing a superseded trading SOP because the old PDF and its replacement both sat in the index with near-identical embeddings, and pure similarity ranked the longer (older) document first.
The constraint: they could not split into per-department services (cost and operational sprawl across twelve business units), and they could not afford another audit finding.
The fix was two changes, both at the retrieval layer. They made the security filter non-optional by moving it into a thin retrieval wrapper that every caller had to go through — the wrapper derived groups from the validated Entra token and always appended security_group/any(g: search.in(...)), so no application code could issue an unfiltered query. And they enabled semantic ranking plus a freshness tiebreak so the current SOP won.
{
"search": "margin call escalation procedure",
"filter": "security_group/any(g: search.in(g, 'trading-desk', ',')) and is_current eq true",
"vectorQueries": [
{ "kind": "text", "text": "margin call escalation procedure",
"fields": "content_vector", "k": 50 }
],
"queryType": "semantic",
"semanticConfiguration": "kb-semantic",
"select": "chunk_id,title,url,content,last_modified",
"top": 8
}
An is_current boolean was stamped by the ingestion pipeline (set false on the prior version when a replacement landed), turning “prefer recent” into a hard filter rather than a soft hope. Zero unfiltered queries could leave the wrapper, and the superseded-SOP citations disappeared. No new services, no per-team indexes — just the two retrieval-layer guarantees that the naive design had left to chance.
Verify
Confirm each layer behaves before you point an LLM at it.
# Index exists with the expected fields and vector profile
curl -s "https://my-search.search.windows.net/indexes/kb-chunks?api-version=2024-07-01" \
-H "api-key: $ADMIN_KEY" | jq '.fields[].name, .vectorSearch.profiles'
# Indexer ran and produced documents (check status + counts)
curl -s "https://my-search.search.windows.net/indexers/kb-indexer/status?api-version=2024-07-01" \
-H "api-key: $ADMIN_KEY" | jq '.lastResult.status, .lastResult.itemsProcessed, .lastResult.errors'
# Document count is non-zero
curl -s "https://my-search.search.windows.net/indexes/kb-chunks/docs/\$count?api-version=2024-07-01" \
-H "api-key: $ADMIN_KEY"
- Run an exact-token query (an error code) and a semantic query (“how do I undo a release”) against the same hybrid endpoint; both should return relevant chunks. If the token query misses, your analyzer or BM25 field is wrong.
- Inspect
@search.rerankerScoreon a known-good query — it should be high (>2). On an off-topic query it should be low, proving your low-confidence cutoff will fire. - Issue a query as a user who should not see restricted content and confirm the filter excludes it. Then remove the filter and confirm the restricted doc would have appeared — proving the trim is doing real work.
- Soft-delete a blob, let the indexer run, and confirm the chunk disappears from the index.