AI/ML Multi-cloud

Hybrid Vector Search Architecture (pgvector + reranking)

A national retail bank ships a “search our knowledge” feature to 6,000 branch and contact-centre agents, and within a fortnight the complaints arrive. An agent types “early closure penalty on FD account no. 50410023” and the system confidently returns a passage about foreclosure of home loans — semantically adjacent, factually wrong, and the kind of mistake that ends in a mis-sold product and a regulator’s letter. The team had done everything the tutorials said: chunked the documents, embedded them with a good model, stored the vectors, and ranked by cosine similarity. The problem is that pure vector search is fuzzy by design. It is brilliant at “this means roughly the same thing” and mediocre at “this exact account number, this exact clause, this exact circular dated last Tuesday.” In a domain where account numbers, IFSC codes, statute citations, and effective dates must match exactly, fuzzy is a liability.

This article is a reference architecture for the pattern that fixes it: hybrid vector search with reranking. Dense vectors find what the query means; a lexical engine (BM25) finds what it literally says; the two result sets are fused; and a cross-encoder reranker does a final, expensive, high-precision pass over the survivors. Layered on top is a freshness-aware embedding pipeline so the index reflects last night’s circular, not last quarter’s. It is built to run on either of the two stacks most enterprises already own — PostgreSQL with pgvector, or OpenSearch / Elasticsearch — and to satisfy the security, cost, and governance bar a regulated workload demands.

The business scenario

The driver is precision under regulation. Our bank — call it a mid-size scheduled commercial bank, ~8,000 staff, ~1,400 branches — has a policy corpus that is unforgiving: product T&Cs, RBI master directions, internal circulars superseded weekly, and FAQ sheets in three languages. An agent answering a customer has seconds, cannot read a 40-page PDF, and must not improvise. Three properties make naïve retrieval fail here.

Exactness matters. “Section 80C” is not “Section 80D.” “FD” (fixed deposit) is not “RD” (recurring deposit). A dense embedding squashes these into nearby points in vector space; a customer gets the wrong tax treatment. Lexical search treats the token as the token.

Freshness is non-negotiable. When a circular changes the premature-withdrawal penalty on Monday, an answer grounded in Friday’s version is a compliance breach, not a stale cache. Retrieval must know which version is current and must be able to demote or hide the superseded one within minutes of ingestion.

Precision beats recall at the top. The agent reads passage #1, maybe #2. It does not matter that the right answer is somewhere in the top 50; it must be rank 1. This is exactly what a cross-encoder reranker is for, and exactly what cosine-sorted vector search is bad at, because bi-encoder similarity is a coarse proxy for true relevance.

The naïve fixes fail predictably. Pure vector search loses every exact-match query. Pure keyword search (the bank’s legacy Solr box) misses anything phrased differently from the document — “penalty for breaking my deposit early” never matches “premature closure charges.” Bigger embedding model improves semantics but does nothing for exactness or freshness, and costs more per token. Hybrid + rerank threads the needle: recall from two complementary retrievers, precision from a reranker, currency from the pipeline.

The scenario scales cleanly from a single product team (one corpus, a few hundred queries a day, pgvector on the database they already run) to the whole bank (dozens of corpora, thousands of queries an hour, OpenSearch with dedicated reranking GPUs). The shape of the diagram does not change — only the engine and the replica counts.

Architecture overview

The design has two paths on different clocks: an ingestion/indexing pipeline (event-driven, keeps the index fresh and correct) and a query pipeline (synchronous, serves agents in real time). Keeping them mentally separate is the first discipline of operating this well.

Hybrid Vector Search Architecture (pgvector + reranking) — architecture

Query path, numbered as in the diagram: (1) the agent’s request enters through an edge tier — Akamai for global TLS termination, WAF, and bot/flood protection — and authenticates against the corporate IdP. The bank standardised on Okta for workforce SSO (it federates the contact-centre app), issuing a short-lived OIDC token that carries the agent’s role and the data-domain groups they may read. (2) The request lands on the retrieval orchestrator, a stateless service the bank runs on Kubernetes. The orchestrator does query understanding — light normalisation, optional spell-fix, and crucially entity extraction that pulls out account-number-shaped and citation-shaped tokens so the lexical leg can match them verbatim. (3) It fans out two retrievers in parallel: a dense vector query (embed the question, ANN search) and a BM25 lexical query against the same corpus. (4) The two candidate lists are fused — Reciprocal Rank Fusion (RRF) is the default — into one merged top-N (typically N≈50–100). (5) A cross-encoder reranker scores each (query, passage) pair jointly and re-sorts; the orchestrator keeps the top-k (often 3–8) and applies a freshness/recency tiebreak and a security filter. (6) The results — passage text plus citations and effective dates — return to the agent UI; for a RAG flow they would be handed to an LLM as grounding, but precision retrieval is valuable on its own, and many enterprise “answer search” surfaces stop at step 6.

Ingestion path runs independently: documents land in object storage (S3 or Azure Blob, synced from SharePoint, the DMS, and the circulars portal), an event triggers a parsing + chunking worker, each chunk is embedded by the embedding service, and the chunk plus its vector plus rich metadata (source_uri, effective_from, superseded_by, acl_groups[], doc_version) is upserted into both the vector store and the lexical index. The metadata is what makes freshness and security real at query time, not bolt-ons.

The defining property of the topology: dense and lexical signals are first-class peers, the reranker is the arbiter, and freshness/ACL are columns on the data — not afterthoughts in app code.

Component breakdown

Stage pgvector stack OpenSearch stack Role Key configuration
Edge Akamai + WAF Akamai + WAF TLS, DDoS, bot/flood, caching Custom rule for query-flood; private origin to the orchestrator
Identity Okta / Entra ID (OIDC) Okta / Entra ID (OIDC) SSO, role + data-domain claims validate-jwt; map groups → acl_groups filter
Dense retrieval Postgres + pgvector (HNSW) OpenSearch knn_vector (HNSW/FAISS) ANN over embeddings vector_cosine_ops; m=16, ef_construction=64; ef_search tuned to recall
Lexical retrieval Postgres FTS / ParadeDB (BM25) OpenSearch BM25 (match) Exact tokens, rare terms analyzer per language; boost exact phrase + entity fields
Fusion App-side RRF OpenSearch hybrid query (search pipeline) Merge the two lists RRF k≈60, or weighted-sum on normalised scores
Rerank Cross-encoder (BGE/Cohere/Voyage) Same, or OpenSearch ML rerank processor Final precision sort top-50 in, top-k out; batch on GPU/serverless
Embeddings Embedding API (managed or self-hosted) Same Vectorise chunks + queries pin model + dimension; same model for index & query
Secrets HashiCorp Vault HashiCorp Vault DB creds, API keys, rotation dynamic Postgres creds; short TTL leases
Observability Datadog / Dynatrace Datadog / Dynatrace traces, recall, latency, cost span per retriever + rerank; custom recall@k metric

A few choices deserve the why, because they are the ones teams get wrong.

Why two retrievers, not a better single one. Dense and lexical retrieval fail in different directions, which is exactly what you want in an ensemble. Dense recovers paraphrase and intent (“break my deposit early” → “premature closure”); lexical recovers exact tokens and rare strings (account numbers, “80C”, error codes, product SKUs) that an embedding blurs. Run both, fuse, and you get the union of their strengths. A single retriever — however good — gives you the union of their weaknesses on the queries it is bad at.

Why RRF for fusion, before you reach for weighted sums. Dense similarity scores and BM25 scores live on incompatible scales; naïvely adding them lets whichever engine has the larger numeric range dominate. Reciprocal Rank Fusion sidesteps the problem entirely by fusing on rank position, not score: each document’s fused score is Σ 1/(k + rank_i) across the lists it appears in (with k≈60). It is parameter-light, robust, and a strong default. Move to a tuned weighted-sum (with per-engine score normalisation) only once you have an evaluation set proving it beats RRF on your corpus — usually it is a few points, not a transformation.

Why a cross-encoder reranker is the highest-leverage component. The retrievers use bi-encoders: query and document are embedded separately and compared by cosine — fast (you can pre-compute every document vector) but coarse, because the model never sees the two texts together. A cross-encoder feeds (query, passage) through the model jointly, so it can weigh every query term against every passage term, and it is dramatically more accurate at “is this passage actually the answer.” It is too slow to run over the whole corpus — which is the entire point of the two-stage design: cheap retrievers cut millions of chunks to ~50; the expensive reranker judges only those 50. In practice the reranker lifts top-1 accuracy more than any embedding-model upgrade or prompt tweak you will make.

Why freshness is a retrieval concern, not a cache TTL. A superseded circular is not “stale” the way a CDN asset is stale — serving it is wrong. So freshness is encoded as data: every chunk carries effective_from, and superseding a document sets superseded_by on the old chunks. At query time the orchestrator filters out superseded chunks and applies a recency tiebreak so that, between two equally-relevant passages, the current one wins. This makes “as of today” correctness a property of the index, recomputed on every ingest, not a hope that a cache expired.

Implementation guidance

Choosing the engine. Both stacks implement the same architecture; pick on operational gravity, not benchmarks.

pgvector (Postgres) OpenSearch / Elasticsearch
Best when You already run Postgres; corpus ≲ low tens of millions of chunks; you want vectors, metadata, and ACL filters transactionally consistent in one store Corpus is large; you need native hybrid search, sharding, and a mature analyzer/BM25 stack; search is its own platform
Hybrid App-side: run vector + FTS queries, fuse in the orchestrator Native: one hybrid query + search pipeline does fusion server-side
Freshness/ACL SQL WHERE on the same row — strong consistency, trivial joins Filter clause in the query DSL; eventually consistent across shards
Scaling pain Vertical first; ANN recall vs. ef_search vs. write load tradeoffs Horizontal sharding, but cluster ops + JVM heap tuning
Reranking External service (engine reranks nothing itself) External, or in-cluster ML rerank processor

A pragmatic rule: start on pgvector if Postgres is already in your estate — one store, ACID metadata, dynamic creds from Vault, far less to operate — and graduate to OpenSearch when corpus size, native hybrid, or search-as-a-platform needs force it. Our bank ran the contact-centre corpus on pgvector and the enterprise-wide document search on OpenSearch; same architecture, two engines.

The pgvector index, concretely. Use HNSW (not IVFFlat) for low-latency ANN, and put the lexical and vector columns on the same table so filters are one WHERE:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunk (
  id            bigserial PRIMARY KEY,
  doc_id        text NOT NULL,
  content       text NOT NULL,
  embedding     vector(1024) NOT NULL,        -- pin the model's dimension
  ts            tsvector GENERATED ALWAYS AS  -- BM25-style lexical column
                (to_tsvector('simple', content)) STORED,
  acl_groups    text[]  NOT NULL,             -- security filter
  effective_from date,
  superseded_by  text                          -- NULL = current
);

CREATE INDEX ON chunk USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
CREATE INDEX ON chunk USING gin (ts);
CREATE INDEX ON chunk USING gin (acl_groups);

The two legs of the hybrid query then run against this one table; the orchestrator fuses their ranks. Tune SET hnsw.ef_search upward until recall@50 plateaus on your eval set — higher ef_search means better recall at the cost of latency.

Embedding pipeline + freshness. Ingestion is event-driven so the index tracks reality. A document upload to S3/Blob emits an event (S3 Notification / Event Grid) onto a queue; a worker parses, chunks (~300–500 tokens, 10–15% overlap), and embeds. Two rules keep it correct: (1) pin the embedding model and dimension and use the identical model for indexing and querying — mixing models silently destroys recall because the vectors live in different spaces; (2) make ingestion idempotent and version-aware — upsert by a stable (doc_id, chunk_hash) key, and when a new version of a document arrives, write the new chunks and stamp the old ones superseded_by = <new doc_id> in the same transaction so retrieval never sees a gap or a duplicate. Re-embedding the whole corpus (when you upgrade the embedding model) is a blue/green index operation: build the new index alongside the old, evaluate, then cut traffic over — never re-embed in place.

Identity and secrets. The orchestrator does not hold database passwords. HashiCorp Vault issues dynamic Postgres credentials with short TTL leases (the service requests a credential, uses it, Vault revokes it on expiry), and brokers the embedding/reranker API keys the same way — nothing static to leak, automatic rotation, and a full audit trail of who got which lease. (This is exactly the discipline the bank adopted after a prior incident with long-lived DB passwords committed to a repo.) At query time, the Okta/Entra OIDC token’s group claims are mapped to the acl_groups filter; the agent only ever retrieves passages their groups permit.

Why ACL belongs in the query filter, not after. It is tempting to retrieve broadly and drop forbidden results in app code — don’t. That pulls restricted content into process memory and logs before it is dropped, and one bug leaks it. Instead pass the caller’s groups as a filter the engine applies during retrieval (WHERE acl_groups && :user_groups in pgvector, a terms filter in OpenSearch), so a forbidden passage is never returned and never reranked. Permission stays a property of the data.

Enterprise considerations

Failure modes — and they are specific to this pattern. (1) Reranker down or slow. It is a synchronous hop on the hot path; if its GPU pool is saturated, every query stalls. Mitigate with a strict timeout and graceful degradation to fusion-only (RRF-ranked, no rerank) — measurably worse precision, but the feature stays up, and you alert on the degradation. (2) Embedding-model drift. The day someone changes the embedding model or its version without re-indexing, query vectors and stored vectors diverge and recall silently collapses — no error, just bad answers. Guard it: store the model id/version on every chunk and refuse to query if the live embedder’s version doesn’t match the index’s. (3) ANN recall cliff. Set ef_search (pgvector) or ef/k (OpenSearch) too low for write throughput and the right passage never reaches fusion; the reranker cannot fix what retrieval never surfaced. (4) Fusion swamping. One over-eager retriever (e.g., BM25 matching a stopword-heavy query) floods the candidate pool; cap each leg’s contribution before fusing. (5) Stale-but-relevant. A superseded circular scores highest and, without the freshness filter, becomes rank 1 — the compliance failure that started this article.

Security posture. Beyond identity-based ACL filtering and Vault-brokered secrets: run a CSPM/data-posture scanner — Wiz — over the vector store and object storage so a publicly-exposed S3 bucket of source documents or a Postgres instance with a permissive security group is caught before an attacker finds it (the index is a concentrated copy of your sensitive corpus, so its blast radius is high). Put CrowdStrike Falcon on the orchestrator and OpenSearch/Postgres nodes for runtime threat detection — the reranker and embedding services pull models and make outbound calls, exactly the egress an exfiltration attempt would abuse. Route ingestion changes and any production index cutover through ServiceNow change requests so a re-embed or schema migration has an approval trail and a rollback owner. Treat the embedding pipeline’s parser as an attack surface too: a malicious PDF in the corpus is untrusted input, so sandbox extraction and validate before it ever reaches the index.

Cost optimisation. Three line items dominate, and each has a lever. (1) Reranking compute — the cross-encoder is the priciest per-query stage; control it by reranking fewer candidates (top-50 is usually plenty; reranking 200 rarely changes rank-1 and triples cost) and by running the reranker on serverless GPU or batched inference that scales to zero off-peak rather than a pinned 24×7 fleet. A managed rerank API (Cohere, Voyage) trades per-call cost for zero ops — cheaper until volume is high, at which point self-hosting an open cross-encoder (BGE-reranker) on your own GPUs wins. (2) Embedding spend — one-time on ingest plus per-query on the question; cache query embeddings for repeated questions and never re-embed unchanged chunks (idempotent upsert pays for itself). (3) Storage/compute for the index — pgvector with HNSW keeps it inside your existing Postgres bill; OpenSearch costs scale with shard count and hot-node RAM, so use index lifecycle management to tier old corpora to cheaper storage. A blunt but effective rule: size the reranker to your p95 query rate, not your peak, and let the timeout-to-fusion-only path absorb the tail.

Scalability. Each stage scales independently. The orchestrator is stateless — scale on concurrency. Dense retrieval scales vertically on pgvector (more RAM so the HNSW graph stays resident) and horizontally on OpenSearch (more shards/replicas). The reranker is the bottleneck to watch: it is GPU-bound and on the hot path, so autoscale it on queue depth and keep batch sizes tuned. A useful release valve at very high QPS is caching the reranked top-k for popular queries (the contact centre asks the same 300 questions on repeat), deflecting both the retrievers and the reranker.

Observability. Instrument the retrieval span end-to-end in Datadog or Dynatrace: one trace covering embed → (dense ‖ lexical) → fuse → rerank → filter, with latency and candidate counts on each hop. Emit the metrics the business actually cares about — recall@k and MRR/nDCG against a golden query set (run in CI so a model or fusion change is scored before it ships), rerank latency p95, fusion-only fallback rate (how often the reranker timed out), freshness lag (ingest-to-queryable time), and cost per 1,000 queries split by stage. The single most useful operational signal is “queries where rank-1 changed after rerank” — it tells you the reranker is earning its cost; if it is near zero, you are paying for nothing.

Reference enterprise example

Aravind Federal Bank, a fictional mid-size Indian scheduled commercial bank (~8,000 staff, ~1,400 branches), built this to give contact-centre and branch agents instant, correct, current answers from a policy corpus that changes weekly. Their corpus: ~140,000 documents — product T&Cs, RBI master directions, internal circulars, and tri-lingual FAQs — chunked to ~1.1 million passages.

Decisions they made. They ran the contact-centre surface on pgvector because the agent app already sat on a managed Postgres they operated; one store gave them transactional ACL and freshness filters and let Vault hand the service dynamic, short-TTL DB credentials. Dense retrieval used a 1024-dim open embedding model, self-hosted; lexical used Postgres FTS with a per-language analyzer and an exact-phrase boost on an entity field populated from query-time extraction (account numbers, section citations). Fusion was RRF (k=60) in the orchestrator over top-100 from each leg; a self-hosted BGE cross-encoder reranker on a small serverless-GPU pool re-sorted the top-50 to top-6, with a 250 ms timeout falling back to fusion-only. Freshness was enforced by superseded_by filtering plus a recency tiebreak, written transactionally on every ingest. Okta federated the agent SSO and supplied the data-domain group claims; Akamai fronted the edge; Wiz watched the Postgres instance and the S3 source bucket; CrowdStrike Falcon ran on the nodes; index cutovers went through ServiceNow; the whole estate was Terraform-provisioned and built in GitHub Actions, with the eval harness (recall@k on 2,000 golden agent queries) gating every embedding/fusion change. The enterprise-wide document search, a separate corpus, ran the identical design on OpenSearch with its native hybrid query and in-cluster rerank processor.

The numbers. ~22,000 queries/day at peak. Median end-to-end latency ~190 ms; rerank p95 ~140 ms. Fusion-only fallback fired on ~0.6% of queries (a rerank-pool blip), invisible to agents. Monthly run cost landed near ₹6.3 lakh (~$7,500): the serverless-GPU reranker ~$2,600, embedding inference ~$1,200, the incremental pgvector/Postgres capacity ~$1,500, OpenSearch for the enterprise corpus ~$1,400, edge/observability the remainder. Caching the reranked top-k for the ~300 most-repeated questions deflected ~31% of full retrieval+rerank cycles — roughly the gap between this budget and a 40%-higher one.

The outcome. The metric that moved was rank-1 correctness on exact-match queries: account/section/circular lookups went from ~58% (pure-vector) to ~94% (hybrid + rerank + freshness) on the golden set, because BM25 recovered the exact tokens and the cross-encoder put the right passage first. Average handle time on policy questions fell ~22%. And the compliance line that got sign-off: because every result carried an effective_from date and superseded circulars were filtered out, the risk team certified the tool for use during live customer calls — which the original cosine-only system, occasionally surfacing last quarter’s penalty schedule, never could have cleared.

When to use it

Use this architecture when your queries mix meaning and exactness (paraphrased questions and account numbers, citations, SKUs, error codes); when freshness is correctness, not convenience; when rank-1 precision matters because users read the top result, not the top fifty; and when you can keep the index inside your security boundary. That covers most regulated “find the right passage” demand — financial-services policy lookup, legal and contract search, clinical-protocol retrieval, technical support over changing docs, and the retrieval stage of any serious enterprise RAG system.

Trade-offs to accept. Hybrid + rerank is more moving parts than a single vector index: two retrievers to keep in sync, a fusion step to tune, a reranker on the hot path with its own GPU bill and failure mode, and an ingestion pipeline that must be idempotent and version-aware. Latency is the sum of retrieval and reranking. And it is only as good as the eval set behind it — without recall@k and rank-change metrics you are tuning blind.

Anti-patterns. (1) Pure-vector retrieval in an exact-match domain — you will lose every account-number and citation query. (2) Score-sum fusion without normalisation — the larger-scaled engine silently dominates; start with RRF. (3) Reranking the whole corpus — defeats the two-stage design; retrieve cheaply, rerank ~50. (4) Mismatched embedding models for index vs. query — silent recall collapse; pin and assert the version. (5) ACL filtering in app code — restricted data leaks into memory and logs; filter in the query. (6) Freshness as a cache TTL — superseded ≠ stale; encode it as data and filter it. (7) No evaluation harness — every fusion or model change becomes a gamble.

Alternatives, and when they win. If your corpus is tiny, static, and all exact-match, a plain keyword index (the Solr/Elasticsearch you already run) is simpler — skip the vectors. If queries are purely semantic with no exact-token needs and precision-at-1 is forgiving, vector-only search is less to operate. If you need the model to take actions rather than find passages, you want an agent/tool-calling design with this retrieval as one tool. And if you want hybrid + rerank with the least assembly, a managed search platform (OpenSearch Serverless with its hybrid pipeline, Vespa, or a vector DB with built-in reranking) gets you most of the way before you build the orchestrator yourself — graduate to this explicit, engine-portable architecture when cost, control, or multi-stack reality demands it. The pattern here is the destination; pick the on-ramp that matches your scale.

pgvectorOpenSearchVector SearchRAGRerankingArchitecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading