A national grocery and general-merchandise retailer — 1,400 stores, an app and web storefront doing 18 million sessions a week — gets a directive from its Chief Digital Officer after a brutal quarter: a discount-grocery competitor has been eating its margin, basket sizes are flat, and the “Recommended for you” rail on the homepage is a static, merchandiser-curated list that has not been personalized to a single shopper since it launched. The ask is specific: every logged-in customer should see a recommendation rail that reflects what they actually browse and buy, it should update within the same session when someone adds milk and bread to the cart, and the merchandising team must be able to prove — with a number the CFO trusts — that the new rail lifts revenue before it rolls out to all stores. The constraint is just as specific: this runs on GCP, the data platform is already BigQuery, peak is the 6pm weekday rush and the days before a holiday, and the security team will not approve anything that lets a third party touch customer purchase history. This article is the reference architecture for building that engine properly on Vertex AI — a two-tower retrieval model, a feature store, a streaming clickstream, a stateless serving tier, and an experimentation harness that a head of merchandising and a CISO will both sign.
The pressures stack the way they do in retail. Scale means tens of thousands of recommendation requests per second at the dinner-hour peak, against a catalog of 250,000 SKUs. Latency means the rail must render inside the page’s render budget — a recommendation that arrives after the homepage paints is a recommendation nobody sees, so the p99 budget is tens of milliseconds, not seconds. Freshness means in-session signal: the customer who just put nappies in the cart should see formula and wipes before they reach checkout, not tomorrow. And measurability means the engine is worthless to the business until an A/B test proves incremental revenue, because merchandising has been burned by “AI” that looked clever and moved no money. A two-tower retrieval model on Vertex AI satisfies all four, and the rest of this architecture exists to feed it, serve it, and prove it.
Why two-tower retrieval, not the obvious shortcuts
The naive approaches each fail predictably, and naming why matters because someone on the project will propose all three.
Hand-curated merchandiser rails — the status quo — do not personalize at all; they show the same “top sellers” to a vegan and a household of five, and they go stale the moment a promotion ends. A single “people who bought X also bought Y” co-occurrence table is better but brittle: it is recomputed in a nightly batch, has nothing to say about a brand-new SKU or a brand-new customer (the cold-start problem), and cannot fuse a shopper’s long-term taste with what they did ten seconds ago. A monolithic deep model that scores every one of 250,000 SKUs per request is mathematically the “right” answer and operationally impossible — scoring a quarter-million items inside a 40-millisecond budget at peak QPS is not something you can pay your way out of.
The two-tower architecture threads the needle, and it is worth understanding because it shapes everything downstream. You train two neural networks. The query tower ingests user and context features — recent browse and purchase history, time of day, store, device — and emits a single embedding vector representing “what this shopper wants right now.” The candidate tower ingests item features — category, brand, price band, attributes — and emits an embedding for each SKU. The training objective pulls the embeddings of items a user actually engaged with close to that user’s embedding, and pushes everything else away. The payoff is the operational trick: because items and users live in the same vector space, serving a recommendation becomes an approximate nearest-neighbor (ANN) lookup — embed the user once, find the closest item vectors in single-digit milliseconds, and never score 250,000 items at request time. Retrieval narrows a quarter-million SKUs to a few hundred candidates; a cheaper ranking pass then orders that shortlist.
Architecture overview
The platform runs three distinct paths that share infrastructure but live on different schedules, and keeping them separate in your head is the first step to operating this well: an offline training path that learns the towers from history, an online serving path that answers recommendation requests in real time, and a streaming event path that keeps features fresh as customers act.
Serving path, following the request as a shopper sees it:
- A logged-in customer opens the app or storefront. The edge is Akamai — TLS termination, global anycast, CDN for the static catalog imagery, and WAF/bot mitigation so scrapers and credential-stuffers never reach the origin. Customer identity is federated through Okta as the consumer/workforce IdP (brokered to Microsoft Entra ID for the corporate merchandising and data-science staff who administer the platform), so the storefront carries a verified user token and the back-office tools enforce SSO and conditional access.
- The storefront’s backend-for-frontend calls the recommendation service running on Cloud Run. Cloud Run is the deliberate choice for the serving tier: it scales to zero overnight, scales out automatically to absorb the 6pm spike, and bills per request — you are not paying for idle GKE nodes at 3am. The service runs behind an internal HTTPS load balancer; it is stateless, which is what lets it scale horizontally without coordination.
- The service fetches the shopper’s online features from Vertex AI Feature Store with a single low-latency read — their last-N viewed categories, current cart contents, loyalty segment, recency/frequency aggregates. It pulls any secrets it needs that are not GCP-native identities — the Okta introspection secret, the Datadog API key, a partner pricing-feed token — from HashiCorp Vault via the Vault Agent with GCP IAM auth, so nothing sensitive sits in an environment variable or a container image.
- The service runs the query tower to turn those features into a user embedding, then issues an ANN query against the Vertex AI Vector Search index (the deployed item embeddings) to retrieve the few hundred nearest candidate SKUs — filtered server-side for what is in stock at this customer’s store and not on a suppression list (allergens the customer flagged, age-restricted items, already-purchased one-offs).
- A lightweight ranking model re-scores that shortlist with richer features — predicted click-through and predicted margin — so the final order optimizes for revenue, not just relevance. The top ~20 SKUs, hydrated with price and imagery, stream back and render in the rail. Every served impression and the user’s subsequent click or add-to-cart is emitted as an event.
Streaming event path, independent and always-on: every storefront interaction — page view, product view, add-to-cart, purchase — is published to Pub/Sub as a clickstream event. Two subscribers consume the same stream. A Dataflow streaming job computes near-real-time features (this session’s category affinities, rolling counts) and writes them to the online Feature Store so the next request in the same session reflects what just happened — that is the in-session freshness the CDO demanded. The same events land in BigQuery through its streaming insert path as the durable, queryable system of record for all behavioral data.
Offline training path, on a schedule: BigQuery is the analytical heart — years of transactions, the full clickstream, catalog and inventory snapshots. A Vertex AI Pipeline (the orchestrator, Kubeflow-based) reads training data from BigQuery, engineers features and materializes them to the Feature Store so training and serving compute features the same way (defeating training/serving skew, the single most common cause of “it worked offline and tanked in production”), trains the two towers on Vertex AI custom training, evaluates against held-out data, registers the model in the Vertex AI Model Registry, and — on passing the eval gate — rebuilds and redeploys the Vector Search index and the query-tower endpoint. This pipeline runs nightly for embeddings and is re-triggered whenever drift monitoring fires.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Edge | Akamai | TLS, anycast, image CDN, WAF, bot mitigation | Origin shield to the internal LB; cache catalog imagery, never the personalized rail |
| Identity / SSO | Okta + Microsoft Entra ID | Shopper auth (Okta) and corporate SSO for data-science/merch staff (Entra) | OIDC; conditional access on Entra for the MLOps console; short-lived tokens |
| Serving tier | Cloud Run | Stateless recommendation API: feature read → embed → ANN → rank | Min instances for warm peak, scale-to-zero off-peak; concurrency tuned to CPU |
| Retrieval | Vertex AI Vector Search | ANN over item embeddings; in-stock + suppression filtering | ScaNN index; restricts/crowding tags for store & category filters |
| Feature store | Vertex AI Feature Store | Low-latency online features + offline materialization | Online serving for reads; offline store backed by BigQuery for training parity |
| Streaming | Pub/Sub + Dataflow | Clickstream ingest; real-time feature computation | At-least-once delivery; dead-letter topic; windowed aggregations |
| Analytics / lake | BigQuery | System of record; training data; experiment analysis | Streaming inserts; partitioned + clustered tables; scheduled queries |
| Training / orchestration | Vertex AI Pipelines + Training | Build, train, evaluate, register the towers | KFP components; eval gate; Model Registry promotion |
| Secrets | HashiCorp Vault | Okta introspection, Datadog key, partner feed tokens | GCP IAM auth; dynamic leases; Vault Agent injection |
| CSPM / data posture | Wiz + Wiz Code | Cloud posture, sensitive-data exposure, IaC scanning | Agentless scan of BigQuery/buckets; Wiz Code gates Terraform in CI |
| Runtime security | CrowdStrike Falcon | Runtime threat detection on build/runner hosts and any GKE nodes | Sensor on CI runners and node pools; detections to the SOC |
| Observability | Datadog | Service APM, infra metrics, and model-performance dashboards | Agent + OTel traces; custom metrics for CTR, latency, model version |
| ITSM / approvals | ServiceNow | Model-promotion change approvals, incident records | Change gate before a model serves prod; auto-ticket on drift breach |
| CI/CD + IaC | GitHub Actions + Argo CD + Terraform | Build/test/deploy; GitOps rollout; infra as code | OIDC to GCP (no stored keys); Argo CD syncs Cloud Run revisions |
| Config mgmt | Ansible | Configure the self-managed Vault cluster and any VM appliances | Idempotent playbooks; pulls no secrets into the repo |
A few of these choices deserve the why, because they are the ones teams get wrong.
Why the Feature Store is non-negotiable, not a nice-to-have. The subtlest bug in any recommender is training/serving skew: the offline pipeline computes “30-day category purchase count” one way and the online service computes it slightly differently, so the model sees inputs in production it never saw in training and quietly degrades. A Feature Store fixes this by being the single place a feature is defined and computed — the training pipeline materializes features into the offline store, and the serving path reads the identical feature definitions from the online store. One definition, two read paths, no skew.
Why Vector Search instead of scoring everything. Restating the core trick because it is load-bearing: at 250,000 SKUs and tens of thousands of QPS, you cannot score the full catalog per request. Vertex AI Vector Search (built on Google’s ScaNN) does approximate nearest-neighbor in single-digit milliseconds and supports filtering via restricts — so “only items in stock at store 1183, excluding suppressed categories” is applied inside the index, not by fetching candidates and filtering in app code. Retrieval is the cheap funnel; ranking is the expensive polish on a shortlist of hundreds.
Why Cloud Run, not GKE, for serving. Recommendation traffic is spiky — a 6pm dinner-hour and pre-holiday peak many times the 3am trough. A stateless, autoscaling, scale-to-zero serverless tier matches that curve and the bill to it. GKE would mean paying for provisioned nodes through the quiet hours, and the serving logic (read features, embed, query ANN, rank) holds no state that needs a long-lived pod. Keep min-instances warm enough to absorb the peak’s leading edge without cold-start latency, and let it scale from there.
Implementation guidance
Provision with Terraform, and make the data boundary the first deliverable. The security team’s veto is about customer purchase history, so the network and IAM posture is the foundation, not an afterthought.
- A VPC with VPC Service Controls drawing a perimeter around BigQuery, Cloud Storage, the Feature Store, and Vector Search — so even a leaked credential cannot exfiltrate purchase data to a project outside the perimeter.
- Private Service Connect / private endpoints for the Vertex AI services and BigQuery, with public access off, so data-plane traffic never traverses the public internet.
- Least-privilege service accounts: the Cloud Run runtime SA gets only
aiplatform.userscoped for online prediction and Feature Store reads; the pipeline SA gets BigQuery read and Vertex AI training; no human holds standing write to the prod model endpoint. - The Pub/Sub topics, Dataflow templates, and BigQuery datasets (partitioned by event date, clustered by customer and SKU).
A minimal Terraform shape for the serving tier communicates the intent — internal ingress, dedicated SA, autoscaling bounds:
resource "google_cloud_run_v2_service" "recs" {
name = "reco-serving-prod"
location = "europe-west2"
ingress = "INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER" # not public
template {
service_account = google_service_account.reco_runtime.email
scaling {
min_instance_count = 6 # warm for the 6pm peak's leading edge
max_instance_count = 400 # headroom for pre-holiday surge
}
containers {
image = "europe-docker.pkg.dev/${var.project}/reco/serving:${var.git_sha}"
resources { limits = { cpu = "2", memory = "2Gi" } }
}
}
}
The pipeline that builds and ships this runs in GitHub Actions, authenticating to GCP via Workload Identity Federation (OIDC) so there is no stored service-account key to leak — a hard lesson the platform team intends never to repeat. Argo CD then drives the GitOps rollout: a new Cloud Run revision is declared in the config repo, Argo CD reconciles it, and a model-serving change is therefore auditable and instantly revertable by reverting a commit. Wiz Code scans the Terraform and container manifests in that pipeline and fails the build on a public-bucket or over-broad-IAM regression before it can ever reach an environment.
Identity: federate the humans, kill the static keys. Shoppers authenticate through Okta; the corporate data scientists and merchandisers who operate the MLOps console and the experiment dashboards authenticate through Microsoft Entra ID with conditional access, so promoting a model to production requires a verified, MFA-backed corporate identity — not a shared key. The residual non-IAM secrets — the Okta introspection secret, the Datadog API key, a partner pricing-feed token — live in HashiCorp Vault, leased dynamically and injected by the Vault Agent, never written into a Cloud Run env var or baked into an image. Ansible configures that self-managed Vault cluster and any virtual appliances (for example, a partner-supplied pricing or fraud appliance that must run as a VM image) idempotently, so the security configuration is code, not a console click someone forgets.
Feature and index wiring. Define every feature once in the Feature Store registry. Materialize from BigQuery for training; serve from the online store at request time. Co-locate the Vector Search index, the Feature Store online store, and Cloud Run in the same region as the BigQuery dataset — a cross-region hop in the serving path is latency you cannot afford in a 40ms budget. Rebuild the index from the freshly trained candidate tower as the final pipeline step, and deploy it behind a new endpoint so a bad index can be rolled back by repointing traffic.
A/B experimentation: proving the lift
This is the section the CFO cares about, and it is where most recommender projects quietly fail — they ship a model that looks smarter and never prove it earns more. The engine is built for experimentation from day one.
Traffic is split at the serving tier by a stable hash of the customer ID into arms — control (the old merchandiser rail or the incumbent model) versus one or more challenger models. The split is deterministic so a given shopper always sees a consistent experience for the experiment’s duration, and it is configured, not coded, so merchandising can launch a test without a deploy. Every served impression is tagged with its experiment arm and model version and flows through Pub/Sub into BigQuery, where the analysis is a scheduled query: per arm, the add-to-cart rate, conversion rate, revenue per session, and average order value, with a statistical-significance test so “the challenger wins” is a defensible claim, not a vibe.
| Stage | Mechanism | What it answers |
|---|---|---|
| Online split | Stable customer-ID hash at Cloud Run | Which shoppers see which model, consistently |
| Metric capture | Impression + outcome events → Pub/Sub → BigQuery | Per-arm CTR, conversion, revenue/session, AOV |
| Live monitoring | Datadog dashboards on the same metrics, by model version | Is a challenger silently tanking right now |
| Decision | BigQuery scheduled query with significance test | Did the lift clear the bar to promote |
The crucial pairing is Datadog for the live view and BigQuery for the verdict. Datadog dashboards — fed custom metrics tagged with model version and experiment arm — show CTR, p99 latency, and error rate per arm in real time, so if a freshly promoted challenger is dropping conversion or blowing the latency budget, an alert fires and the arm can be cut in minutes via config, not after a week of lost revenue. BigQuery is where the statistically sound end-of-experiment decision is made and where merchandising gets the revenue number to take to the CFO. Promotion of a winning model to 100% of traffic passes through a ServiceNow change approval, giving merchandising leadership and the platform owner a documented gate before a model owns the homepage.
Enterprise considerations
Security & data posture. The architecture is Zero-Trust by construction: identity-based access only, least-privilege service accounts, and a VPC Service Controls perimeter so customer purchase history physically cannot leave to an unauthorized project. Layer on top: (a) Wiz running continuous CSPM and sensitive-data scanning across BigQuery, Cloud Storage, and the Feature Store, alerting the moment a dataset drifts to broader access or a bucket turns public — the posture backstop behind the IAM policy; (b) Wiz Code shifting that same checking left into the Terraform/container pipeline so misconfigurations are caught pre-merge; © CrowdStrike Falcon sensors on the CI runners and any GKE node pools for runtime threat detection, feeding the retailer’s SOC; (d) a drift or guardrail breach auto-raising a ServiceNow incident so security and MLOps get a ticket, not just a log line. Org Policy denies public ingress on the Vertex AI and serving resources, and Wiz independently verifies the policy is actually holding.
Cost optimization. Compute and data scanning dominate, and both grow with success, so engineer for them from day one.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Serverless serving | Cloud Run scale-to-zero off-peak, autoscale to peak | Pay for the dinner-rush, not 3am idle |
| Retrieval funnel | ANN to a few hundred candidates, rank only the shortlist | Avoids scoring 250k SKUs per request |
| Embedding cadence | Rebuild item embeddings nightly, not per-request | Amortizes training cost across millions of calls |
| BigQuery hygiene | Partition + cluster tables; scheduled rollups; BI Engine for dashboards | Cuts bytes scanned per analytical query |
| Feature reuse | One feature definition serves train + online | No duplicate compute, no skew-debugging cost |
Tag spend by environment and pipe BigQuery and Cloud Run cost metrics to Datadog, which the platform team uses for the showback dashboard the CDO and CFO see.
Scalability. Each tier scales independently. Cloud Run scales on concurrent requests; Vector Search scales by adding replicas to the deployed index for query QPS; the Feature Store online store scales its serving nodes for read throughput; Pub/Sub and Dataflow scale on backlog depth, with the streaming job’s autoscaler matching the clickstream’s diurnal curve. BigQuery’s analytical capacity is elastic by design. The natural ceiling at peak is feature-read and ANN-query latency under load, which is why the index and online store are over-provisioned and warmed ahead of the known dinner-hour and pre-holiday spikes rather than discovered cold.
Failure modes, and what each one looks like. Name them before they page you.
- Cold start at the peak’s leading edge — Cloud Run scaled to zero overnight and the 6pm surge hits a cold pool, so the first wave of shoppers eats container start latency and the rail renders late. Mitigation: min-instances warmed ahead of the known peak; scheduled scale-up before 6pm.
- Stale features / streaming lag — Dataflow falls behind, so in-session signal arrives late and the just-added-to-cart nudge does not appear. Mitigation: monitor subscription backlog as a first-class SLO; autoscale the streaming job; a dead-letter topic so one poison event does not stall the pipeline.
- Cold-start items and users — a brand-new SKU has no engagement history and a first-time shopper has no profile, so pure two-tower retrieval has little to go on. Mitigation: content-based item features (category, brand, attributes) give new SKUs a reasonable embedding immediately; fall back to popularity-and-context rails for anonymous or brand-new users.
- Training/serving skew — a feature computed differently offline vs online silently degrades quality with no error. Mitigation: the shared Feature Store definition, plus Vertex AI model monitoring on feature drift between training and serving distributions.
- A bad model promoted — a challenger wins offline but tanks live conversion. Mitigation: the staged A/B rollout, the Datadog live guardrail that cuts the arm in minutes, and instant rollback by repointing serving traffic to the prior endpoint via Argo CD.
Reliability & DR (RTO/RPO). Decide the numbers per tier. BigQuery is multi-region-durable and is the recoverable source of truth for all behavioral and training data, so the engine is rebuildable from it. The serving tier is stateless and regional — for DR, deploy the Cloud Run service, a replicated Vector Search index, and the online Feature Store in a paired region and fail over at the load-balancer/Akamai layer. A pragmatic target for this platform: RTO 15 minutes, RPO near-zero for the behavioral event stream (Pub/Sub buffers and re-drives), with the model and index rebuildable from BigQuery within hours if lost. Critically, the storefront must degrade gracefully: if the recommendation service is unavailable, the rail falls back to a cached popularity list rather than failing the page — a recommender outage must never take down the homepage.
Observability. Instrument the serving request end to end in Datadog with OpenTelemetry: one trace covering feature-read → embed → ANN → rank, with timing on each hop and the model version as a tag, so a latency regression is attributable to a specific stage and release. Emit the metrics the business actually cares about — recommendation CTR, add-to-cart rate, conversion and revenue per session by model version, plus p99 serving latency and streaming backlog. The model-performance dashboards in Datadog are the daily operational view; BigQuery scheduled queries are the rigorous periodic verdict. Vertex AI model monitoring watches feature and prediction drift and, on breach, re-triggers the training pipeline and opens a ServiceNow ticket. New models pass a ServiceNow change gate before serving production, giving the business a documented promotion record.
Governance. Pin and register every model version in the Vertex AI Model Registry with its training-data snapshot and eval metrics, so a production model is always traceable to exactly what produced it and promotion is an auditable event, never a floating “latest.” Keep feature definitions, pipeline code, and the experiment configuration in version control, reviewable and revertable. Apply Org Policy to deny public ingress and require the VPC-SC perimeter, with Wiz as the independent check that the controls are real. Retain the clickstream under the retailer’s data-retention and consent regime, honoring deletion requests across BigQuery and the Feature Store, since purchase history and behavioral data are personal data.
Explicit tradeoffs
Accept these or do not build it. A two-tower retrieval engine adds real moving parts — a streaming pipeline, a feature store to keep consistent, embedding rebuilds, and retrieval quality you must measure and tune. Latency is the sum of feature read and embedding and ANN and ranking; budget all four. Approximate nearest-neighbor is approximate — you trade a sliver of recall for the latency that makes real-time serving possible, and ranking on the shortlist is what recovers quality. Cold start is a genuine weakness of any embedding recommender and needs the content-feature and popularity-fallback mitigations baked in, not bolted on. The VPC-SC perimeter and private endpoints that make the security team sign cost you setup complexity and remove public debugging shortcuts. And the A/B harness, the dual Okta/Entra identity planes, and the Vault-held secrets are overhead you could skip for a single-store pilot and absolutely cannot skip for a 1,400-store, regulated-data rollout.
The alternatives, and when they win. If your catalog is small and traffic modest, a nightly co-occurrence/matrix-factorization batch served from a cache is far simpler and may be enough — graduate to two-tower when real-time freshness and a large catalog demand it. If you want a managed black box and less ML ownership, Vertex AI Search for commerce / Recommendations AI packages much of this as a service — you trade control and customizability for speed-to-launch. If your problem is really ranking a known small set rather than retrieving from a huge catalog, you may not need the retrieval tower at all — a single ranking model suffices. And if personalization is not yet a proven revenue lever for your business, start with the experimentation harness and a simple popularity model, prove the rail moves money, and only then invest in the full Vertex AI stack described here.
The shape of the win
For the retailer, the payoff is not “an AI rail.” It is that a shopper who just put nappies in their cart sees formula and wipes in the recommendation slot before checkout, the rail rendered inside the page’s latency budget, the purchase history never left the GCP perimeter — and, the sentence that funds the platform, merchandising can show the CFO a BigQuery-backed, statistically significant lift in revenue per session from the controlled rollout before the model ever owned 100% of the homepage. Everything upstream — the two towers, the Feature Store parity, the Pub/Sub clickstream, the Vector Search funnel, the Okta-to-Entra identity planes, the Vault-held secrets, the Wiz posture scanning, the Datadog model dashboards — exists to make a head of merchandising, a CISO, and a CFO each say yes. The architecture here is the destination; start with the experiment and a simple model if you must, but this is where real-time, at-scale, provable retail personalization on GCP has to land.