AI/ML Multi-cloud

A Two-Tower Recommendation System at Scale

A streaming-media company lives and dies by the home screen. Helix Media, a fictional subscription video service with 38 million monthly actives across twelve markets, measured the truth bluntly: a member who does not start something to watch within roughly ninety seconds of opening the app is materially more likely to churn that month. Their old recommender — a nightly batch job that pre-computed a single ranked row per user and cached it — could not keep up. It did not know that you had just finished a thriller four minutes ago, it served the same shelf to a household’s children’s profile and its adult profile, and every catalog change or new title took a full day to influence a single recommendation. Engagement was flat, the content team could not tell whether a new ranking idea helped or hurt, and the cost of re-scoring the entire catalog for every user every night had grown into a line item finance kept circling in red.

The fix is not “a better model.” It is a system: one that can consider a 600,000-title catalog for tens of millions of users, react to the last action a member took seconds ago, stay inside a strict latency budget, run controlled experiments so every change is measured, and do all of it at a cost that does not scale linearly with the catalog. The pattern the industry converged on for exactly this shape of problem is the two-tower architecture — split into a retrieval stage that cheaply narrows 600,000 candidates down to a few hundred, and a ranking stage that expensively scores those few hundred with full context. This article is a reference architecture for building that properly at enterprise scale.

Why two towers, and why two stages

The core difficulty is combinatorial. You cannot run a rich, feature-heavy model over every (user, item) pair at request time — 38M users times 600K items is an absurd number of scores per day, and the per-request version (score the whole catalog while a person waits) blows the latency budget by orders of magnitude. So the system is deliberately split by cost.

The two-tower model (also called a dual-encoder) is the engine of the cheap first stage. It is two separate neural networks trained jointly:

They are trained so that the dot product of a user’s query vector and an item’s item vector approximates the likelihood that the user engages with that item. The decisive property: the item tower does not depend on the user. So you can run every title in the catalog through the item tower offline, store the resulting 600,000 vectors, and at request time only run the query tower once (for the single user) and then find the nearest item vectors. Nearest-neighbour search over a few hundred thousand vectors is a millisecond-class operation. That is how you reduce 600K candidates to ~500 fast enough to serve a home screen.

But a dot product of two embeddings is a coarse instrument — it deliberately cannot model rich cross-features between this exact user and this exact item (the architecture’s strength is that separation, because it is what makes precomputation possible). So the ~500 survivors flow into a second, heavier ranking model — typically a gradient-boosted tree or a deep cross network — that does see full crossed features, real-time signals, and business constraints, and produces the final ordered shelf. Retrieval optimises recall (don’t miss anything good); ranking optimises precision and ordering (put the best thing first). Conflating the two into one model is the most common way teams paint themselves into a corner.

Architecture overview

The end-to-end design has three loops running on three different clocks: an offline training + indexing loop (hours/daily), a near-real-time feature loop (seconds), and a synchronous serving loop (milliseconds). Holding them separate in your head is the prerequisite to operating this without losing your mind.

A Two-Tower Recommendation System at Scale — architecture

Serving loop, numbered as in the diagram: (1) a member opens the app; the request hits the edge through Akamai, which terminates TLS, absorbs DDoS, and serves cached static assets, then forwards the personalization call to the recommendation API gateway. The member’s session is already authenticated — Helix uses Okta for consumer identity and SSO, and the gateway validates the JWT and extracts the profile ID. (2) The serving orchestrator fetches online features for this user and context from the feature store’s online store (a low-latency key-value tier, here Redis-backed) — the keys are profile ID, household, device, and the rolling “last N titles watched” list updated seconds ago. (3) The orchestrator runs the query tower (a small ONNX model served on a GPU/CPU inference node) over those features to produce the live query embedding. (4) It issues an approximate nearest-neighbour (ANN) query against the vector index of precomputed item embeddings to retrieve the top ~500 candidates, applying hard filters in the same call (market licensing, maturity rating for a kids profile, already-watched exclusion). (5) Those candidates plus their ranking features (fetched from the same feature store) go to the ranking service, which scores and orders them with the heavy model and applies business rules — diversity, freshness boosts, editorial pins. (6) The ordered shelf returns through the gateway and Akamai to the device, and the served slate plus the eventual interaction (play, skip, dwell) is emitted as an event for logging and training. The whole loop has a p99 budget of ~120 ms.

Feature loop runs continuously alongside it: interaction events stream through Apache Kafka, a stream processor (Flink) computes rolling aggregates — “titles watched in last hour,” “genre affinity decayed over 7 days” — and writes them to the online feature store so the next request sees them within seconds. The same events land in the data lake for the offline side.

Training + indexing loop runs on a schedule: the offline feature store materializes point-in-time-correct training data from the lake; the two-tower model retrains (daily or on a trigger); the item tower is run in batch over the entire catalog to produce fresh item embeddings; those vectors are loaded into the ANN index; and the ranking model retrains on the same logged data. Crucially the query tower and the item-embedding batch job must use the same model version — a skew here silently destroys retrieval quality because the two towers stop sharing a vector space.

Component breakdown

Component Representative technology Role in the system Key configuration choices
Edge / CDN Akamai TLS, DDoS, static caching, geo-routing of the personalization call Do not cache the personalized response; cache only the catalog/art assets
Identity Okta (consumer) / Entra ID (workforce) Authenticate the member session; SSO for the internal ML/ops consoles Short-lived JWTs validated at the gateway; profile + household claims
Event bus Apache Kafka Durable interaction stream feeding both real-time and offline paths Partition by profile ID; tiered storage; exactly-once to the lake
Feature store Feast / Tecton (Redis online + warehouse offline) Single source of feature definitions; serves online + materializes offline Same transformations both sides to kill training/serving skew
Query tower serving ONNX Runtime / Triton on Kubernetes Runs the user tower per request → query embedding Pinned model version; warm pods; sub-10 ms inference
Vector index (ANN) Vespa / Milvus / ScaNN Stores item embeddings; top-K retrieval with metadata filters HNSW or IVF; filtered ANN so licensing/maturity apply inside the search
Ranking service XGBoost / DCN-v2 served on K8s Heavy scoring of ~500 candidates with crossed + real-time features Latency-bounded; feature timeout fallbacks; business-rule layer
Orchestrator / API gRPC service on Kubernetes (EKS/GKE) Sequences feature fetch → retrieve → rank; enforces budget Per-stage deadlines; graceful degradation paths
Experimentation In-house / Optimizely + metrics pipeline Assigns members to model variants; measures lift Deterministic hashing; guardrail metrics; sequential testing
Secrets HashiCorp Vault Model-registry creds, DB/Redis creds, signing keys Dynamic short-lived DB creds; no static secrets in pods
Observability Dynatrace Distributed traces, the latency budget, model-quality SLOs OneAgent on nodes; per-stage spans; business-metric SLOs
Cloud security posture Wiz (CSPM) + CrowdStrike Falcon (runtime) Data-posture scanning of the feature store/lake; runtime threat detection Wiz watches PII in the lake; Falcon on every node

A few choices deserve the why, because they are the ones teams get wrong.

Why a feature store is non-negotiable, not a nicety. The single most expensive bug in production ML is training/serving skew: the feature you computed in a notebook for training is subtly different from the one your serving code computes at request time (a different time window, a timezone, a null-handling rule), and your offline metrics look great while production quietly underperforms. A feature store like Feast or Tecton fixes this by defining each feature once, then both materializing it offline (point-in-time correct, so you never leak the future into training labels) and serving it online from the same definition. It also gives you the low-latency online store the serving loop reads in step 2, and a clean contract — “feature views” — that data scientists and the serving team share instead of reimplementing.

Why filtered ANN, not filter-after-retrieve. Naively you retrieve the top 500 by vector similarity and then drop the ones the member is not licensed to watch in their market, or that exceed a kids profile’s maturity rating. This is both a correctness and a quality bug: if 400 of your top 500 happen to be unlicensed in that market, you are left with 100 candidates and a thin shelf, and a single mistake leaks an inappropriate title onto a children’s profile. Instead, push the hard constraints into the ANN query as metadata filters (Vespa and Milvus both support filtered search) so the index only ever returns eligible candidates and back-fills to a full top-K from the eligible set. Eligibility stays a property of the retrieval, not an afterthought in app code.

Why retrieval and ranking use different model families. Retrieval needs a model whose item side is decomposable from the user side (that is the entire reason the item tower can be precomputed) — hence the two-tower dot-product structure. Ranking has no such constraint because it scores only ~500 items synchronously, so it can use the richest model you can afford in the latency budget: gradient-boosted trees (XGBoost/LightGBM) are still the workhorse for tabular ranking features, while deep cross networks (DCN-v2) shine when you have many high-cardinality categorical crosses. Using the two-tower model itself to do final ranking throws away all the crossed and real-time signal that actually orders the shelf well.

Implementation guidance

Provision with IaC and treat the data plane as the first deliverable. Use Terraform for the cloud substrate — Kubernetes (EKS or GKE), the Redis/online-store tier, Kafka (MSK or Confluent Cloud), object storage for the lake, and the GPU node pools for tower serving and training. A minimal shape for the serving node pool and the online store communicates the intent:

resource "aws_eks_node_group" "tower_serving" {
  cluster_name    = aws_eks_cluster.reco.name
  node_group_name = "tower-serving-gpu"
  instance_types  = ["g5.xlarge"]              # query-tower + ranking inference
  scaling_config { min_size = 6  max_size = 40  desired_size = 8 }
  labels = { workload = "inference", "nvidia.com/gpu" = "true" }
}

resource "aws_elasticache_replication_group" "online_features" {
  replication_group_id = "reco-online-features"
  engine               = "redis"
  node_type            = "cache.r6g.2xlarge"   # online feature store hot tier
  num_node_groups      = 6                      # sharded by profile id
  replicas_per_node_group = 2                    # read replicas for QPS + HA
  multi_az_enabled     = true
  transit_encryption_enabled = true
}

Kill static secrets. The serving pods need credentials for Redis, the model registry, and the metadata store. Do not bake them into config or images — issue them from HashiCorp Vault with dynamic, short-lived database credentials and the Vault Agent sidecar injecting them, so a leaked pod never yields a long-lived secret and nothing needs manual rotation. Workforce access to the ML platform, training UIs, and the experimentation console goes through Entra ID (or Okta) SSO with SCIM-provisioned groups, while the member-facing session is the consumer Okta tenant — keep the two identity planes separate.

Wire CI/CD for two artifacts, not one. A recommender change is rarely just code; it is usually a model. Run application CI through GitHub Actions (build, unit/integration tests, container scan, Terraform plan), but route model promotion through a registry-driven pipeline: a retrained two-tower or ranking model is registered, evaluated offline against a golden held-out set, and only on passing the offline gates does it become eligible for an online experiment. Any promotion of a model variant to a larger traffic share routes a change request through ServiceNow for approval, because flipping the recommender that drives the home screen for millions of members is a production change the on-call and content stakeholders must sign off on — the approval and the rollback plan live on the ServiceNow ticket, linked to the experiment ID.

Get the embedding-refresh cadence right. The item embeddings drift as the catalog and the model change. Two independent triggers regenerate them: a scheduled daily batch of the item tower over the whole catalog, and an incremental path for newly-added titles (run the current item tower over just the new items and upsert them into the ANN index within minutes, so a launch-day title is recommendable immediately). The hard rule from the architecture overview bears repeating: the query tower in serving and the item-tower batch must be the same model version, so promote them together as a unit and pin the version explicitly.

Failure modes and graceful degradation

This system has more ways to fail partially than to fail completely, and the difference between a good and a bad design is what happens when one stage is sick. The orchestrator enforces a per-stage deadline and a fallback for each:

Failure Symptom Graceful degradation
Online feature store slow/down Step 2 times out Serve the query tower with default/aggregate features (cohort-level), not personalized — degraded but present
ANN index unavailable Retrieval returns nothing Fall back to a precomputed popular-in-market shelf cached at the edge; never an empty home screen
Ranking service overloaded Step 5 exceeds budget Return candidates in retrieval order (the two-tower score) — worse ordering, still relevant
Stale item embeddings (batch failed) New titles missing Alert + serve yesterday’s index; incremental path back-fills launches
Model rollout regresses metrics Engagement drop in a variant Experimentation guardrail auto-halts the variant; route 100% to control

The governing principle is that a stale or generic recommendation is vastly better than an error or a blank shelf — a member who sees something reasonable keeps browsing; a spinner or an empty row is an immediate churn signal. Every stage therefore has a “good enough” fallback that is itself cheap and pre-warmed, and Dynatrace drives the alerting on each fallback firing so the team learns a dependency is degraded before it cascades.

Observability with Dynatrace

You cannot operate this on infrastructure metrics alone, because the system can be perfectly healthy (CPU fine, no errors, p99 inside budget) while silently recommending garbage — and that only shows up in business metrics days later. The observability strategy is therefore two-layered, both unified in Dynatrace:

Operational layer. Deploy the Dynatrace OneAgent across the Kubernetes nodes and instrument the orchestrator so a single distributed trace spans the whole serving loop — edge → gateway → feature fetch → query-tower inference → ANN retrieval → ranking → response — with the time and the candidate counts on each span. This is what makes the 120 ms budget enforceable: when p99 breaches, the trace immediately shows which stage ate the budget (a slow Redis shard, a cold ranking pod, an ANN recall-vs-latency setting), instead of a blind hunt. Set Dynatrace SLOs on p99 latency, error rate, and feature-store hit-rate, and let its causal-AI (Davis) correlate a latency regression to the exact deployment or node that introduced it.

Model-quality layer. Emit business and ML metrics as custom events into Dynatrace and put SLOs on them too: retrieval recall@K against logged engagement, prediction-serving skew (do online feature distributions match training?), candidate-pool size after filtering (a collapsing pool is an early warning of a licensing-filter bug), catalog coverage (are we only ever recommending the same 2,000 titles — a popularity-bias failure?), and embedding freshness lag. A drop in recall@K or a coverage collapse fires the same way a latency breach does, so a quality regression is treated as an incident, not a quarterly surprise. This is the single biggest maturity gap between teams that ship recommenders and teams that operate them.

A/B experimentation: how every change earns its way in

Nothing reaches all members on a hunch. Every model, every ranking tweak, every retrieval parameter ships behind a controlled experiment, and the experimentation platform is first-class infrastructure, not a spreadsheet.

Members are assigned to variants by deterministic hashing of a stable unit (the household, not the request — so the experience is consistent across a household’s sessions and devices, and you don’t contaminate by re-randomizing). A typical rollout: 1% canary → guardrails clear → 5% → 50/50 → ship. The metrics split into three classes that must all be respected:

Two statistical disciplines keep this honest. Sequential testing (always-valid p-values) lets you peek at results continuously and stop early when a variant is clearly winning or harmful, without the false-positive inflation of naively re-checking a fixed-horizon test. And a holdback group — a small slice (say 1%) that never gets the new model for a quarter — measures the cumulative, compounding value of the whole personalization effort against a frozen baseline, which is the number you show the board. The experiment platform reads the same logged events that feed training, so the loop is closed: what you measure is what you learn from.

Enterprise considerations

Security and data posture. The feature store and the data lake are the crown jewels — they hold viewing history, which in many markets is personal data with real regulatory weight. Wiz continuously scans the lake and warehouse for data-posture problems: PII landing in an unencrypted bucket, an over-permissioned IAM role that can read raw viewing logs, a public-exposure regression. At runtime, CrowdStrike Falcon sensors on every Kubernetes node detect compromise — a cryptominer in an inference pod, anomalous egress that could signal model or data exfiltration. Identity is least-privilege end to end: member sessions via Okta, workforce via Entra ID, machine credentials issued dynamically by HashiCorp Vault. Encrypt the online store in transit and at rest, and treat the embeddings themselves as sensitive — a query embedding is a fingerprint of a person’s taste, and a leaked item-embedding table is your recommendation IP.

Cost optimization. The economics are the whole reason for the two-stage split, and they still need active management. (1) Retrieval is cheap, ranking is expensive — so retrieve aggressively and rank conservatively: pulling ~500 candidates instead of ~2,000 and ranking 500 instead of all is the dominant lever, because ranking cost scales with candidate count. (2) Cache the embeddings, not the recommendations — re-running the item tower over 600K titles nightly is far cheaper than re-scoring every user-item pair, which is exactly the trap the old nightly system fell into. (3) Right-size ANN recall — pushing the index toward 99.9% recall costs latency and compute for a quality gain users cannot perceive past a point; tune it on the recall@K SLO, not vibes. (4) GPU only where it pays — the towers and DCN ranking justify GPUs; XGBoost ranking does not, so split node pools. (5) Kafka tiered storage and lake lifecycle policies keep the event history (which is also your training data) from becoming an unbounded bill. Dynatrace cost-and-utilization views tie spend back to the per-stage spans, so you optimize the stage that actually costs money.

Scalability. Each tier scales independently and on a different signal. The stateless orchestrator, query-tower, and ranking services scale horizontally on QPS/concurrency via the Kubernetes HPA and cluster autoscaler — straightforward. The online feature store scales by sharding on profile ID and adding read replicas for read-heavy QPS. The ANN index is the one to watch: a single 600K-vector index is trivial, but as the catalog and per-market variants grow you shard the index and fan out the query, and you decide the recall-vs-latency-vs-memory trade explicitly (HNSW gives great recall at higher memory; IVF is leaner but needs tuning). Kafka scales by partition count (partitioned by profile ID, so a single household’s events stay ordered). The natural ceiling is rarely compute — it is the embedding-refresh and index-rebuild time, which is why the incremental upsert path for new titles matters more as you grow.

Reliability and DR. Decide the numbers per tier. The serving path is the strict one: it is in the member’s critical path, so it runs active-active across at least two regions behind Akamai’s geo-routing, with the online feature store and ANN index replicated per region — RTO in low minutes via health-probe failover, and the pre-warmed popular-in-market fallback covering any single-region retrieval outage with near-zero member impact. The training and indexing loop is not latency-critical: if a region’s training cluster is down, you serve yesterday’s model and index and rebuild when it recovers — the durable source of truth is the event log and lake (geo-replicated), from which the entire model and index are reproducible. A pragmatic enterprise target: serving RTO 5 minutes / RPO seconds (you might lose a few seconds of just-emitted interaction events), and a fully rebuildable model and index within hours from the geo-redundant lake.

Governance. Pin model versions explicitly (the two-tower version and the ranking version) and promote them only through the offline-evaluation and online-experiment gates — never hot-swap the home-screen model without an experiment and a ServiceNow-approved rollback. Version feature definitions in the feature store so a feature change is reviewable and revertable. Keep an audit trail linking every shipped model to its training-data snapshot, its offline metrics, its experiment ID, and its approval ticket — because when someone asks “why was this recommended,” you need to reproduce the exact model and features that produced it. And honor member privacy controls: a “do not personalize” or “clear watch history” request must propagate to the feature store and remove the member from the personalized path, falling back to the generic popular shelf.

Reference enterprise example

Helix Media rebuilt around exactly this architecture. Their catalog: ~600,000 titles across twelve markets with heavy per-market licensing variation; ~38M monthly members, ~110M household profiles. The old nightly batch recommender was replaced stage by stage.

Decisions they made. They stood up Tecton as the feature store (Redis online tier, Snowflake offline) to kill the training/serving skew that had been quietly capping their old model — the rolling “last-hour watches” and decayed genre-affinity features now compute identically in Flink for serving and in the warehouse for training. Retrieval used a two-tower model with 128-dimension embeddings, the item tower run nightly over the full catalog plus a 5-minute incremental path for launches, indexed in Vespa with filtered ANN so market-licensing and maturity constraints applied inside the search. Ranking was DCN-v2 scoring ~500 candidates with crossed and real-time features, served on g5 GPU pods, with an XGBoost fallback model on CPU for the degraded path. The whole serving loop ran on EKS with a p99 budget of 120 ms, enforced and traced in Dynatrace end to end. Every change shipped through their Optimizely-driven experiment platform with household-level assignment and a permanent 1% holdback. Identity was Okta for members, Entra ID for staff; secrets came from HashiCorp Vault; Wiz watched the lake’s data posture and CrowdStrike Falcon ran on every node; model promotions were approved in ServiceNow.

The numbers. ~4,200 personalization requests/second at peak (evening prime time across the Americas), p99 ~108 ms inside the 120 ms budget. Daily item-embedding batch over 600K titles ran in ~40 minutes on a transient GPU pool. The shift from “re-score every user-item pair nightly” to “precompute item embeddings + retrieve + rank ~500” cut the recommendation compute bill by roughly 70% even as request volume rose, because the expensive ranking model now touches 500 items per request instead of the catalog. Retrieval recall@200 held above 0.94 against logged engagement on the Dynatrace SLO; catalog coverage tripled versus the old popularity-biased batch (the long tail finally surfaced).

The outcome. In a 50/50 experiment against the old batch system, the two-tower stack lifted home-screen start-rate by ~11% and first-week completion by ~6%, with p99 latency down, and zero kids-profile maturity leaks across the rollout (the guardrail would have auto-halted otherwise). The permanent holdback let Helix put a defensible number on personalization’s contribution to retention for the board. And because every change now ships behind a measured experiment with a ServiceNow-approved rollback, the content and ML teams stopped arguing about whether a ranking idea worked and started measuring it — which, more than any single model, was the cultural unlock.

When to use it

Use this architecture when you have a large item catalog (thousands to millions), a large user base, a strict serving latency budget, a need to react to fresh user behavior, and a business that demands every change be measured. That describes most large-scale personalization: media and content feeds, e-commerce product discovery, app/marketplace recommendations, music and podcast queues, and large-catalog search ranking.

Trade-offs to accept. Two stages mean two models to train, evaluate, and keep in sync, plus a feature store and an ANN index to operate — real moving parts and a real platform team. The two-tower model’s strength (decomposable towers) is also its ceiling: it cannot model user-item cross-features in retrieval, so a genuinely relevant item with a weak embedding match can be missed before ranking ever sees it (mitigated by retrieving generously and by multiple retrieval sources, not eliminated). And the system is only as fresh as your feature and embedding pipelines — stale features or a failed embedding batch degrades quality silently unless your observability catches it.

Anti-patterns. (1) One model for retrieval and ranking — you lose either precomputation or crossed features; keep them separate. (2) Filter-after-retrieve — thin shelves and safety leaks; push constraints into the ANN query. (3) No feature store — training/serving skew quietly caps your quality. (4) Shipping models without experiments — you cannot tell improvement from regression, and the home screen is too important to guess. (5) Infra-only observability — the system stays “green” while recommending garbage; put SLOs on recall and coverage too. (6) Letting query-tower and item-batch versions drift — the two towers stop sharing a vector space and retrieval silently collapses.

Alternatives, and when they win. If your catalog is small (hundreds of items), skip retrieval entirely and rank everything every request — the two-tower complexity earns nothing. If you have almost no interaction data yet, start with content-based or simple collaborative filtering and graduate to two-tower once you have engagement at scale. If sequence matters intensely — the order of recent actions strongly predicts the next one — a sequential/transformer recommender (SASRec-style) can replace or augment the query tower. And if you need to balance long-term value or exploration explicitly rather than greedily predict the next click, a bandit or reinforcement-learning layer sits on top of this stack as the policy that chooses among ranked candidates. The two-tower retrieve-and-rank architecture here is the proven default for large-catalog, low-latency personalization — the destination most teams should aim for, and the substrate the fancier ideas plug into.

Recommendation SystemsMachine LearningArchitectureFeature StoreMLOpsEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading