Most teams can train a model. Very few can answer, on a Tuesday afternoon when the model starts returning garbage, which version is live, what data it was trained on, whether the features it sees in production match the ones it learned from, and how to roll back without a redeploy that takes an afternoon. The gap between “we have a model” and “we operate models” is the entire discipline of MLOps, and on Google Cloud that discipline has a concrete shape: Vertex AI Pipelines, Feature Store, Model Registry, Endpoints, and Model Monitoring, wired together so that a model’s whole lifecycle is reproducible, governed, and observable. This is the reference architecture for that platform — one that a five-person data team and a five-hundred-person ML org can both run, because the moving parts are the same and only the scale changes.
The business scenario
Picture a mid-market lender, an e-commerce marketplace, or a B2B SaaS company — any business where a handful of models now sit on the revenue path. Fraud scoring on every transaction. Propensity-to-churn driving a retention budget. Dynamic pricing. Demand forecasting that decides what gets bought. These are not science projects; if the fraud model goes blind, money walks out the door, and if the churn model silently drifts, marketing burns spend on the wrong cohort.
The failure mode these organizations hit is organizational, not algorithmic. The data scientist’s notebook computes a “days since last login” feature one way; the production service computes it another way, off a slightly stale replica, and the model degrades for reasons nobody can see — classic training-serving skew. The model that is live in production was trained from a CSV on someone’s laptop that no longer exists, so it cannot be reproduced or audited. Promotion to production is a person copying a model file and updating a config, with no record of who approved it or what it scored on the holdout set. When a regulator or a board member asks “why did the model decline this customer,” there is no lineage to walk back from prediction to model version to training data.
The platform in this article solves exactly that. It gives you one place features are defined and served so training and serving see identical values; one pipeline definition that produces a model the same way every time; one registry that is the single source of truth for what is promotable and what is live; endpoints that scale and roll out safely; and continuous monitoring that tells you when reality has drifted away from the data the model learned on. It is widely useful because the problems are universal — they show up the moment a second model and a second engineer enter the picture.
Architecture overview
The architecture is best read as a loop that data flows around, not a top-to-bottom stack. Five planes cooperate: a feature plane, a training plane, a registry/governance plane, a serving plane, and a monitoring plane that closes the loop back to training.
Imagine the diagram as a wide horizontal ring. On the left, raw and curated data lands in BigQuery (warehouse tables, event streams materialized from Pub/Sub via Dataflow, batch loads from Cloud Storage). A feature engineering step — a Dataflow or BigQuery job orchestrated as part of a pipeline — transforms that data into feature values and writes them to the Vertex AI Feature Store. The Feature Store has two faces: an offline store backed by BigQuery that serves point-in-time-correct feature values for training, and an online store backed by Bigtable that serves the same features at single-digit-millisecond latency for live prediction. That dual face is the whole point — both training and serving draw features from one definition, killing skew.
Moving clockwise to the top, Vertex AI Pipelines (Kubeflow Pipelines under the hood, authored with the KFP DSL plus Google Cloud Pipeline Components) orchestrates the training workflow: pull a point-in-time feature set from the offline store, validate the data, train on Vertex AI Training (CPU/GPU/TPU), evaluate against a holdout, and — only if the eval gate passes — register the resulting model. Every pipeline run emits artifacts and lineage into Vertex ML Metadata, so each model is traceable back to the exact dataset, code, and hyperparameters that produced it. Vertex AI Experiments tracks runs so candidates can be compared.
At the top-right sits the Model Registry — the governance choke point. Pipelines do not deploy models; they register them as new versions of a logical model. A model version carries its eval metrics, its lineage, and a model card. Promotion is an explicit act: a human or an automated gate moves an alias (staging, production, champion, challenger) onto a specific version. Aliases, not version numbers, are what serving points at, so promotion and rollback are pointer moves, not redeployments.
Down the right side, the serving plane deploys the version that carries the production alias to a Vertex AI Endpoint. For request/response, that is an online endpoint — a dedicated public endpoint for isolation, or a dedicated private endpoint over Private Service Connect when traffic must stay inside the VPC. The endpoint autoscales replicas against CPU/GPU utilization, and a single endpoint can host multiple deployed models with traffic splitting for canaries and A/B tests. For bulk scoring, Batch Prediction reads from BigQuery or Cloud Storage and writes results back, no standing endpoint required.
At the bottom, the live request path: an application calls the endpoint; the serving container fetches the latest feature values for the entity (e.g. this user, this transaction) from the online Feature Store, combines them with request-time features, runs inference, and returns the prediction in tens of milliseconds. Prediction requests and responses are sampled to logging.
The monitoring plane closes the ring. Vertex AI Model Monitoring compares the live serving distribution against the training baseline and raises feature drift and training-serving skew alerts; with v2 it can also watch prediction drift and models served outside Vertex. Those alerts feed back to the top of the ring: drift is the signal to trigger continuous training — the same pipeline runs again on fresh data, produces a new model version, and the loop repeats. That feedback edge is what makes the diagram a loop rather than a pipeline.
Component breakdown
Each component earns its place by removing a specific failure mode. The table maps component to purpose to the configuration decisions that actually matter.
| Component | What it does | Why it’s here | Key configuration choices |
|---|---|---|---|
| BigQuery (data + offline store) | Warehouse for curated data and the offline Feature Store backend | Point-in-time-correct training data with no data movement | Partition/cluster feature tables; use feature_timestamp for point-in-time joins to avoid label leakage |
| Dataflow / BigQuery jobs | Compute feature values from raw/streaming data | Feature logic lives once, in pipeline code, not in two services | Stream features from Pub/Sub for freshness; batch backfills for history; idempotent writes |
| Vertex AI Feature Store | Offline (BigQuery) + online (Bigtable) feature serving | Single feature definition for train and serve → kills skew | Online store node count for QPS; sync cadence offline→online; TTLs on online entities |
| Vertex AI Pipelines | Orchestrates the ML workflow as a DAG (KFP) | Reproducible, parameterized, schedulable training | Compile to KFP IR; cache successful steps; use Google Cloud Pipeline Components for train/eval/register |
| Vertex AI Training | Managed custom/AutoML training jobs | Elastic compute without managing clusters | machineType + acceleratorType (e.g. NVIDIA_L4/A100); reduction server for distributed; Spot for cheap retrains |
| Vertex ML Metadata + Experiments | Lineage graph and run tracking | Auditability: prediction → version → data | Auto-logged by pipeline; tag runs with git SHA and dataset hash |
| Model Registry | Versioned catalog of models with aliases | The single source of truth for what is promotable/live | Logical model + versions; aliases (production, champion, challenger); attach model cards |
| Vertex AI Endpoints | Online serving of deployed model versions | Low-latency, autoscaling, safe rollouts | Dedicated public vs PSC private; min/max replicas; 60% util target; traffic split for canary |
| Batch Prediction | Large-scale offline scoring | Cost-efficient bulk inference, no standing endpoint | Source/sink in BigQuery or GCS; right-size machine pool per job |
| Model Monitoring | Drift / skew / prediction-drift detection | Detects silent degradation; triggers CT | Training baseline; sampling rate; thresholds per feature; alert to Pub/Sub/Cloud Monitoring |
Two components are worth dwelling on because teams under-invest in them and pay later.
The Feature Store is the load-bearing wall. Without it, every model team rebuilds feature pipelines, and serving inevitably computes features differently from training. With it, a feature is defined once, materialized to BigQuery for training and to Bigtable for serving from the same job, and reused across models. The configuration that bites people is freshness: the online store is only as current as the last sync, so for features that must be real-time (a running transaction count this session), you write them straight to the online store from a streaming Dataflow job rather than relying on a periodic offline→online sync.
The Model Registry is the governance choke point, not a filing cabinet. Its value is the discipline it enforces: pipelines register, they do not deploy; promotion moves an alias; serving follows the alias. This indirection is what makes rollback instant — repoint production from version 7 back to version 6 and the next requests route there, with zero rebuild. The model card attached to each version (intended use, training data summary, eval metrics, fairness slices) is what you hand a regulator or risk committee, and it is generated by the pipeline, not written after the fact.
Implementation guidance
Project and environment topology. Use separate GCP projects per environment — ml-dev, ml-staging, ml-prod — under a folder, with a shared ml-shared project for the artifact registry and Terraform state. The Model Registry, Feature Store, and Endpoints are regional resources; pick a region (e.g. europe-west1) and keep data residency in mind, since the offline store is BigQuery in that region. Promotion across environments is a registry alias move plus a controlled redeploy in the higher project, gated in CI.
Infrastructure as Code. Provision the platform with Terraform using the google and google-beta providers. The durable, slow-changing resources belong in Terraform; the fast-changing model versions do not (those are produced by pipelines).
| Resource | Terraform | Notes |
|---|---|---|
| Feature Store + online store | google_vertex_ai_feature_online_store, google_vertex_ai_feature_group |
Bigtable-backed online store; feature groups map to BigQuery sources |
| Endpoint | google_vertex_ai_endpoint |
Create the endpoint in TF; deploy models to it from the pipeline |
| Pipeline schedule | google_cloud_scheduler_job → pipeline run, or Vertex Pipeline Schedules |
Cron-triggered continuous training |
| Service accounts + IAM | google_service_account, google_project_iam_member |
One SA per plane (pipelines, serving, monitoring) |
| Networking | google_compute_network, PSC endpoint resources |
VPC + Private Service Connect for private endpoints |
| Artifact + model storage | google_artifact_registry_repository, GCS buckets |
Containers for custom training/serving; model artifacts in GCS |
A clean split is: Terraform owns the platform (stores, endpoints, IAM, network, schedules); the KFP pipeline owns the model lifecycle (train, evaluate, register, and — guarded by an approval gate — deploy). Trying to manage model versions in Terraform fights the tool; the registry is the state store for those.
Pipeline authoring and CI/CD vs CT. Author pipelines with the KFP SDK, lean on Google Cloud Pipeline Components for the train/evaluate/register/deploy steps, and compile to the pipeline IR as a build artifact. Distinguish two cadences clearly:
- CI/CD (continuous integration/delivery) ships pipeline code. A change to feature logic or model code goes through Cloud Build: unit tests, compile the pipeline, run it in
ml-dev, and on green, publish the compiled pipeline and container images. - CT (continuous training) runs the published pipeline on a schedule or a trigger to produce fresh model versions. New data, or a drift alert from Model Monitoring landing on Pub/Sub, kicks off a run that registers a new candidate. Promotion to
productionstays gated — usually a human approval in Cloud Build or a metric threshold the candidate must beat the incumbent on.
Identity and networking wiring. Give each plane its own least-privilege service account: the pipeline SA can read the offline store, run training, and register models, but cannot deploy to prod; the deploy SA (used only behind the approval gate) can move aliases and deploy; the serving runtime SA can read the online store and write prediction logs and nothing else. For private serving, deploy the endpoint as a dedicated private endpoint over Private Service Connect so application VPCs reach it without traversing the public internet, and front it with the app’s existing internal load balancing. Keep BigQuery, GCS, and Vertex behind VPC Service Controls so data cannot exfiltrate to a project outside the perimeter even if credentials leak.
Enterprise considerations
Security & Zero Trust. Treat every plane as mutually distrusting. Per-plane service accounts with minimal IAM mean a compromised serving container cannot read training data or promote a model. Wrap the data and ML projects in a VPC Service Controls perimeter; serve privately via Private Service Connect; encrypt model artifacts and feature data with CMEK (customer-managed keys in Cloud KMS) where compliance requires control of the key. The registry’s alias mechanism is a Zero-Trust control: nothing reaches production without an explicit, audited alias move, and Cloud Audit Logs record who moved it. For regulated decisions, the per-version model card plus ML Metadata lineage gives you the “why did the model decide this” trail end to end.
Cost optimization. The two cost sinks are idle endpoints and over-eager training. For endpoints, set min replicas honestly — scale-to-low for spiky internal models, but remember online prediction nodes cost while provisioned, so a model that gets one request an hour should be a batch job, not a standing endpoint. Use the 60% utilization autoscaling target as the default and tune per workload. For training, run retrains on Spot machine types and cache successful pipeline steps so a re-run that only changed the eval step does not retrain from scratch. The online Feature Store (Bigtable nodes) is billed for provisioned capacity — size it to real QPS, not aspiration. Batch prediction for anything that can tolerate latency is dramatically cheaper than keeping an endpoint warm.
Scalability. Each plane scales independently. The online store scales by Bigtable node count for read QPS; endpoints scale by replica autoscaling and by sharding traffic across deployed models; training scales by machine type, accelerators, and distributed training with a reduction server. Because features are centralized, onboarding the tenth model is cheaper than the first — it reuses existing features and the same pipeline skeleton. This is the payoff that makes the architecture span small to large orgs: marginal model cost falls as the platform matures.
Reliability & DR (RTO/RPO). Define targets per plane, because they differ:
- Serving is the hot path. Run endpoints with min replicas ≥ 2 across zones; the regional endpoint survives a zonal failure transparently. For regional DR, register the same model version in a second region’s registry and pre-create a standby endpoint — failover is a traffic redirect at your load balancer. RTO: minutes (redirect), RPO: 0 for the model (versions are reproducible artifacts).
- Feature Store online must be backfillable: it is a serving cache of values that live authoritatively in BigQuery. If the online store is lost, re-materialize from the offline store. RPO is bounded by your sync cadence for batch-synced features; stream-written features need a replayable Pub/Sub source.
- Training/registry is not latency-sensitive. Pipelines are reproducible, metadata and registry are managed and regionally durable. A lost pipeline run is simply re-run. RTO: hours is fine.
Observability. Three layers. Infrastructure metrics (endpoint latency, error rate, replica count, utilization) flow to Cloud Monitoring with SLO alerts. Model quality — drift, skew, prediction drift — comes from Vertex AI Model Monitoring against the training baseline, with thresholds per feature and alerts routed to Pub/Sub and on to your incident channel. Lineage from ML Metadata answers the audit questions. The crucial wiring is the drift-alert-to-retraining edge: a monitoring alert publishes to Pub/Sub, which triggers the CT pipeline, so the platform self-heals against gradual data shift rather than waiting for a human to notice degraded business metrics.
Governance. The registry is the policy enforcement point. Enforce that only the gated deploy SA can move the production alias; require a model card and a passing eval before a version is promotable; keep champion/challenger aliases so you always have a tested rollback target. Audit logs over alias moves and IAM give you the compliance story. For fairness and responsible-AI obligations, compute sliced metrics in the eval step and persist them on the model card, so the evidence is attached to the version, not living in a notebook.
Reference enterprise example
Meridian Mutual, a fictional mid-market consumer lender (about 1,800 employees, ~2.4 million active accounts), runs three revenue-critical models: real-time fraud scoring on card-not-present transactions, credit-line propensity for cross-sell, and a batch collections-prioritization model. Before the platform, fraud recall had quietly slipped because the production “transactions in last hour” feature was computed off a 15-minute-stale replica while training used exact values — textbook skew — and nobody could reproduce the live model because it had been trained from an ad-hoc export. A regulator’s question about a declined applicant took the team three weeks to answer.
They rebuilt on this Vertex AI architecture in europe-west1, with ml-dev/ml-staging/ml-prod projects under VPC Service Controls. Decisions and numbers:
- Features: 140 features across the three models, defined once. Real-time fraud features (session transaction count, velocity) are written straight to the online store from a streaming Dataflow job for sub-second freshness; slower features (90-day spend averages) sync from BigQuery hourly. Online store sized at the QPS for ~900 transactions/second peak.
- Training & CT: the fraud pipeline retrains nightly on Spot GPU machines (NVIDIA_L4), ~40 minutes a run, with step caching so eval-only changes skip training. A Model Monitoring skew alert can trigger an off-schedule retrain via Pub/Sub.
- Promotion: nightly runs register a challenger; it is auto-promoted to production only if it beats the incumbent’s holdout AUC by a set margin, otherwise a human reviews. Promotion is an alias move — no redeploy.
- Serving: fraud runs on a dedicated private endpoint over PSC (transaction data never leaves the VPC), min 3 / max 20 replicas at 60% utilization, p99 inference ~28 ms including online feature fetch. Credit-line propensity shares an endpoint with a 90/10 traffic split for A/B. Collections runs as nightly Batch Prediction from BigQuery to BigQuery — no standing endpoint.
Outcome after one quarter: training-serving skew was eliminated, recovering roughly 6 points of fraud recall that the stale-feature bug had been costing. The “why was this applicant declined” answer dropped from three weeks to under an hour by walking ML Metadata from prediction to version to training dataset, with the model card in hand. Total platform run-rate landed near the cost of the single over-provisioned always-on endpoint they had before, because collections moved to batch and retrains moved to Spot. The tenth feature reused by a new model cost effectively nothing to onboard — the marginal-cost curve had bent the right way.
When to use it
Use this architecture when you have — or are about to have — more than one model on a path that matters, served to more than one consumer, maintained by more than one person. The moment those plurals appear, ad-hoc notebooks and hand-copied model files become the bottleneck and the risk, and the centralized feature/registry/monitoring loop pays for itself. It is equally valid at small scale: a two-person team gets reproducibility, skew-free serving, and instant rollback without running any clusters, because every plane is managed.
Trade-offs and anti-patterns to avoid:
- Standing endpoints for batch workloads. A model scored once a day does not need a warm endpoint burning replicas around the clock — use Batch Prediction. Keeping it online is the most common avoidable cost.
- Skipping the Feature Store “to move faster.” Computing features in the serving service is the express lane to training-serving skew and the exact bug this design exists to prevent. The shortcut costs you model accuracy you cannot see.
- Treating the Registry as storage. If pipelines deploy directly and serving points at version numbers, you have lost the governance and the instant rollback. Pipelines register; aliases promote; serving follows aliases.
- Monitoring without a feedback edge. Drift dashboards nobody wires to retraining are decoration. Connect the alert to the CT trigger or you will still find out from a business-metric dip a week late.
- One giant shared endpoint for everything. Multi-model per endpoint is great for A/B and canary within a model family, but co-tenanting unrelated, differently-scaling models couples their failure and scaling domains. Separate them.
Alternatives. If you are all-in on Gemini and generative models rather than classic predictive ML, the gravity shifts toward Model Garden, grounding, and agent tooling, and the feature-store-centric loop matters less. If you have one model and no near-term roadmap for a second, a single Vertex training job plus one endpoint is legitimately enough — adopt the registry and monitoring early but defer the full feature platform until the second model justifies it. And if you are deliberately multi-cloud with portability as a hard requirement, an open KFP-on-GKE plus an open feature store (e.g. Feast) trades managed convenience for portability — a real choice, but you take on the operational weight that Vertex otherwise carries for you. For the common enterprise case — several predictive models, real users, real governance, on Google Cloud — this Vertex AI platform is the path of least regret.