A training script that runs once on a data scientist’s laptop is not a model lifecycle. Production MLOps is the discipline of making every run reproducible, lineage-tracked, governed, and re-runnable from a commit hash — so that six months from now you can answer “which data and which code produced the model serving this prediction?” without guessing. This guide builds that on Vertex AI end to end: a typed Kubeflow pipeline, Feature Store and managed datasets wired into training, promotion through the Model Registry, canary deployment to an endpoint, drift monitoring, and the CI/CD plus IAM scaffolding that ties it together.
1. The architecture: KFP components, artifacts, and metadata lineage
Vertex AI Pipelines is a managed runner for pipelines authored with the Kubeflow Pipelines (KFP) v2 SDK. You compile a Python pipeline to a JSON spec; Vertex executes each step as an isolated container, captures inputs and outputs as typed artifacts, and records the whole graph in Vertex ML Metadata. That metadata store is the point of the whole exercise — it is what gives you lineage.
commit / schedule
|
v
[KFP pipeline spec (compiled JSON in GCS)]
|
v
[Vertex AI Pipelines runner] --records--> [Vertex ML Metadata]
| | | ^
v v v |
[ingest]->[train]->[evaluate]->[register]---------+
| | | |
Dataset Model Metrics Model Registry version
(artifact)(artifact)(artifact) |
v
[Endpoint: canary -> 100%]
Two distinctions matter before you write code:
- Components are the unit of execution. A lightweight Python component is a decorated function packaged into a container at compile time; a container component points at a prebuilt image. Both declare typed inputs and outputs.
- Parameters vs. artifacts. Parameters are small values (a string, an int) passed by value. Artifacts (
Dataset,Model,Metrics) are files in Cloud Storage plus metadata, passed by reference. Getting this right is what makes lineage work — Vertex tracks artifact-to-artifact edges, not parameter values.
Pin your SDK. KFP v2 and the
google-cloud-aiplatformSDK move quickly and the compiled spec format is versioned. Pin exact versions (for examplekfp==2.*and a known-goodgoogle-cloud-aiplatform) in the build image and in CI so a compile today produces the same spec next quarter.
2. Author a typed pipeline with custom and prebuilt components
Start with one custom component. The @component decorator turns a function into a containerized step; base_image and packages_to_install define its environment. Inputs and outputs are declared with type annotations — Output[Dataset] hands you a path to write to, and Vertex registers the result as an artifact.
from kfp import dsl
from kfp.dsl import component, Input, Output, Dataset, Model, Metrics
@component(
base_image="python:3.11-slim",
packages_to_install=["pandas==2.2.2", "scikit-learn==1.5.1", "joblib==1.4.2"],
)
def train(
training_data: Input[Dataset],
model: Output[Model],
metrics: Output[Metrics],
max_depth: int = 8,
):
import pandas as pd, joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
df = pd.read_csv(training_data.path)
X, y = df.drop(columns=["label"]), df["label"]
clf = RandomForestClassifier(max_depth=max_depth, random_state=42)
score = cross_val_score(clf, X, y, cv=5).mean()
clf.fit(X, y)
# Write to the artifact path Vertex provisioned, then log lineage metadata.
joblib.dump(clf, model.path)
model.metadata["framework"] = "sklearn"
metrics.log_metric("cv_accuracy", float(score))
Compose components into a pipeline with @dsl.pipeline. Passing task_a.outputs["x"] into task_b is what creates the dependency edge and the lineage link — never shuttle data through side channels.
@dsl.pipeline(
name="fraud-training",
pipeline_root="gs://acme-mlops-artifacts/pipeline-root",
)
def training_pipeline(project: str, region: str, max_depth: int = 8):
ingest_task = ingest(project=project, region=region)
train_task = train(
training_data=ingest_task.outputs["training_data"],
max_depth=max_depth,
)
evaluate(
model=train_task.outputs["model"],
metrics=train_task.outputs["metrics"],
)
Compile to a spec. The compiled JSON is the deployable artifact — treat it like a build output, not source.
python -c "from kfp import compiler; import pipeline; \
compiler.Compiler().compile(pipeline.training_pipeline, 'fraud_training.json')"
For heavy lifting, lean on Google Cloud Pipeline Components (GCPC) rather than rolling your own — pip install google-cloud-pipeline-components. It ships first-party, supported ops such as CustomTrainingJobOp (runs a training container as a Vertex Custom Job, which is how you get GPUs and distributed training), ModelUploadOp, EndpointCreateOp, and ModelDeployOp. Prefer these for any step that touches a Vertex resource; they are maintained against API changes and emit the correct metadata.
3. Wire Feature Store and managed datasets into training
Reproducibility dies the moment training reads from a mutable source. Two Vertex primitives fix this.
Managed datasets give a stable, versioned handle to training data. Reference one in the pipeline and Vertex records the dataset resource as lineage. Create it from BigQuery or GCS:
gcloud ai datasets create \
--display-name="fraud-training-v3" \
--metadata-schema-uri="gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml" \
--region=us-central1
Feature Store solves the harder problem: training-serving skew caused by computing features differently in batch and online paths. The current generation, Vertex AI Feature Store, serves features directly from a BigQuery source registered as a feature view. You define the source once; offline training reads point-in-time-correct values and online serving reads the same definitions, so a feature means the same thing in both places.
# Online store that backs low-latency serving
gcloud ai feature-online-stores create fraud_online_store \
--region=us-central1 \
--bigtable-min-node-count=1 \
--bigtable-max-node-count=3 \
--bigtable-cpu-utilization-target=70
# A feature view mapping a BigQuery source into that store
gcloud ai feature-views create txn_features \
--feature-online-store=fraud_online_store \
--region=us-central1 \
--big-query-source-uri="bq://acme.features.txn_features" \
--entity-id-columns=account_id \
--sync-config-cron="0 * * * *"
In the pipeline, an ingest component fetches the offline feature values for the training entities, writing them as an Output[Dataset] so the exact slice is captured. The principle: training data flows in as a tracked artifact, never a raw query embedded in code.
4. Promote artifacts through the Model Registry with versioning
A trained model that lives only as a file in a bucket has no governance story. Upload it to the Vertex AI Model Registry, which gives a stable resource name, immutable version IDs, and aliases (such as default or champion) that you can repoint without changing serving config.
The clean pattern is a register step in the pipeline using ModelUploadOp. Crucially, set parent_model so a new upload becomes the next version of an existing model rather than a brand-new model — this is what builds a version history you can audit and roll back across.
from google_cloud_pipeline_components.v1.model import ModelUploadOp
ModelUploadOp(
project=project,
location=region,
display_name="fraud-detector",
parent_model="projects/acme/locations/us-central1/models/1234567890",
unmanaged_container_model=train_task.outputs["model"],
serving_container_image_uri=(
"us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-5:latest"
),
)
Gate the upload behind your evaluate step. In KFP, return a boolean from evaluation and branch with dsl.If so a model that fails the accuracy or skew bar never reaches the registry. The registry is a promotion boundary, not a dumping ground.
| Concept | What it is | Use it for |
|---|---|---|
| Model | Stable parent resource | The logical model (“fraud-detector”) |
| Version | Immutable child of a Model | Each promoted training run |
| Alias | Movable pointer to a version | champion, default, fast rollback |
Inspect history from the CLI before any deploy:
gcloud ai models list --region=us-central1
gcloud ai models describe MODEL_ID --region=us-central1
5. Deploy to endpoints: traffic splitting, canaries, and autoscaling
A Vertex Endpoint is the serving surface; you deploy one or more model versions behind it and split traffic by percentage. This is the mechanism for canaries: deploy the new version at 10 percent, watch monitoring, then shift to 100.
Create the endpoint, then deploy the candidate alongside the incumbent with a traffic split:
gcloud ai endpoints create \
--display-name="fraud-detector-ep" \
--region=us-central1
# Deploy candidate at 10%, leave 90% on the existing deployed model.
gcloud ai endpoints deploy-model ENDPOINT_ID \
--region=us-central1 \
--model=MODEL_ID \
--display-name="fraud-v3-canary" \
--machine-type=n1-standard-4 \
--min-replica-count=2 \
--max-replica-count=10 \
--traffic-split=0=90,NEW_DEPLOYED_MODEL=10
Key points that bite people:
min-replica-countgoverns cost and cold starts. Set it to at least 2 in production for availability; setting it to 1 saves money but gives you no headroom and a single point of failure.- Autoscaling ranges between min and max replicas based on load. There is no scale-to-zero for standard online endpoints — if you need that economic profile, you are looking at a different serving pattern, not this one.
- Promotion is a traffic update, not a redeploy. Once the canary looks healthy, shift traffic without touching the deployed models:
gcloud ai endpoints update ENDPOINT_ID \
--region=us-central1 \
--traffic-split=NEW_DEPLOYED_MODEL=100
Rollback is the same command pointed back at the old deployed model ID. Because both versions stay deployed during the canary window, rollback is seconds, not a rebuild.
6. Model Monitoring for training-serving skew and feature drift
A model that was accurate at deploy time silently rots as production data shifts. Vertex AI Model Monitoring samples prediction requests, compares feature distributions against a baseline, and alerts when they diverge.
Two signals to configure:
- Training-serving skew compares live serving feature distributions to the training dataset. Use this to catch a feature that is computed or scaled differently in serving than it was in training — the classic silent failure that Feature Store reduces but does not eliminate.
- Prediction drift compares recent serving distributions to an earlier serving window. Use this to catch the world changing under a model that was deployed correctly.
You attach a monitoring job to the endpoint, point it at the training data as baseline, set per-feature thresholds (a distance score above which an alert fires), and set a sampling rate so you monitor a representative fraction without paying to log every request. Wire alerts to a notification channel and triage on a threshold breach — a fired skew alert is a strong prior that retraining is due.
Pick thresholds empirically. Start with the platform defaults, watch the distance scores for a week, then tighten. Thresholds set too tight produce alert fatigue; too loose and drift is real before anyone notices.
7. Trigger pipelines from CI/CD and Cloud Scheduler with parameterization
A pipeline you launch by hand from a notebook is not production. Two triggers cover the real cases.
CI/CD on commit. A merge to main should compile the pipeline, push the spec to GCS, and submit a run. The submit step is a few lines with the SDK; parameter_values is where reproducibility lives — every knob is an explicit input, never a default buried in code.
from google.cloud import aiplatform
aiplatform.init(project="acme", location="us-central1")
job = aiplatform.PipelineJob(
display_name="fraud-training",
template_path="gs://acme-mlops-artifacts/specs/fraud_training.json",
pipeline_root="gs://acme-mlops-artifacts/pipeline-root",
parameter_values={"project": "acme", "region": "us-central1", "max_depth": 8},
enable_caching=True,
)
job.submit(service_account="vertex-pipelines@acme.iam.gserviceaccount.com")
In Cloud Build, run that submit inside a step. Note submit() is non-blocking, which is what you want in CI — fire the run and let Vertex own the long-running execution rather than holding a build minute open for an hour.
steps:
- name: "python:3.11"
entrypoint: bash
args:
- -c
- |
pip install -q kfp==2.* google-cloud-aiplatform
python compile_pipeline.py
gsutil cp fraud_training.json gs://acme-mlops-artifacts/specs/
python submit_pipeline.py
options:
logging: CLOUD_LOGGING_ONLY
Scheduled retraining. For periodic retraining, use the native pipeline scheduler rather than a hand-rolled cron job — it manages the recurring run and reuses the same parameterized template.
schedule = job.create_schedule(
cron="0 3 * * 1", # 03:00 every Monday
display_name="weekly-fraud-retrain",
max_concurrent_run_count=1,
service_account="vertex-pipelines@acme.iam.gserviceaccount.com",
)
enable_caching=Trueskips steps whose inputs are unchanged, which saves real money on reruns. Turn it off for the scheduled retrain — fresh data is the entire point, and a cache hit on ingest would defeat it.
8. Governance: IAM on runs, cost controls, and reproducibility guarantees
Three controls turn a working pipeline into a governed one.
IAM, least privilege. Pipeline steps execute as a dedicated service account, not the default Compute Engine SA. Create one and grant only what the pipeline touches:
gcloud iam service-accounts create vertex-pipelines \
--display-name="Vertex Pipelines runtime"
gcloud projects add-iam-policy-binding acme \
--member="serviceAccount:vertex-pipelines@acme.iam.gserviceaccount.com" \
--role="roles/aiplatform.user"
The runtime SA also needs object access on the pipeline-root bucket and read access on data sources (BigQuery, Feature Store). Grant those scoped to the specific bucket and dataset, not project-wide. Humans submitting runs need roles/aiplatform.user; reserve admin roles for the platform team.
Cost controls. The recurring spend is endpoint replicas (running 24/7) and training compute. Right-size min-replica-count, label every Vertex resource for cost attribution, and set a billing budget with alerts so a runaway autoscale or a forgotten endpoint surfaces before the invoice does.
Reproducibility guarantees. This is the payoff. With the pieces above, every production model satisfies: the code is a compiled spec from a commit; the data is a versioned managed dataset and point-in-time Feature Store reads; the parameters are explicit parameter_values; the environment is pinned base images; and the lineage is recorded in ML Metadata linking the serving model version back through training to the exact input artifacts. That chain is what lets you answer the audit question instead of shrugging at it.
Enterprise scenario
A fraud team’s weekly retrain started shipping models that passed every offline gate but degraded live precision within hours. The lineage graph looked clean — versioned dataset, point-in-time Feature Store reads, the works. The actual cause: their scheduled run had inherited enable_caching=True from the CI submit script. Cloud Scheduler fired every Monday, but the ingest component’s inputs (project, region, the BigQuery view URI) were byte-identical week over week, so Vertex served a cached ingest artifact from weeks earlier. Training ran on stale data while the feature view kept syncing fresh rows for serving — a self-inflicted training-serving skew that monitoring eventually flagged, but only after the bad version took traffic.
The fix was twofold. First, caching off on the schedule, non-negotiable:
job = aiplatform.PipelineJob(
display_name="weekly-fraud-retrain",
template_path="gs://acme-mlops-artifacts/specs/fraud_training.json",
pipeline_root="gs://acme-mlops-artifacts/pipeline-root",
parameter_values={"project": "acme", "region": "us-central1", "max_depth": 8},
enable_caching=False, # fresh data is the entire point of a retrain
)
job.create_schedule(
cron="0 3 * * 1",
max_concurrent_run_count=1,
service_account="vertex-pipelines@acme.iam.gserviceaccount.com",
)
Second, they made the ingest input change on purpose: the component now takes a data_snapshot_date parameter bound to the run date, so identical-input cache hits are impossible by construction even if someone re-enables caching. The lesson the platform team wrote into their runbook: caching keys off declared inputs, not data freshness — a pipeline that reads a mutable source must encode time as an explicit parameter or it will silently train on the past.
Verify
Confirm each layer is actually wired, not just declared.
# Pipeline run reached completion
gcloud ai pipeline-jobs describe PIPELINE_JOB_ID --region=us-central1 \
--format="value(state)"
# expect: PIPELINE_STATE_SUCCEEDED
# A new model version was registered
gcloud ai models list --region=us-central1 --filter="displayName=fraud-detector"
# Endpoint shows the expected traffic split
gcloud ai endpoints describe ENDPOINT_ID --region=us-central1 \
--format="yaml(trafficSplit, deployedModels[].id)"
# Online prediction returns
gcloud ai endpoints predict ENDPOINT_ID --region=us-central1 \
--json-request=instances.json
In the console, open the pipeline run and confirm the lineage graph links the dataset artifact, the model artifact, and the registered version — if the edges are missing, you passed data through a side channel instead of as a typed output.
Production checklist
Pitfalls
- Caching surprises.
enable_cachingis on by default. A “successful” rerun that finished in 30 seconds probably hit the cache and trained nothing new. Disable it for scheduled retrains. - The default service account trap. Pipelines fall back to the Compute Engine default SA, which is wildly over-privileged. Always pass an explicit
service_account. - Forgotten endpoints. A deployed model with
min-replica-count >= 1bills continuously whether or not it serves traffic. Undeploy stale canaries; budget alerts are your backstop. - Parameters smuggled as defaults. Every value a run depends on must be an explicit
parameter_value. A threshold or path hidden as a function default breaks the reproducibility chain the moment someone edits the code. - Skew baselines that drift with you. A monitoring baseline pinned to training data catches serving bugs; one pinned to a recent serving window will slowly accept the drift as normal. Run both signals and know which question each answers.