A data scientist trains a model that beats the baseline by four points of AUC. It lives in a notebook on her laptop. Six weeks later it is still on her laptop, because nobody can answer the questions that stand between a good experiment and a production system: which exact feature values did it train on, and can we reproduce them at serving time? Where do the credentials for the warehouse come from when the pipeline runs unattended at 2 a.m.? Who approved promoting it to production, and what happens when its inputs drift three months from now? MLOps is the discipline of answering those questions by construction rather than by heroics. This article is a reference architecture for a real one — Kubeflow Pipelines, MLflow, and a Feast feature store on Amazon EKS, governed by GitOps and wired into the same security and observability fabric the rest of the enterprise already runs on. Not a tutorial stack you tear down on Friday, but a platform that survives an audit and an on-call rotation.
The business scenario
The sharpest version of this problem lives in consumer lending. A fintech issuing instant-decision credit lines runs models that matter in two directions at once: a fraud model that must score an application in under 80 milliseconds at the point of sale, and a credit-risk model whose decisions a regulator can demand to see explained, reproduced, and justified two years after the fact. The business is under genuine pressure from all four directions at once — regulation (model risk management under SR 11-7 and fair-lending scrutiny), latency (a slow score is an abandoned checkout), scale (millions of decisions a day, hundreds of features each), and cost (GPU training bills and a feature store that, done naively, hammers the warehouse on every request).
The naive approaches fail in characteristic ways. Training in notebooks and copying a pickle to a server gives you a model nobody can reproduce — the features it learned on were computed by a SQL query that has since changed, so the offline AUC and the online behaviour quietly diverge. Each team building its own feature pipeline produces the single most expensive bug in production ML: training-serving skew, where the “average transaction amount over 30 days” computed in the training warehouse is subtly different from the one computed live at inference, and the model degrades for reasons no dashboard explains. Hand-deploying models means no audit trail of who promoted what, when, and on whose sign-off — which is exactly the artifact a model-risk examiner asks for first.
The platform threads these needles. Features are defined once and served two ways — a low-latency online store for inference and a point-in-time-correct offline store for training — so the same definition feeds both and skew is eliminated by design. Training runs as a versioned, parameterised pipeline whose every step, input dataset, and output metric is logged. Models live in a registry with explicit lifecycle stages and gated promotion, so “this model, version 7, was approved by Risk on this date and is the one in production” is a queryable fact, not tribal memory. The same shape serves a five-person team scoring thousands of decisions a day and a thousand-person org scoring millions — the difference is node-group sizing, store sharding, and how many regions you light up, not the architecture.
Architecture overview
The platform has three distinct planes that share infrastructure but run on different schedules, and keeping them mentally separate is the first step to operating it well: a feature plane (continuous, populates the stores), a training plane (batch, produces models), and a serving plane (synchronous, makes decisions). All three sit on a single multi-tenant Amazon EKS cluster provisioned and governed entirely through code.
Control plane and GitOps. Nothing reaches the cluster by kubectl apply from a human. Argo CD watches a Git repository of Kubernetes manifests and Helm/Kustomize overlays — Kubeflow, MLflow, Feast, the serving deployments — and continuously reconciles the cluster to match Git. A merge to main is a deploy; a git revert is a rollback; the cluster’s actual state and its declared state are the same object. Jenkins runs the CI side: on a pull request it builds and tests pipeline component images, runs unit tests on feature transforms, scans containers, pushes signed images to Amazon ECR, and bumps the image tag in the GitOps repo via PR — which Argo CD then rolls out. CI builds and tests; CD (Argo) reconciles; the boundary is clean.
Feature plane. Raw events and warehouse tables (Kafka streams, an S3 data lake, the analytics warehouse) feed transformation jobs whose outputs are Feast feature definitions. Feast materialises them into two backing stores: an offline store (S3 + the warehouse) used to build point-in-time-correct training sets, and an online store (Amazon ElastiCache for Redis or DynamoDB) holding the latest feature value per entity for single-digit-millisecond reads at inference. The feature registry — the catalog of what features exist, their types, owners, and freshness — is the contract both planes code against.
Training plane. Kubeflow Pipelines orchestrates the training DAG: pull a point-in-time-correct dataset from Feast’s offline store, validate it, train (on GPU node groups when needed), evaluate against a hold-out and fairness slices, and — if the gates pass — register the resulting model in MLflow. Every run logs parameters, metrics, the exact dataset snapshot reference, and the model artifact to MLflow’s tracking server, with artifacts in S3 and metadata in a managed Amazon RDS for PostgreSQL backend. The MLflow Model Registry holds the lifecycle: Staging → Production → Archived, with transitions gated by approval.
Serving plane. Approved models are served by KServe (or Seldon) on EKS as autoscaling inference services. A request arrives at the edge, the serving service fetches the entity’s live features from Feast’s online store, runs the model, and returns a decision. Akamai sits in front as the global edge and WAF, terminating TLS close to the user and shielding the origin. Identity for every human and pipeline flows through Okta (or Entra ID) via OIDC; HashiCorp Vault issues short-lived database and cloud credentials so no long-lived secret ever sits in a manifest. Dynatrace (or Datadog) traces the request end to end and watches the models for drift; Wiz continuously audits the cluster and data stores for misconfiguration and exposure; CrowdStrike Falcon runs on the nodes for runtime threat detection; and ServiceNow is the system of record for production-promotion approvals and incidents.
Component breakdown
| Concern | Tool / service | Role in the platform | Key configuration choices |
|---|---|---|---|
| Cluster & infra | Amazon EKS + Terraform | Multi-tenant compute substrate, provisioned as code | IRSA for pod-level IAM; managed + GPU node groups; Karpenter for just-in-time scaling |
| Pipeline orchestration | Kubeflow Pipelines | Training/eval DAGs, parameterised and versioned | Argo Workflows backend; pipeline-as-code (KFP SDK v2); cached steps |
| Experiment tracking & registry | MLflow | Log params/metrics/artifacts; gate model lifecycle | RDS PostgreSQL backend, S3 artifact store; registry stages + aliases |
| Feature store | Feast | One feature definition → online + offline serving | Redis/DynamoDB online; S3/warehouse offline; point-in-time joins |
| Model serving | KServe (or Seldon) | Autoscaling, canary-able inference services | Scale-to-zero for cold models; transformer pulls online features |
| GitOps CD | Argo CD | Reconcile cluster to Git declaratively | App-of-apps; sync waves; auto-prune + self-heal |
| CI | Jenkins | Build/test/scan/sign images, bump GitOps tags | Ephemeral agents on EKS; image signing; PR to manifests repo |
| Identity / SSO | Okta or Entra ID | Human + workload auth via OIDC | SSO to Kubeflow/MLflow; OIDC → IRSA mapping; SCIM provisioning |
| Secrets | HashiCorp Vault | Short-lived DB/cloud creds, no static secrets | Kubernetes auth; dynamic DB engine; Agent Injector sidecars |
| Edge / WAF | Akamai | Global TLS termination, L7 WAF, DDoS, caching | Origin shielding to ALB; bot mitigation on the decision API |
| Observability | Dynatrace or Datadog | Traces, metrics, model drift & data-quality monitors | OpenTelemetry from serving; custom drift/quality metrics |
| Cloud posture | Wiz | CSPM + data security posture over cluster & stores | Agentless scan of EKS, S3, RDS; toxic-combination alerts |
| Runtime security | CrowdStrike Falcon | Node + container runtime threat detection | Falcon sensor as DaemonSet; drift & exploit detection |
| ITSM / approvals | ServiceNow | Promotion approvals, change records, incidents | Change request gates Prod promotion; incident auto-create |
A few choices deserve the why, because they are the ones teams get wrong.
Why a feature store at all, and why one definition for two stores. The expensive failure in production ML is not a bad model — it is training-serving skew. If the training set is built by one SQL query and the serving features by a different code path, the model sees subtly different inputs in production than it learned on, and it degrades silently. Feast collapses that into a single source of truth: you declare a feature once, Feast computes a point-in-time-correct join for training (every feature value as it existed at the label’s timestamp — never leaking the future into the past) and serves the same logic’s freshest value online. Skew stops being a bug you chase and becomes a class of bug the architecture forecloses.
Why MLflow as the registry, not a folder of pickles in S3. A model is not a file; it is a file plus the parameters, the dataset snapshot, the metrics, the environment, and the approval that put it in production. MLflow’s tracking server captures the first set automatically on every run, and the Model Registry adds the lifecycle: a model version sits in Staging, gets promoted to Production only through a gated transition, and the previous Production version moves to Archived but stays retrievable. When a regulator asks “reproduce the decision your model made on this application in March,” you query MLflow for the exact version, its dataset reference, and its params — and you can. “Which pickle was live in March?” is not a question a well-run lender should have to guess at.
Why Kubeflow Pipelines instead of a cron job that runs a script. Reproducibility and lineage. A KFP pipeline is a versioned DAG of containerised steps with typed inputs and outputs; each step’s inputs are content-addressed, so unchanged steps are cached and skipped, and every run records exactly which data and code produced which artifact. A cron’d script has none of this — no lineage, no caching, no typed contract, and no way to answer “what changed between the run that worked and the one that didn’t.” On EKS, KFP also gives you Kubernetes-native resource control (GPU requests, node selectors) the cron job can only fake.
Why GitOps for the whole platform. Drift and auditability. When humans kubectl apply, the cluster’s real state diverges from any document, and you cannot answer “what is deployed right now and who changed it.” With Argo CD reconciling from Git, the deployed state is the Git history: every change is a reviewed commit, every rollback is a revert, and Argo’s self-heal corrects any out-of-band drift back to declared state. For a regulated lender, “show me the change that promoted this model and who approved it” is a git log and a linked ServiceNow record, not an archaeology project.
Implementation guidance
Provision everything with Terraform, and treat IAM as a first-class deliverable. The cluster, node groups, ECR repos, RDS for MLflow, ElastiCache for Feast, S3 buckets, and the OIDC provider all live in Terraform. The pattern that makes pod security tractable is IRSA (IAM Roles for Service Accounts): each workload’s Kubernetes service account maps to a narrowly scoped IAM role, so the Feast materialisation job can read the warehouse and write Redis but cannot touch the training buckets, and the serving pod can read the online store but nothing else. No node-wide instance profile that every pod inherits — least privilege at the pod boundary.
# Feature-materialization job: read offline lake, write online store — nothing else
data "aws_iam_policy_document" "feast_materialize" {
statement {
actions = ["s3:GetObject", "s3:ListBucket"]
resources = ["arn:aws:s3:::lend-feature-lake", "arn:aws:s3:::lend-feature-lake/*"]
}
statement {
actions = ["elasticache:Connect"]
resources = [aws_elasticache_replication_group.feast_online.arn]
}
}
module "feast_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
role_name = "feast-materialize"
role_policy_arns = { policy = aws_iam_policy.feast_materialize.arn }
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["feast:feast-materialize-sa"]
}
}
}
Define a feature once; let Feast serve both planes. Feature views are code, reviewed in PR and version-controlled. The offline path builds training data via a point-in-time join; the online path is materialised on a schedule (or streamed) so inference reads are O(1).
from feast import FeatureView, Field, FileSource
from feast.types import Float32, Int64
txn_stats = FeatureView(
name="txn_stats_30d",
entities=[applicant],
ttl=timedelta(days=2),
schema=[
Field(name="avg_amount_30d", dtype=Float32),
Field(name="txn_count_30d", dtype=Int64),
],
online=True, # served at <2 ms for inference
source=FileSource(path="s3://lend-feature-lake/txn_stats/"),
)
Training pulls these with store.get_historical_features(entity_df, features=[...]) — Feast joins each feature as of the label timestamp — and serving calls store.get_online_features(...) for the same fields. Identical definition, zero skew.
Wire the training pipeline to register, not just to run. The last step of every KFP pipeline logs to MLflow and registers the model only if the evaluation gates pass — so the registry never accumulates models that failed their own quality or fairness bar.
import mlflow
mlflow.set_tracking_uri("https://mlflow.lend.internal")
with mlflow.start_run() as run:
mlflow.log_params(params)
mlflow.log_metric("auc", auc)
mlflow.log_metric("ks_protected_group_gap", fairness_gap)
if auc >= 0.82 and fairness_gap <= 0.05: # promotion gate
mlflow.sklearn.log_model(model, "model",
registered_model_name="credit_risk") # -> Staging
Promotion from Staging to Production is deliberately not automatic: it is triggered by a ServiceNow change request that Risk approves, which flips the MLflow stage (or alias) and, through the GitOps repo, points KServe at the new version. The audit trail writes itself.
Kill static secrets with Vault. Pods authenticate to HashiCorp Vault using their Kubernetes service-account token (Vault’s Kubernetes auth method), and Vault’s dynamic database secrets engine issues a short-lived PostgreSQL/warehouse credential per session that auto-expires. The Vault Agent Injector mounts it as a file; nothing long-lived sits in a manifest, an env var, or Git. Combined with IRSA for AWS APIs, the platform has no static secrets to rotate or leak — the lesson every team relearns the hard way after the first credential ends up in a commit.
Enterprise considerations
Security & Zero Trust. The architecture is identity-first by construction: humans reach Kubeflow and MLflow through Okta/Entra ID SSO, workloads get AWS access through IRSA and database access through Vault dynamic secrets, and no service holds a long-lived key. Layer on top: Wiz runs agentless posture management across EKS, S3, RDS, and ElastiCache, flagging the “toxic combinations” that matter here — a public bucket holding training data with PII, or an over-permissive IRSA role — that a single-resource scan would miss. CrowdStrike Falcon runs as a DaemonSet for runtime detection on the nodes, catching container drift and in-memory exploits that posture scanning cannot see. ECR images are scanned and signed in the Jenkins pipeline, and an admission policy (Kyverno/OPA Gatekeeper) refuses to run an unsigned or unscanned image — so the supply chain is enforced at the cluster door, not just hoped for in CI. Network policies isolate tenants and planes; the serving plane can reach the online store but not the training buckets.
Cost optimization. ML platforms bleed money in two places — idle GPUs and a feature store that over-reads the warehouse — so engineer for both from day one. (1) Just-in-time GPU nodes via Karpenter: training pods request GPUs, Karpenter provisions the node, and it is deprovisioned when the run ends, so you never pay for idle accelerators between nightly retrains. (2) Spot for training, on-demand for serving: training is interruptible and checkpointed, so run it on Spot at a large discount; the serving plane stays on-demand for stability. (3) Scale-to-zero serving with KServe for cold or low-traffic models — a champion-challenger experiment that gets one request an hour should not hold a warm pod. (4) Materialise the online store, do not query through: serving reads precomputed values from Redis at near-zero marginal cost instead of hitting the warehouse per request, which is both faster and an order of magnitude cheaper at volume. (5) Cache KFP steps: content-addressed step caching means an unchanged data-prep step on a re-run costs nothing.
| Lever | What it controls | Typical impact |
|---|---|---|
| Karpenter JIT GPU nodes | Idle accelerator spend between runs | Largest single saving on training |
| Spot for training | Compute price of interruptible jobs | ~60–70% off the GPU bill |
| KServe scale-to-zero | Warm pods for cold models | Eliminates idle serving cost |
| Online-store materialization | Warehouse load per inference | Removes a per-request cost line |
| KFP step caching | Recompute of unchanged steps | Cuts re-run time and compute |
Scalability. Each plane scales independently. The feature online store shards across Redis/DynamoDB to hold read QPS and entity cardinality; materialisation parallelises across the offline data. Training scales by GPU node count (Karpenter) and, for large models, distributed training within a KFP step. Serving autoscales pods on concurrency/latency (KServe + HPA, or KEDA on queue depth) and nodes via the autoscaler, with canary rollouts shifting a slice of traffic to a new model version before full cutover. The natural ceiling is the online store’s read latency under burst and your GPU quota for training — which is why large lenders sit serving behind Akamai for edge caching and absorb regional bursts close to the user.
Reliability & DR (RTO/RPO). Decide the numbers per plane, because they differ. The serving plane is the tight one — a credit decision API that is down is lost revenue — so it runs active-active across AZs (and, for the largest deployments, regions) with the online store replicated; target RTO minutes, RPO seconds. The training plane can tolerate hours of downtime: pipelines are re-runnable from versioned data, so its DR guarantee is “the data and the pipeline definitions are durable,” not “the cluster is always up.” MLflow’s RDS backend and S3 artifacts are the system of record for models — back up RDS (PITR) and replicate the artifact bucket cross-region, because losing the registry means losing the provenance of every production model. A pragmatic enterprise target: serving RTO 5 min / RPO 30 s, training and registry rebuildable from durable storage within hours. Akamai health checks fail ingress over automatically.
Observability & model monitoring. Instrument the decision path end to end with Dynatrace/Datadog via OpenTelemetry: one trace covers edge → feature fetch → inference → response, with latency on each hop, so a slow decision is attributable to the online store or the model, not a mystery. Beyond infra metrics, emit the ones ML actually lives or dies on — feature drift (has the live distribution of avg_amount_30d moved from training?), prediction drift, online/offline feature parity (a direct skew alarm), p99 feature-fetch latency, and model accuracy on delayed labels as ground truth arrives weeks later. A drift breach opens a ServiceNow incident and can trigger a Kubeflow retrain pipeline — closing the loop from detection to remediation without a human paging through dashboards at midnight.
Governance & model risk. For a regulated lender this is the point of the whole platform. Every production model has, queryable in MLflow: its training dataset reference, parameters, evaluation and fairness-slice metrics, the run that produced it, and the ServiceNow change record (with approver) that promoted it. Pin component and base images by digest so pipelines do not drift under you; promote model versions through evaluation and fairness gates, never on AUC alone. Keep feature definitions and pipeline code in Git so any computation is reviewable and revertable. The combination satisfies the examiner’s three questions — reproduce it, explain it, show who approved it — as routine queries rather than fire drills.
Reference enterprise example
Northwind Credit, a fictional digital consumer lender (~600 employees, ~3 million credit decisions a month across point-of-sale and app channels), built this platform to get models out of notebooks and under model-risk governance after a regulator flagged that they could not reproduce a six-month-old decision.
Decisions they made. They consolidated onto a single EKS cluster, Terraform-provisioned, with three node groups: general (Graviton, on-demand) for the control plane and serving, GPU (Spot, Karpenter-provisioned) for training, and a small on-demand pool for stateful add-ons. Feast defined ~140 features across applicant, transaction, and device entities; the offline store was S3 + Snowflake, the online store ElastiCache for Redis, materialised every 15 minutes for batch features and streamed for real-time ones. The fraud model served on KServe with the transformer fetching online features, hitting a p99 of ~52 ms end to end behind Akamai; the credit-risk model ran the same serving path with a heavier model and a 200 ms budget. Kubeflow Pipelines ran nightly retrains pulling point-in-time-correct sets from Feast, logging everything to MLflow (RDS Postgres + S3). Promotion to Production required a ServiceNow change approved by the Model Risk team, which flipped the MLflow alias; Argo CD then rolled KServe to the new version with a 10% canary for an hour. Okta fronted Kubeflow/MLflow SSO; Vault issued short-lived Snowflake and Postgres credentials; Jenkins built, scanned, and signed every component image and bumped the GitOps repo. Wiz watched posture, CrowdStrike Falcon the runtime, Dynatrace the traces and drift.
The numbers. ~3M decisions/month, peaking ~140 requests/second on weekend retail spikes. Monthly run cost landed near ₹19.6 lakh (~$23,500): EKS + serving on-demand nodes ~$7,000, GPU training on Spot + Karpenter ~$4,500 (down from a measured ~$13,000 had they run on-demand reserved GPUs), ElastiCache online store ~$3,200, RDS + S3 + Snowflake compute for offline/MLflow ~$4,000, Akamai + Dynatrace + the security tooling the remainder. Scale-to-zero on the eleven champion-challenger experiments saved roughly a node’s worth of idle serving; online-store materialisation kept Snowflake out of the serving path entirely, which their FinOps lead estimated saved more than the Redis cluster cost.
The outcome. Time from a finished experiment to a governed production model fell from “weeks of manual handoff” to a same-day pipeline-and-approval. The regulator’s reproducibility finding closed: any production decision could be replayed from the exact model version and its point-in-time feature set in MLflow and Feast. Training-serving skew incidents — previously the cause of two unexplained model-degradation events the prior year — went to zero once Feast owned both planes. And in a regional game day, serving failed over inside the 5-minute RTO on the replicated online store while training, being re-runnable, was simply paused.
When to use it
Use this architecture when you run more than a handful of models, you serve at least one of them with a real latency budget, you need reproducibility and an audit trail (regulation, model risk, or just sanity), and training-serving skew is a risk you cannot hand-wave. That covers most production ML at scale — fraud and credit scoring, real-time recommendation and pricing, churn and risk models, anywhere features must be consistent between training and serving and a model’s lineage must be provable.
Trade-offs to accept. This is a platform, and platforms have operational weight — you are running Kubeflow, MLflow, Feast, Argo CD, and a feature store, each with upgrades and failure modes. The feature store adds a freshness pipeline you must keep materialised, or online values go stale. GitOps demands discipline: if someone kubectls around Argo, you have drift and a debugging session. And the latency of a real-time decision is the sum of feature fetch and inference — budget for both, and the online store’s tail latency becomes a first-class SLO.
Anti-patterns. (1) Separate feature pipelines for training and serving — the skew bug, the most expensive in production ML; let one definition serve both. (2) Pickles in S3 instead of a registry — no lineage, no gated promotion, no answer for the auditor. (3) Static secrets in manifests — they end up in Git; use Vault dynamic credentials and IRSA. (4) kubectl apply by humans — drift and no audit trail; reconcile from Git. (5) On-demand GPUs sitting idle between nightly runs — the biggest needless line on the bill; provision them just-in-time on Spot. (6) No drift or feature-parity monitoring — the model degrades silently and you learn about it from the loss numbers, not a dashboard.
Alternatives, and when they win. If you have one or two models and modest scale, a managed platform — Amazon SageMaker (with SageMaker Feature Store, Pipelines, and Model Registry), Vertex AI, or Azure ML — gives you most of this without operating the open-source stack, and is the right call when you value speed over portability and are happy inside one cloud. If you are multi-cloud, want to avoid lock-in, or already run everything on Kubernetes, the Kubeflow/MLflow/Feast stack here is the portable destination. If your models are all batch with no low-latency serving, you can drop the online store and Akamai entirely and keep just the training-plus-registry spine. And if you are a small team standing this up for the first time, start with MLflow plus a managed feature store and graduate to the full Kubeflow-on-EKS platform when the number of models, the latency requirements, or the governance burden demand it. The architecture here is the destination, not always the starting line.