A trained model is worthless until something can call it under load, safely, at a price the business will tolerate. That last mile — inference serving — is where most ML platforms quietly fall apart. The data-science team ships a model.pkl and a Flask wrapper; six months later there are forty of those wrappers, each on its own VM, each with a different way to roll out a new version, none of them autoscaling, all of them holding an expensive GPU idle through the night. Nobody can answer “what changed when the fraud-detection model’s false-positive rate spiked at 2am,” because there is no canary, no traffic history, and no per-model latency dashboard. This article is a reference architecture for doing model serving properly on Kubernetes with KServe: progressive canary rollouts, GPU autoscaling that scales to zero when traffic dies, ModelMesh to pack thousands of small models onto a shared fleet, and a Prometheus/Grafana stack that makes the whole thing observable. Not a demo — a multi-tenant inference platform that holds up from a handful of models to a thousand.
The business scenario
The driver here is a financial-services firm whose models now sit directly in the revenue path — and in the regulator’s path. Picture a mid-size payments company: a real-time fraud-scoring model gates every card authorization (single-digit-millisecond budget, because it runs inline before the issuer responds), a credit-decisioning model runs synchronously when a customer applies for a limit increase, and a fleet of smaller per-merchant anomaly models — hundreds of them — watch transaction streams for account takeover. On top of that, the data-science org wants to start serving an internal LLM for analyst copilots and document extraction, which means GPUs enter the picture and the cost conversation changes entirely.
The pressures are specific and they compound. Latency: the fraud model’s p99 has a hard ceiling because exceeding it means the authorization times out and the transaction is declined — a declined good transaction is lost revenue and an angry customer. Scale: card volume is brutally spiky — lunchtime, Black Friday, payday — so capacity must follow demand within seconds, not minutes. Regulation: a model that makes credit decisions is subject to model-risk governance (think SR 11-7 / fair-lending scrutiny); every version that ever touched a real decision must be traceable, and you cannot roll out a new challenger model to 100% of traffic on a Friday and hope. Cost: GPUs for the LLM are the single largest line item on the platform, and an A100 sitting idle overnight at on-demand pricing will end up in a finance review with your name on it.
The naive approaches each fail in a predictable way. One VM per model gives you no bin-packing, no shared autoscaling, and a bespoke deployment ritual per team — it does not survive past a dozen models. A single monolithic inference service couples every model’s release cycle and blast radius together; one bad deploy takes down fraud and credit. A managed SaaS inference endpoint is fast to start but the regulated data and the model weights leave your boundary, the per-call price does not pencil out at card volume, and you lose the fine-grained control over canary percentages and GPU scaling that the business specifically needs. What the firm needs is a standardized serving substrate — every model deployed the same way, every rollout progressive and reversible, GPUs shared and scaled to demand, and the whole estate observable from one pane of glass. That substrate is KServe on Kubernetes.
Architecture overview
KServe gives you a single Kubernetes custom resource — the InferenceService — that abstracts the whole serving lifecycle behind a declarative spec. Underneath, it leans on Knative Serving for request-driven autoscaling (including scale-to-zero) and revision-based traffic splitting, and on a model-serving runtime (the model server itself) chosen per framework. You describe what you want — this model, this runtime, this much GPU, this canary percentage — and the platform reconciles the how.
There are two distinct serving modes that share infrastructure but solve different problems, and keeping them mentally separate is the first step to operating this well. Serverless mode (one pod-set per InferenceService, Knative-driven) is for the heavyweight, latency-sensitive, or GPU-bound models — fraud, credit, the LLM — where each model justifies its own scaling behavior and you want scale-to-zero on the GPU ones. ModelMesh mode is for the long tail: the hundreds of small per-merchant anomaly models, where giving each its own pod would waste enormous memory; ModelMesh instead loads many models into a shared pool of serving pods, pulling them in and out of memory on demand like a cache.
Request path, numbered as in the diagram: (1) a client (the authorization service, an analyst app) calls the platform through an ingress edge — Akamai at the CDN/edge tier for external surfaces and DDoS absorption, fronting an Istio ingress gateway inside the cluster that does mTLS, routing, and L7 policy. (2) Calls are authenticated against the enterprise IdP — Entra ID or Okta issuing OIDC/JWT tokens — and the gateway validates the token and attaches the caller’s identity and tenant before anything reaches a model. (3) The request lands on KServe’s routing layer, which consults the InferenceService’s traffic split and forwards the request to either the stable (Predictor) revision or the canary revision according to the configured percentage. (4) Inside a predictor pod, an optional Transformer does pre/post-processing (feature lookup, tokenization, output formatting), then calls the Predictor — the actual model server (Triton, TorchServe, the HF/vLLM runtime for the LLM) — which runs the forward pass, on GPU for the LLM and credit ensemble, on CPU for the lightweight frail-fast models. (5) For the long-tail models the request instead hits ModelMesh, which checks whether the requested model is resident; if not, it pulls the model artifact from the model registry / object store (S3/Blob/GCS), loads it into a serving pod, and routes the call — keeping hot models in memory and evicting cold ones. (6) Throughout, every component emits metrics scraped by Prometheus, visualized in Grafana, and traced into Dynatrace or Datadog for end-to-end APM; the answer streams back to the caller.
The control plane runs alongside: model artifacts and versions live in a model registry (MLflow, or a cloud equivalent) backed by object storage; a GitOps pipeline (Argo CD / Flux) reconciles InferenceService manifests from Git so every rollout is a reviewed, auditable commit; and Terraform stands up the cluster, node pools, GPU drivers, and IAM. Secrets — registry credentials, signing keys, downstream API tokens — come from HashiCorp Vault via the Vault Agent injector or Secrets Store CSI, never baked into images.
Component breakdown
| Component | Technology | Role in the platform | Key configuration choices |
|---|---|---|---|
| Edge / ingress | Akamai + Istio ingress gateway | DDoS absorption, TLS termination, mTLS, L7 routing into the mesh | WAF at Akamai; Istio AuthorizationPolicy per tenant; gateway mTLS STRICT |
| Identity | Entra ID / Okta (OIDC) | Issue and validate JWTs; bind calls to tenant/identity | RequestAuthentication + AuthorizationPolicy validating issuer & audience |
| Serving control plane | KServe InferenceService CRD |
Declarative model serving: predictor, transformer, traffic split, autoscale | canaryTrafficPercent; minReplicas/maxReplicas; runtime per framework |
| Autoscaling | Knative Serving (KPA) | Request/concurrency-driven scaling incl. scale-to-zero | autoscaling.knative.dev/target (concurrency); scale-to-zero for GPU svc |
| Predictor runtime | Triton / TorchServe / vLLM | The model server executing the forward pass | Triton for multi-framework + dynamic batching; vLLM for LLM throughput |
| Long-tail density | KServe ModelMesh | Pack thousands of small models onto a shared pod pool | Memory-based LRU eviction; per-model ServingRuntime |
| GPU scheduling | NVIDIA GPU Operator + node pools | Expose GPUs, MIG/time-slicing, driver lifecycle | Tainted GPU node pool; cluster-autoscaler scale-from-zero on GPU nodes |
| Model registry | MLflow + S3/Blob/GCS | Versioned model artifacts, lineage, stage promotion | Immutable version URIs; stage tags drive canary promotion |
| GitOps delivery | Argo CD / Flux | Reconcile serving manifests from Git; auditable rollouts | App-of-apps per tenant; sync waves; drift detection |
| CI | GitHub Actions / Jenkins | Build runtime images, run eval gates, open the promotion PR | Model eval + load test as required checks before merge |
| Secrets | HashiCorp Vault | Registry creds, signing keys, downstream tokens | Agent injector / CSI; short-lived dynamic creds |
| Security posture | Wiz + CrowdStrike Falcon | CSPM/data-posture on the cluster; runtime threat detection on nodes | Wiz scans IaC + cluster config; Falcon sensor as DaemonSet |
| Change governance | ServiceNow | Change requests + approvals for prod promotions | Argo promotion gated on an approved CR |
| Observability | Prometheus + Grafana + Dynatrace/Datadog | Metrics, dashboards, SLOs, distributed tracing | ServiceMonitor scrape; recording rules for p99 & GPU util |
A few choices deserve the why, because they are the ones teams get wrong.
Why a CRD instead of hand-rolled Deployments. The temptation is to “just” write a Deployment + HPA + Service + Ingress per model. That works for one model and collapses at forty — every team reinvents readiness probes, traffic shifting, and autoscaling, inconsistently. The InferenceService collapses all of that into one declarative object with sane, security-reviewed defaults, so a data scientist ships a spec, not a Kubernetes thesis, and the platform team controls the substrate once.
Why Knative for autoscaling rather than the HPA. The standard Horizontal Pod Autoscaler scales on CPU/memory, which are poor proxies for inference load — a GPU model can be saturated at 30% CPU. Knative’s Pod Autoscaler (KPA) scales on concurrency / requests-per-second, which is what actually correlates with latency, and crucially it can scale to zero: an LLM endpoint used only during business hours drops to no GPU pods overnight and cold-starts on the next request. For models that cannot tolerate cold starts (fraud), you pin minReplicas: N and forgo scale-to-zero — the same knob, a different value per model.
Why ModelMesh for the long tail, not more InferenceServices. Each serverless InferenceService is at least one pod holding its model in memory. Five hundred per-merchant anomaly models that way is five hundred idle pods — financially absurd. ModelMesh inverts the model: a fixed pool of serving pods, into which models are loaded on demand and evicted LRU-style, so a few dozen pods can serve a thousand models with the hot ones always resident. The tradeoff is a possible load latency on a cold model and slightly more complex routing — acceptable for low-QPS tail models, wrong for your hot path.
Canary rollouts
This is the heart of why the firm chose KServe, so it is worth being concrete. With a single field, KServe runs two revisions of a model side by side and splits live traffic between them:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-scorer
namespace: risk-prod
spec:
predictor:
minReplicas: 3 # never scale to zero — hot path
maxReplicas: 40
canaryTrafficPercent: 10 # 10% to the new revision, 90% to the last good one
model:
modelFormat: { name: triton }
storageUri: s3://models/fraud-scorer/v37 # immutable, registry-pinned
resources:
limits: { cpu: "4", memory: 8Gi }
When you apply a spec pointing at v37 with canaryTrafficPercent: 10, KServe stands up a new revision, routes 10% of live traffic to it, and keeps the previously-promoted revision serving the other 90%. You watch the canary’s real metrics — p99 latency, error rate, and the business signal (fraud catch-rate, false-positive rate) — on Grafana. If it holds, you bump the percentage (25 → 50 → 100) over hours; if it degrades, you set it back to 0 and traffic instantly reverts to the known-good revision with zero redeploy. Because the revision is immutable and addressable, a rollback is a traffic change, not a rebuild.
The progression should be automated, not nursed by hand. A common pattern wires Argo Rollouts (or a small controller) to Prometheus analysis: it bumps canaryTrafficPercent on a schedule only if a query like histogram_quantile(0.99, ...) < threshold and error_rate < threshold hold for the canary’s metrics; otherwise it aborts and pins to 0. For the credit-decisioning model, that automated gate is itself gated by a ServiceNow change request — a human risk-owner approves the promotion to 100% because a regulator may later ask who signed off. GitHub Actions (or Jenkins) runs the offline eval and a load test as required checks before the promotion PR can even merge, so a model that fails its golden-set evaluation never reaches a canary.
| Strategy | What it does | Rollback | Best for |
|---|---|---|---|
| Canary (traffic %) | N% live traffic to new revision, rest to stable | Set % to 0 — instant, no redeploy | Default for every model; required for regulated ones |
| Blue/Green | 100% cut from old revision to new at once | Re-point tag back | Low-risk batch/offline models |
| Shadow (mirror) | Copy traffic to new revision, ignore its responses | N/A (never served) | Validating a challenger on real traffic with zero user risk |
| Pinned (no canary) | Single revision, manual cutover | Manual | Dev/test only |
Shadow deployments deserve a mention: KServe can mirror production traffic to a candidate model whose responses are discarded, so you measure a challenger’s latency and outputs against real load with no customer exposure — invaluable for the credit model, where you want to compare the new model’s decisions to the incumbent’s on real applications before a single live decision is at stake.
GPU autoscaling
GPUs are where this architecture earns or loses its budget. Three layers cooperate:
- Pod-level (Knative): the GPU-bound
InferenceService(the LLM) scales pods on concurrency. For business-hours-only workloads,minReplicas: 0enables scale-to-zero — overnight there are no LLM pods and therefore no GPU cost. The price is a cold start (pull image + load weights into VRAM), which for a multi-GB model is tens of seconds; mitigate with node-level warm pools or aminReplicas: 1floor if even one cold start per morning is unacceptable. - Node-level (cluster autoscaler): GPU nodes are expensive and slow to provision, so the GPU node pool is tainted (only GPU workloads schedule there) and configured to scale from zero — when a GPU pod is pending and no GPU node exists, the autoscaler adds one; when the pool is idle, it scales back to zero nodes. Pair this with spot/preemptible GPU nodes for batch and shadow workloads and on-demand for the hot path.
- GPU-sharing (NVIDIA GPU Operator): not every model needs a whole A100. MIG (Multi-Instance GPU) partitions one physical GPU into isolated slices, and time-slicing lets several low-QPS models share a GPU cooperatively. The credit ensemble and a couple of mid-size models can share one card via MIG instead of each pinning a full GPU.
A representative GPU spec with scale-to-zero:
spec:
predictor:
minReplicas: 0 # scale to zero overnight
maxReplicas: 8
scaleTarget: 4 # target concurrent requests per pod
scaleMetric: concurrency
annotations:
autoscaling.knative.dev/scale-to-zero-pod-retention-period: "10m"
model:
modelFormat: { name: vllm } # high-throughput LLM runtime
storageUri: s3://models/analyst-copilot-llm/v4
resources:
limits: { nvidia.com/gpu: "1" }
The honest tradeoff: scale-to-zero saves the most money on the most expensive resource but introduces cold-start latency, so you apply it surgically — yes for the internal analyst LLM (humans tolerate a one-time few-second spin-up), never for inline fraud scoring. The economic lever is enormous: an LLM endpoint that runs 10 GPU-hours/day instead of 24 cuts that line by ~60% before any other optimization.
Failure modes and reliability
Decide the numbers per tier, then engineer to them. Cold-start storms: if many models scale from zero simultaneously (a regional failover, a deploy that cycles everything), you get a thundering herd of image pulls and weight loads. Mitigate with image pre-pull DaemonSets, a warm node pool, and staggered minReplicas floors on the hot models. GPU exhaustion: when the GPU pool hits its ceiling or the cloud is out of that instance type, GPU pods sit Pending. Guard the hot path with PriorityClasses so fraud/credit preempt batch and shadow workloads, and set realistic maxReplicas plus quota alerts. Bad canary: covered above — its blast radius is bounded to canaryTrafficPercent, and rollback is a traffic flip. Model-server crashloop (OOM on a fat model, corrupt artifact): readiness probes keep a non-ready revision out of rotation, and because the prior revision is still live, users never see it. Registry/object-store outage: a model that needs loading cannot, so keep hot models pinned in memory (minReplicas ≥ 1) and cache artifacts on nodes so a registry blip does not take down resident models.
For DR, a pragmatic enterprise target is RTO 15 minutes, RPO ~0 for the serving layer: the model artifacts are immutable and replicated in geo-redundant object storage (your durable source of truth), the InferenceService manifests live in Git, so recovery in a paired region is “point Argo CD at the cluster and let it reconcile” — minutes, not a rebuild. Multi-region active/active for the hot path runs the same manifests in two clusters behind Akamai global load balancing with health-based failover.
Security
The platform is Zero-Trust by construction and you layer defense-in-depth on top. Identity everywhere: no anonymous calls — Entra ID / Okta JWTs validated at the Istio gateway via RequestAuthentication, and mTLS STRICT inside the mesh so pod-to-pod traffic is mutually authenticated and encrypted. Secrets: registry credentials, model-signing keys, and downstream tokens come from HashiCorp Vault with short-lived dynamic credentials injected at runtime — nothing in images, nothing to rotate manually. Supply chain: models are artifacts and must be treated like code — sign them (cosign) and verify the signature before load, so a tampered or unverified model never reaches a GPU; pin immutable version URIs (.../v37, never latest). Posture and runtime: Wiz continuously scans the cluster config and IaC for misconfigurations and data-exposure paths (an InferenceService accidentally exposed without auth, an over-broad IAM role on the model bucket), while CrowdStrike Falcon runs as a node DaemonSet for runtime threat detection — catching a compromised serving pod attempting lateral movement or crypto-mining on those idle GPUs. Tenant isolation: per-namespace NetworkPolicy and Istio AuthorizationPolicy so the risk team’s models and the analyst LLM cannot call each other, and a malicious actor who lands in one tenant cannot pivot to another. Model-specific threats: rate-limit per tenant at the gateway to blunt model-extraction (stealing a model by querying it exhaustively) and adversarial-probing, and log every inference request/response (subject to retention/privacy rules) for audit and incident review.
Cost optimization
Inference cost is dominated by GPUs and grows with success, so engineer for it from day one. (1) Scale-to-zero on every workload that tolerates a cold start — the single biggest lever, applied to the LLM and all batch/shadow models. (2) GPU sharing via MIG/time-slicing so mid-size models share a card instead of each pinning one. (3) Spot/preemptible GPU nodes for batch, eval, and shadow traffic — often 60–80% cheaper, and the workload is interruption-tolerant. (4) Right-size the runtime: pick CPU for models that meet latency on CPU (most classic tabular models do), reserve GPU for what genuinely needs it; a surprising number of “GPU” models were never benchmarked on CPU. (5) Dynamic batching in Triton/vLLM raises GPU throughput-per-dollar by amortizing the forward pass across concurrent requests. (6) ModelMesh density turns a thousand idle pods into a few dozen shared ones for the long tail. (7) Meter GPU-hours and request volume per tenant in Prometheus and feed it to chargeback, so the cost has an owner — the fastest way to get an idle endpoint turned off is to put its bill on someone’s budget.
Observability
Instrument the serving span end to end. Prometheus scrapes KServe, Knative, the model runtimes, and the NVIDIA DCGM exporter (GPU utilization, memory, temperature, ECC errors) via ServiceMonitor objects; Grafana dashboards turn that into the views the business actually watches: p50/p99 latency per model, request rate and error rate, GPU utilization and saturation, canary-vs-stable comparison (the single most important panel during a rollout), scale-to-zero / cold-start counts, and GPU-hours and cost per tenant. Define SLOs as Prometheus recording rules — e.g. fraud p99 < its budget, credit error-rate < threshold — and alert on burn rate, not just instantaneous breach. Layer Dynatrace or Datadog for distributed tracing across the gateway → transformer → predictor → downstream-feature-store hops, so when a model’s latency spikes you can see where in the chain. The metrics that matter most are the ones tied to money and risk: false-positive rate on the fraud canary, decision-distribution drift on the credit model, and time-to-first-token on the LLM (the latency a human actually feels). Run an offline evaluation harness in CI (GitHub Actions / Jenkins) so a model change is scored on its golden set before it becomes a canary, never after a Grafana panel turns red in prod.
Reference enterprise example
Crestline Pay, a fictional mid-size card processor (~2,400 employees, ~6 million transactions/day), built this platform to standardize a sprawl of one-VM-per-model deployments after an incident where a hand-rolled fraud-model update was pushed straight to 100% and spiked false declines for forty minutes with no fast rollback.
Decisions they made. They ran KServe in serverless mode for the hot path — the fraud scorer (minReplicas: 6, no scale-to-zero, p99 budget 8ms on CPU via Triton with dynamic batching) and the credit-decisioning ensemble (MIG-shared GPU) — and ModelMesh for the ~420 per-merchant anomaly models, which collapsed onto 18 serving pods instead of 420. The internal analyst-copilot LLM ran on vLLM with scale-to-zero overnight on a spot-backed A100 pool. Every promotion went through canary: 10 → 25 → 50 → 100 with an Argo Rollouts + Prometheus gate, and the credit model’s jump to 100% additionally required an approved ServiceNow change request signed by a model-risk owner. Manifests lived in Git, reconciled by Argo CD; GitHub Actions ran the golden-set eval and a load test as merge gates; Terraform stood up the AKS cluster, the tainted GPU node pool, and IAM. Vault injected registry and signing creds; models were cosign-signed and verified on load. Wiz scanned cluster config and the model buckets, CrowdStrike Falcon ran on every node, and Istio enforced mTLS + per-namespace policy. Akamai fronted the external analyst surface; Entra ID issued the tokens.
The numbers. Inference ran at ~6M fraud scores/day inline plus the credit and tail volume. The fraud p99 held at ~6.5ms. Monthly serving run-cost landed near ₹11.3 lakh (~$13,500): GPU compute ~$6,000 (scale-to-zero + spot on the LLM cut what would have been ~$15,000 of 24/7 on-demand A100 to roughly a third), CPU fraud/tail fleet ~$3,000, AKS control + networking + Istio ~$2,000, observability (Prometheus/Grafana self-hosted, Datadog for tracing) and the rest the remainder. ModelMesh density saved an estimated ~$4,000/month versus one pod per tail model. The LLM scale-to-zero alone — 9 GPU-hours/day instead of 24 — was the difference between this budget and a ~40%-higher one.
The outcome. The forty-minute-incident class of failure became a non-event: a bad fraud canary now self-aborts at 10% and reverts in seconds. New models reach production in days through a standardized, reviewed GitOps path instead of weeks of bespoke VM setup. And — the line that got the CRO’s attention — because every credit-model version that ever served a live decision is pinned, signed, traced, and tied to an approved change request, the model-risk and audit teams could finally answer a regulator’s “show me exactly which model decided this application, and who approved its rollout” without a forensic archaeology project. A paired-region game day held RTO at 12 minutes: Argo CD reconciled the manifests into the secondary cluster, artifacts pulled from geo-redundant storage, and Akamai shifted traffic.
When to use it
Use this architecture when you have more than a handful of models, they are on the revenue or compliance path, you need progressive and reversible rollouts, and you run GPUs whose idle cost you must control. That covers most serious enterprise ML serving — fraud/risk scoring, recommendation and ranking, document/vision pipelines, and internal LLM serving — anywhere you need one standardized, observable, governed substrate instead of a museum of bespoke endpoints.
Trade-offs to accept. KServe rides on Kubernetes + Knative + Istio + a GPU operator + a registry + GitOps — that is real platform surface area and a real on-call burden; a two-person team serving three models does not need it and should reach for a managed endpoint first. Scale-to-zero trades cost for cold-start latency. ModelMesh trades per-model isolation for density. And canary only protects you if you actually watch the right metrics — a rollout gated on latency but blind to the business signal (false-positive rate) can promote a model that is fast and wrong.
Anti-patterns. (1) Hand-rolled Deployment-per-model — no standard rollout, no shared autoscaling, collapses past a dozen. (2) Scale-to-zero on the hot path — cold starts blow your latency SLO. (3) One pod per long-tail model — financially absurd; that is what ModelMesh exists for. (4) Cutting a regulated model to 100% with no canary or approval — exactly the incident the firm was trying to prevent. (5) GPUs with no DCGM metrics — you cannot manage utilization or cost you cannot see. (6) Mutable model URIs (latest) — destroys reproducibility and makes rollback ambiguous; pin immutable versions.
Alternatives, and when they win. If you serve one or two models and value speed over control, a managed endpoint (SageMaker, Vertex, Azure ML online endpoints) or a single Seldon / BentoML service is simpler and skips the platform tax — graduate to this when model count, rollout safety, or GPU cost demand it. If your workload is purely batch / offline scoring, you do not need request-driven autoscaling at all — a scheduled job is cheaper and simpler. If you are serving a single very large LLM at high throughput as your whole product, a dedicated inference stack (vLLM/TGI on a fixed GPU fleet with its own router) may out-optimize the general substrate. KServe is the destination for a fleet of heterogeneous models that must all be deployed, rolled out, scaled, secured, and observed the same way — which is precisely the financial-services scenario above, and not always the starting line.