GPU Inference Platform for LLMs on AWS EKS with Karpenter

A national health-insurance payer’s clinical-operations group lands a directive from the chief medical officer: claims adjudicators and nurse case managers are drowning in 40-page prior-authorization packets, and every hour a complex case sits in the queue is an hour a member waits for care and a day the payer carries open liability. The ask is an assistant that summarizes a packet, extracts the diagnosis and procedure codes, and drafts a medical-necessity rationale against policy — at the desk, in seconds. The constraint is the one that kills the easy answer: this is PHI under HIPAA, the data-governance board has a standing prohibition on sending member clinical records to any third-party model API, and the legal team will not accept “your prompts may be retained for abuse monitoring” in a BAA. So the model has to run inside the payer’s own AWS account, on their own GPUs, with no member data ever leaving the VPC. That single non-negotiable — self-hosted inference — is what turns a weekend prototype into the platform this article describes: open-weight LLMs served on Amazon EKS, with GPUs that appear only when there is work and vanish when there is not, behind a gateway that meters and protects every request.

The pressures stack the way they always do when you own the silicon. Cost is the dominant one: an A10G or L40S GPU left running idle overnight burns real money, and an H100 burns a great deal of it, so utilization is not a nice-to-have but the whole economic case. Latency means an adjudicator mid-review will not wait fifteen seconds for a summary, and a cold GPU that takes four minutes to come up and load 30 GB of weights is a latency event, not a scaling event. Scale means a Monday-morning queue surge of hundreds of concurrent packets against a Tuesday-afternoon lull, and the platform has to track that curve without either dropping requests or paying for peak all week. And governance means every token, every model version, and every node has to be auditable for a regulator. Self-hosting on EKS with Karpenter doing GPU lifecycle and vLLM doing high-throughput serving is the pattern that satisfies all four — but only if the pieces are assembled deliberately.

Why not the obvious shortcuts

Three shortcuts will be proposed in the first planning meeting, and naming why each fails saves a quarter of wasted effort.

Call a hosted model API. The cleanest engineering answer and a complete non-starter here: it sends PHI across a tenant boundary the governance board has explicitly forbidden, and no abuse-monitoring retention clause survives the BAA review. For a payer, this option is not “less private,” it is “illegal.”

Run the model on a fixed fleet of always-on GPU EC2 instances. This works and it is what most teams reach for first, but it inverts the cost problem: you size the fleet for Monday’s peak and pay for it through Sunday’s trough, and GPU instances do not get cheaper when they idle. You also inherit manual capacity management — someone watching dashboards and resizing an Auto Scaling Group, badly, at 8 a.m.

Put a single GPU behind a Flask app and a queue. It demos beautifully and collapses under concurrency: one in-flight request blocks the GPU, batching is naive or absent, and throughput per dollar is a fraction of what the hardware can do. The moment a second adjudicator submits, latency doubles.

The platform threads the needle by separating three concerns that the shortcuts conflate. Karpenter owns when GPUs exist — it provisions a right-sized GPU node within a minute of a pending pod and terminates it when the work drains, so you pay for silicon by the request-storm, not by the calendar. vLLM owns how efficiently each GPU serves — continuous batching and PagedAttention push tokens-per-second per GPU to several times what a naive server delivers. And KServe owns how many model replicas exist — it scales serving pods on real demand signals, down to zero between surges. Three independent control loops, each tuned for the thing it controls.

Architecture overview

GPU Inference Platform for LLMs on AWS EKS with Karpenter — architecture

The platform runs two paths that share a cluster but live on different clocks: a synchronous inference path that serves adjudicators in real time, and an asynchronous model-lifecycle path that publishes, validates, and promotes model versions. Keeping them mentally separate is the first step to operating this well — one is measured in milliseconds, the other in builds.

The defining property of the topology is the one the governance board cares about most: the entire serving plane lives in private subnets with no inbound path from the internet, model weights never leave the account, and every egress to AWS services rides a VPC endpoint. Members’ clinical text enters through the payer’s own edge, is processed on GPUs the payer owns, and the completion returns — without a single prompt or token transiting a third party. That is what makes the self-hosting story defensible to a HIPAA auditor.

Inference path, following the control flow:

An adjudicator opens the assistant inside the payer’s claims workbench. Identity federates through Okta as the workforce IdP (the payer’s standard), brokered to Microsoft Entra ID for the Microsoft-side resources, so every call carries a first-class, group-stamped token. Traffic hits Akamai at the edge for TLS termination, global anycast, and WAF/bot protection before it reaches AWS.
The request lands on the token-aware inference gateway — an Envoy-based gateway (the AI Gateway pattern) running in the cluster. It validates the OIDC JWT, attaches the caller’s team and cost-center claims, enforces a per-team token-per-minute budget, rate-limits by tenant, and — critically for self-hosted LLMs — meters prompt and completion tokens rather than just request count, because a 30-page packet and a one-line question cost wildly different amounts of GPU. This is the single front door: one place to authenticate, throttle, meter, and audit every model call.
The gateway routes to a KServe InferenceService fronting the target model. KServe’s router handles model-name routing and, where configured, canary traffic splitting between model versions.
The request reaches a vLLM serving pod scheduled on a GPU node. vLLM has already loaded the weights into GPU memory at startup; it adds the incoming request to its continuous batch, runs decoding with PagedAttention for memory-efficient KV-cache handling, and streams tokens back. If no GPU node currently has capacity, the pod sits Pending — which is the signal that drives the next loop.
A Pending GPU pod is seen by Karpenter, which evaluates its resource and nvidia.com/gpu requirements, selects the cheapest instance type that fits from its NodePool (an g6.xlarge/L4 for a 7-8B model, a g6e/L40S or p4d/A100 for a 70B model), launches it — often as Spot for batch-tolerant traffic with On-Demand fallback — bootstraps it with the GPU device plugin via a Bottlerocket GPU AMI, and the pod schedules within roughly a minute.
The cited, structured answer streams back through the gateway to the adjudicator; the request, token counts, model version, and latency are emitted to Datadog.

Model-lifecycle path, independent and build-driven: a new or fine-tuned open-weight model (say a Llama- or Mistral-family checkpoint, or an internally fine-tuned variant) is published to an S3 model registry — an immutable, versioned prefix layout (s3://payer-models/clinical-summary/v7/) with weights, tokenizer, and a signed manifest. A pipeline runs an offline evaluation (groundedness against a golden packet set, code-extraction accuracy, latency on a reference GPU), and only a passing version is tagged for promotion. KServe storage initializers pull weights from S3 at pod startup over a Gateway VPC endpoint, so the multi-gigabyte download stays on the AWS backbone and never touches the public internet.

Component breakdown

Component	Service / tool	Role in the platform	Key configuration choices
Edge	Akamai	TLS, anycast, WAF, bot mitigation at the perimeter	WAF rules for prompt-flood / token-abuse patterns; origin shield to the private gateway
Identity / SSO	Okta + Microsoft Entra ID	Workforce SSO (Okta) federated to Entra; group claims to the gateway	OIDC federation; cost-center claim drives token chargeback
AI gateway	Envoy AI Gateway	JWT validation, token-aware metering, rate limiting, model routing	Per-team token-per-minute limits; OpenAI-compatible `/v1` routes
Model serving	vLLM	High-throughput LLM inference engine	Continuous batching; PagedAttention; `--tensor-parallel-size` per model
Serving control	KServe	Model deployment, request-driven autoscaling, scale-to-zero, canary	KEDA/KPA on concurrency; `minReplicas: 0` for off-peak models
GPU lifecycle	Karpenter	Just-in-time GPU node provisioning and consolidation	GPU NodePool; Spot + On-Demand fallback; `consolidationPolicy`; TTL
Model registry	Amazon S3	Immutable versioned weight + manifest store	Versioned prefixes; Object Lock; bucket-key SSE-KMS; VPC endpoint
Secrets	HashiCorp Vault	Registry creds, gateway signing keys, fine-tune data tokens	IRSA-backed auth; dynamic leases; Vault Agent sidecar injection
CSPM / posture	Wiz + Wiz Code	Cloud posture, attack-path analysis, IaC scanning pre-merge	Agentless EKS scan; alert on public-exposure or open-SG drift
Runtime security	CrowdStrike Falcon	Runtime threat detection on GPU nodes and the cluster	Sensor as DaemonSet; container drift detection; SOC pipeline
Observability	Datadog	GPU utilization, token throughput, latency SLOs, cost telemetry	DCGM GPU metrics; APM trace per request; SLO monitors
ITSM / approvals	ServiceNow	Model-promotion approvals, change requests, incident records	Change gate before a model version goes live; auto-ticket on SLO breach
CI/CD + IaC	GitHub Actions + Argo CD + Terraform	Build/eval pipeline; GitOps deploy; infrastructure as code	OIDC to AWS (no stored creds); eval gate; Argo syncs KServe manifests

A few of these choices deserve the why, because they are the ones teams get wrong.

Why Karpenter, not the Cluster Autoscaler with GPU ASGs. The Cluster Autoscaler scales pre-defined node groups, which forces you to guess GPU instance types up front and maintain a separate ASG per type. Karpenter is instance-type-aware and bin-packs to the workload: it reads the pending pod’s exact GPU and memory request and picks the cheapest instance that fits from a flexible NodePool, mixes Spot and On-Demand, and — the part that drives the cost case — runs consolidation, proactively replacing underused nodes and terminating empty ones so a GPU never idles for long. For a fleet where a single H100 hour is expensive, “provision the exact GPU the pod needs, and reclaim it the moment it drains” is the entire economic argument.

Why vLLM, not a vanilla model server. The naive pattern processes one request per forward pass and leaves the GPU underfed. vLLM’s continuous (in-flight) batching keeps the GPU saturated by adding and retiring requests from the batch every decoding step, and PagedAttention manages the KV cache like virtual memory so you fit far more concurrent sequences in GPU RAM. The result is several times the tokens-per-second per GPU, which directly divides your cost-per-million-tokens. On self-hosted hardware, throughput per GPU is the unit economics.

Why a token-aware gateway, not a plain API gateway. A standard gateway meters requests; an LLM platform must meter tokens, because cost and GPU time scale with prompt and completion length, not request count. The Envoy AI Gateway enforces a per-team token-per-minute budget so the appeals team cannot exhaust the GPU capacity the adjudication team is paying for, routes by model name across KServe backends, and produces one audit log of who sent what to which model version — the chargeback and compliance backbone in one layer.

Implementation guidance

Provision with Terraform, and treat the GPU NodePool and the network as first deliverables. Get the device plugin, the AMI, and the VPC endpoints right before anything else, or pods schedule onto nodes with no usable GPU and weight downloads silently traverse a NAT gateway you are paying per-GB for.

An EKS cluster (private API endpoint) with private subnets for the serving plane and Gateway VPC endpoints for S3 plus Interface endpoints for ECR, STS, and CloudWatch, so weight pulls and image pulls stay on the AWS backbone.
Karpenter installed with an EC2NodeClass using a Bottlerocket NVIDIA GPU AMI (the device plugin is built in) and a GPU NodePool constrained to GPU families, Spot+On-Demand, with consolidation and a node TTL.
KServe (with Knative/KEDA for request-driven scaling) and the Envoy AI Gateway.
The S3 model registry bucket with versioning, Object Lock, and SSE-KMS; IRSA roles granting the storage initializer read-only access to exactly its model prefix.
Argo CD watching the GitOps repo so InferenceService and gateway manifests deploy by merge, not by kubectl.

A minimal Karpenter GPU NodePool communicates the intent — cheapest fitting GPU, Spot-first, reclaim aggressively:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["g6", "g6e", "p4d"]      # L4 / L40S / A100
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]      # Spot first, On-Demand fallback
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule                 # only GPU pods land here
      nodeClassRef:
        name: gpu-bottlerocket
  limits:
    nvidia.com/gpu: 64                        # hard ceiling on fleet GPUs
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m                      # reclaim idle GPUs fast

And the KServe InferenceService that serves a model from the S3 registry with scale-to-zero between surges:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: clinical-summary
spec:
  predictor:
    minReplicas: 0                            # scale to zero off-peak
    maxReplicas: 8
    scaleTarget: 12                           # target concurrent requests/replica
    scaleMetric: concurrency
    model:
      modelFormat: { name: vLLM }
      storageUri: s3://payer-models/clinical-summary/v7/
      args: ["--max-model-len=16384", "--tensor-parallel-size=1"]
      resources:
        limits: { nvidia.com/gpu: "1" }

The pipeline that applies this runs in GitHub Actions, authenticating to AWS via OIDC federation so there is no stored access key to leak. It builds the serving image, runs the offline eval gate, pushes the model version to S3, and updates the GitOps manifest; Argo CD then reconciles the change onto the cluster. Terraform owns the cluster, NodePool, S3, IAM/IRSA, and endpoints; Ansible handles any node-level or golden-AMI hardening config that lives outside the managed AMI.

Identity: kill the static keys, federate the humans. No node and no pod holds a long-lived AWS key. The serving pods, the storage initializer, and the gateway authenticate to AWS with IRSA (IAM Roles for Service Accounts), each scoped to the minimum — the summarizer’s role can read only its own S3 model prefix and decrypt with only its KMS key. Human SSO flows Okta → Entra: adjudicators log in once with the payer’s Okta credentials and conditional-access policies, Okta federates to Entra over OIDC, and the resulting token carries the team and cost-center claims the gateway consumes for routing and chargeback. The residual secrets that are not IAM roles — third-party registry credentials, the gateway’s JWT signing key, tokens for the fine-tuning data lake — live in HashiCorp Vault, leased dynamically and injected by the Vault Agent sidecar, so nothing sensitive sits in a Kubernetes Secret.

Enterprise considerations

Security & Zero Trust. The architecture is Zero Trust by construction: private API endpoint, no public ingress to the serving plane, identity-based access only, least-privilege IRSA per workload, and model weights that never leave the account. Layer on top: (a) Wiz running continuous CSPM and attack-path analysis across the EKS cluster and S3, alerting the moment a security group opens or a bucket policy drifts toward public, with Wiz Code scanning the Terraform and KServe manifests in the pull request so a misconfiguration is caught before merge, not after exposure; (b) CrowdStrike Falcon sensors as a DaemonSet on every GPU node for runtime threat detection and container-drift alerts, feeding the payer’s SOC — important because GPU nodes run third-party model and CUDA images that warrant runtime scrutiny; © prompt-injection and PHI-leak guardrails at the gateway, since a malicious instruction hidden in an uploaded packet is the LLM-specific attack here; (d) an SLO breach or a blocked-injection event auto-raises a ServiceNow incident so security and operations have a ticket, not just a metric. KMS encrypts weights at rest, and the S3 Object Lock on the registry means a published, validated model version is immutable — you cannot silently swap the weights an auditor signed off on.

Cost optimization. GPU spend dominates and idle GPUs are the leak, so engineer utilization from day one.

Lever	Mechanism	Typical effect
Scale-to-zero	KServe `minReplicas: 0` drains a model with no traffic; Karpenter reclaims the node	Eliminates off-peak GPU cost entirely
Spot GPUs	Karpenter NodePool Spot-first with On-Demand fallback for batch-tolerant traffic	~60-70% off the GPU hour on the Spot share
Node consolidation	Karpenter bin-packs and replaces underutilized nodes	Keeps live GPUs near-saturated, not half-idle
Right-sized instance	Karpenter picks the cheapest GPU that fits the model (L4 vs L40S vs A100)	Stops paying H100 prices for a 7B model
Continuous batching	vLLM keeps the GPU fed, raising tokens/sec per GPU	Lowers cost-per-million-tokens several-fold
Token chargeback	Gateway meters tokens per team; piped to Datadog	Makes each desk own its GPU spend

Meter tokens per team at the gateway and pipe the metric to Datadog, which the platform team uses for the per-cost-center chargeback dashboard the CFO sees. The single highest-leverage number is GPU utilization: a fleet averaging 25% is a fleet you are overpaying for by 3-4x, and consolidation plus batching plus scale-to-zero exist to drive it up.

Scalability. Each loop scales independently and on its own signal. KServe scales replicas on request concurrency (KPA/KEDA), from zero up to its ceiling. Karpenter scales GPU nodes on pending pods, choosing instance types live. vLLM scales throughput within a GPU via batching, and across GPUs via --tensor-parallel-size for a model too large for one card (a 70B model sharded across 4 GPUs on a p4d). The natural ceilings are your GPU service quota in the region and Spot availability for the chosen instance family — which is why a payer planning a Monday surge requests quota and configures On-Demand fallback early, rather than discovering the ceiling during an incident.

Failure modes, and what each one looks like. Name them before they page you.

Cold-start latency on a scaled-to-zero model — the first request after a lull waits for Karpenter to launch a node and for vLLM to download 30+ GB of weights and load them into GPU memory: minutes, not milliseconds. Mitigation: keep minReplicas: 1 (a warm pool) for latency-critical models and reserve scale-to-zero for batch ones; pre-bake weights into the AMI or use a faster registry pull; size the GPU so weights fit without paging.
Spot GPU reclamation mid-request — AWS reclaims a Spot node with two minutes’ notice and in-flight requests on it fail. Mitigation: Karpenter drains on the interruption signal; the gateway retries idempotent requests on another replica; keep interactive traffic on On-Demand and Spot for batch.
GPU quota exhaustion at peak — Karpenter cannot launch the node the pending pod needs because the regional GPU quota or Spot pool is dry, and requests queue. Mitigation: pre-request quota for peak, configure On-Demand fallback and a second instance family, and alert on sustained Pending GPU pods.
OOM / KV-cache exhaustion — concurrency or prompt length exceeds GPU memory and vLLM rejects or degrades. Mitigation: cap --max-model-len, tune gpu-memory-utilization, and let KServe’s concurrency target shed load to a new replica rather than overloading one.
A bad model version promoted — a regression ships and answers degrade. Mitigation: the offline eval gate, KServe canary traffic splitting to validate a new version on a slice of live traffic, and instant rollback by reverting the GitOps manifest.

Reliability & DR (RTO/RPO). Decide the numbers per tier. The model registry in S3 (versioned, Object-Locked, cross-region-replicated) is the durable source of truth with near-zero RPO — a model version, once published, is recoverable indefinitely. The serving plane itself is stateless: there is no conversation state on a GPU node, so DR is “stand the platform back up in a paired region,” which Terraform and Argo CD make a rebuild, not a restore. A pragmatic target for this platform: RTO 30 minutes to bring the serving plane up in a second region (cluster + Karpenter + Argo sync, gated by GPU quota in that region, which you pre-request), RPO effectively zero for models given S3 replication. Akamai health checks drive edge failover for ingress. The honest caveat: GPU capacity in the failover region is itself a dependency — DR for a GPU platform is partly a quota exercise, not only an automation one.

Observability. Instrument the request end to end in Datadog with APM: one trace covering gateway → KServe router → vLLM, with token counts and per-hop timing. Scrape DCGM GPU metrics so the dashboards show GPU utilization, memory used, temperature, and tokens-per-second per GPU alongside the request metrics. Define explicit latency SLOs — p95 time-to-first-token (what an adjudicator feels in a stream) and end-to-end p95 — as Datadog SLO monitors, with the error budget visible. Emit the business metrics that matter: GPU-hours per team, tokens and cost per cost-center, scale-to-zero/cold-start frequency, Spot-reclamation rate, and eval scores per model version. An SLO burn or a sustained GPU-saturation alert auto-raises a ServiceNow incident; a new model version passes a ServiceNow change approval before promotion, giving compliance a documented gate.

Governance. Pin model versions explicitly by immutable S3 prefix (clinical-summary/v7, never a latest alias) so behavior does not drift, and promote new versions only through the eval gate and the ServiceNow change. Keep KServe and gateway manifests in version control under Argo CD, reviewable and instantly revertable. Log every prompt/response pair (with PHI handled under the same HIPAA controls and a retention/right-to-erasure path) for audit, incident review, and future eval data. Wiz Code enforces that no infrastructure change ships with a public-exposure or open-IAM regression, and Wiz independently verifies in production that the controls are actually holding.

Explicit tradeoffs

Accept these or do not build it. Self-hosting LLMs trades a managed API’s simplicity for control and data residency, and the bill is real: you now operate Kubernetes, GPU drivers, a model registry, an autoscaling stack, and a serving engine — none of which a hosted API made you think about. Cold starts are the price of scale-to-zero: you cannot have both zero idle cost and instant first-token latency on the same model, so you choose per model (warm pool for interactive, scale-to-zero for batch). Spot is the price of cheap GPUs: you accept mid-request reclamation in exchange for 60-70% off, and you keep interactive traffic on On-Demand. And open-weight models on your hardware will, for the hardest reasoning, trail the largest frontier hosted models — you trade some capability ceiling for the right to keep PHI in your account, which for this payer is the entire point and not a regret.

The alternatives, and when they win. If your data is not regulated and you can send it out, a hosted model API is dramatically less to operate and usually the right call until governance, scale, or cost says otherwise. If your inference is sporadic and bursty with no latency floor, a serverless GPU offering or batch-only jobs may beat a standing platform. If you need only one model at modest, steady traffic, a managed inference endpoint (e.g. SageMaker) skips the Kubernetes layer entirely. Graduate to this full EKS-plus-Karpenter platform when you must self-host for residency, run multiple models with independent scaling, drive GPU cost down through consolidation and scale-to-zero, and hold the whole thing to auditable SLOs — which is exactly the payer’s situation.

The shape of the win

For the payer’s clinical-operations team, the payoff is not “an internal chatbot.” It is that an adjudicator opens a 40-page prior-auth packet, gets a structured summary with extracted codes and a policy-grounded medical-necessity draft in a few seconds, and — because the model ran on the payer’s own GPUs inside the VPC and no member record ever left the account — the governance board cleared it for PHI workloads, which a hosted API would never have passed. That clearance is what funds the platform. Everything upstream — Karpenter reclaiming idle GPUs, vLLM saturating the ones that remain, KServe scaling to zero overnight, the S3 registry’s immutable versions, the token-aware gateway’s chargeback, the Vault-held keys, the Wiz posture checks, the CrowdStrike runtime sensors, the Datadog GPU-utilization and latency SLOs — exists to make a CISO, a compliance officer, and a CFO each say yes. The architecture here is the destination; start with a single warm model behind the gateway if you must, but a regulated, at-scale, cost-disciplined “run our own LLMs” lands here.

GPU Inference Platform for LLMs on AWS EKS with Karpenter

Why not the obvious shortcuts

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Explicit tradeoffs

The shape of the win

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)