A national grocery chain’s e-commerce team has one question in their architecture review and they cannot agree on the answer: when a customer hits “place order,” what kind of compute should run the code that reserves stock, charges the card, and books a delivery slot? One engineer wants to lift the existing app onto a couple of large virtual machines because that is what they know. Another wants Kubernetes because “everyone uses containers.” A third has read about Lambda and wants the whole thing serverless so they “never patch a server again.” All three are partly right, and the disagreement is costing weeks. This article is the decision framework that resolves it — not by declaring a winner, but by showing what each compute model actually is, what it costs you in money and operational effort, and which parts of a real retail workload each one fits.
The grocer’s situation is the one every team meets eventually. Their traffic is brutally spiky: a Tuesday afternoon is quiet, but a Friday-evening “weekend shop” rush and a one-day promotion can drive ten times the normal load for a few hours. Their operations team is small — eight engineers, no dedicated platform team, and a CFO who reads the monthly cloud bill line by line. And they have a mix of workloads that genuinely differ: a always-on product catalogue API, a bursty checkout flow, a nightly batch job that re-prices thousands of SKUs, and a legacy warehouse-integration component that ships as a vendor virtual appliance they cannot containerise. One compute model will not serve all four well, and the real skill is matching each workload to the right one.
The three models, in one breath each
Before the tradeoffs, the definitions — because half of every compute argument is two people meaning different things by the same word.
A virtual machine (VM) is a whole emulated computer: its own operating system, kernel, and disk, that you boot, patch, and own. On AWS that is EC2, on Azure Virtual Machines, on GCP Compute Engine. You get total control and you carry total responsibility — the OS is yours to secure and keep alive.
A container packages your application and its dependencies into an image that shares the host’s kernel but runs isolated. You no longer manage an OS per app, but you do run an orchestrator that schedules containers across a pool of machines: ECS or EKS on AWS, AKS on Azure, GKE on GCP (all three “K” services are managed Kubernetes). The unit of deployment is an image; the unit you still operate is the cluster.
Serverless (specifically Functions-as-a-Service) means you hand the platform a function, and it runs on demand, scales to zero when idle, and bills per invocation and millisecond. AWS Lambda, Azure Functions, GCP Cloud Functions. There is no server for you to see — hence the name — and no cluster to operate. The platform owns everything below your code. A middle ground, container-based serverless — AWS Fargate, Azure Container Apps, GCP Cloud Run — runs your container image with serverless scaling and no node management, blending two of the models.
The honest mental model is a spectrum of how much of the stack you operate versus how much the cloud operates for you. VMs put almost everything on you; serverless puts almost nothing on you; containers sit in between, and exactly where depends on whether you self-manage the cluster or let the cloud do it.
Architecture overview
Rather than force the whole grocer onto one model, the reference architecture places each of the four workloads on the model that fits it — which is what a real production estate looks like, and exactly the lesson a junior architect needs. Trace the request and you can see why each choice lands where it does.
A customer’s browser hits Akamai at the edge first — CDN caching for static product images and pages, TLS termination, and WAF/bot mitigation so credential-stuffing and scraping traffic is absorbed before it reaches any of your compute. Akamai routes dynamic requests to the cloud, and from there the four workloads diverge:
-
Product catalogue API — containers (managed Kubernetes). This service is always on, gets steady high traffic, and runs many small replicas behind a load balancer. It lives on AKS / EKS / GKE as a Deployment with a horizontal pod autoscaler. Containers fit because the workload is continuous (so serverless’s pay-per-call offers little, and cold starts would hurt a latency-sensitive read path) and because the team wants fast, image-based rollouts and easy horizontal scaling.
-
Checkout flow — container-based serverless (Cloud Run / Container Apps / Fargate). This is the spiky one: near-zero between rushes, then a wall of traffic during a promotion. It runs as a container image but on a scale-to-zero serverless runtime, so the grocer pays almost nothing on a quiet Tuesday and the platform absorbs the Friday spike automatically with no nodes to pre-provision. Packaging it as a container (rather than a raw function) keeps it portable and lets it share the same image build as the rest of the estate.
-
Nightly re-pricing batch — serverless functions. A scheduled job fans out over thousands of SKUs once a night. Lambda / Azure Functions / Cloud Functions triggered on a schedule, fanning out across many parallel invocations, is ideal: it runs for minutes a day and bills only for those minutes, with massive built-in parallelism and nothing running the other 23 hours.
-
Warehouse integration — a virtual machine. The vendor ships this as a virtual appliance — a pre-built VM image with a kernel and drivers the supplier supports and you may not modify. It has no container build, it holds a long-lived stateful connection to the warehouse system, and the support contract is void if you re-platform it. It runs on EC2 / Azure VM / Compute Engine, full stop. This is the workload that proves “serverless everything” is a fantasy: some software only ships as a VM.
Every one of these authenticates the same way and is observed the same way, which is the second half of the picture. Workforce and customer identity flow through Okta (or Entra ID on Azure) as the identity provider — Okta issues the OIDC/SAML tokens that the catalogue and checkout services validate, so a model choice never means a separate auth story. Application secrets — database passwords, the payment-gateway API key — come from HashiCorp Vault, which issues short-lived dynamic credentials to the containers, functions, and VMs alike rather than baking secrets into an image or a VM disk. And every tier emits telemetry to Datadog (or Dynatrace), so one dashboard spans all three compute models — VM host metrics, container/pod metrics, and per-function invocation traces in a single pane.
How they compare on what matters
The three properties a junior architect should weigh first are operational burden, scaling behaviour, and cost shape. Here they are side by side.
| Dimension | Virtual machines | Containers (managed K8s) | Serverless (FaaS) |
|---|---|---|---|
| You operate | OS, patching, runtime, app, scaling | Cluster, node pools, images, app | Just your function code |
| Cloud operates | Hardware, hypervisor | Hardware, control plane | Everything below your code |
| Scaling unit | Whole VM (minutes) | Pod / container (seconds) | Invocation (milliseconds) |
| Scale to zero | No (you pay while it runs) | Rarely (nodes stay up) | Yes (pay only on use) |
| Cold starts | None (always warm) | None once pods are up | Yes — tens of ms to seconds |
| Cost shape | Pay for reserved time | Pay for the node pool | Pay per request + duration |
| Portability | Image is cloud-specific-ish | High (image runs anywhere) | Low (vendor-specific triggers) |
| Best for | Legacy, stateful, appliances, special hardware | Steady, microservices, mixed languages | Spiky, event-driven, glue, batch |
A second table reframes the same trade as the question you should ask yourself, because matching is easier than memorising.
| Ask yourself | If yes, lean toward |
|---|---|
| Is traffic spiky or near-zero between bursts? | Serverless |
| Is it event-driven or scheduled (a queue, a timer, a file upload)? | Serverless |
| Is it always-on with steady, latency-sensitive traffic? | Containers |
| Do I run many services in different languages I want deployed uniformly? | Containers |
| Is it a vendor appliance, a legacy app, or does it need a specific kernel/GPU/driver? | VMs |
| Is it stateful with long-lived in-process connections? | VMs (or stateful containers) |
| Is my ops team tiny and I want minimal infrastructure to run? | Serverless first, then container-serverless |
What “operational burden” really means
This is the dimension juniors underestimate most, so make it concrete. With the warehouse VM, the grocer owns the operating system: every month there is OS patching, kernel CVEs to track, the runtime to upgrade, and capacity to size by hand. CrowdStrike Falcon runs as an endpoint agent on that VM for runtime threat detection precisely because the OS is now attackable surface that the team owns. The VM never scales itself; if the warehouse job needs more headroom, someone resizes the instance.
With the catalogue running on managed Kubernetes, the cloud runs the control plane, but the team still owns node pools, cluster upgrades, ingress, and autoscaler tuning — real work, just less than a fleet of hand-patched VMs. Kubernetes is powerful and genuinely complicated, and “we picked Kubernetes” quietly signs the team up for that complexity. The container images themselves get scanned by Wiz (specifically Wiz Code in the pipeline) for vulnerable dependencies and misconfigurations before they ship, so a known-bad base image never reaches the cluster.
With the checkout and re-pricing on serverless, there is no OS, no patching, and no cluster — the platform absorbs all of it, which is exactly why a small team reaches for it. The burden that remains is different: you must design for statelessness, accept the platform’s limits, and reason about cold starts. The burden does not vanish; it moves.
A blunt way to put it to the team: every layer you operate is a layer you patch, secure, scale, and get paged for. Choosing a compute model is largely choosing how many of those layers you want to own.
Cold starts — the serverless catch juniors miss
The headline objection to serverless is the cold start: when a function has been idle and a request arrives, the platform must spin up an execution environment before your code runs, adding latency from tens of milliseconds to a couple of seconds depending on language, package size, and whether it sits in a VPC.
For the nightly re-pricing batch, this is a non-issue — a few hundred milliseconds of warm-up on a job that runs for minutes at 2 a.m. is invisible. For checkout, it could matter: a customer waiting two extra seconds at “place order” is a real problem during a promotion. The mitigations are standard and worth knowing: provisioned concurrency (Lambda) or minimum instances (Cloud Run / Container Apps) keep a few environments permanently warm, trading some of serverless’s pay-per-use savings for predictable latency. The grocer keeps a small warm pool on checkout during business hours and lets it scale to zero overnight.
The general rule: serverless is excellent for spiky, event-driven, and batch work where occasional cold-start latency is acceptable, and a poorer fit for a steady low-latency hot path — which is exactly why the always-on catalogue is on containers, not functions.
Cost — three different bill shapes
Cost is where the CFO’s line-by-line reading bites, and the three models bill so differently that comparing them needs the workload’s shape, not a sticker price.
-
VMs bill for reserved time. You pay for the instance whether it is busy or idle. An always-on VM at low utilisation is the classic waste — paying 24/7 for a box that is busy 15% of the time. VMs win on cost only when utilisation is high and steady (commit to a 1- or 3-year reserved instance / savings plan and the hourly rate drops sharply), or when a workload simply must be a VM.
-
Containers bill for the node pool. You pay for the worker nodes the cluster runs on, sized to hold your pods. Bin-packing many services onto shared nodes gives strong economics at steady scale, but an over-provisioned cluster idling overnight is the same waste as idle VMs. Cluster autoscaling and scaling node pools down off-peak are the levers.
-
Serverless bills per request and per millisecond of execution. Idle costs nothing. This is unbeatable for spiky and low-volume work — the checkout flow costs almost nothing on a quiet day. But at sustained high volume, per-invocation pricing can cost more than a well-utilised reservation. The re-pricing batch is cheap on serverless because it runs briefly; a hypothetical always-on service doing billions of calls might be cheaper on containers.
The crossover is the lesson: serverless is cheapest at low and spiky volume; reserved containers or VMs win at high and steady volume. Tag every resource by workload and pipe spend into Datadog cloud-cost dashboards so the team sees, per workload, whether it sits on the right side of that crossover — and revisits the choice as volume grows.
| Workload | Shape | Model chosen | Why it is cost-optimal |
|---|---|---|---|
| Catalogue API | Steady, always-on | Containers | High utilisation amortises the node pool |
| Checkout | Spiky, bursty | Container-serverless | Scales to ~zero between rushes |
| Re-pricing batch | Minutes/day | Serverless functions | Pays only for the nightly run |
| Warehouse integration | Constant, stateful | VM (reserved) | Must be a VM; reservation cuts the rate |
Security and the shared-responsibility line
The compute model changes where the line falls between what you secure and what the cloud secures — and a junior architect must know which side they are on.
On a VM, you own the OS and everything in it: patching, hardening, the runtime, and runtime threat detection via a CrowdStrike Falcon agent on the instance. On containers, you own the image and the workload; Wiz Code scans images and IaC for vulnerabilities and misconfigurations in the pipeline, and Falcon can run as a node/runtime sensor, while the cloud secures the control plane. On serverless, the platform owns the runtime and host entirely, so your security surface shrinks to your code, its dependencies, and its IAM permissions — get the function’s least-privilege role wrong and that is your exposure.
Three controls span all three models and keep the security story uniform regardless of compute choice. Identity is centralised in Okta / Entra ID, so every service validates the same tokens and there is one place to enforce MFA and conditional access. Secrets come from HashiCorp Vault as short-lived dynamic credentials — the checkout function, the catalogue pod, and the warehouse VM each fetch a database credential at runtime rather than carrying a baked-in password, so a leaked image or snapshot does not leak a standing secret. And Wiz runs continuous posture management (CSPM) across the whole estate — VMs, clusters, and serverless functions — flagging an over-permissive IAM role or a publicly exposed resource no matter which model it lives on.
Failure modes worth naming
Each model fails in a characteristic way; recognising the signature is half of operating it.
- VM: the box dies, and so does everything on it. A single VM is a single point of failure — patch it wrong, fill its disk, or lose the host, and the workload is down. Mitigation: run at least two across availability zones behind a load balancer or in a scaling group, and never treat a VM as a pet.
- Containers: a node fails or the cluster is misconfigured. A bad node drains its pods (Kubernetes reschedules them — usually graceful), but a misconfigured autoscaler, an exhausted node pool, or a control-plane issue can stall deploys or starve the service. Mitigation: pod anti-affinity across zones, sane resource requests/limits, and autoscaler headroom.
- Serverless: throttling and downstream overload. Functions scale so fast they can overwhelm a database or hit a concurrency limit and start returning throttle errors. The classic incident is a spike fanning out thousands of concurrent functions that exhaust the database connection pool. Mitigation: reserved/maximum concurrency caps, a connection proxy (e.g. RDS Proxy) in front of the database, and queues to smooth bursts.
When any of these trips, Datadog / Dynatrace is the common nervous system — host metrics for the VM, pod and cluster events for containers, and invocation traces with cold-start timing for functions — and a breach auto-raises a ServiceNow incident so on-call gets a ticket with context, not just a pager buzz. New production deployments and any change to the warehouse VM also pass through a ServiceNow change request, giving a small team a lightweight but real approval gate.
Build and deploy — the same pipeline, three targets
The reassuring part for a junior team is that the delivery path is largely shared. Infrastructure for all three models is declared in Terraform — the VMs and their networking, the Kubernetes cluster and node pools, and the serverless functions and their triggers all live as code in one repository, reviewed and reproducible — with Ansible handling in-guest configuration of the warehouse VM (the one place a config-management tool still earns its keep, since there is an OS to configure). The application pipeline runs in GitHub Actions (or Jenkins): it builds and tests, runs Wiz Code as a security gate on images and IaC, and then deploys — pushing container images that Argo CD rolls out to the Kubernetes cluster via GitOps, and publishing function packages to the serverless platform. One team, one pipeline, three deployment targets — which is the practical reason a mixed estate is manageable for eight engineers rather than overwhelming.
A minimal Terraform sketch makes the “all three as code” point concrete:
# Always-on catalogue → managed Kubernetes (containers)
resource "aws_eks_node_group" "catalogue" {
cluster_name = aws_eks_cluster.main.name
instance_types = ["m6i.large"]
scaling_config { min_size = 3, max_size = 12, desired_size = 3 }
}
# Spiky checkout → container-serverless (scales to zero)
resource "aws_ecs_service" "checkout" {
launch_type = "FARGATE"
# autoscaling target tracks request count; min can be 0 off-peak
}
# Nightly re-pricing → serverless function on a schedule
resource "aws_lambda_function" "reprice" {
function_name = "nightly-reprice"
timeout = 900 # minutes-long batch
memory_size = 1024
}
# Warehouse appliance → a plain VM (vendor image, reserved)
resource "aws_instance" "warehouse_gw" {
ami = var.vendor_appliance_ami # cannot be containerised
instance_type = "m6i.xlarge"
}
Explicit tradeoffs
Accept these, or pick differently. A mixed estate — the architecture above — fits each workload optimally but costs you cognitive and tooling breadth: your team must understand VMs, Kubernetes, and serverless, and operate monitoring and security across all three. The alternative, standardising on one model, trades fit for simplicity: an all-Kubernetes shop runs even the batch job as a CronJob and the spiky checkout as a pod (over-paying a little for idle nodes) but only has to master one platform — a defensible choice for a small team that values focus over per-workload optimisation. There is no universally correct answer; there is the answer that fits your workload shapes and your team’s size.
The honest cautions per model. Serverless buys the least operational burden and the best spiky-cost story, but pays in cold starts, execution and statelessness limits, and vendor lock-in — a Lambda’s triggers and event shapes are AWS-specific, and moving to another cloud is a rewrite, not a redeploy. Containers buy portability and clean scaling but make you operate Kubernetes, which is real, ongoing complexity that a team must be honest about resourcing. VMs buy total control and run literally anything — legacy apps, vendor appliances, special kernels and GPUs — but hand you the entire OS to patch, secure, and scale, which is the most operational burden of the three.
When each clearly wins. Reach for serverless when the workload is event-driven, scheduled, spiky, or low-volume, and the team is small — it is the lowest-effort starting point and scales to zero when idle. Reach for containers when you run many always-on services, want them deployed uniformly across languages, and need portability and steady-state scaling efficiency. Reach for VMs when something must be a VM — a vendor virtual appliance like the warehouse gateway, a legacy monolith that will not containerise, a stateful service with long-lived connections, or a workload needing a specific kernel, driver, or GPU. Most real estates, like the grocer’s, end up using all three — and the mark of a good architect is not loyalty to one model but the judgement to put each workload where it belongs.
The shape of the win
For the grocer’s review, the resolution is not “the VM person won” or “serverless won” — it is that the argument was the wrong shape. The catalogue goes on containers because it is always on and latency-sensitive; checkout goes on container-serverless because it is spiky and the team should not pay for idle capacity; the nightly re-pricing goes on serverless functions because it runs for minutes a day; and the warehouse appliance stays a VM because the vendor ships it as one and the support contract demands it. Underneath, one identity layer in Okta / Entra ID, one secrets layer in HashiCorp Vault, one observability layer in Datadog, one security posture in Wiz, and one delivery pipeline in GitHub Actions / Argo CD with Terraform make a three-model estate operable by eight people. The lesson a junior architect should carry out of this is the durable one: a compute model is not a religion to adopt but a tool to match — to the workload’s traffic shape, its statefulness, its packaging, and the size of the team that has to run it at 3 a.m.