Google Compute Engine, In Depth: Machine Types, Disks, Images, Metadata & Every Option

A Google Compute Engine (GCE) instance is the most fundamental piece of compute you can rent on Google Cloud: a virtual machine — vCPUs, memory, disks, a network interface — running on Google’s infrastructure, that you control from the operating system upward. It is pure Infrastructure as a Service (IaaS). Google runs the physical host, the hypervisor, the datacentre, the power and the network fabric; you own the OS, the patches, the software you install, and your data. If you have ever installed Ubuntu or Windows Server on a laptop, you already understand most of what a VM is. The remaining part — the part interviewers and the ACE and Professional Cloud Architect exams probe relentlessly — is the dozens of choices GCE asks you to make when you create an instance, and the operations you can (and cannot) perform afterwards.

This lesson is deliberately exhaustive. We go family by family through every machine type, then through images, disks (from Balanced Persistent Disk all the way to Hyperdisk), every provisioning and discount model (on-demand, Spot/preemptible, sole-tenant, committed and sustained use), networking, the metadata server with startup and shutdown scripts, OS Login vs SSH keys, service accounts and access scopes, and the security shells Shielded VM and Confidential VM. Every option gets the same treatment: what it is · the choices · the default · when to pick which · the trade-off · the limit · the cost impact · the gotcha. Each core operation comes with a real gcloud command so you can do this by hand or wire it into Terraform later. By the end you will know the Compute Engine instance end to end — enough to ace an ACE or PCA question, sail through an interview, and run VMs safely in production.

Learning objectives

By the end of this lesson you can:

Choose the right machine family and type (E2, N2, N2D, C3, C3D, T2D, M3, A3, plus custom machine types) for a workload and justify the vCPU-to-memory ratio.
Select the correct disk type — Balanced, SSD, Extreme, or Hyperdisk — and explain boot vs data disks, snapshots, and regional Persistent Disk.
Pick a provisioning model — on-demand, Spot, or sole-tenant — and apply committed use discounts (CUDs) and sustained use discounts (SUDs) correctly.
Configure networking (VPC/subnet, internal vs external IP, network tags) and read the metadata server, including startup and shutdown scripts.
Explain OS Login vs metadata SSH keys, attach a service account with the right scopes, and turn on Shielded and Confidential VM.
Reproduce the core create and configure operations in real gcloud, and know the path from an instance template to a managed instance group.

Prerequisites & where this fits

You should already understand Google Cloud’s resource hierarchy — organisation → folder → project → resource — what a region and a zone are, and how to run gcloud from Cloud Shell or a local SDK install (covered in the Fundamentals module). No prior VM experience is assumed; we define every term. This is the anchor lesson of the Compute module in the GCP Zero-to-Hero course: it introduces the machine types, disks, images, metadata, and identity model that the rest of the compute track — managed instance groups, Cloud Run, GKE — builds on. Once you can drive a single instance fluently, the leap to a self-healing fleet in Regional Managed Instance Groups: Autohealing, Canary Rollouts, and Stateful MIGs is small.

Core concepts

Before the options, fix five mental models. They explain why the settings are shaped the way they are.

An instance is an assembly, not a single resource. When you “create a VM” you actually create and wire several objects: the instance itself, one or more disks (a boot disk, optional data disks), a network interface attached to a subnet, optionally an external IP, and an attached service account. The console hides this behind one form; gcloud and Terraform make it explicit. It matters for deletion too: by default the boot disk is deleted with the instance, but additional disks are not unless you set auto-delete — a classic source of orphaned-disk cost.

Compute and storage are decoupled. The instance is the CPU and RAM; Persistent Disk (PD) and Hyperdisk are independent network-attached block devices. This is the single most important architectural idea: you can stop an instance (largely stop paying for compute) while keeping its disks, change the machine type without touching the disks, or detach a disk and attach it to another instance for recovery. The exceptions are Local SSD (physically attached to the host, blisteringly fast but ephemeral) and the small in-host scratch — data there is lost on stop or host migration.

Zonal vs regional resources. An instance is a zonal resource — it lives in exactly one zone (e.g. europe-west2-a). A standard Persistent Disk is also zonal and must be in the same zone as the instance it attaches to. Regional Persistent Disk synchronously replicates across two zones in a region for high availability. Images, snapshots, instance templates, and firewall rules are global; subnets are regional. Knowing the scope of each resource explains where you can move things and what survives a zone outage.

Live migration keeps you running through maintenance. Unlike many clouds, GCE can live-migrate a running instance to another host during planned host maintenance with no reboot — controlled by the instance’s availability policy (onHostMaintenance = MIGRATE by default for standard VMs). Spot VMs and some accelerator/Confidential configurations cannot migrate and are instead terminated on maintenance. This is why GCE rarely forces a reboot for host patching.

Projects carry quota, defaults, and billing. vCPU counts, IP addresses, and disk capacity are all governed by per-region quotas on the project. Every instance also runs as a service account identity and bills to the project’s billing account. Key terms used throughout: vCPU (one hyperthread on most families), machine type (a named shape such as n2-standard-4 = 4 vCPU, 16 GiB), image (the OS template you boot from), metadata (key/value config the instance can read about itself), and access scope (a legacy ceiling on what the attached service account’s token may do).

Choosing a machine type: every family

A machine type defines the instance’s vCPU count, memory, and the underlying CPU platform. Google groups machine types into families by purpose, and within a family into series (a hardware generation) and types (standard, highmem, highcpu, sometimes highgpu/ultramem). The four broad categories are general-purpose, compute-optimised, memory-optimised, and accelerator-optimised.

Family	Series	Category	CPU platform	vCPU:memory feel	Live migration	Typical use cases
E2	`e2`	General-purpose (cost)	Intel/AMD (abstracted)	Balanced (`standard` 1:4)	Yes	Dev/test, small/medium web and app servers, microservices on a budget
N2	`n2`	General-purpose (balanced)	Intel Cascade/Ice Lake	`standard` 1:4, `highmem` 1:8, `highcpu` 1:1	Yes	Most production workloads needing predictable Intel performance
N2D	`n2d`	General-purpose (balanced)	AMD EPYC (Rome/Milan)	Same ratios as N2	Yes	Same as N2 but cheaper per vCPU on AMD; scale-out web/app
T2D / T2A	`t2d` (AMD), `t2a` (Arm)	General-purpose (scale-out)	AMD Milan / Ampere Altra Arm	1:4, no SMT (vCPU = physical core)	T2D yes; T2A no	High throughput-per-cost scale-out; T2A for Arm-native workloads
C3 / C3D	`c3` (Intel), `c3d` (AMD)	Compute-optimised (latest)	Intel Sapphire Rapids / AMD Genoa	`standard` 1:4, `highcpu` 1:2, `highmem` 1:8	Yes (with newer support)	CPU-bound, latency-sensitive: gaming, HPC front-ends, ad serving, high-traffic web
C2 / C2D	`c2`, `c2d`	Compute-optimised (prior gen)	Intel Cascade Lake / AMD Milan	High clock, 1:4	Yes	Single-thread-sensitive, HPC, electronic design
M3 / M2 / M1	`m3`, `m2`, `m1`	Memory-optimised	Intel	`megamem`/`ultramem` up to ~1:28+	Limited	SAP HANA, large in-memory databases, big analytics
A3 / A2 / G2	`a3`, `a2` (NVIDIA), `g2`	Accelerator-optimised	Intel + NVIDIA H100/A100 / L4	GPU-attached	No (terminated)	AI/ML training and inference, rendering, GPU compute

A few rules make sense of the table:

Series names encode the generation and vendor. A leading number is the generation; a trailing d means AMD (e.g. n2d, c3d), a trailing a means Arm (t2a), no suffix is usually Intel. So c3d-highcpu-8 = compute-optimised, AMD Genoa, 8 vCPU, high-CPU ratio.
Types set the vCPU:memory ratio. standard ≈ 4 GiB per vCPU, highmem ≈ 8 GiB, highcpu ≈ 1–2 GiB, ultramem/megamem push far higher for the M-family. Pick the type by your workload’s memory hunger, not by guesswork.
SMT (hyperthreading). On E2/N2/N2D/C3 a vCPU is one hyperthread (two vCPUs share a physical core). On T2D/T2A a vCPU is a full physical core — relevant for licensing and for CPU-bound throughput. You can also disable SMT (--threads-per-core=1) on supported families for licence-bound or security-sensitive workloads.
E2 is the default cost choice; C3 is the performance edge. Start at E2 for dev/test and modest services, move to N2/N2D for steady production, and reach for C3/C3D when you are genuinely CPU-bound and latency-sensitive.

Custom machine types

The predefined types may not fit your ratio — perhaps you need 6 vCPU with 40 GiB rather than the 24 GiB a standard gives. Custom machine types (E2, N2, N2D, and others) let you choose vCPU and memory independently within the family’s limits, with extended memory available above the normal per-vCPU ceiling at a small premium.

Aspect	Predefined	Custom
vCPU/memory	Fixed combinations	You pick both (within family rules)
When to use	Standard ratios, simplest billing	Right-sizing to avoid paying for unused RAM or CPU
Constraints	n/a	vCPU even numbers above 1; memory between 0.5–8 GiB per vCPU (more with extended memory)
Cost	List price per shape	Per-vCPU + per-GiB pricing; extended memory billed higher
Gotcha	May force you up a size	Slightly higher unit price than the closest predefined; not every family supports custom

Create a custom shape with --custom-cpu and --custom-memory (use --custom-extensions for extended memory). Right-sizing with custom types is one of the cheapest wins on a GCE bill, and Google’s rightsizing recommendations in the console will suggest moves based on observed utilisation.

Choosing an image

An image is the OS template the boot disk is created from. GCE offers three categories.

Image kind	What it is	When to use	Gotcha
Public images	Google- and partner-maintained OS images: Debian, Ubuntu, RHEL, Rocky, SLES, Windows Server, Container-Optimized OS (COS)	The default starting point; COS for running containers directly on a VM	Some (RHEL, SLES, Windows) carry a per-second premium licence charge on top of the VM
Custom images	Your own image baked from a configured disk (a “golden image”)	Bake dependencies and hardening once, boot identical VMs fast	You own patching and lifecycle; store in a dedicated image project
Image families	A named pointer (e.g. `debian-12`, or your `my-app-prod`) that always resolves to the latest non-deprecated image	Templates and MIGs — get patches without editing the template	Pin a specific image instead when you need fully reproducible builds

Reference a public image by --image-family + --image-project (e.g. --image-family=debian-12 --image-project=debian-cloud). Image families are the right default for instance templates because they let a rebuild pick up the latest patched image automatically; pin an exact image (--image=...) when you need byte-for-byte reproducibility. Images are global resources and can be shared across projects via IAM. Machine images (a related but distinct object) capture the whole instance — config plus all disks — and are handy for cloning or backup.

Disks: every type

Storage is decoupled from compute, so the disk choice is its own decision. The boot disk holds the OS; data disks hold everything you want to survive a machine-type change or a rebuild. The modern, recommended block-storage line is Hyperdisk; the long-standing line is Persistent Disk (PD); Local SSD is ephemeral host-attached storage.

Disk type	Media	Performance model	Boot disk?	Best for	Cost feel	Gotcha
Standard PD (`pd-standard`)	HDD	Throughput scales with size; low IOPS	Yes	Cold/sequential, logs, cheap bulk	Lowest	Poor random-IOPS; avoid for databases
Balanced PD (`pd-balanced`)	SSD	Good IOPS/throughput per GB; the sensible default	Yes	Most boot disks and general workloads	Mid	n/a — this is the default pick
SSD PD (`pd-ssd`)	SSD	Higher IOPS/throughput per GB than Balanced	Yes	Latency-sensitive databases, high-IOPS apps	Higher	More expensive; size still gates performance
Extreme PD (`pd-extreme`)	SSD	Provisioned IOPS independent of size	Yes (limited)	Highest-performance PD workloads (large DBs)	High	Only on larger machine types; you pay for provisioned IOPS
Hyperdisk Balanced	SSD (next-gen)	Independently provisioned IOPS and throughput	Yes	New general workloads wanting tuned performance	Mid–high	Family/region support varies; the strategic default going forward
Hyperdisk Extreme	SSD (next-gen)	Very high provisioned IOPS	Data	Mission-critical DBs (SAP HANA, large SQL)	High	Larger machine types only
Hyperdisk Throughput	SSD (next-gen)	Provisioned throughput, cost-efficient	Data	Throughput-oriented analytics, Kafka, Hadoop	Mid	Optimised for MB/s, not IOPS
Hyperdisk ML	SSD (next-gen)	Very high read throughput, multi-attach read-only	Data	Loading large ML datasets/models to many VMs	Varies	Read-optimised; specialised use
Local SSD	Physically attached NVMe	Highest IOPS/lowest latency; ephemeral	No	Scratch, caches, temp, shuffle space	Per-device	Data lost on stop/terminate/migration; back it up if it matters

Three properties cut across all disk types:

Performance often scales with provisioned size (PD families) — a tiny pd-ssd is slow; the same type at 500 GB is fast. Hyperdisk and Extreme PD break this link by letting you provision IOPS and throughput directly, so you can buy performance without buying capacity you do not need.
Boot vs data. The boot disk is created from an image and (by default) deleted with the instance. Data disks are created or attached separately and should set --device-name for stable in-OS paths; set auto-delete=no if they must outlive the VM.
Snapshots are incremental and global. A snapshot is a point-in-time, differential backup of a disk stored regionally/multi-regionally; after the first full snapshot, later ones only store changed blocks. Use a resource policy to schedule snapshots automatically. Machine images snapshot the whole instance. For zone-failure resilience on a single VM, use Regional Persistent Disk, which keeps a synchronous replica in a second zone so you can force-attach it elsewhere after a zone outage.

Provisioning and discount models

How you acquire the instance changes both its resilience and its price. There are three provisioning models and two automatic/committed discount programmes.

Model	Discount	Eviction/termination	SLA	When to use	Gotcha
On-demand (standard)	None (list price)	None	Standard VM SLA	Steady production you cannot interrupt	Most expensive per hour
Spot VMs	~60–91% off	Google can preempt any time (30-second notice); cannot live-migrate	No SLA	Fault-tolerant, stateless, batch, CI, rendering, MIG burst capacity	Can vanish mid-run; never for stateful primaries
Preemptible VMs (legacy)	Similar discount	Preempted, hard 24-hour cap	No SLA	Legacy; prefer Spot	Spot is the modern replacement with no 24h cap
Sole-tenant nodes	n/a (premium)	None	Standard	Licensing (BYOL per-core), compliance/isolation requirements	You pay for the whole physical host; more expensive

And the two discount programmes that apply on top:

Programme	What it is	Commitment	Typical saving	Applies to	Gotcha
Sustained Use Discounts (SUDs)	Automatic discount the longer an instance runs in a month	None — applied automatically	Up to ~20–30% (general-purpose/memory-optimised)	E2 is excluded; N2/N2D/C2 etc. qualify	Nothing to do; not stackable with CUDs on the same vCPUs
Committed Use Discounts (CUDs)	A 1- or 3-year commitment to spend/usage	1 or 3 years	Up to ~57% (resource-based) / flexible (spend-based)	Resource-based (per family/region) or spend-based (flexible)	Pay even if unused; plan to baseline, not peak

The mental model: Spot for anything interruptible (huge savings), CUDs for your steady baseline (commit to what you always run), and let SUDs apply automatically to the rest. Reserve on-demand for the spiky top of the curve. Create a Spot instance with --provisioning-model=SPOT --instance-termination-action=STOP (or DELETE); set --max-run-duration if you want a self-terminating box.

Networking

Every instance attaches to a VPC network and a subnet through a network interface, and that placement governs its IP addressing and reachability.

VPC and subnet. GCP VPCs are global; subnets are regional with an IP range. The instance takes a primary internal IP from the subnet’s range (static or ephemeral). Choose the subnet in the same region as the instance.
Internal vs external IP. The internal IP is private (RFC 1918) and used for VPC-internal traffic. An external IP (ephemeral or reserved/static) makes the instance reachable from the internet — but you can omit it entirely and reach the VM via Cloud NAT for egress, IAP TCP forwarding for admin access, or an internal load balancer. Best practice: no external IP on private workloads.
Network tags. Free-form labels on the instance (e.g. web, allow-ssh) that firewall rules target. Tags are the primary way firewall policy attaches to instances; keep a small, intentional vocabulary.
Multiple NICs and alias IP ranges. An instance can have several network interfaces (one per VPC) and alias IP ranges (extra IPs from the subnet’s secondary range, used heavily by GKE Pods).
IP forwarding. Off by default; enable --can-ip-forward only for NAT gateways, VPN/router VMs, or appliances that route traffic not addressed to themselves.

You will go far deeper on subnets, routes, firewall rules, and Cloud NAT in the Networking module; here, just place the instance in the right subnet and tag it for the firewall rules it needs.

The metadata server, startup and shutdown scripts

Every instance can query a metadata server at the link-local address http://metadata.google.internal/ (i.e. 169.254.169.254) to learn about itself and its project, and to fetch credentials for its service account. Requests must send the header Metadata-Flavor: Google — a deliberate guard against confused-deputy SSRF attacks.

Metadata kind	Scope	Examples	Set/changed by
Project metadata	All instances in the project	Project-wide SSH keys, `enable-oslogin`, custom keys	Project editors
Instance metadata	A single instance	`startup-script`, `shutdown-script`, custom app config, per-instance SSH keys	Instance owner
Default/derived	Per instance	Hostname, zone, machine type, network, service-account token	Google (read-only)

Two metadata keys are workhorses:

startup-script runs as root on every boot (first boot and every restart) — install packages, pull config, register with a load balancer. Supply it inline (--metadata startup-script=...), from a file (--metadata-from-file startup-script=...), or from a Cloud Storage object (--metadata startup-script-url=gs://...).
shutdown-script runs on graceful stop/delete and on Spot preemption (best-effort, ~30-second window) — flush buffers, deregister, drain connections. Keep it short and idempotent; the window is small.

Fetch the service-account token or any attribute from inside the VM:

# Read this instance's zone and the active service-account access token
curl -s -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/zone

curl -s -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token

This token is how Application Default Credentials authenticate on a VM with no key file — the recommended pattern. Gotcha: never expose the metadata endpoint through a proxy or SSRF-prone app; the Metadata-Flavor header requirement and firewalling exist precisely because the token is sensitive.

OS Login vs SSH keys

There are two ways to grant Linux SSH access, and choosing correctly is a recurring exam and security topic.

Aspect	OS Login (recommended)	Metadata SSH keys (legacy)
Where identity lives	Tied to Google/Cloud Identity users via IAM	Public keys stored in project or instance metadata
Access control	IAM roles: `roles/compute.osLogin`, `roles/compute.osAdminLogin`	Whoever’s public key is in metadata can log in
Provisioning	POSIX accounts created automatically from the directory	You manually add/rotate keys in metadata
2FA / centralisation	Supports 2-step verification; central audit	None; sprawls across instances
Audit	Logins tied to a Google identity in Cloud Audit Logs	Hard to attribute; keys outlive people
Enable	`enable-oslogin=TRUE` in project or instance metadata	Default if OS Login is off
Gotcha	IAM role needed in addition to network access; org policy can enforce it	Stale keys are a real breach vector; avoid at scale

Use OS Login. Set enable-oslogin=TRUE at the project level, grant users roles/compute.osLogin (or osAdminLogin for sudo), and you get IAM-governed, auditable, centrally revocable SSH with optional 2FA — and you can enforce it org-wide with an org policy. Metadata keys remain useful only for break-glass or automation that cannot use a Google identity. For day-to-day admin without an external IP, combine OS Login with IAP TCP forwarding (gcloud compute ssh --tunnel-through-iap) so you never open port 22 to the internet.

Service account & access scopes

Every instance runs as a service account identity. By default that is the Compute Engine default service account, but you should attach a dedicated, least-privilege service account per workload. What the instance can actually do is the intersection of two controls:

IAM roles granted to the service account — the real, modern permission model (e.g. roles/storage.objectViewer).
Access scopes — a legacy coarse ceiling set on the instance that caps which APIs the attached token may touch, regardless of IAM. The old default (https://www.googleapis.com/auth/devstorage.read_only plus a few) is narrow; the broad cloud-platform scope lets IAM be the sole gate.

Best practice: set the scope to cloud-platform (--scopes=cloud-platform) and control everything precisely with IAM roles on a dedicated service account (--service-account=...). Scopes are an artefact from before fine-grained IAM existed; treating IAM as the single source of truth avoids the confusing “I granted the role but it still says permission denied” trap caused by a too-narrow scope.

Shielded VM & Confidential VM

Two security shells harden the instance itself.

Shielded VM defends the boot integrity of the guest. It bundles Secure Boot (only signed bootloaders/kernels run), a virtual TPM (vTPM) (a hardware root of trust for keys and measured boot), and integrity monitoring (baselines the boot measurements and alerts on drift — a rootkit/bootkit signal). It is on by default for Shielded-image OSes and adds no cost; enable with --shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring. Gotcha: unsigned third-party kernel modules can fail Secure Boot — test before enforcing.
Confidential VM encrypts memory in use using AMD SEV/SEV-SNP (or Intel TDX on supported types), so even Google’s hypervisor cannot read the VM’s RAM. Use it for regulated, “data-in-use” workloads; enable with --confidential-compute (which requires a supported machine type such as N2D and a compatible image). Gotcha: limited to specific families/regions, a small performance overhead, and Confidential VMs cannot live-migrate (they are terminated on maintenance).

From an instance to a fleet: templates & MIGs

A single instance is fine for a pet server, but production runs fleets. An instance template is an immutable, global definition of an instance — machine type, image, disks, network, metadata, service account, every option above frozen into one object. A Managed Instance Group (MIG) stamps out identical instances from a template and then autoheals (recreates failed instances against a health check), autoscales (adds/removes instances on a signal), and performs rolling and canary updates (shift the fleet to a new template gradually). Build the template once with gcloud compute instance-templates create, then drive the fleet — zonal or, for production, regional across zones — exactly as covered in Regional Managed Instance Groups: Autohealing, Canary Rollouts, and Stateful MIGs. Everything you have learned about machine types, disks, provisioning (including Spot for cheap burst capacity), metadata, and identity flows straight into the template.

Google Compute Engine anatomy & options

The diagram above shows the full anatomy of a Compute Engine instance — the machine type and CPU platform at the centre, the boot and data disks (and ephemeral Local SSD) attaching from the storage plane, the network interface into a regional subnet with optional external IP and network tags, the metadata server feeding startup/shutdown scripts and the service-account token, and the OS Login, Shielded, and Confidential security shells wrapping it — and how a template projects all of this onto a managed instance group.

Hands-on lab

We will create a small instance on the Free Tier, inspect it, read its metadata, attach a data disk, then clean everything up. The e2-micro in an eligible US region is part of the GCE Always Free allowance, so this lab is effectively free; a $300 free-trial credit covers it comfortably regardless.

1. Set your project and a default zone.

gcloud config set project YOUR_PROJECT_ID
gcloud config set compute/zone us-central1-a

2. Create the instance — an e2-micro, Debian 12, Balanced boot disk, no external IP, with a startup script and the broad scope so IAM governs permissions.

gcloud compute instances create gce-lab-01 \
  --machine-type=e2-micro \
  --image-family=debian-12 --image-project=debian-cloud \
  --boot-disk-type=pd-balanced --boot-disk-size=10GB \
  --no-address \
  --shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring \
  --scopes=cloud-platform \
  --metadata=enable-oslogin=TRUE \
  --metadata-from-file=startup-script=<(echo '#!/bin/bash
echo "hello from $(hostname) in $(curl -s -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/zone | cut -d/ -f4)" > /var/tmp/lab.txt')

Expected output: a table showing gce-lab-01, its zone, machine type e2-micro, an internal IP, and STATUS: RUNNING.

3. Validate. Confirm it is running and read back the metadata and machine type.

gcloud compute instances describe gce-lab-01 \
  --format="value(status, machineType.basename(), networkInterfaces[0].networkIP)"

You should see RUNNING e2-micro 10.x.x.x. Because there is no external IP, SSH in over IAP (OS Login is enabled, so use your Google identity):

gcloud compute ssh gce-lab-01 --tunnel-through-iap --command="cat /var/tmp/lab.txt"

It should print hello from gce-lab-01 in us-central1-a, proving the startup script and metadata server both worked.

4. Attach a data disk — create a 10 GB Balanced PD and attach it, surviving the instance if needed.

gcloud compute disks create gce-lab-data --size=10GB --type=pd-balanced
gcloud compute instances attach-disk gce-lab-01 \
  --disk=gce-lab-data --device-name=data1

Confirm the OS sees a new block device (lsblk over the IAP SSH session will show /dev/disk/by-id/google-data1).

5. Cleanup. Delete the instance (and its boot disk), then the data disk, to stop all charges.

gcloud compute instances delete gce-lab-01 --quiet
gcloud compute disks delete gce-lab-data --quiet

Cost note. An e2-micro in us-central1/us-west1/us-east1 falls under the Always Free tier (one per month) with a 30 GB-month standard PD allowance; this lab’s 10 GB Balanced boot disk plus a short-lived 10 GB data disk costs only a few pennies even outside the free allowance, and nothing once deleted. The number-one source of surprise GCE cost is leftover disks and external IPs after the instance is gone — the explicit disk delete above is why we clean up by hand.

Common mistakes & troubleshooting

Symptom	Likely cause	Fix
`Permission denied` calling an API from the VM despite the right IAM role	Access scope on the instance is too narrow (legacy default)	Recreate/stop-edit with `--scopes=cloud-platform`; let IAM gate access
Cannot SSH; `gcloud compute ssh` times out	No external IP and no IAP, or firewall blocks 22/IAP range	Use `--tunnel-through-iap` and allow IAP source range `35.235.240.0/20` to port 22
Data lost after stopping a VM	Data was on Local SSD (ephemeral)	Move persistent data to PD/Hyperdisk; back up Local SSD before stop
Orphaned disks/IPs still billing after deleting VMs	Additional disks default to `auto-delete=no`; reserved IPs persist	Delete leftover disks and release static IPs explicitly
Database disk feels slow	`pd-standard` or a tiny `pd-ssd` (perf scales with size)	Use Balanced/SSD/Hyperdisk; size up or provision IOPS (Extreme/Hyperdisk)
Instance unexpectedly terminated	It is a Spot VM and was preempted	Expected — handle with a shutdown script and MIG recreate; use on-demand for stateful
OS Login user cannot log in	Missing `roles/compute.osLogin` even with network access	Grant the OS Login (or `osAdminLogin`) IAM role to the user
Secure Boot blocks a custom kernel module	Module is unsigned; Shielded Secure Boot rejects it	Sign the module or temporarily disable Secure Boot to validate

Best practices

Right-size with custom machine types and act on Google’s rightsizing recommendations; do not pay for RAM or vCPU you never use.
Default to Balanced PD, step up to SSD/Extreme/Hyperdisk only where measured IOPS demand it, and schedule snapshots with a resource policy.
No external IP on private workloads — egress via Cloud NAT, admin via IAP, traffic via internal load balancers.
Use OS Login with IAM and IAP tunnelling; treat metadata SSH keys as break-glass only.
Attach a dedicated, least-privilege service account with scope cloud-platform, and never bake key files onto disks — rely on the metadata token and ADC.
Spot for interruptible work, CUDs for the steady baseline, and let SUDs apply automatically; reserve on-demand for the spiky peak.
Template everything and run production behind a regional MIG for autohealing, autoscaling, and safe rollouts.

Security notes

Enable Shielded VM (Secure Boot + vTPM + integrity monitoring) by default; use Confidential VM for data-in-use sensitivity.
Enforce OS Login org-wide and disable project-wide SSH keys to kill the stale-key attack surface.
Lock down the metadata server: never proxy 169.254.169.254, keep the Metadata-Flavor: Google guard, and watch for SSRF in apps that take URLs.
Apply least privilege with per-workload service accounts and IAM; use cloud-platform scope so IAM is the single gate rather than fighting legacy scopes.
Keep firewall exposure minimal with tight network tags; allow the IAP range to port 22 instead of 0.0.0.0/0.
Encrypt with CMEK where policy requires it (boot disks, data disks, and images all support customer-managed keys), and use Regional PD or scheduled snapshots so a zone failure is recoverable.

Interview & exam questions

What is the difference between E2, N2, and N2D, and when would you choose each? E2 is the cost-optimised general-purpose family (Intel/AMD abstracted, no SUDs); N2 is balanced Intel; N2D is the same balance on AMD EPYC at a lower per-vCPU price. Use E2 for dev/test and budget services, N2 for predictable Intel production, N2D when AMD price/performance wins.
Explain Spot VMs vs preemptible VMs. Both are deeply discounted surplus capacity with no SLA that Google can reclaim. Preemptible (legacy) has a hard 24-hour lifetime; Spot is the modern replacement with no 24-hour cap and configurable termination action. Use either only for interruptible, stateless, or checkpointed work.
CUDs vs SUDs? Sustained Use Discounts are automatic, applied as an instance runs longer in a month (E2 excluded). Committed Use Discounts require a 1- or 3-year commitment for a larger, predictable saving. Commit CUDs to your baseline; SUDs need no action.
Why might an API call from a VM fail even though the service account has the IAM role? The instance’s legacy access scope is too narrow and caps the token below what IAM allows. Set the scope to cloud-platform and govern with IAM.
OS Login vs metadata SSH keys — which and why? OS Login ties SSH to Google identities and IAM (auditable, centrally revocable, 2FA-capable); metadata keys are static public keys that sprawl and go stale. Prefer OS Login, enforced by org policy.
What does the metadata server do, and how do you query it safely? It serves instance/project metadata and the service-account token at 169.254.169.254; requests require the header Metadata-Flavor: Google, which (with firewalling) guards against SSRF. It powers Application Default Credentials.
Difference between a startup script and a shutdown script? A startup script runs as root on every boot; a shutdown script runs best-effort on graceful stop/delete and on Spot preemption (~30 s) — keep it short and idempotent.
Standard PD vs Balanced vs SSD vs Extreme vs Hyperdisk? Standard is HDD (cheap, low IOPS); Balanced is the SSD default; SSD is higher IOPS per GB; Extreme and Hyperdisk let you provision IOPS/throughput independently of size for the most demanding databases.
Persistent Disk vs Local SSD? PD/Hyperdisk are network-attached, durable, and survive a stop; Local SSD is host-attached, fastest, but ephemeral — data is lost on stop, terminate, or host migration.
What is live migration and when does it not happen? GCE can move a running standard VM to a new host during maintenance with no reboot (onHostMaintenance=MIGRATE). Spot, GPU/accelerator, and Confidential VMs cannot migrate and are terminated on maintenance instead.
Shielded VM vs Confidential VM? Shielded protects boot integrity (Secure Boot, vTPM, integrity monitoring); Confidential encrypts memory in use (AMD SEV-SNP / Intel TDX) so the hypervisor cannot read RAM.
When do you reach for a custom machine type? When no predefined shape matches your vCPU:memory ratio — right-size to avoid paying for unused RAM or CPU, using extended memory for very RAM-heavy needs.

Quick check

Which machine family gives a vCPU as a full physical core (no SMT) and offers an Arm variant?
Which discount applies automatically the longer an instance runs in a month, and which family is excluded from it?
What header must a request to the metadata server include?
Which disk types let you provision IOPS independently of disk size?
What single instance setting should you set to cloud-platform so that IAM becomes the sole permission gate?

Answers

T2D (AMD Milan; one vCPU = one physical core) and its Arm sibling T2A — T2A is the Arm variant. T2D does not use SMT.
Sustained Use Discounts (SUDs) apply automatically; the E2 family is excluded.
Metadata-Flavor: Google.
Extreme Persistent Disk and the Hyperdisk family (Hyperdisk Balanced/Extreme), which decouple performance from capacity.
The instance access scope (--scopes=cloud-platform); permissions are then governed entirely by the attached service account’s IAM roles.

Exercise

Provision a small production-shaped web instance and harden it. Using gcloud: (a) create a dedicated service account with only roles/logging.logWriter and roles/monitoring.metricWriter; (b) launch an e2-small Debian 12 instance with no external IP, Balanced boot disk, --scopes=cloud-platform, enable-oslogin=TRUE, Shielded VM on, the dedicated service account attached, and a startup script that installs nginx; © add the network tag web and confirm a firewall rule allowing the IAP range to port 22; (d) SSH in through IAP, verify nginx is running, and read the instance’s zone from the metadata server; (e) take a snapshot of the boot disk; then (f) delete the instance, the disk, and the snapshot. Note in a sentence why you set the scope to cloud-platform rather than relying on legacy narrow scopes.

Certification mapping

Associate Cloud Engineer (ACE): “Deploying and implementing Compute Engine resources” — machine types, images, disks, metadata, startup scripts, SSH/OS Login, and instance lifecycle map directly to exam objectives; expect questions on Spot vs on-demand and on scopes vs IAM.
Professional Cloud Architect (PCA): designing compute that meets cost, performance, and resilience requirements — choosing families (E2/N2/N2D/C3/M3/A3), provisioning models (Spot/CUDs/sole-tenant), Regional PD, live-migration behaviour, and Shielded/Confidential VM are recurring design-scenario themes.
Both exams probe the stop-vs-billing, OS Login vs keys, scope vs IAM, and live-migration distinctions covered above.

Glossary

vCPU — a virtual CPU; on most families one hyperthread of a physical core (a full core on T2D/T2A).
Machine type — a named shape of vCPU + memory, e.g. n2-standard-4; predefined or custom.
Series / family — a hardware generation and purpose grouping (E2, N2, C3, M3, A3…).
Persistent Disk (PD) — network-attached, durable block storage (Standard/Balanced/SSD/Extreme).
Hyperdisk — next-generation block storage with independently provisioned IOPS and throughput.
Local SSD — host-attached, very fast, ephemeral NVMe storage.
Snapshot — incremental, point-in-time backup of a disk; machine image captures a whole instance.
Spot VM — deeply discounted preemptible capacity with no SLA (successor to preemptible VMs).
CUD / SUD — Committed (1/3-year) and Sustained (automatic) Use Discounts.
Metadata server — 169.254.169.254; serves instance/project metadata and the service-account token (requires Metadata-Flavor: Google).
Startup/shutdown script — root scripts run on every boot / on graceful stop and preemption.
OS Login — IAM-governed SSH access tied to Google identities.
Access scope — legacy per-instance ceiling on the service-account token’s API reach; use cloud-platform and rely on IAM.
Shielded VM — Secure Boot + vTPM + integrity monitoring for boot integrity.
Confidential VM — memory encrypted in use (AMD SEV-SNP / Intel TDX).
Instance template / MIG — an immutable instance definition / a self-healing, autoscaling, rolling-update fleet built from it.
Live migration — moving a running VM to a new host during maintenance without reboot.

Next steps

You can now drive a single Compute Engine instance end to end. The natural next step is to turn one instance into a resilient, self-managing fleet: read Regional Managed Instance Groups: Autohealing, Canary Rollouts, and Stateful MIGs to learn instance templates, autohealing health checks, autoscaling signals, and canary rollouts across zones. From there, continue into the deep dive on Cloud Run for serverless containers when you want to stop managing VMs altogether, and the VPC networking module to master the subnets, firewall rules, and Cloud NAT that the instances in this lesson depend on.