Architecture Multi-cloud

Carbon-Aware Workload Scheduling Across Cloud Regions

A global consumer-genomics company — the kind that sequences a saliva kit and emails you an ancestry report two weeks later — has a sustainability problem that just became a board problem. Their secondary-analysis pipeline (alignment, variant calling, annotation across millions of samples) burns tens of thousands of vCPU-hours a night, and under the EU’s CSRD reporting regime their auditors now want Scope 2 and Scope 3 emissions for cloud compute, broken down and defensible. The head of platform engineering gets the brief: “Cut the carbon intensity of the batch fleet without missing a single customer-facing SLA, and give the auditors a number they can’t poke holes in.” The pipeline is genuinely deferrable — a variant-calling job that finishes at 02:00 versus 06:00 makes no difference to a customer who gets their report on day fourteen — which is exactly the property that makes it a perfect candidate for carbon-aware scheduling. This article is the reference architecture for doing that properly: a multi-cloud batch platform that routes deferrable work to the cleanest available region and the cleanest available hour, with cost and SLA as hard constraints rather than afterthoughts.

The economics and the physics line up better than people expect. The carbon intensity of electricity — grams of CO2 per kWh — varies by 5–10x depending on where and when you compute. A vCPU-hour in eu-north-1 (Stockholm, predominantly hydro and nuclear) can be a tenth as dirty as the same vCPU-hour in a coal-heavy grid at peak demand, and within a single region the intensity swings hour to hour as wind drops off and gas peakers spin up. For an interactive web tier none of this is actionable — the user is where the user is. But for a fleet of non-urgent, relocatable batch jobs, you have two free dimensions to optimize over: which region and which hour. Spend them well and the same workload emits a fraction of the carbon, often at a lower dollar cost, because cheap clean power and cheap spare capacity frequently coincide.

Why this is a scheduling problem, not a hardware problem

The naive responses each fall short, and naming why matters because someone on the sustainability committee will propose all of them.

“Just buy renewable energy certificates (RECs)” makes the annual report look green via accounting, but it does not change a single electron your jobs actually consume, and a serious CSRD auditor increasingly wants location-based emissions (what the local grid really emitted) alongside the market-based REC number. “Move everything to the greenest region permanently” ignores data-residency law (you cannot ship EU genomic data to eu-north-1 if the customer contract pins it to Frankfurt), ignores data-gravity egress costs, and ignores that the “greenest region” is only greenest on average — at the wrong hour it can be dirtier than the alternative. “Buy more efficient instances” helps the denominator but is a one-time step change, not a continuous lever, and you will do it anyway.

Carbon-aware scheduling threads the needle. It treats grid carbon intensity as a first-class input to the placement decision, the same way a cost-aware scheduler treats spot price. For every deferrable job you ask: across the regions this job is legally and practically allowed to run, and across the time window before its deadline, what is the placement that minimizes carbon subject to the SLA and a cost ceiling? The job still runs, still meets its deadline, still respects residency — it just runs in the cleanest feasible (region, time) cell. The signal is external and continuous; the scheduler is the thing you build.

Architecture overview

Carbon-Aware Workload Scheduling Across Cloud Regions — architecture

The platform has two planes that share infrastructure but run on different clocks: a signal plane that continuously ingests grid-intensity forecasts and normalizes them, and a scheduling plane that consumes those signals to place and dispatch jobs onto AWS Batch and GKE compute. Holding them apart is the key to operating this well — the signal plane is a slowly-changing data pipeline, the scheduling plane is a latency-sensitive control loop.

The defining property of the topology is that carbon intensity, cost, and SLA deadline are three inputs to one optimization, and SLA is the hard constraint that can never be violated to chase a cleaner grid. A job with a 06:00 deadline will run on a dirtier grid at 05:30 rather than wait for cleaner power at 09:00 — the scheduler is carbon-aware, not carbon-obsessed. That ordering is what lets the platform team promise the business “zero SLA regressions” while still cutting emissions hard.

Signal plane, following the data flow:

  1. A forecast ingester (an AWS Lambda on a 5-minute EventBridge schedule, mirrored by a GKE CronJob for cloud independence) pulls marginal carbon-intensity forecasts per grid region from providers like Electricity Maps and WattTime, plus each cloud’s own published region carbon data. Marginal intensity — the emissions of the next kWh you’d draw — is the right signal for a shiftable load, because your job is the marginal demand.
  2. The raw signals are heterogeneous (different grid zones, units, refresh cadences), so a normalizer maps every cloud region to its grid zone, converts to a common gCO2/kWh, and writes a tidy time series — current value plus a 24-hour forecast — into DynamoDB (global table, low-latency reads) with the forecast horizon in Amazon Timestream for trend analysis.
  3. A policy service overlays the constraints that have nothing to do with carbon: data-residency rules (which regions a given job class may legally use), a per-region cost ceiling fed from spot-price feeds, and capacity reality. This is the layer that keeps the optimizer honest and lawful.

Scheduling plane, following the control flow:

  1. Jobs are submitted to a single intake API behind an Akamai edge (TLS, anycast, WAF, bot mitigation at the perimeter) with a manifest declaring resource shape, a deadline, a residency class, and a carbon-flexibility flag (urgent jobs bypass shifting entirely; flexible jobs are the optimizer’s playground).
  2. The carbon-aware scheduler — the brain, running on a private GKE Autopilot cluster — is a custom controller that, for each pending flexible job, queries the signal store for the feasible (region, hour) cells the policy service permits, scores each cell as a weighted blend of carbon and cost, and emits a placement decision: region X, dispatch at time T. Deferred jobs sit in a queue with their scheduled wake time; ready jobs are dispatched now.
  3. Dispatch fans out to the right compute backend. CPU-bound, embarrassingly parallel stages (alignment, variant calling) land on AWS Batch with Fargate Spot and EC2 Spot compute environments in the chosen region; container-native and GPU stages (deep-learning variant refinement) land on GKE node pools with Spot VMs. Both run the same job container, so a job is genuinely portable across clouds and regions — the property the whole design depends on.
  4. Results land in the job’s residency-correct object store (S3 or GCS), and every placement decision — chosen region, the carbon intensity at dispatch, the counterfactual intensity of the default region, and the realized cost — is emitted as a metric to Datadog, which is where the cost-vs-carbon story gets told.

Component breakdown

Component Service / tool Role in the platform Key configuration choices
Edge Akamai TLS, anycast, WAF, bot mitigation for the intake API WAF on the submit endpoint; origin shield to the private intake origin
Identity / SSO Okta + Entra ID Workforce SSO (Okta) federated to Entra; OIDC for human + service access to the control plane Group claims gate who may submit urgent jobs and override placement
Signal ingest Lambda + GKE CronJob Pull marginal-intensity forecasts every 5 min from Electricity Maps / WattTime + cloud region data Dual-cloud schedulers so no single provider outage blinds the fleet
Signal store DynamoDB + Timestream Current + 24h forecast per region; historical trend Global table for low-latency reads; TTL on stale forecasts
Policy service Custom on GKE Residency rules, per-region cost ceiling, capacity guardrails Residency class → allowed-region set; spot-price feed for cost cap
Scheduler Custom controller on GKE Score feasible (region, hour) cells; place + defer jobs Weighted carbon+cost objective; SLA deadline as hard constraint
Batch compute (CPU) AWS Batch Run portable job containers on Spot in the chosen region Fargate Spot + EC2 Spot CEs; managed scaling; retry on reclaim
Batch compute (GPU/containers) GKE GPU + container-native stages on Spot VMs Autopilot or Spot node pools; same image as AWS Batch
Secrets HashiCorp Vault Provider API keys (Electricity Maps/WattTime), cross-cloud creds, signing keys Dynamic short-lived leases; agent injection; no long-lived keys in CI
Cloud posture Wiz + Wiz Code CSPM across both clouds; IaC scanning of the Terraform before it ships Agentless scan of S3/GCS/DynamoDB; Wiz Code gate in the PR
Runtime security CrowdStrike Falcon Runtime threat detection on GKE nodes and EC2 batch hosts Sensor on node pools + EC2 AMIs; detections to the SOC
Observability & carbon dashboards Datadog Cost-vs-carbon dashboards, placement traces, SLA + scheduler health Custom carbon.* metrics; monitors on missed-deadline risk
ITSM / approvals ServiceNow Change gate for residency-policy edits; auto-incident on SLA breach risk Change request before allowed-region set changes; auto-ticket on guardrail trip
CI / IaC GitHub Actions + Argo CD + Terraform + Ansible Build/test images; GitOps deploy of the scheduler; multi-cloud infra; host config OIDC to AWS/GCP (no stored creds); Argo CD syncs scheduler; Ansible bakes batch AMIs

A few choices deserve the why, because they are the ones teams get wrong.

Why marginal, not average, intensity. Cloud providers publish average grid carbon for a region — useful for the annual report, wrong for the scheduling decision. When your deferrable job adds load, it is served by whatever generation is on the margin — usually the dispatchable fossil plant ramping to meet demand, not the wind farm that was already running flat out. Optimizing against marginal intensity (the emissions you actually cause) can route a job to a region with a higher average but lower marginal footprint at that hour. This distinction is the difference between a dashboard that looks green and a fleet that is greener; it is also the number a rigorous CSRD auditor respects.

Why the SLA deadline is sacred. It is tempting to let “wait for cleaner power” win ties. Do not — the moment a customer report slips because the scheduler held out for greener electrons, the entire program loses its mandate. Encode the deadline as a hard constraint: the scheduler may only choose among (region, time) cells that still leave enough runway to finish before the deadline, with a safety margin for spot reclaims and retries. Carbon is the objective; the deadline is the wall.

Why run the same container on both AWS Batch and GKE. Portability is not architectural vanity here — it is what gives the optimizer enough feasible cells to find a genuinely clean one. A job that can only run in eu-central-1 on AWS has one option and no carbon lever. A job that can run across three AWS regions and two GCP regions, on either backend, has a rich (region, time) search space, and on any given night one of those cells is meaningfully cleaner and often cheaper. Multi-cloud is the enabler of carbon savings, not a checkbox.

Implementation guidance

Provision with Terraform across both clouds, and treat the signal store as the first deliverable. Without a trustworthy, low-latency carbon signal, the scheduler is just a queue.

  1. The signal plane: the ingester Lambda + GKE CronJob, the DynamoDB global table and Timestream database, and the normalizer mapping each cloud region to its grid zone.
  2. The policy and scheduler services on a private GKE Autopilot cluster, with Workload Identity to GCP and an IAM-Roles-for-Service-Accounts (IRSA)-equivalent web-identity federation to AWS so the scheduler can dispatch to AWS Batch without static keys.
  3. The compute backends: AWS Batch compute environments (Fargate Spot + EC2 Spot) in each allowed region, and GKE Spot node pools — all consuming one container image.
  4. The intake API behind Akamai, with Okta-federated auth gating who may submit and who may override placement.

A minimal Terraform shape for an AWS Batch compute environment that prefers Spot and lets Batch optimize instance choice communicates the intent:

resource "aws_batch_compute_environment" "carbon_cpu_euw" {
  compute_environment_name = "carbon-cpu-eu-west-1"
  type                     = "MANAGED"
  compute_resources {
    type                = "SPOT"            # cheap + clean often coincide
    allocation_strategy = "SPOT_CAPACITY_OPTIMIZED"  # fewer reclaims, less retry waste
    max_vcpus           = 4096
    instance_role       = aws_iam_instance_profile.batch.arn
    subnets             = var.private_subnets
    security_group_ids  = [aws_security_group.batch.id]
    instance_type       = ["optimal"]
  }
}

The job container itself carries no region affinity — the scheduler chooses the region and submits to the matching compute environment.

The scheduling decision, made concrete. The controller’s core is small and auditable. For each pending flexible job it gathers the feasible cells from the policy service, scores them, and picks the minimum:

def score_cell(cell, weights):
    # cell.carbon: marginal gCO2/kWh forecast for (region, hour)
    # cell.cost:   $/vCPU-hr for that region/time (spot feed)
    # both normalized 0..1 across the feasible set
    return weights.carbon * cell.carbon + weights.cost * cell.cost

def place(job, cells, weights):
    feasible = [c for c in cells
                if c.region in job.allowed_regions          # residency
                and c.cost <= job.cost_ceiling              # cost wall
                and c.start_time + job.est_runtime + SAFETY  # SLA wall
                    <= job.deadline]
    if not feasible:                       # no clean+cheap+legal cell in time
        return place_now_cheapest(job)     # fall back: meet the SLA, full stop
    return min(feasible, key=lambda c: score_cell(c, weights))

The weights are a business dial: carbon=0.7, cost=0.3 leans green, the inverse leans frugal, and the sustainability committee owns the number. Crucially, an empty feasible set never blocks a job — it falls back to run now, cheapest legal region, so the SLA is structurally protected even when the grid is dirty everywhere in the window.

Identity: federate the humans, kill the static keys. Human access to the control plane and the override path flows Okta → Entra ID over OIDC — engineers log in once with the company’s Okta credentials and conditional-access policies, Okta federates to Entra, and the resulting group claims decide who may submit urgent jobs (which skip shifting and cost real carbon) versus who may only submit flexible work. The scheduler authenticates to AWS via web-identity federation and to GCP via Workload Identity, so there are no long-lived cloud keys anywhere in the loop. The few residual secrets that are not federated identities — the Electricity Maps and WattTime API tokens, the cross-cloud signing keys — live in HashiCorp Vault, leased dynamically and injected at runtime, never written into a Kubernetes Secret or a CI variable.

Enterprise considerations

Security & Zero Trust. The platform is Zero Trust by construction: federated identity only, least-privilege roles scoped per cloud, no static keys, private clusters. Layer on: (a) Wiz running continuous CSPM across both AWS and GCP, alerting the moment a result bucket drifts to public or an IAM role over-grants — the posture backstop behind the policy controls; (b) Wiz Code scanning the Terraform and the scheduler image in the pull request, so a misconfiguration is caught before it ever reaches a cloud; © CrowdStrike Falcon sensors on the GKE node pools and the EC2 batch AMIs for runtime threat detection feeding the SOC — batch fleets are juicy crypto-mining targets precisely because they burst huge and quiet; (d) any guardrail trip — a residency policy that would have placed a job in a forbidden region, or a scheduler decision that risks a deadline — auto-raises a ServiceNow incident, so the on-call has a ticket, not just a log line. The carbon signal is advisory; the residency and SLA constraints are enforced, and Wiz independently verifies the controls are holding.

Cost optimization. The pleasant surprise of this architecture is that carbon-aware and cost-aware mostly agree: clean power is frequently cheap power, and Spot capacity is most abundant where demand (and price, and often marginal intensity) is low. But they diverge sometimes, and the divergence must be explicit.

Lever Mechanism Typical effect
Region+time shifting Place flexible jobs in the cleanest feasible cell 40–70% lower marginal carbon on the deferrable share
Spot-first compute Fargate/EC2 Spot + GKE Spot, capacity-optimized 60–80% cheaper than on-demand for batch
Carbon/cost weight dial Tune weights per job class Lets the business price carbon vs dollars explicitly
Forecast-driven deferral Wait within deadline for a cleaner hour Cuts carbon at zero dollar cost when slack exists
Right-sized retries Capacity-optimized allocation reduces reclaims Less wasted recompute (which is wasted carbon and money)

The one honest caveat: pushing the carbon weight too high can route work to a clean-but-pricey region, raising the bill. Datadog dashboards plot realized cost and realized carbon against the do-nothing baseline on the same screen, so the CFO and the sustainability lead negotiate the weight with real numbers instead of slogans.

Scalability. Each plane scales independently. The signal plane is tiny and constant — a few hundred region time-series refreshed every five minutes. The scheduler is a control loop whose load tracks job submission rate, not job size, so it stays light even as the fleet bursts to tens of thousands of vCPUs. The compute backends scale on their own: AWS Batch managed scaling grows EC2/Fargate Spot to the queue depth, GKE Autopilot and Spot node pools scale pods and nodes to demand. The natural ceiling is regional Spot capacity in the clean regions — when eu-north-1 is cheapest-and-greenest, so is everyone else’s batch fleet, so plan the allowed-region set with several clean fallbacks, not one.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). Decide the numbers per plane. The signal store (DynamoDB global table) is multi-region by design — near-zero RPO. The scheduler is stateless except for the deferred-job queue, which is persisted and replayable, so a scheduler-pod loss costs seconds of RTO via Argo CD re-sync. The compute is inherently disposable — a lost batch node just means a retried job. A pragmatic target: RTO 5 minutes, RPO near-zero for the control plane, with the explicit and deliberate degradation that if the carbon signal is unavailable the platform keeps running in cost-only mode rather than stopping. Losing the green dimension is an inconvenience; stopping the pipeline is an outage.

Observability. This is where the architecture earns its board mandate. Instrument the placement decision as a Datadog trace: for every job, record the chosen region, the marginal intensity at dispatch, the counterfactual intensity of the default region (so you can prove avoided emissions), the realized cost, and whether it ran flexible or urgent. Emit the metrics the business actually reports: total marginal CO2 avoided vs baseline, carbon per job / per sample / per dollar, percentage of deferrable work successfully shifted, SLA attainment (the number that must stay at 100%), and scheduler decision latency. The avoided-emissions figure, with its counterfactual baseline, is exactly what goes to the CSRD auditor — and because every decision is traced, it is defensible line by line rather than an estimated annual hand-wave. Monitors watch missed-deadline risk (a deferred job approaching its safety margin) so the on-call acts before an SLA actually slips.

Governance. Residency-policy edits — changing a job class’s allowed-region set — pass through a ServiceNow change request, because a careless edit could send regulated data somewhere illegal; the carbon weights, by contrast, are a tunable the sustainability committee owns directly. The scheduler and policy services deploy via Argo CD GitOps, so every change to placement logic is a reviewed, reverted-in-one-click commit. GitHub Actions builds and tests the portable job image and runs Wiz Code as a required gate; Terraform provisions both clouds via OIDC with no stored credentials; Ansible bakes the hardened EC2 batch AMIs and configures the (few) long-lived hosts and virtual appliances — the network firewall and egress-inspection appliances that sit at each cloud’s perimeter to enforce that batch traffic only talks to sanctioned endpoints. Pin the carbon-data provider and its API version explicitly so the signal’s methodology does not drift mid-reporting-period, and version the placement weights alongside the code so an auditor can see exactly what objective the fleet optimized in any given quarter.

Explicit tradeoffs

Accept these or do not build it. Carbon-aware scheduling only pays off for genuinely deferrable, relocatable work — it does nothing for your interactive web tier, and a fleet that is mostly urgent has little to shift. It adds real moving parts: a signal pipeline to keep fresh, a custom scheduler to operate, and a portable-container discipline that forbids region-pinned assumptions in your job code. The carbon signal is a forecast, so a deferred job can occasionally land on a dirtier hour than predicted — you optimize expected emissions, not guaranteed ones. Multi-region, multi-cloud portability is the enabler, but it is also genuine engineering cost: one image that runs on both AWS Batch and GKE, residency rules encoded and enforced, egress economics understood. And the data-gravity tax is real — shifting a job to a clean region is only a net win if the job’s input data is already there or cheap to reach; for data-heavy stages, the time dimension (wait for a cleaner hour in the same region) often beats the region dimension precisely because moving the data would cost more carbon than it saves.

The alternatives, and when they win. If your batch fleet is small or your grid is already near-zero-carbon (you are entirely in Québec or the Nordics), the savings may not justify the scheduler — buy efficient instances and move on. If you cannot move data across regions for residency or gravity reasons, do time-shifting only: keep every job in its required region and merely defer it to the cleanest hour, which captures a large share of the benefit with none of the multi-region complexity. If you are a single-cloud shop, the same pattern works on AWS Batch alone or GKE alone — you lose the cross-cloud feasible cells but keep cross-region and cross-time, and the architecture above degrades gracefully to it. And if your goal is purely reporting rather than reduction, you do not need this at all — you need a carbon-accounting tool and honest RECs. This platform is for the organization that has decided to actually emit less, with auditable proof, without paying a dollar or an SLA for the privilege.

The shape of the win

For the genomics company, the payoff is not “a greener dashboard.” It is that the same nightly pipeline — the same alignment, the same variant calling, the same millions of samples — emits a large fraction less marginal CO2 because it runs in the cleanest feasible region at the cleanest feasible hour, costs less on Spot capacity that clean regions tend to have going spare, never once misses a customer’s day-fourteen report, and hands the CSRD auditor a per-decision, counterfactual-backed avoided-emissions number that survives scrutiny. That last clause is what turns a platform-engineering project into a board-level win. Everything upstream — the marginal-intensity signal, the Okta-to-Entra federation, the Vault-held provider keys, the Wiz posture scanning, the SLA-as-hard-constraint scheduler, the Datadog cost-vs-carbon traces — exists to let a CFO, a CISO, and a Chief Sustainability Officer each say yes to the same architecture. Start with time-shifting in one cloud if you must; this is where a serious carbon-aware batch platform has to land.

SustainabilityAWS BatchGKESchedulingFinOpsMulti-cloud
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading