A global online-travel marketplace runs its booking funnel across AWS in Virginia, Azure in Dublin, and GCP in Singapore — partly for latency to three continents of travellers, partly because an airline-distribution partner contractually requires a non-AWS failover. On a Friday before a long weekend, a release engineer flips a flag to roll out a new fare-bundling algorithm to 5% of traffic. The flag’s config propagates to the US edge in seconds, to Singapore in ninety, and to Dublin not at all, because the regional config service there was mid-restart. For eleven minutes the same user, refreshing the page, sees two different prices depending on which POP terminated their TLS — and a few of them screenshot it. By the time anyone correlates the price-mismatch tickets to the rollout, the algorithm has also quietly pushed checkout-error rate up 4x in the cohort that did get it, and nobody wired that signal to anything that could pull the flag.
That incident is the entire reason this platform exists. A feature flag looks trivial — a boolean in a config file — until it is the control surface for revenue across three clouds and a dozen regions, at which point its three hard problems surface all at once. Consistency: every region must converge to the same flag state fast, and a partial rollout must look identical to a given user no matter which edge serves them. Latency: the flag is evaluated on the hot path of every page render, so a read cannot cost a network round trip to a central database. Safety: a flag is a production change made by someone who is not doing a deploy, with no code review and no canary by default — so the platform itself has to supply the guardrails, the audit trail, and the automatic kill switch that a code deploy would get from CI. This article is the reference architecture for a globally distributed configuration and feature-flag platform that solves all three, spanning AWS, Azure, and GCP, with progressive rollout governed by live guardrail metrics.
Why not the obvious shortcuts
Three cheaper answers get proposed on every team, and each fails in a way worth naming before someone builds it.
A config file baked into the deploy makes a flag change require a full release — minutes to hours, a pipeline run, and a rollback that is another deploy. That defeats the point of a flag, which is to decouple ship the code from turn it on, and to let you turn it off in seconds when it misbehaves.
A single central flag database that every region reads live gives you one consistent source of truth and a cross-region round trip on the hot path of every request. From Singapore to a us-east database that is ~230 ms each way — added to every page render — and the day that database has a bad minute, every region’s funnel stalls together. A flag read must be local and must survive the source of truth being unreachable.
Each cloud’s native flag service in isolation — AWS AppConfig here, Azure App Configuration there, GCP runtime config elsewhere — means three consoles, three audit logs, three rollout semantics, and no way to say “this flag is at 25% globally.” A flag at 25% on AWS and 60% on Azure is not a 25% rollout; it is a silent inconsistency that produces exactly the two-prices-for-one-user bug.
The platform threads the needle the way mature progressive-delivery systems do: one authoritative control plane where flags are defined, targeted, and audited; a replicated data plane that pushes evaluated config to every region and to the edge so reads are local and fast; and a rollout controller that watches guardrail metrics and moves (or rolls back) the rollout itself, so safety is automatic rather than a human noticing a graph.
Architecture overview
The platform separates cleanly into a control plane (where flags are authored, targeted, approved, and audited — low traffic, strongly consistent, runs in one home region with DR) and a data plane (where flags are evaluated billions of times a day — high traffic, eventually consistent, replicated everywhere). The single most important design rule follows from that split: the data plane never blocks on the control plane. If the control plane is down, every flag keeps serving its last-known state from local cache; you simply cannot change a flag until it recovers. Conflating the two is how teams build a flag system whose outage takes down everyone who reads a flag.
Control plane — authoring and rollout, following the flow:
- An engineer or product manager opens the admin console to create or change a flag. Admin access is gated by Okta as the workforce IdP (federated to Entra ID for the staff who live in the Microsoft estate), enforcing SSO, MFA, and conditional access — and, critically, fine-grained authorization: who may touch a
billing.*flag is not who may touch aui.*flag, and production targeting changes require a second approver. The flag-management layer here is LaunchDarkly (or an equivalent), which owns flag definitions, targeting rules, segments, and rollout state. - Every change is written to an immutable audit log — who, what, old value, new value, justification, timestamp — before it takes effect. A change to a high-blast-radius flag opens a ServiceNow change record automatically, so production config changes live in the same change-management system as everything else and an auditor can reconstruct exactly who turned what on when.
- The control plane resolves the change into evaluated config per environment and per region and hands it to the distribution layer. Targeting rules (percentages, user segments, geo, plan tier) are part of this payload so the edge can evaluate them locally without calling home.
Data plane — distribution and evaluation, the part that must always be up:
- The distribution layer streaming-pushes flag state to regional relay nodes in each cloud (AWS, Azure, GCP) over a persistent connection (SSE/streaming), so a change reaches every region in well under a second rather than waiting for a poll. Each region also has the full ruleset, not just on/off — so a 25% rollout is the same 25% of users everywhere, because the bucketing hash is deterministic and identical across regions.
- Akamai at the edge holds a copy of flag config close to users and terminates the marketplace’s traffic; for flags evaluated in the CDN/edge-worker tier (the price-display decision that started this story), evaluation happens at the POP with zero origin round trip. Origin services read from an in-process SDK that keeps flags in local memory, updated by the streaming connection — so a flag evaluation is a hash and a map lookup (microseconds), never a network call.
- The booking, pricing, and checkout services evaluate flags inline on every request. The SDK emits flag-exposure events (which user saw which variation) back through the telemetry pipeline — these are the denominator for every guardrail metric and every experiment readout.
Rollout control — the safety loop: as a flag ramps (1% → 5% → 25% → 50% → 100%), the rollout controller continuously compares the exposed cohort against the control on a set of guardrail metrics in Datadog — checkout-error rate, p95 latency, payment-decline rate, conversion. If a guardrail breaches its threshold, the controller halts the ramp and rolls the flag back to 0% automatically, then raises a ServiceNow incident. This is the loop the opening incident did not have.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Admin identity | Okta + Entra ID | SSO/MFA + fine-grained authz for who may edit which flag namespace | OIDC; SCIM provisioning; prod-change approver group; conditional access |
| Flag management | LaunchDarkly (control plane) | Flag definitions, targeting rules, segments, rollout state | One project per business domain; environments per stage; SDK keys per service |
| Audit & change | Immutable log + ServiceNow | Tamper-evident history; change record for high-blast-radius flags | Append-only store; auto-ticket on billing.*/auth.* changes |
| Distribution | Streaming relay (per cloud) | Push flag state to every region in sub-second; survive control-plane outage | Persistent SSE; regional relay in AWS/Azure/GCP; last-known-good cache |
| Edge evaluation | Akamai (edge workers) | Evaluate hot-path flags at the POP with no origin round trip | EdgeKV/edge config for flag payload; deterministic bucketing |
| In-process eval | Flag SDK (in service memory) | Microsecond local flag reads; emit exposure events | Streaming mode; bounded staleness fallback; offline-default per flag |
| Guardrail metrics | Datadog | Real-time cohort metrics that gate every ramp | Monitors per guardrail; cohort tag from exposure events; anomaly detection |
| Rollout controller | Progressive-delivery service | Ramp the flag, watch guardrails, auto-halt/rollback | Bake-time per stage; rollback to 0% on breach; SLO-aware thresholds |
| Secrets | HashiCorp Vault | SDK server keys, relay credentials, Datadog API keys | Dynamic leases; per-cloud auth method; no static keys in config |
| Cloud posture | Wiz / Wiz Code | Misconfig + exposure scanning across all three clouds; IaC scanning pre-merge | Agentless scan of relay infra; Wiz Code gate in the PR |
| Runtime security | CrowdStrike Falcon | Runtime protection on relay nodes and edge-compute hosts | Sensor on every relay VM/host; detections to the SOC |
| ITSM | ServiceNow | Change records, approvals, incident on guardrail breach | Change gate for risky flags; auto-incident from the rollout controller |
| CI/CD + IaC | GitHub Actions / Jenkins / Argo CD + Terraform / Ansible | Build relays, deploy SDKs, provision per-cloud infra, GitOps the relay config | OIDC to each cloud; Argo CD syncs relay manifests; Terraform per cloud |
A few choices carry the design and are worth the why.
Why deterministic bucketing is non-negotiable for multi-region. A percentage rollout works by hashing a stable user key (account id, or a sticky anonymous id) into a number and comparing it to the rollout percentage. If every region uses the same hash function and the same salt per flag, then user u-91823 is either in the 5% everywhere or out of it everywhere — so they get one price in every POP. Get this wrong (different salts, region-local randomness) and you reintroduce the two-prices bug at the data-plane level, invisibly. The rule: bucketing is a pure function of (flagKey, userKey), computed identically at the edge, in the SDK, and in any region.
Why the data plane must serve from last-known-good. Each relay and each SDK keeps the last config it successfully received and serves it even if the control plane and the streaming connection are both unreachable. A flag platform that fails closed — refuses to evaluate when it can’t reach home — is worse than no flag platform, because it takes down every service that reads a flag. So each flag also declares an offline default the SDK uses if it has never received config (cold start during an outage), chosen to be the safe variation.
Why exposure events are the backbone, not an afterthought. Every guardrail check, every experiment result, and every “who saw what” audit answer is computed from the stream of exposure events the SDKs emit. They are the denominator. Skimp on them — sample them, drop them under load — and the rollout controller is comparing cohorts it cannot actually measure, which is how a bad flag rolls past a guardrail that was silently blind.
Implementation guidance
Provision the data plane in every cloud with Terraform, GitOps the config with Argo CD. The asymmetry to internalize: the control plane lives in one home region (with DR), while the data plane is replicated to all of them. So the IaC splits in two.
- Control plane (home region, e.g. AWS us-east-1 with an Azure DR standby): the flag-management service, the immutable audit store, the rollout controller, and the ServiceNow integration. Strongly consistent, backed up, version-pinned.
- Data plane (every region in all three clouds): a regional relay deployment, the edge config push to Akamai, and the SDK configuration baked into each service. Terraform stands up the relay infra per cloud (separate state per cloud, OIDC auth, no stored keys); Ansible handles any relay-node host config and the virtual appliances some regions require — e.g. a hardened relay/forward-proxy appliance in a regulated region’s VPC that brokers the streaming connection without that region’s services egressing directly to the SaaS control plane.
- Argo CD syncs the relay manifests across clusters from a single Git repo, so “what config is in which region” is a
git log, and a drifted region self-heals back to the declared state.
A minimal relay deployment shape communicates the intent — stream from the control plane, serve locally, default safe:
# relay-config.yaml — one per region, GitOps'd via Argo CD
relay:
streamUri: https://stream.flags.internal # control-plane push channel
cache:
persist: true # last-known-good survives restarts
staleIfError: true # serve cache when control plane unreachable
region: ap-southeast-1
cloud: gcp
flagDefaults:
pricing.fareBundlingV2: "control" # offline default = safe variation
checkout.newAddressForm: "off"
Identity and authz: who can flip what. Admin SSO is Okta, federated to Entra ID for staff in the Microsoft estate, with MFA and conditional access enforced. Beyond authentication, the platform enforces per-namespace authorization: a pricing.* flag change requires the pricing on-call role and a second approver; a billing.* or auth.* change additionally opens a ServiceNow change record before it can take effect. The server-side SDK keys, relay credentials, and the Datadog API key the controller uses are not config-file constants — they are leased from HashiCorp Vault per cloud, short-lived, so a leaked manifest never leaks a working key.
Wire the guardrails before you wire the first rollout. The rollout controller is only as good as the metrics it watches. In Datadog, define a monitor per guardrail (checkout-error rate, p95 render latency, payment-decline rate, conversion delta), each tagged so it can be sliced by the exposed cohort using the exposure events. The controller then runs a staged ramp with a bake time at each stage and an explicit rollback trigger:
rollout:
flag: pricing.fareBundlingV2
stages: [1, 5, 25, 50, 100] # percent
bakeMinutes: 20 # observe guardrails before next stage
guardrails:
- datadogMonitor: checkout.error_rate.cohort
maxRelativeIncrease: 0.10 # >10% worse than control → halt
- datadogMonitor: render.p95.cohort
maxRelativeIncrease: 0.15
onBreach: rollbackToZero # not "page a human and hope"
The opening incident is exactly what onBreach: rollbackToZero prevents: the 4x checkout-error spike trips checkout.error_rate.cohort, the controller pulls the flag to 0% globally within a bake window, and the price-mismatch never reaches the volume that produces screenshots.
Enterprise considerations
Security & Zero Trust. Access is identity-first: admin actions require Okta/Entra SSO + MFA, authorization is least-privilege per flag namespace, and every change is in an immutable audit log. The relay/edge tier is hardened and watched: Wiz runs continuous posture and exposure scanning across all three clouds — flagging a relay endpoint accidentally made public, an over-broad IAM role on a relay, or a misconfigured edge config — and Wiz Code runs as a gate in the pull request so a Terraform change that would expose a relay is caught before merge, not after Wiz finds it in prod. CrowdStrike Falcon sensors run on every relay node and edge-compute host for runtime threat detection feeding the SOC. A particular risk unique to flag platforms deserves naming: a flag is a production change, so a compromised admin account is a way to change behavior across three clouds without touching a single deploy pipeline — which is why high-blast-radius namespaces require a second approver and a ServiceNow record, turning a single stolen credential into an insufficient one.
Cost optimization. The dominant costs are the SaaS flag platform (priced per seat and per monthly context/evaluation) and the egress of exposure events and metrics. Engineer for both.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Edge/in-process eval | Evaluate locally; never call a flag service per request | Removes per-eval network cost and the SaaS per-eval charge entirely |
| Exposure-event batching | Batch and compress exposure events per SDK before egress | Cuts cross-cloud egress on the highest-volume telemetry |
| Relay fan-in | Region’s services connect to a local relay, not the SaaS directly | One streaming connection per region instead of thousands |
| Context hygiene | Use a stable, minimal user context key; avoid unbounded contexts | Controls the per-context SaaS billing dimension |
| Flag lifecycle | Retire stale flags on a cadence; Wiz/dashboards surface dead flags | Avoids paying for and reasoning about flags nobody uses |
The relay fan-in is also a resilience win, not only a cost one: thousands of services holding direct streaming connections to the SaaS is both a bill and a thundering-herd risk on reconnect.
Scalability. Reads scale trivially because they are local — adding traffic adds no load on the control plane at all, only on the services already serving that traffic. The data plane scales by adding regions/relays; the control plane is low-QPS by nature (humans and the rollout controller, not the request path) and scales vertically with DR. The real scaling discipline is flag-count governance: an organization with 5,000 live flags has an unmaintainable combinatorial state space, so enforce a lifecycle that retires flags after a rollout completes, with dashboards (and Wiz Code checks for references to removed flags) to find the dead ones.
Failure modes, and what each one looks like. Name them before they page you.
- Control plane outage — flags cannot be changed, but every region keeps serving last-known-good. Mitigation: this is the designed-for case; the alert is “config is frozen,” not “site is down.” DR-failover the control plane to the standby region to restore editing.
- Streaming connection drops to one region — that region stops getting new flag states and silently diverges. Mitigation: serve last-known-good, alert on relay staleness (time since last successful update), and fall back to polling so the gap is bounded — the exact failure (Dublin mid-restart) from the opening incident.
- Non-deterministic bucketing — different regions put the same user in different cohorts; one user sees two prices. Mitigation: a single pure bucketing function with a per-flag salt, verified by a cross-region consistency test in CI.
- Guardrail blindness — exposure events sampled/dropped under load, so the controller ramps past a real regression. Mitigation: treat exposure events as critical telemetry (don’t sample them), and have the controller refuse to advance a stage if event volume for the cohort looks implausibly low.
- Flag sprawl — thousands of stale flags make behavior unreasonable and rollouts risky. Mitigation: enforced flag lifecycle and retirement SLAs.
Reliability & DR (RTO/RPO). Decide the numbers per plane, because they differ sharply. The data plane has effectively zero RTO/RPO for reads by design — last-known-good means a region serves through almost anything, and a brand-new pod uses the flag’s offline default. The control plane is where DR actually matters: replicate the flag definitions and the immutable audit log to a standby region (cross-cloud, AWS→Azure here, to honor the same non-single-cloud requirement the marketplace has elsewhere), targeting RTO 15 minutes / RPO 1 minute for the ability to edit and roll out flags. Losing the control plane for fifteen minutes is an inconvenience (no changes) rather than an outage (the site runs), which is the property that makes this architecture calm under failure.
Observability. Datadog is doing double duty: the guardrail metrics that gate rollouts, and the operational health of the platform itself. Monitor relay staleness per region (the single most important platform health metric — it catches silent divergence), streaming-connection state, exposure-event throughput, and rollout-controller actions (every halt and rollback as an event on the timeline). Emit the metrics the business cares about per flag: exposure counts per variation, guardrail deltas vs. control, and rollout state across regions on one board so an operator can see at a glance that a flag is at 25% everywhere, not 25% in one cloud and 60% in another. Every automatic rollback raises a ServiceNow incident so there is a ticket and a postmortem trigger, not just a graph nobody watched.
Governance and an unexpected stakeholder. Two governance touchpoints are easy to miss. First, flag changes are deploys without code review, so the platform supplies the missing controls: approvals on risky namespaces, the ServiceNow change record, and the immutable audit log are the substitute for the PR review a code change would get. Second — the unexpected one — the company’s internal enablement runs on Moodle, and “how to safely use feature flags” is a required course before an engineer is granted production flag-edit rights in Okta. Flags are powerful enough that ungoverned access is a real risk, so authorization to flip a production flag is gated on a Moodle completion the same way access to a regulated system would be. It is the cheapest guardrail in the whole architecture and prevents the most common failure: a well-meaning engineer who did not know that a 100% flag flip is a production change.
Comparison: edge-evaluated vs. central-service flags
| Dimension | Edge / in-process evaluation (this platform) | Central flag service per request |
|---|---|---|
| Read latency | Microseconds (local hash + map lookup) | A network round trip per evaluation (10s–100s ms cross-region) |
| Outage behavior | Serves last-known-good; site stays up | Flag service down ⇒ every reader degraded together |
| Cross-region consistency | Deterministic bucketing ⇒ identical cohort everywhere | Consistent only if every region hits the same central instance (latency) |
| Cost at scale | No per-eval network/SaaS charge | Per-eval cost on the highest-frequency operation in the app |
| Trade-off | Eventual consistency (sub-second), more moving parts | Strong consistency, but latency and a shared-fate outage surface |
Explicit tradeoffs
Accept these or do not build it. Local, edge-evaluated flags are eventually consistent — sub-second in the happy path, but a region with a dropped stream is briefly behind, and you mitigate the user-visible effect with deterministic bucketing rather than eliminate the staleness. The platform is more moving parts than a config file: a control plane, per-cloud relays, edge config, an exposure-event pipeline, and a rollout controller — overhead a ten-service startup does not need and a three-cloud marketplace cannot live without. Running across AWS, Azure, and GCP multiplies the operational surface — three sets of IAM, three relays, three Terraform states — which is justified only because the business already requires multi-cloud, not for its own sake. And the safety machinery (guardrails, bake times, auto-rollback) costs you rollout speed: a flag that could be at 100% in one click instead climbs through staged bake windows — slower on purpose, because the opening incident is what “fast” looks like without it.
The alternatives, and when they win. If you run in a single cloud and a single region, your cloud’s native flag service (AWS AppConfig, Azure App Configuration) is the right amount of machinery — skip the multi-cloud relay layer entirely. If you have a handful of flags and a small team, a flags-in-config-with-a-fast-deploy approach is simpler than any platform, as long as your deploy is genuinely fast and your blast radius is small. If you need flags only for internal experimentation rather than production safety, an experimentation platform without the cross-cloud data plane and auto-rollback may suffice. Graduate to this architecture when flags control revenue, span multiple clouds or regions, and are flipped by people who are not running a deploy pipeline — which is exactly when a flag stops being a boolean and becomes a control surface that needs a control plane, a replicated data plane, and a safety loop.
The shape of the win
For the travel marketplace, the payoff is not “feature flags.” It is that the next Friday-before-a-long-weekend rollout ramps to 5%, the new bundling algorithm pushes checkout-error rate up in the exposed cohort, Datadog’s guardrail trips, and the rollout controller pulls the flag to 0% across all three clouds within a bake window — before a single customer sees two prices, and before anyone is paged. The release engineer reads about it in a ServiceNow incident the controller filed on their behalf, with the audit log showing exactly what changed and the deterministic bucketing guaranteeing that whatever did ship, shipped identically everywhere. Everything upstream — the Okta-gated admin, the Vault-held keys, the per-cloud relays serving last-known-good, the Wiz Code gate, the exposure-event backbone, the Datadog guardrails — exists so that a production change made without a code review is safer than one made with it. That inversion is the architecture. Start with one cloud’s native service if that is all you need; come here when a flag flip can move revenue on three continents at once.