Internal Developer Platform on Backstage with Golden Paths

A national health-insurance carrier — 1,400 engineers across claims, member portal, provider network, and a fast-growing telehealth division — has a velocity problem that everybody can feel and nobody can name precisely. The head of engineering puts a number on it in a board deck: the median time from “we have approval to build a new service” to “that service is in production serving traffic” is six weeks, and most of those weeks are not coding. They are a developer opening a ticket for a repo, waiting four days; opening another ticket for a Kubernetes namespace and a database, waiting a week; copying a Helm chart from whatever service shipped last quarter and inheriting its three-year-old logging config; and discovering on day twelve that the security team requires a Vault path and a Wiz scan nobody told them about. Meanwhile the carrier operates across AWS for claims, Azure for the member portal (a legacy acquisition), and GCP for the telehealth analytics stack — so every one of those manual steps is subtly different per cloud, and the tribal knowledge lives in four senior engineers’ heads.

The mandate is not “buy a tool.” It is “make the right way the easy way” — a paved road where a developer who wants a new service clicks a button, answers five questions, and ten minutes later has a repo, a pipeline, an environment, observability, and an on-call rotation, all wired to the carrier’s standards by default. Under HIPAA, “by default” is the load-bearing phrase: PHI-touching workloads must inherit encryption, network policy, audit logging, and least-privilege identity without a developer ever choosing to. This article is the reference architecture for that paved road — an Internal Developer Platform (IDP) on Backstage with golden paths, spanning three clouds, that a platform team of eight can run for fourteen hundred engineers.

Why a portal, and why not the obvious alternatives

The instinct of most organizations is to fix this with documentation and goodwill, and it fails for reasons worth naming because someone will propose each one.

A wiki of “how to create a service” runbooks rots the day it is written, is ignored under deadline, and produces fourteen hundred subtly-divergent interpretations of the standard. A pile of Terraform modules in a shared repo is better — it is real code — but it assumes every developer is fluent in HCL, knows which fifteen modules to compose in which order, and will not copy last year’s version. A dedicated “platform request” queue staffed by SREs simply moves the six weeks onto the SREs, who become a ticket-driven bottleneck and burn out. And letting every team self-serve raw cloud consoles is how a HIPAA carrier ends up with an unencrypted S3 bucket of claims data on the front page of a newspaper.

The pattern that actually works is platform-as-a-product: treat the internal platform like a product with developers as customers, expose it through a single portal, and encode the standards as executable golden paths rather than prose. Backstage — the open-source developer portal CNCF graduated out of Spotify — is the substrate. It gives you a software catalog (every service, its owner, its dependencies, its docs, in one searchable place), software templates (the scaffolding engine that turns a five-question form into a real repo and real infrastructure), and a plugin model that pulls the rest of the toolchain — CI status, deployments, incidents, cost — into one pane of glass. The golden path is the opinionated default route from idea to production; the developer can step off it for genuinely novel needs, but stepping on it is one click and stepping off is the exception that gets a second look.

Architecture overview

Internal Developer Platform on Backstage with Golden Paths — architecture

The platform has two planes that are easy to conflate and must be kept distinct in your head: a control plane — Backstage, the catalog, the templating and provisioning machinery, the policy gates — and the runtime planes, the three clouds where the services actually run. Backstage never runs the customer’s workloads; it orchestrates the creation and the visibility of them. Everything below follows one developer through a single golden path: “stand me up a new PHI-touching microservice.”

The control flow, end to end:

A developer opens the Backstage portal. Identity federates through Okta as the corporate IdP, brokered to Microsoft Entra ID for the Azure-resident workloads so each cloud sees a first-class token; Backstage maps the resulting OIDC identity and group claims onto catalog ownership, so the portal knows who you are and which teams you belong to. Traffic hits Akamai at the edge first — TLS termination, global anycast, and WAF/bot protection — before it ever reaches the Backstage frontend.
The developer picks the “PHI Microservice (Go)” software template and answers the form: service name, owning team, target cloud (defaulted by the team’s domain — claims → AWS, portal → Azure, telehealth → GCP), data-classification (defaulting to PHI), and an SLO tier. These five answers are the entire decision surface; everything else is the golden path’s opinion.
Backstage’s scaffolder executes the template’s steps. It (a) renders a repository from a templated skeleton — application code stub, a pre-wired GitHub Actions workflow, a Dockerfile, a Helm chart, an app-config with logging/tracing already correct; (b) creates the GitHub repo with branch protection, CODEOWNERS set to the owning team, and required-checks enabled; © commits a Terraform stack describing the service’s infrastructure; and (d) registers the new component back into the Backstage catalog via its catalog-info.yaml so it is discoverable from the moment it exists.
The committed Terraform does not apply itself from a laptop. The scaffolder opens a pull request against the platform’s infrastructure repository, and merge triggers Atlantis/Terraform running in a CI runner that holds the cloud credentials via HashiCorp Vault dynamic secrets — short-lived, per-run AWS/Azure/GCP credentials leased at apply time, never long-lived keys in a pipeline variable. Terraform provisions the per-cloud primitives: an EKS/AKS/GKE namespace with a default-deny network policy, an encrypted database (RDS/Azure SQL/Cloud SQL with CMEK), a Vault path seeded for the service’s secrets, an IAM/workload-identity binding scoped to exactly what the service needs, and the observability and security agents pre-attached.
Delivery is GitOps via Argo CD. The scaffolder also writes an Argo CD Application (or an ApplicationSet entry) pointing at the new service’s Helm chart in Git. Argo CD — running per-cloud or as a single hub with three target clusters — continuously reconciles the cluster to match Git, so the desired state lives in version control and a drift or a manual kubectl change is detected and reverted. The first deploy happens automatically once the infra PR merges; every subsequent deploy is a Git commit that GitHub Actions builds and Argo CD rolls out.
From the first minute the service is observable and governed by default. The Helm chart ships with Datadog Agent and tracing libraries wired in, so APM traces, logs, infrastructure metrics, and an auto-generated service dashboard appear without the developer configuring anything; the Backstage Datadog plugin surfaces those dashboards and SLOs right on the service’s catalog page. The ServiceNow plugin links the catalog entity to its CMDB record and incident/change history, so an on-call engineer sees deployments and tickets in one place, and a failed change can open a ServiceNow incident automatically.

The defining property of this topology — the one that makes a HIPAA CISO sign — is that steps 3 through 6 are not optional and not configurable by the developer. Encryption, network policy, least-privilege identity, audit logging, the security agents, and the observability stack are baked into the template and the Terraform modules. A developer cannot ship a PHI service without them because there is no path through the portal that omits them. The paved road is also the compliant road.

Component breakdown

Component	Tool	Role in the platform	Key configuration choices
Edge	Akamai	TLS, anycast, WAF, bot mitigation in front of the Backstage portal	WAF rules; origin shield to the Backstage backend; health-checked failover
Developer portal	Backstage	Catalog, software templates, plugin hub — the single front door	Custom entity processors; org-model from Okta groups; TechDocs enabled
Identity / SSO	Okta + Entra ID	Workforce SSO (Okta) federated to Entra for Azure-resident workloads	OIDC; group claims drive catalog ownership and RBAC; SCIM for org sync
Scaffolding	Backstage software templates	Turn a form into a repo + pipeline + infra PR + catalog entry	Custom scaffolder actions for Terraform PR and Argo App generation
Source + CI	GitHub + GitHub Actions	Repos with branch protection; build/test/scan/publish pipeline	Reusable workflows; OIDC to clouds (no stored keys); Wiz Code + Wiz scan steps
Provisioning	Terraform (+ Ansible)	Self-service infra: namespaces, DBs, IAM, network policy, agents	Per-cloud module library; PR-gated apply; Ansible for VM/appliance config
Delivery	Argo CD	GitOps reconciliation of Helm charts to the three clusters	ApplicationSets per cloud; sync waves; auto-prune + self-heal on
Secrets	HashiCorp Vault	Dynamic short-lived cloud creds; per-service secret paths	Per-run leases for Terraform; Kubernetes auth for workloads; no static keys
Code & cloud security	Wiz / Wiz Code	CSPM across all three clouds; IaC + code scanning in the pipeline	Wiz Code as a PR gate; agentless posture scan; attack-path alerts
Runtime security	CrowdStrike Falcon	Runtime threat detection on nodes and any VM-based appliances	Sensor in the node image / DaemonSet; detections to the SOC
Observability	Datadog	APM, logs, metrics, SLOs surfaced into Backstage	Agent in the base chart; unified service tagging; Datadog Backstage plugin
ITSM	ServiceNow	CMDB linkage, change approval, incident records on catalog entities	Backstage ServiceNow plugin; change gate for prod; auto-incident on failed deploy
Internal enablement	Moodle	Golden-path onboarding courses and HIPAA training, linked from the portal	SSO via Okta; completion status linked from the catalog’s “getting started”

A few of these choices carry the architecture and deserve the why, because they are where teams go wrong.

Why Terraform behind a pull request, not behind a “Provision” button that applies directly. It is tempting to have Backstage call terraform apply synchronously so the developer gets infra in the same ten minutes. Resist it. Direct apply means Backstage holds powerful cloud credentials and the state-changing action has no review, no plan output, and no audit artifact. The PR pattern keeps Terraform’s plan visible, lets a platform engineer (or an automated policy) approve genuinely sensitive changes, produces a Git history of every infrastructure mutation, and confines the credentials to a hardened runner using Vault dynamic leases. The developer waits a few extra minutes for a merge; the carrier gets an auditable, revertible change trail that a HIPAA auditor will ask for by name.

Why GitOps with Argo CD instead of kubectl from the pipeline. A push-based pipeline that runs kubectl apply works until it does not: there is no continuous reconciliation, drift accumulates silently, and the cluster’s true state is whatever the last human or pipeline did to it. Argo CD makes Git the single source of truth and the cluster a continuously-reconciled projection of it. Self-heal reverts a panicked 2 a.m. manual change; sync waves order the rollout (database migration job before app pods); and the diff between desired and live state is a first-class, reviewable object. Across three clouds this matters more, not less — one delivery model and one drift story instead of three bespoke ones.

Why a real internal-developer-platform layer and not “just Backstage.” Backstage out of the box is a catalog and a templating engine; it is not the platform. The platform is the combination of Backstage’s portal, the golden-path templates, the curated Terraform module library, the GitOps delivery, and the policy gates. A common failure is to stand up vanilla Backstage, declare victory, and leave the templates so thin that “scaffold a service” produces an empty repo and nothing else — at which point developers are back to copying last year’s Helm chart and the six weeks return. The value is in the opinionated back half of every template: the infrastructure, the pipeline, the security, the observability that the developer never had to think about.

Implementation guidance

Start the catalog before you start the templates. The catalog is the foundation; templates and plugins all reference catalog entities. Onboard your existing services first — even imperfectly — so the org model (teams, owners, domains, systems) is real before you build paved roads on top of it. A minimal catalog-info.yaml is the contract every component honors:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: claims-eligibility-svc
  annotations:
    github.com/project-slug: carrier/claims-eligibility-svc
    datadoghq.com/service-name: claims-eligibility-svc
    servicenow.com/cmdb-ci: CI042117
spec:
  type: service
  lifecycle: production
  owner: team-claims
  system: claims-platform
  dataClassification: phi      # drives policy: PHI inherits encryption + netpol by default

The golden-path template is where the opinions live. A Backstage software template is a YAML definition with a parameter form and an ordered list of scaffolder steps. The form should be short — five fields, sane defaults — and the steps should do the heavy lifting the developer used to do by hand. The structure below shows the intent (abbreviated):

apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata: { name: phi-microservice-go, title: "PHI Microservice (Go)" }
spec:
  parameters:
    - title: Service
      properties:
        name:  { type: string }
        owner: { type: string, ui:field: OwnerPicker }
        cloud: { type: string, enum: [aws, azure, gcp] }   # defaulted by team domain
        sloTier: { type: string, enum: [bronze, silver, gold] }
  steps:
    - id: fetch        # render repo skeleton: code, Dockerfile, Helm, app-config, CI
      action: fetch:template
    - id: publish      # create GitHub repo with branch protection + CODEOWNERS
      action: publish:github
    - id: infra-pr     # open a Terraform PR in the infra repo (custom action)
      action: carrier:terraform:pr
    - id: argo         # write the Argo CD Application for GitOps delivery
      action: carrier:argocd:app
    - id: register     # add the new component to the Backstage catalog
      action: catalog:register

The two custom actions — carrier:terraform:pr and carrier:argocd:app — are the platform team’s own code, and they are what separate a toy from a platform. The Terraform action selects the right per-cloud module set based on cloud and stamps the data-classification so a phi service automatically composes the encryption, network-policy, and audit-logging modules. The Argo action writes a delivery manifest into the GitOps repo. Everything downstream is standard Git mechanics.

CI: the pipeline is part of the golden path, not a developer’s afterthought. The rendered repo ships with a GitHub Actions workflow built from reusable, centrally-owned workflows so the platform team can roll a fix to all 1,400 services by updating one file. That pipeline authenticates to each cloud via OIDC federation (no stored cloud keys to leak), runs build and unit tests, and — critically for a HIPAA shop — runs Wiz Code as a required gate: scanning the application code, dependencies, and the IaC for vulnerabilities, secrets, and misconfigurations before anything merges, with the build failing on a policy violation. Ansible handles the cases Terraform should not: configuring the handful of virtual appliances the carrier still runs (a legacy fax-to-EDI gateway for provider claims, network appliances in the Azure landing zone) with idempotent, version-controlled playbooks rather than hand-edits.

Enterprise considerations

Security & Zero Trust. The platform is Zero Trust by construction and, just as importantly, by default. Identity is federated (Okta → Entra) and every workload runs under a least-privilege, scoped identity that Terraform provisions, never a shared key. Secrets are HashiCorp Vault dynamic leases — the Terraform runner gets per-apply cloud credentials that expire in minutes, and each service authenticates to Vault via its Kubernetes service-account to fetch its own secrets, so a compromised pod cannot read another service’s data. Layer on top: (a) Wiz running continuous CSPM across all three clouds plus Wiz Code as the PR-time gate, so a misconfiguration is caught in the pull request and, if one slips through, posture scanning and attack-path analysis flag it in the live cloud and the agentless scan independently verifies the golden path’s controls are actually holding; (b) CrowdStrike Falcon sensors baked into the node image (and on the VM-based appliances Ansible manages) for runtime threat detection feeding the SOC; © a failed security gate or a runtime detection auto-raises a ServiceNow incident so security has a ticket, not just a log line. The single highest-leverage control is the one already described: because the golden path is the only paved route, the encryption, the default-deny network policy, the audit logging, and the scoped identity are not things a developer remembers to add — they are things a developer cannot remove.

Cost optimization. A self-service platform’s danger is that “click to provision” makes it trivial to spin up cloud spend, so the platform must make cost visible and accountable from day one.

Lever	Mechanism	Typical effect
Cost surfaced in the portal	Backstage cost plugin / Datadog Cloud Cost on each service’s catalog page	Owners see their spend where they already look
Right-sizing in the template	SLO tier (bronze/silver/gold) maps to default replica counts and instance sizes	Stops every dev from picking “gold” by reflex
Ephemeral preview envs	Argo CD ApplicationSets spin up per-PR namespaces, torn down on merge	Avoids long-lived idle staging environments
Golden AMIs/base images	Shared, scanned base images instead of per-team bespoke ones	Less drift, smaller images, fewer redundant builds
Showback by team	Unified Datadog tagging (team, system, dataClassification) drives a chargeback view	Each domain owns its cloud bill

Unified tagging is the keystone: because every resource the golden path creates is stamped with team, system, and dataClassification, Datadog can produce the per-team showback dashboard the CFO sees, and finance stops guessing which cloud line belongs to whom.

Scalability — of the platform, not just the services. Two different scaling stories matter here. The runtime services scale per cloud as normal (HPA/cluster-autoscaler on EKS/AKS/GKE). The more interesting axis is scaling the platform to 1,400 engineers: Backstage itself runs as a horizontally-scaled backend behind Akamai with Postgres for its catalog; the scaffolder work is asynchronous so a burst of “new service” requests queues rather than blocks; and Argo CD is sharded — either an instance per cloud or a hub with application-controller sharding — because a single Argo controller reconciling thousands of applications across three clusters becomes the bottleneck long before the clusters do. The catalog ingestion (processing every catalog-info.yaml) is the other thing that strains at scale; tune its refresh cadence and use a GitHub org-level discovery provider rather than registering repos one by one.

Failure modes, and what each one looks like. Name them before they page the eight-person platform team.

The scaffolder fails halfway — the repo got created but the Terraform PR step errored, leaving an orphan repo and a confused developer. Mitigation: make scaffolder steps idempotent and surface a clear “what succeeded / what to retry” status; treat partial creation as the common case, not the exception.
Argo CD drift fight — a developer makes a manual kubectl change for a “quick fix”; Argo’s self-heal reverts it; the developer re-applies; a loop ensues. Mitigation: make Git the only write path culturally and technically (RBAC that denies direct cluster writes for app teams), and surface the drift in Backstage so it is visible, not mysterious.
A golden-path template change breaks new services silently — a bad edit to the shared template renders broken repos for everyone who scaffolds after it. Mitigation: version templates, test them in CI by actually scaffolding a throwaway service, and roll changes through a canary team first.
Vault unavailable at apply time — the Terraform runner cannot lease cloud credentials, so all provisioning halts. Mitigation: Vault HA with a clear, time-bounded fallback, and a circuit-breaker that fails the apply loudly rather than retrying into a stuck state.
Plugin sprawl degrades the portal — a slow Datadog or ServiceNow plugin call makes catalog pages crawl. Mitigation: cache plugin data backend-side, set timeouts, and degrade gracefully so one slow integration does not take the portal down.

Reliability & DR. Decide the numbers per plane. The Backstage control plane is important but not in the request path of the running services — if the portal is down for an hour, no member-facing traffic is affected; developers just cannot scaffold. So a pragmatic target is RTO 1 hour, RPO 15 minutes for the portal (Postgres backups + redeploy from its own GitOps definition), while the runtime services carry their own per-tier SLAs independent of the platform. The crucial design property is that Argo CD and the GitOps repos are the disaster-recovery mechanism for the workloads: because desired state lives entirely in Git, recovering a lost cluster is “point a fresh Argo CD at the repo and let it reconcile,” not a frantic manual rebuild. Akamai health checks drive edge failover for the portal itself.

Observability and governance. Instrument the golden path itself, not only the services it produces: track time-to-first-deploy (idea → production), scaffolds per week, golden-path adoption rate (what fraction of new services went through the paved road versus around it), and template failure rate — these are the platform-as-a-product KPIs that tell you whether the road is actually paved or just painted. Datadog carries the service-level telemetry into Backstage; ServiceNow carries the change and incident governance, with a change gate before any production deploy of a high-tier service and an auto-opened incident on a failed rollout. Governance lives in version control: golden-path templates, the Terraform module library, and the reusable CI workflows are all reviewed, versioned, and instantly revertible, so “the standard” is a Git history, not an opinion. New developers are onboarded through Moodle courses — golden-path walkthroughs and mandatory HIPAA training — linked directly from each service’s “getting started” page in the catalog and gated by the same Okta SSO, so enablement is part of the platform rather than a separate forgotten wiki.

Explicit tradeoffs

Accept these or do not build it. An IDP is a product, which means a standing team and a roadmap forever — there is no “ship it and walk away.” Build it with two people as a side project and you get vanilla Backstage with empty templates, which is worse than nothing because it carries the maintenance cost without the paved-road payoff. The opinionated golden path that makes the common case effortless also makes the uncommon case harder: a team with a genuinely novel need now has to either extend the platform or get an explicit exception, and if you make exceptions too painful, teams route around the platform entirely and you are back to fourteen hundred snowflakes. The multi-cloud surface multiplies the work — three sets of Terraform modules, three identity integrations, three Argo targets — and the honest move is to ask whether you truly need three clouds or whether the Azure-acquisition workloads should migrate so the platform spans two. And the PR-gated, GitOps-everything discipline that gives you auditability and drift control costs you the raw speed of “click and it is live” — a few minutes per provision and a cultural insistence that nobody touches the cluster directly.

The alternatives, and when they win. If you are a thirty-engineer single-cloud startup, this is wild over-engineering — a well-templated cookiecutter, a shared Terraform repo, and a Slack channel will serve you for years; build the portal when the coordination cost of not having one exceeds the cost of running one. If you want the paved-road outcome without operating Backstage yourself, a managed/commercial IDP (a hosted Backstage distribution or a turnkey product) trades flexibility and cost for a smaller platform team — a reasonable choice when your toolchain is mainstream and your differentiation is not your platform. And if your pain is purely delivery and not discovery or scaffolding, you may only need the Argo CD + Terraform + Vault spine and can skip the portal until catalog sprawl actually hurts. Backstage earns its keep precisely when the org is large enough that finding services, standardizing their creation, and governing them centrally are each real problems — which, at 1,400 engineers across three clouds under HIPAA, they unambiguously are.

The shape of the win

For the carrier, the payoff is not “a developer portal.” It is that the six-week median from approval to production collapses to under a day, and — the part that funds the platform — that every service created through the paved road is encrypted, network-isolated, least-privilege, audit-logged, observable, and on-call-ready the moment it exists, because there is no path through the portal that produces anything less. A new telehealth service on GCP and a new claims service on AWS now look the same to the developer, inherit the same HIPAA controls, and show up in the same catalog with the same Datadog dashboards and the same ServiceNow change history. The platform team of eight stops being a ticket queue and becomes the team that paves roads. Everything upstream — the Backstage catalog, the golden-path templates, the Terraform module library, the Vault-leased credentials, the Wiz Code gate, the Argo CD reconciliation, the Datadog and ServiceNow plugins — exists so that the right way is the only way that is also the easy way. Start with the catalog and one golden path if you must; but this is where a large, regulated, multi-cloud engineering organization’s developer experience has to land.

Internal Developer Platform on Backstage with Golden Paths

Why a portal, and why not the obvious alternatives

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Explicit tradeoffs

The shape of the win

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)