A national EdTech provider runs Moodle for 1.4 million students across a dozen state education boards, and the platform team has a problem that has nothing to do with Moodle and everything to do with how they build the ground it stands on. They have four engineers, forty environments (each board gets isolated dev/staging/prod, plus shared platform tooling), two clouds because two boards mandated data residency on a domestic provider, and an audit obligation: every exam-season scale-up has to be reproducible, reviewed, and traceable back to a commit, because the year the autoscaling was hand-edited in the console during a results-day surge, a misconfigured security group left a grades database reachable for nine hours. The mandate from the new head of platform is blunt: everything that touches infrastructure goes through code, through review, through a pipeline — no exceptions, no console. The question that has stalled them for a month is not whether to use infrastructure as code. It is which tool, because someone proposed Terraform, someone else swears by Pulumi, the senior SRE wants Ansible for the VM fleet, and a blog post convinced a junior that Terragrunt fixes everything. This article is the decision framework that ends that argument.
The trap here is treating “IaC tool” as a single choice with one winner. It is not. These four tools occupy different points on two axes — what they describe (provisioning cloud resources vs configuring what runs inside them) and how they describe it (declarative desired-state vs imperative steps) — and the mature answer for a team like this is usually a small, deliberate combination, not a monoculture. The skill is knowing which tool owns which job, and refusing to use a tool outside the job it is good at.
The two axes that actually decide this
Before naming tools, name the distinctions, because every real disagreement traces back to one of them.
Declarative vs imperative. A declarative tool takes a description of the desired end state and figures out the diff to get there — you say “this VNet, these three subnets, this database tier,” and the tool computes what to create, change, or destroy. An imperative tool runs steps in order — “install this package, write this file, restart this service.” Declarative is reproducible and converges to the same state regardless of where you started; imperative is procedural and depends on order. Provisioning cloud resources is naturally declarative. Configuring the inside of a server — packages, files, services, OS tuning — is often more naturally imperative.
State model. A declarative tool needs to know what it already created so it can compute the next diff. Terraform and Pulumi keep an explicit state file — a recorded inventory of managed resources mapped to real cloud IDs. That state is powerful (it enables plan, drift detection, dependency graphs) and dangerous (it can drift from reality, it must be stored securely because it contains resource metadata and sometimes secrets, and concurrent writes corrupt it without locking). Ansible is largely stateless — it inspects the target at run time, decides what is out of compliance, and fixes it, holding no persistent record between runs. Stateless is simpler and harder to corrupt; stateful is what gives you a true plan-before-apply and drift detection.
Hold those two axes in your head and the four tools sort themselves cleanly.
The four tools, by the job each one owns
Terraform — declarative cloud provisioning in HCL. This is the default substrate for standing up cloud resources: networks, managed databases, Kubernetes clusters, load balancers, IAM. You write HCL (HashiCorp Configuration Language) describing desired state; Terraform builds a dependency graph, shows you a plan (the exact diff it intends to apply), and on apply reconciles reality to the config, recording everything in state. Its decisive advantage is the provider ecosystem — thousands of providers covering every major cloud and SaaS, all driven by the same workflow. For the EdTech team, Terraform is what creates the per-board VPCs/VNets, the managed Postgres for Moodle, the object storage for course content, and the Kubernetes clusters — identically, on both clouds, from reviewed code.
Terragrunt — a thin DRY wrapper around Terraform for many environments. Terragrunt is not a separate IaC engine; it is a wrapper that calls Terraform and solves the specific pain of running the same Terraform across many environments without copy-paste. Forty environments in plain Terraform means forty near-identical sets of backend config, provider config, and variable wiring — and the day you need to change the state-bucket naming convention, you edit it forty times. Terragrunt lets you define that boilerplate once and generate it per environment, keep each environment’s inputs in a small file, and run a change across a whole tree of environments. You only reach for Terragrunt once the environment count makes Terraform’s own repetition genuinely painful — which is exactly this team’s situation, and would be over-engineering for a shop with three environments.
Ansible — agentless, procedural configuration management. Ansible owns the inside of machines and the day-2 procedural work Terraform is bad at: install and patch packages, lay down config files, manage services, run an ordered upgrade, orchestrate a rolling restart. It is agentless (it connects over SSH/WinRM — nothing to pre-install on targets), describes work as ordered playbooks (YAML), and is built to be idempotent (re-running converges rather than duplicating), but the model is fundamentally procedural and largely stateless. For the EdTech fleet, Ansible is what hardens the Moodle application VMs to the CIS benchmark, installs the PHP/Moodle stack and cron jobs on any board still on VMs rather than containers, applies emergency OS patches across the fleet when a CVE lands, and configures the virtual appliances — the third-party WAF and load-balancer appliances that ship as VM images and expose no Terraform provider, so they have to be driven over SSH/API by playbooks.
Pulumi — declarative provisioning in real programming languages. Pulumi covers the same job as Terraform — declarative cloud provisioning with an explicit state model — but you write it in TypeScript, Python, Go, or C# instead of HCL. The payoff is real programming-language power: loops, conditionals, functions, classes, unit tests, and your IDE’s autocomplete and type-checking, plus the ability to build genuine reusable abstractions (a typed MoodleEnvironment component you instantiate forty times). The cost is that you now carry general-purpose-language complexity into your infrastructure, your reviewers must read code rather than declarative config, and the talent pool that knows your IaC shrinks to people who know both the cloud and that language. Pulumi wins when your team is software-engineering-heavy and your infrastructure has genuinely complex logic that HCL expresses awkwardly; it is a harder sell when ops engineers, not application developers, own the platform.
Architecture overview
The reference shape for the EdTech team is a layered pipeline where each tool does the one job it is best at, and a single source of identity, secrets, security scanning, and ITSM wraps all of them. Follow the control flow from a commit to a running, configured environment.
- An engineer opens a pull request against the infrastructure monorepo. Authentication to every system in this pipeline — the Git host, the CI runners, the cloud accounts — is brokered through Okta as the workforce IdP, federated to Entra ID where Azure resources need a first-class token, so there is one identity plane and no tool holds its own user database.
- The PR triggers GitHub Actions (and, for the boards still on the team’s older self-hosted estate, a Jenkins pipeline that mirrors the same stages). The pipeline authenticates to the clouds via OIDC federation — short-lived tokens, no stored cloud keys — and pulls any unavoidable secrets (a virtual-appliance admin credential, a third-party API token) from HashiCorp Vault with dynamic, short-TTL leases rather than baking them into the repo or the CI config.
- Provisioning layer — Terraform, orchestrated by Terragrunt. Terragrunt runs Terraform across the affected environments, generating each environment’s backend and provider config from a single shared definition and feeding per-board inputs. Terraform produces a
planfor every environment, which is posted to the PR for human review. State for all forty environments lives in a locked remote backend (one state object per environment, isolated so a bad apply to one board cannot touch another). - Policy and security gates. Before any
apply, Wiz Code scans the Terraform/Pulumi in the PR for misconfigurations — a public storage bucket, an over-broad security group, an unencrypted database — and blocks the merge if it finds the class of mistake that caused the nine-hour exposure. Post-deploy, Wiz continuously scans the live cloud posture so configuration drift or a console hot-fix that bypassed the pipeline surfaces as an alert, not as next year’s incident. - Configuration layer — Ansible. Once Terraform has created the VMs and appliances, Ansible playbooks (triggered by the same pipeline, using Terraform’s output as a dynamic inventory) configure the inside: harden the OS, install the Moodle/PHP stack on VM-based boards, configure the virtual WAF appliances, lay down cron and backup jobs. This is the imperative, day-2 work that Terraform deliberately does not do.
- Runtime and operations. On the running fleet, CrowdStrike Falcon provides runtime threat detection on every node and appliance VM, feeding the SOC. Dynatrace (with Datadog on the boards that standardized on it before the consolidation) instruments the platform end to end — and, crucially, watches the pipeline itself, so a Terraform apply that regresses latency or a drift-remediation run shows up on a dashboard. Any failed apply, blocked policy gate, or detected drift auto-raises a ServiceNow change/incident record, giving audit the documented trail the mandate requires. At the edge, Akamai terminates TLS and provides WAF/DDoS protection for results-day surges — itself provisioned as code through its Terraform provider, so even the CDN config lives in the same reviewed pipeline.
The shape’s discipline is the point: Terraform/Pulumi provision, Ansible configures, Terragrunt keeps the forty-way repetition DRY, and one identity/secrets/security/ITSM spine governs all of it. No tool reaches outside its job.
The decision table
This is the artifact that ends the team’s argument. Map the job to the tool, not the tool to a preference.
| If the job is… | Reach for | Because | Not because the others “can’t” |
|---|---|---|---|
| Standing up cloud resources (network, DB, K8s, IAM) declaratively | Terraform | Largest provider ecosystem; plan/state/graph; team-standard HCL |
Pulumi does this too — pick by language preference |
| The same Terraform across 10+ environments without copy-paste | Terragrunt | DRY backend/provider generation; run a change across a whole env tree | Plain Terraform works for a handful of envs; this is for many |
| Configuring inside servers — packages, files, services, OS hardening | Ansible | Agentless, idempotent, procedural day-2 work; drives appliances over SSH/API | Terraform can run remote-exec, but it is the wrong tool for config mgmt |
| Provisioning with complex logic, in an app team’s own language | Pulumi | Real languages: loops, types, tests, true reusable components | HCL handles most cases; choose Pulumi for genuine code complexity + skills |
| Driving a virtual appliance with no first-class provider | Ansible | SSH/WinRM/API automation; no provider required | Terraform needs a provider; appliances often don’t have one |
| An auditable, repeatable exam-season scale-up | Terraform (+ Terragrunt) | Declarative desired state, reviewed plan, state-tracked, drift-detectable |
Ansible is stateless — weaker fit for “what exactly is provisioned” |
The horizontal split (provision vs configure) is Terraform/Pulumi above the line, Ansible below it — these compose, they do not compete. The vertical choice (Terraform vs Pulumi) is one-or-the-other on language and team, and Terragrunt only enters once Terraform’s own repetition hurts.
Where teams get this wrong
Using Terraform to configure servers. Terraform has remote-exec and local-exec provisioners, and it is tempting to install packages with them. Do not. They run once at create time, are not idempotent, are invisible to plan, and turn your state into a lie about what is actually installed. Provisioning and configuration are different jobs; remote-exec is an escape hatch, not a config-management strategy. Hand the inside of the box to Ansible.
Using Ansible to provision cloud resources. Ansible has cloud modules and can create a VPC. But it is stateless, so it has no real plan, no dependency graph, and no clean answer to “what does this manage and what would change” — exactly the questions an audit asks. For the declarative “here is the desired set of cloud resources” job, a state-backed tool is the right model.
Reaching for Terragrunt on day one. Terragrunt earns its keep when repetition across environments is genuinely painful. A team with three environments that adopts it early pays the cost — another tool, another layer of indirection, a steeper onboarding — for benefits they don’t yet have. Start with plain Terraform; adopt Terragrunt when the forty-way copy-paste is the actual pain.
Choosing Pulumi because “real code is nicer,” then handing it to ops. Pulumi is excellent if your team writes that language daily and your infra logic is genuinely complex. If your platform is owned by ops engineers who do not write TypeScript, you have traded HCL — which they can read — for code that shrinks your reviewer pool and your bus factor. The honest question is “who maintains this at 2 a.m. on results day,” not “which is more elegant.”
Multi-environment DRY: a worked comparison
The forty-environment problem is where the choice becomes concrete. In plain Terraform, each environment needs its own backend and provider config, repeated:
# environments/board-ka/prod/backend.tf — duplicated, with edits, ×40
terraform {
backend "s3" {
bucket = "edtech-tfstate-board-ka-prod"
key = "moodle/terraform.tfstate"
region = "ap-south-1"
}
}
With Terragrunt, the backend is generated from one root definition and each environment shrinks to its inputs:
# root.hcl — written ONCE
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
bucket = "edtech-tfstate-${local.board}-${local.env}"
key = "${path_relative_to_include()}/terraform.tfstate"
region = local.region
}
}
# environments/board-ka/prod/terragrunt.hcl — tiny, per environment
include "root" { path = find_in_parent_folders() }
inputs = { board = "board-ka", env = "prod", moodle_db_tier = "db.r6g.xlarge" }
Change the state-bucket convention once in root.hcl and it applies to all forty. With Pulumi, the same DRY goal is met with a typed component and a loop — no wrapper tool, but you are now maintaining a software project:
# one reusable component, instantiated per board
class MoodleEnv(pulumi.ComponentResource):
def __init__(self, board: str, env: str, db_tier: str): ...
for board in BOARDS: # forty instances from real code
MoodleEnv(board, "prod", db_tier=TIERS[board])
Three valid ways to kill the same duplication. Terragrunt wins when the team is HCL-native and wants the smallest new concept. Pulumi wins when they would rather express it in a language they already test and refactor.
Drift, security, and operating reality
Drift detection is where the state model pays off — and where Ansible’s statelessness shows its edge differently. A scheduled terraform plan (or pulumi preview) against an environment reports any difference between code and reality: the console hot-fix someone applied during a surge shows up as a diff to be reverted or codified. Wiz backstops this from outside the pipeline by scanning live cloud posture continuously, so even drift the plan misses — or a resource created entirely outside IaC — gets caught. Ansible, by contrast, doesn’t report drift so much as erase it: re-running the playbook simply re-converges the box to the playbook’s definition, which is its own kind of guarantee for the inside of a server.
Security wraps every tool identically. Wiz Code scans the Terraform/Pulumi in the PR before merge, catching the public bucket or open security group as code — the shift-left control that would have prevented the original exposure. The blast-radius discipline of one state object per environment means a bad apply is contained to one board. Secrets never live in code: Vault issues short-lived, dynamic credentials, and the pipeline authenticates to clouds via OIDC, so there are no long-lived keys in the repo or CI — a direct answer to the standing rule that leaked credentials must never be re-committed.
Cost is mostly about avoiding the wrong tool’s overhead. Terragrunt and Ansible are open-source and free; cost shows up as operational burden — Terragrunt is another concept to learn, Pulumi pulls language-runtime and dependency management into your infra repo, and HCP Terraform / Pulumi Cloud / Ansible Automation Platform are paid SaaS tiers you adopt only if you need managed state, RBAC, and policy-as-a-service rather than running runners and a state backend yourself. For a four-person team, the dominant cost is cognitive: every extra tool is onboarding time and a thing that breaks at 2 a.m.
Explicit tradeoffs and the honest recommendation
The combined stack’s cost is real. Running Terraform + Terragrunt + Ansible means three tools, three mental models, and a pipeline that orchestrates all of them — more than a one-tool shop carries. State is a liability as well as an asset: it must be stored securely, locked against concurrent writes, and reconciled when it drifts. The provision/configure split means an environment is only “done” after both layers run, so your pipeline and your runbooks have to treat them as one logical unit. And Terragrunt, for all its DRY power, is another layer between you and Terraform — when something breaks, you debug through the wrapper.
The alternatives, and when each genuinely wins. If you provision only cloud resources and never touch the inside of a VM (a pure-serverless, pure-managed-services shop), you may need just Terraform or just Pulumi — no Ansible, no Terragrunt. If you have a handful of environments, skip Terragrunt; plain Terraform’s repetition is tolerable and the wrapper is overkill. If your platform is owned by application engineers who live in TypeScript or Python and your infra has real branching logic, Pulumi alone (with its own config component instead of Terragrunt) is a coherent, single-language stack. If you mostly manage a fleet of long-lived VMs and appliances and provision little, Ansible-heavy with a thin Terraform base is right. And if you want managed state, policy-as-code, and RBAC out of the box rather than assembling them, the paid platforms (HCP Terraform, Pulumi Cloud, Ansible Automation Platform) buy that — at a per-seat price a four-person team should weigh hard.
For the EdTech team specifically, the framework lands here: Terraform as the provisioning substrate on both clouds, Terragrunt to keep the forty environments DRY (they have earned it), Ansible for OS hardening, the Moodle stack on VM-based boards, and the virtual appliances with no provider, and Pulumi held in reserve — not adopted, because the platform is ops-owned and HCL is what the team reads, but the right escape hatch the day a piece of infra genuinely needs real-language logic. All of it rides one spine: Okta/Entra identity, Vault secrets, GitHub Actions/Jenkins with Argo CD for the GitOps delivery of the Kubernetes workloads themselves, Wiz/Wiz Code for shift-left and posture, CrowdStrike Falcon at runtime, Dynatrace/Datadog for observability, ServiceNow for the audit trail, and Akamai at the edge. The win is not a favorite tool. It is that next results-day’s scale-up is a reviewed pull request with a readable plan, applied by a pipeline, scanned before it merges, and traceable to a commit — and the security group that was hand-edited last year is now a line of code that Wiz would refuse to let through.