CloudFormation is a declarative language with no loops, no first-class functions, and a deliberately small set of intrinsics. That ceiling is a feature: a template is meant to be a static, reviewable artifact. But the moment you need ten near-identical subnets, a conditional that branches on a list length, or a resource type AWS has not modeled yet, you hit the wall. The interesting part of CloudFormation is the set of extension points the service exposes for exactly these cases: client-side transforms (SAM), template macros, the AWS::LanguageExtensions transform, the resource provider registry, custom resources, and finally CDK escape hatches when you generate the template instead of writing it.
This guide walks each mechanism, where it runs in the deployment lifecycle, and the failure modes that bite in production. Everything targets the current CloudFormation control plane and CDK v2.
1. Know where each extension runs before you reach for it
The single most common mistake is using the wrong extension for the job because people do not internalise when each one executes. Macros and transforms run at template-processing time, before any resource is touched. Resource providers and custom resources run during the actual stack operation, as part of the change set being executed.
| Mechanism | Runs when | Runs where | Use it for |
|---|---|---|---|
Transform (SAM, LanguageExtensions) |
Template processing, pre-changeset | CloudFormation service | Macro-expanding shorthand into full resources |
Template Macro |
Template processing, pre-changeset | Your Lambda | Custom template-to-template rewriting (loops, string ops) |
| Resource provider (registry type) | Stack operation | AWS-hosted, your handler | A real, first-class resource type with full CRUD + drift |
| Custom resource | Stack operation | Your Lambda / SNS | One-off gaps, side effects, lookups, glue |
Rule of thumb: if you are rewriting the template, you want a macro or transform. If you are managing a thing that has a lifecycle, you want a resource provider or a custom resource. Mixing these up produces code that is impossible to reason about.
A processed template is what CloudFormation actually deploys. Always inspect it before trusting a macro:
aws cloudformation get-template \
--stack-name my-stack \
--template-stage Processed \
--query 'TemplateBody' --output text
2. Author a Lambda-backed template macro
A macro is a Lambda function plus an AWS::CloudFormation::Macro resource that registers it by name. When a template references the macro under its top-level Transform, CloudFormation invokes your function with the template fragment, and your function returns a rewritten fragment. This is the escape hatch for syntactic features the language lacks: real loops, string manipulation, injecting boilerplate.
The contract is strict. CloudFormation sends an event and expects a JSON response containing requestId (echoed back unchanged), a status of SUCCESS or FAILURE, and the rewritten fragment.
# macro_handler.py - expands a "Count" property into N copies of a resource
import copy
def handler(event, context):
fragment = event["fragment"]
new_resources = {}
for name, resource in fragment.get("Resources", {}).items():
count = resource.get("Count")
if count is None:
new_resources[name] = resource
continue
# Strip the synthetic Count key before emitting real CFN
template = copy.deepcopy(resource)
template.pop("Count", None)
for i in range(int(count)):
new_resources[f"{name}{i}"] = copy.deepcopy(template)
fragment["Resources"] = new_resources
return {
"requestId": event["requestId"],
"status": "SUCCESS",
"fragment": fragment,
}
Register the function as a macro in its own stack. The macro and the Lambda must live in the same account and region as the stacks that consume it.
# macro-registration.yaml
AWSTemplateFormatVersion: "2010-09-09"
Resources:
MacroFunction:
Type: AWS::Lambda::Function
Properties:
Handler: macro_handler.handler
Runtime: python3.12
Timeout: 30
Role: !GetAtt MacroRole.Arn
Code:
S3Bucket: !Ref ArtifactBucket
S3Key: macro_handler.zip
CountMacro:
Type: AWS::CloudFormation::Macro
Properties:
Name: CountMacro # this is the name templates reference
FunctionName: !GetAtt MacroFunction.Arn
Consume it by listing the macro name in Transform. The synthetic Count property only exists because the macro removes it before CloudFormation validates the resource:
AWSTemplateFormatVersion: "2010-09-09"
Transform: [CountMacro]
Resources:
Topic:
Type: AWS::SNS::Topic
Count: 3
Properties:
DisplayName: worker-topic
Hard-won lessons that are not obvious from the docs:
- No drift, no rollback semantics inside the macro. A macro is a pure template-rewrite. If it throws, the entire operation fails before a change set exists. You get one error string back; log generously to CloudWatch because that is your only debugger.
- Macros do not compose with cross-stack references cleanly. A template that uses a macro cannot be used as a nested stack via
AWS::CloudFormation::Stackin some configurations, andpackage/deploywill refuse certain combinations. Validate the processed output early. - Macros run with their own IAM role, but they cannot read other AWS resources unless you make API calls inside the handler. Keep them deterministic; a macro that calls out to live infrastructure is a macro that makes your template non-reproducible.
3. Use AWS::LanguageExtensions for loops and intrinsics
Before writing a custom macro for a loop, check whether the AWS-managed AWS::LanguageExtensions transform already covers it. It is a first-party transform that adds Fn::ForEach, Fn::Length, Fn::ToJsonString, and relaxes some intrinsic-function restrictions (for example, allowing Ref and Fn::GetAtt inside Fn::Sub-adjacent positions and intrinsics in more places). No Lambda, no registration, no IAM.
Fn::ForEach takes a loop name, an identifier, a collection, and an output map whose keys and values can reference the identifier with &{Identifier} for logical-ID interpolation and ${Identifier} for values.
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::LanguageExtensions
Parameters:
BucketNames:
Type: CommaDelimitedList
Default: "logs,artifacts,backups"
Resources:
Fn::ForEach::Buckets:
- LogicalId # the loop identifier
- !Ref BucketNames # the collection
- "${LogicalId}Bucket": # output key template
Type: AWS::S3::Bucket
Properties:
BucketName: !Sub "myorg-${LogicalId}"
Fn::Length is the conditional-on-list-length primitive that plain CloudFormation cannot express. Pair it with Conditions:
Transform: AWS::LanguageExtensions
Conditions:
HasMultipleAZs:
!Not [!Equals [!Length !Ref SubnetList, 1]]
The transform is the right default for templated infrastructure because AWS owns the implementation and its expansion is deterministic and visible in the processed template. Reach for a custom macro only when you need string operations or rewriting logic that LanguageExtensions does not provide.
If
Fn::ForEachplusFn::Lengthsolves it, never write a Lambda macro for the same thing. You are taking on a runtime, an IAM role, and a CloudWatch debugging surface to reinvent something AWS maintains for free.
4. Build a first-class resource type with the CloudFormation CLI
When you need a real resource type, not template sugar, build a resource provider and publish it to the registry. A registry resource type gets a fully namespaced name (Vendor::Service::Resource), participates in drift detection, supports create/read/update/delete/list handlers, and is referenced exactly like an AWS-native type. This is the path for managing third-party SaaS or internal control-plane objects as native CloudFormation resources.
Scaffold with the CloudFormation CLI (cfn). It generates a JSON schema for your type and language-specific handler stubs (Java, Go, Python, TypeScript).
pip install cloudformation-cli cloudformation-cli-python-plugin
cfn init # choose RESOURCE, type name MyOrg::Billing::Budget, language Python
The schema is the contract. You declare properties, which are createOnlyProperties (force replacement), which are readOnlyProperties (set by the handler, not the user), and the primaryIdentifier:
{
"typeName": "MyOrg::Billing::Budget",
"properties": {
"Name": { "type": "string" },
"Limit": { "type": "number" },
"Arn": { "type": "string" }
},
"primaryIdentifier": ["/properties/Arn"],
"readOnlyProperties": ["/properties/Arn"],
"createOnlyProperties": ["/properties/Name"],
"additionalProperties": false
}
Implement the handlers, then submit. cfn submit builds the package, registers the type version, and (with --set-default) makes it the active version in the account/region:
cfn generate # regenerate code from schema after edits
cfn submit --set-default --region us-east-1
A submitted private type is then usable like any native resource:
Resources:
TeamBudget:
Type: MyOrg::Billing::Budget
Properties:
Name: platform-team
Limit: 5000
The reason to pay the cost of a provider over a custom resource: drift detection works (CloudFormation calls your read handler and diffs), the type is discoverable in the registry, and list enables import. A custom resource gets none of that.
5. Fill the gaps with custom resources and lifecycle hooks
For genuinely one-off needs, a side effect, an AMI lookup, a string transform, calling an API once during deploy, a full resource provider is overkill. The AWS::CloudFormation::CustomResource (or its Custom:: alias) backed by Lambda is the right tool. CloudFormation invokes your function on create, update, and delete, and blocks the stack operation until your function calls back to the pre-signed S3 URL in event["ResponseURL"].
The two failure modes that cause stuck stacks: not responding at all, and not handling Delete.
import json, urllib.request
def send(event, status, data=None, physical_id=None):
body = json.dumps({
"Status": status,
"Reason": "See CloudWatch logs",
"PhysicalResourceId": physical_id or event["LogicalResourceId"],
"StackId": event["StackId"],
"RequestId": event["RequestId"],
"LogicalResourceId": event["LogicalResourceId"],
"Data": data or {},
}).encode()
req = urllib.request.Request(
event["ResponseURL"], data=body, method="PUT",
headers={"content-type": "", "content-length": str(len(body))},
)
urllib.request.urlopen(req)
def handler(event, context):
try:
if event["RequestType"] == "Delete":
# Always succeed Delete unless you truly own teardown,
# or a failed create will wedge the rollback.
send(event, "SUCCESS")
return
# Create / Update logic here
send(event, "SUCCESS", data={"Result": "ok"})
except Exception:
send(event, "FAILED") # never let the Lambda time out silently
Non-negotiable patterns:
- Always respond, including in the failure path. A
try/exceptthat postsFAILEDis what saves you from a stack stuck inCREATE_IN_PROGRESSfor an hour until the resource timeout fires. - Treat
Deleteas best-effort. If a create fails, CloudFormation rolls back by deleting the resource it just half-created. ADeletethat throws on a resource that never fully existed wedges the rollback. - Watch the
PhysicalResourceId. If you return a different physical ID during anUpdate, CloudFormation interprets it as a replacement and issues aDeletefor the old ID afterward. Keep it stable unless you intend replacement.
This is also where CloudFormation Hooks differ in intent: a custom resource manages a thing, whereas a Hook (AWS::Hooks) inspects and can block create/update/delete of other resources for policy enforcement, before they are provisioned. Reach for Hooks when the goal is a guardrail, not a managed object.
6. Drop to L1 constructs and escape hatches in CDK
Most of the time you are not hand-writing templates, you are generating them with CDK. CDK’s L2 constructs are opinionated, and periodically the property you need is not surfaced, or a brand-new CloudFormation property ships before the L2 catches up. CDK has a layered set of escape hatches for exactly this, and knowing them prevents the “I’ll just drop CDK and write YAML” overreaction.
Escape hatch 1: override properties on the underlying L1 (Cfn*) resource. Every L2 wraps an L1. Reach into it and override raw CloudFormation properties by their CloudFormation names (not the CDK camelCase):
const bucket = new s3.Bucket(this, "Data");
// Get the L1 child and override a raw CFN property
const cfnBucket = bucket.node.defaultChild as s3.CfnBucket;
cfnBucket.addPropertyOverride(
"AccelerateConfiguration.AccelerationStatus",
"Enabled",
);
// Remove a property the L2 set that you do not want
cfnBucket.addPropertyDeletionOverride("LoggingConfiguration");
Escape hatch 2: raw overrides for non-property fields such as UpdateReplacePolicy, DeletionPolicy, Metadata, or Condition, which are not under Properties:
cfnBucket.addOverride("DeletionPolicy", "Retain");
cfnBucket.addOverride("Metadata.guard.SuppressedRules", ["S3_BUCKET_LOGGING_ENABLED"]);
Escape hatch 3: use the L1 directly when there is no L2 at all (common for day-one resource launches). Cfn* constructs map one-to-one onto the resource and accept every property the resource supports:
new cfn.CfnResource(this, "Raw", {
type: "MyOrg::Billing::Budget",
properties: { Name: "platform-team", Limit: 5000 },
});
The escape-hatch order is the mental model: prefer the L2 property, then
addPropertyOverride, thenaddOverride, then drop to theCfn*L1. Abandoning CDK for raw YAML because one property is missing is almost always the wrong trade.
After applying any escape hatch, synthesize and read the actual template. CDK’s job is to emit CloudFormation; verify the override landed where you expect:
cdk synth MyStack > /tmp/synth.yaml
7. Verify
Treat every extended template as untrusted until the processed output, linting, policy, and a real deploy agree.
Inspect the processed template. Macros and transforms only manifest after processing, so lint the expanded form, not your source:
aws cloudformation get-template \
--stack-name my-stack --template-stage Processed \
--query 'TemplateBody' --output text > processed.json
Lint with cfn-lint. It understands the resource specification, validates intrinsic usage, and supports the LanguageExtensions transform natively:
pip install cfn-lint
cfn-lint template.yaml
Enforce policy with CloudFormation Guard. cfn-guard runs declarative rules against the template (or the processed output) and fails the build on violations, this is your policy-as-code gate in CI:
cfn-guard validate --data processed.json --rules guardrails.guard
Integration-test with taskcat. It deploys the stack into real accounts/regions from a config, reports pass/fail per region, and tears down. This is the only check that proves your macro/provider/custom resource behaves end to end:
# .taskcat.yml
project:
name: extended-cfn
regions: [us-east-1, eu-west-1]
tests:
default:
template: template.yaml
pip install taskcat
taskcat test run
For resource providers specifically, run the contract tests the CLI generates before you trust submit:
cfn test # runs the resource type contract test suite against your handlers