Active Directory Forest Recovery: Building and Testing a Ransomware-Ready Recovery Runbook

At 03:40 the SOC calls: file servers across three sites are throwing ransom notes, and the domain controllers stopped answering LDAP twenty minutes ago. You log into the jump host — it prompts you for credentials against a domain that no longer authenticates anyone. This is the moment every identity architect prepares for and hopes never arrives: not “a domain controller died,” but the entire Active Directory forest can no longer be trusted. Once an attacker owns Tier 0 and has almost certainly forged a Golden Ticket from the krbtgt hash, every surviving DC is a suspect. SYSVOL group policy objects may be weaponised, the schema poisoned, service accounts backdoored, and the AdminSDHolder ACL rewritten. Restoring one DC does nothing if the rest of the forest still answers to the attacker. Forest recovery is the deliberate teardown and rebuild of the whole forest from known-good backups, inside an environment the attacker cannot reach — and this article is the runbook I build, harden, and rehearse for that day.

Active Directory is a distributed, multi-master database. That design is a gift in normal operations (any DC can take a write, replication converges) and a curse during compromise (a poisoned change replicates everywhere in minutes, and there is no single “master copy” to fall back to). Microsoft’s own guidance is blunt: if the forest is compromised at the level where you cannot prove the directory, schema, and SYSVOL are clean on every DC, the only safe path is a full forest recovery from backup. This is not the AD Recycle Bin. It is not Restore-ADObject. It is not force-removing a dead DC and re-promoting a replica. It is the nuclear option — restore one authoritative DC per domain from a backup taken before compromise, isolate it, cleanse it, rotate every secret, rebuild every other DC from that clean seed, and only then reconnect a workforce whose every machine and credential you must also treat as burned.

By the end of this article you will have a complete, tested procedure: what to back up and how to make those backups immutable and offline; how to design the isolated recovery environment (IRE) before the incident; the exact Microsoft-sanctioned sequence for restoring the first DC, cleaning metadata, seizing FSMO roles, invalidating the RID pool, resetting krbtgt twice, purging lingering objects, and rebuilding; the recovery order across a multi-domain forest; where the AD Forest Recovery (ADFR) automation tool fits; how to keep the runbook alive with quarterly restore tests and annual tabletops; and how Purple Knight and an ITDR posture reduce the odds you ever run this at 03:40. Every step comes with real PowerShell, ntdsutil, repadmin, and wbadmin commands, and the reference material is laid out as scannable tables you can keep open during the incident.

What problem this solves

Active Directory is still the beating heart of most enterprises: it authenticates users and computers, authorises access via Kerberos and NTLM, distributes configuration through Group Policy, and anchors the hybrid identity plane that Entra ID, ADFS, PKI, and countless applications trust. When AD goes down hard, everything that depends on it goes down with it — file shares, Exchange, SQL Windows-auth, VPN, RADIUS, certificate issuance, and the very administrative tooling you would use to fix it. A forest-wide ransomware event is therefore not an “identity outage”; it is a business-extinction event if you cannot recover, and companies have folded over exactly this.

What breaks without a rehearsed runbook is not the theory — Microsoft publishes the procedure — but the execution under pressure. Teams discover mid-incident that their only backup is an online copy the ransomware already encrypted; that the backup is older than the tombstone lifetime and therefore un-restorable as a DC; that nobody knows the DSRM password because it was set at promotion in 2019 and never documented; that the FSMO role map lives in a Visio on a file share that is now encrypted; that they have no isolated network to restore into and are about to restore a clean DC straight back onto the poisoned production LAN. Each of these turns a 12-hour recovery into a multi-week catastrophe, and several of them mean no recovery at all.

Who hits this: any organisation running on-premises or hybrid Active Directory — which is still the overwhelming majority of enterprises, including those “mostly in the cloud” whose Entra Connect sync and legacy apps still depend on a healthy forest. It bites hardest where AD grew organically over 15–20 years: multi-domain forests with forgotten trusts, DCs nobody documented, SYSVOL still on the deprecated FRS engine, and a Tier 0 that was never truly isolated. Ransomware crews specifically target AD because owning it means owning the deployment path for the ransomware itself — they encrypt after they have used your own Group Policy and PsExec to push the payload domain-wide. The runbook is what turns their best day into your long, ugly, but survivable shift.

To frame the whole field before the deep dive, here are the failure classes this article addresses, the question each forces, and the first move — the map you keep open at 03:40:

Failure class	What it means	First question	First move
Forest-wide ransomware	Every DC encrypted/wiped; forest offline	Is any backup restorable and pre-compromise?	Verify immutable backup < tombstone age; declare forest recovery
Tier 0 compromise (no crypto yet)	Attacker owns DA/EA; `krbtgt` likely dumped	Can I prove the directory is clean anywhere?	If no → forest recovery; reset `krbtgt` twice regardless
Backups deleted/encrypted	Recovery points gone with the DCs	Did immutability/MUA survive the delete requests?	Restore from immutable/offline copy; if none → escalate
Poisoned SYSVOL / GPO	Malware re-deploys via policy after restore	Was SYSVOL restored authoritatively from a clean point?	Authoritative SYSVOL restore; audit every GPO
Golden Ticket persistence	Forged TGTs survive a naive restore	Did I reset `krbtgt` twice, per domain?	Double reset with replication wait
Silent divergence (USN rollback)	A snapshot-reverted DC quietly ignored by partners	Was any DC restored via VM snapshot?	Never; rebuild that DC fresh

Learning objectives

By the end of this article you can:

Decide correctly between object-level recovery, single-DC rebuild, and full forest recovery — and articulate the trust-based decision gate that separates them.
Design a backup strategy that survives ransomware: system-state and Install-From-Media (IFM) backups, tombstone-lifetime constraints, and offline/immutable/isolated copies using Azure Recovery Services vault immutability plus multi-user authorization (MUA), or WORM/air-gapped media.
Build an isolated recovery environment (IRE) ahead of time: air-gapped network, independent DNS and time, a clean staging forest concept, and a break-glass credential store that does not depend on the forest you are recovering.
Execute the Microsoft forest-recovery procedure end to end: isolate, restore the first writable DC per domain into DSRM, disable replication, seize all five FSMO roles, clean metadata of every other DC, invalidate and raise the RID pool, reset krbtgt twice, reset trust and DSRM and Tier 0 secrets, purge lingering objects, and rebuild replicas from the clean seed.
Sequence recovery across a multi-domain forest — forest root first, then child domains — and reason about the dependencies (schema, PDC emulator, DNS, global catalog) that dictate the order.
Use the ADFR (AD Forest Recovery) automation tool to orchestrate and time the recovery, and know what it does and does not remove from your responsibility.
Keep the runbook alive with quarterly single-DC restore tests, annual full tabletops, and a defensible, timed RTO/RPO — and use Purple Knight and an ITDR platform (Microsoft Defender for Identity / Semperis / Quest) to shrink your attack surface before the incident.

Prerequisites & where this fits

You should already understand Active Directory operations at a senior level: the roles of the five FSMO (Flexible Single Master Operation) holders (Schema Master, Domain Naming Master, PDC Emulator, RID Master, Infrastructure Master), how multi-master replication and the KCC build the topology, what SYSVOL contains and the difference between FRS and DFSR replication of it, how Kerberos issues TGTs and service tickets and the special role of the krbtgt account, and how tombstone lifetime and AD Recycle Bin govern deletion and reanimation. You should be fluent in PowerShell, the ActiveDirectory module, ntdsutil, repadmin, dcdiag, netdom, and wbadmin, and comfortable operating a DC in Directory Services Restore Mode (DSRM).

This article sits at the top of the Identity resilience stack. Upstream of it is sound forest design and DC placement — see Active Directory Domain Services: Forest Design & DC Promotion on Azure — because a well-designed forest is far easier to recover. It is the on-premises identity twin of the general pattern in Ransomware Resilience: Immutable Backup & Isolated Recovery Environment, and it plugs directly into your broader Security Incident Response Runbooks, Tabletops & Cloud Forensics program. The Tier 0 hardening that prevents you ever running this lives in Privileged Identity Management: PIM & PAM Architecture, Privileged Access Management: Vaulting, Session Brokering & Credential Rotation, and the mandatory Entra Break-Glass Emergency Access: Monitoring & Governance. If your workforce authenticates via hybrid, the recovery also has to account for Entra Connect Sync Deep Dive: PHS, PTA & Seamless SSO. The whole effort is an instance of a Zero Trust Architecture Blueprint: Identity, Network & Data.

A quick map of who owns and confirms what during a forest recovery, so you page the right people at 03:40:

Layer / concern	What lives here	Who usually owns it	Why it matters to recovery
AD directory (NTDS.dit)	Users, groups, computers, schema	Identity / AD team	The thing you restore; must be pre-compromise + trusted
SYSVOL / Group Policy	GPOs, scripts, ADMX	Identity + endpoint team	Weaponised GPOs re-deploy ransomware; must be clean
DNS (AD-integrated)	`_msdcs`, SRV, A records	Identity / network	Broken DNS = DCs can’t find each other; rebuild in IRE
Backup platform	System-state / IFM, vaults	Backup / infra team	If it’s online-only, the attacker already encrypted it
Network / segmentation	VLANs, VNets, firewall	Network team	The IRE lives here; must have no route to production
Time source (NTP)	Authoritative clock	Infra / network	Kerberos dies past 5-min skew; IRE needs its own
Endpoints / member servers	Machine secrets, cached creds	Endpoint / app teams	Every machine secret is suspect; rejoin or reimage
Credential vault (break-glass)	DSRM, FSMO map, runbook	Security / IAM	If it depends on the dead AD, you have no procedure

Core concepts

Six mental models make every later step obvious. Internalise these and the runbook stops being a checklist you follow blindly and becomes a set of decisions you understand.

AD is multi-master, so there is no golden copy — only a golden point in time. Any DC can accept a write, and replication converges the forest. During compromise this works against you: a malicious change (a new Enterprise Admin, a modified AdminSDHolder, a poisoned logon script GPO) replicates to every DC within minutes. Recovery is therefore not “find the good DC” — there may be none — but “restore the state the forest was in before the attacker touched it,” which lives only in a backup taken before compromise. This is why backup frequency and retention are the two dials that set your recovery-point floor.

Trust, not blast radius, decides the recovery type. A few thousand deleted objects in an otherwise-healthy forest is an object problem: enable the AD Recycle Bin and reanimate with Restore-ADObject. One dead DC is a metadata problem: force-remove and re-promote. A forest where you cannot prove the directory, schema, and SYSVOL are clean on every DC is a forest-recovery problem, full stop. Over-recovering costs a weekend; under-recovering restores objects into the attacker’s still-live forest and hands them the keys again. When genuinely in doubt, recover the forest.

krbtgt is the master key, and it has a two-generation memory. Every Kerberos TGT in the domain is encrypted with the krbtgt account’s key. A Golden Ticket is a TGT the attacker forged offline using a stolen krbtgt hash — valid for up to ten years, for any user, undetectable by normal auth logs. Resetting krbtgt invalidates forged tickets — but the account retains a password history of N-1 (the current and the immediately previous key both validate tickets), so a single reset leaves the old key live and Golden Tickets still working. You must reset it twice, with enough time (or forced replication) between resets for the first new key to propagate and the original to age out. Getting this wrong is the single most dangerous, most common recovery mistake — it feels done and it is not.

RID pools and USNs make a restored DC dangerous if you skip the housekeeping. Each DC leases blocks of RIDs (Relative Identifiers) from the RID Master to mint SIDs for new objects. A DC restored from backup has an older RID pool than the forest reached at compromise, so if it starts issuing RIDs it may hand out identifiers other (now-gone) DCs already used — creating duplicate SIDs, which are silent until two principals collide. Likewise, a restored DC has a lower USN (Update Sequence Number) than its partners remember, which is why AD stamps restores with an InvocationID reset and a VM-GenerationID check — but you still explicitly invalidate the RID pool and, when the gap is large, raise rIDAvailablePool. Skip this and you plant a time bomb.

Lingering objects and the tombstone lifetime govern what can and cannot come back. When an object is deleted it becomes a tombstone for the tombstone lifetime (default 180 days on modern forests). A backup older than that lifetime is un-restorable as a DC, because reviving it would reanimate objects the rest of the forest already garbage-collected — resurrecting lingering objects (objects present on a reconnected/restored DC but deleted everywhere else) and corrupting the forest. Strict replication consistency and repadmin /removelingeringobjects are your cleanup tools; keeping every recovery point younger than tombstone lifetime is your prevention.

Isolation is the whole game. The most common way forest recoveries fail is restoring a clean DC straight back onto the network the attacker still owns, where it is re-compromised before you finish. The restored DC must come up air-gapped from production — no routing, no peering, no shared DNS, no VPN — with its own time and DNS, and stay there until every secret is rotated and health is proven. You lift the gap once, deliberately, at the end.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this is the mental model side by side.

Concept	One-line definition	Where it lives	Why it matters to forest recovery
Forest recovery	Rebuild the whole forest from pre-compromise backups	The runbook	The nuclear option when no DC is trustworthy
System-state backup	NTDS.dit + SYSVOL + registry + COM+ + boot files	`wbadmin` / Azure Backup	The only backup you can restore as a DC
IFM (Install From Media)	An ntdsutil-created media set to seed a new DC	`ntdsutil ifm`	Promote a replica without WAN replication
DSRM	Directory Services Restore Mode (safe-mode for DCs)	Boot option + local password	You restore/authoritative-restore here
FSMO	The five single-master roles	One holder each per role	Seize onto the restored DC; don’t transfer
Authoritative restore	Mark restored data as the winning version	`ntdsutil authoritative restore`	Forces SYSVOL/objects to win replication
`krbtgt`	The Kerberos master-key account	Per domain	Reset twice to kill Golden Tickets
RID pool	Block of RIDs a DC leases to mint SIDs	Per DC, from RID Master	Invalidate on a restored DC to avoid dup SIDs
Tombstone lifetime	How long a deleted object lingers	Forest-wide (default 180 d)	Backups older than this are un-restorable
Lingering object	Object on one DC, deleted everywhere else	On a stale/restored DC	Must be purged or replication is poisoned
Metadata cleanup	Remove a dead DC’s objects from the directory	`ntdsutil metadata cleanup`	Stale DC objects poison replication
IRE	Isolated recovery environment	Pre-built network	Where you restore, air-gapped from production
ADFR	Microsoft AD Forest Recovery tool	Downloadable tool	Automates/times the recovery steps
ITDR	Identity Threat Detection & Response	Defender/Semperis/Quest	Shrinks the odds you ever run this

The three recovery types, side by side

The first and most consequential decision is which fight you are in. Choosing forest recovery when object restore would do wastes a weekend; choosing object restore when the forest is compromised is fatal. This table is the decision gate.

Dimension	Object-level recovery	Single-DC rebuild	Full forest recovery
When	Bounded accidental/malicious deletions; forest trusted	One DC dead/corrupt; rest of forest trusted	No DC provably clean; forest-wide compromise
Trigger example	OU of users deleted by a bad script	DC hardware failure, single-DC ransomware on an isolated box	Every DC encrypted; `krbtgt` stolen; Golden Tickets in play
Tooling	AD Recycle Bin, `Restore-ADObject`	Force-removal + metadata cleanup + re-promote	This whole runbook
Isolation needed?	No	No	Yes — mandatory IRE
Reset `krbtgt`?	No	No	Yes, twice, per domain
Rebuild every DC?	No	Just the one	Yes — all, from clean seed
Typical duration	Minutes to hours	Hours	8–24+ hours, then days of cleanup
Risk if you pick wrong (too little)	—	Restore an object into a live-compromised forest	Hand the forest back to the attacker
Risk if you pick wrong (too much)	Needless downtime	Needless downtime	A wasted weekend (survivable)

Backup strategy: system state, IFM, immutable & isolated copies

Your recovery is only ever as good as the backup the attacker could not reach, alter, or delete. Ransomware operators know backups are the counter to extortion, so they hunt and encrypt them first — and increasingly they exfiltrate, then delete, backup catalogues before detonating. Everything in this section is about ensuring that at least one restorable, pre-compromise copy of each domain survives.

Back up system state, not files

You cannot restore a DC from a file-level copy of C:\Windows\NTDS. AD is a live, transactional (ESE/Jet) database, and a naive file copy is inconsistent and un-restorable. You must capture a system-state backup, which the OS assembles as a consistent snapshot of the components that make a DC a DC:

System-state component	What it contains	Why recovery needs it
NTDS database (`ntds.dit`)	The directory: objects, schema, partitions	The forest’s data — the whole point
SYSVOL	GPOs, logon scripts, ADMX, DFSR/FRS staging	Policy; a poisoned SYSVOL re-deploys malware
Registry	System hive incl. security, LSA secrets	Machine identity, service config
COM+ class registration DB	Registered COM+ components	App/service integration
Boot files / system files	Bootloader, protected OS files	Bootable, functioning OS
Certificate Services DB (if AD CS on DC)	Issued-cert database, CA keys	PKI continuity (better: CA off DCs)
Cluster DB (if clustered)	Cluster config	Rarely on DCs; avoid clustering DCs

The canonical tool is wbadmin, scheduled to a dedicated, ACL-locked volume that the rest of the domain cannot browse or write:

# One-off system-state backup to a dedicated, locked volume.
wbadmin start systemstatebackup -backupTarget:E: -quiet

# List what's on the target so you know your restore points.
wbadmin get versions -backupTarget:E:

Schedule it daily (Task Scheduler or the Windows Server Backup policy) on at least two DCs per domain, one of them a FSMO holder, so a single corrupt backup or a single dead DC does not leave a domain un-restorable.

IFM — Install From Media — and why it is a recovery accelerator, not a backup

IFM (Install From Media) is an ntdsutil-generated media set (a copy of the NTDS database, optionally including SYSVOL) used to seed a new DC without replicating the entire database across the WAN. It is not a system-state backup and cannot restore a failed DC, but during recovery it is invaluable for the rebuild phase: once you have one clean, authoritative DC, promoting the next replicas over a slow or constrained IRE link is far faster from IFM than from live replication of a large directory.

# Create an IFM media set from a healthy (clean) DC, for seeding replicas.
ntdsutil
  activate instance ntds
  ifm
    create sysvol full C:\IFM-clean
    quit
  quit
quit

Backup / media type	Restores a dead DC?	Seeds a new DC fast?	Contains SYSVOL?	Primary recovery role
System-state (`wbadmin`)	Yes (authoritative)	No (not its job)	Yes	Restore the first DC per domain
Azure Backup (MARS/MABS) system state	Yes	No	Yes	Off-box, immutable restore point
IFM (`ntdsutil ifm`)	No	Yes	If `create sysvol full`	Rebuild replicas from the clean seed
Bare-metal recovery (BMR)	Yes (whole server)	No	Yes	Full server incl. OS, if needed
VM snapshot / checkpoint	Dangerous*	No	N/A	*USN rollback risk — avoid as DC backup

* Reverting a DC to a VM snapshot outside the supported VM-GenerationID mechanism can cause a USN rollback — the DC re-uses update numbers, replication partners silently ignore its changes, and the directory quietly diverges. Modern hypervisors expose VM-GenerationID so AD can detect a rollback and reset its InvocationID, but snapshots are still not a substitute for system-state backups.

Tombstone lifetime — the hard expiry on every backup

A system-state backup is only restorable as a DC while it is younger than the tombstone lifetime. Default is 180 days on any forest built on Windows Server 2003 SP1 or later (older upgraded forests may still show 60 days — check yours). Restore a backup older than that and AD refuses, because reviving it would reanimate objects the forest already garbage-collected.

# Read the forest's actual tombstone lifetime (null often means the 60-day default).
Get-ADObject -Identity "CN=Directory Service,CN=Windows NT,CN=Services,$((Get-ADRootDSE).configurationNamingContext)" `
  -Properties tombstoneLifetime |
  Select-Object tombstoneLifetime

Value	Meaning	Backup implication
`tombstoneLifetime = 180`	Modern default	Keep every recovery point < 180 days old
`tombstoneLifetime = 60`	Legacy/upgraded forest	Very tight; consider raising to 180
`tombstoneLifetime` null	Uses the code default (often 60)	Set it explicitly so you’re not guessing
Backup age > lifetime	Un-restorable as a DC	You have no valid DC backup — fix retention now

Practically: keep frequent recovery points, retain enough history to survive a dwell time longer than a single day (attackers often lurk for weeks — your restore point must predate their first foothold, not just detonation), but never let a point you might restore as a DC exceed tombstone lifetime.

Offline, immutable, isolated — the copies ransomware cannot touch

Online backups reachable from the domain are precisely what ransomware encrypts. You need copies the production identity plane cannot reach or delete. There are two proven patterns.

Pattern A — Azure Backup with vault immutability + MUA. Back up DC system state to an Azure Recovery Services vault (via the MARS agent or MABS/DPM), then lock immutability (recovery points cannot be deleted or have retention shortened before expiry) and enable multi-user authorization (MUA) so even a compromised backup admin cannot weaken protection without a second, separately-controlled approver.

# Create the vault, then lock immutability irreversibly.
az backup vault create \
  --name rsv-ad-forest-recovery \
  --resource-group rg-identity-dr \
  --location eastus2

az backup vault update \
  --name rsv-ad-forest-recovery \
  --resource-group rg-identity-dr \
  --immutability-state Locked

# Resource guard for MUA — deletions/retention changes now require approval
# from a principal that has access to this guard, in a separate scope/tenant.
az dataprotection resource-guard create \
  --name rg-guard-ad-dr \
  --resource-group rg-identity-security \
  --location eastus2

Pattern B — WORM / air-gapped media. Write-once media (tape with WORM, or an immutable object-lock bucket) and/or a pull-based backup host in a separate trust domain that authenticates to the DCs and pulls, so nothing on the production side holds credentials to the backup store. The backup network is one-way: production cannot initiate to it.

Property	Vault immutability + MUA	WORM / air-gapped media
Deletion protection	Immutability lock (irreversible)	Physical/logical WORM
Insider/compromised-admin protection	MUA second approver	Separate custody of media/host
Attacker with Tier 0 can delete it?	No (needs MUA approver)	No (offline / separate trust)
Restore speed	Fast (cloud, parallel)	Slower (media handling)
Ongoing cost	Storage + egress	Media + handling + storage
Recovery-time note	Restore into Azure IRE directly	Ship/mount media into IRE
Best for	Cloud-adjacent orgs, fast RTO	Regulated/air-gap mandates

Protect the recovery credentials and artifacts themselves

The most humiliating failure is having perfect backups and no way to use them because the DSRM password, FSMO map, or runbook lived on the encrypted file share. Store the operational artifacts offline and outside the forest you are recovering:

Artifact	Why you need it during recovery	Where to store it
DSRM password (per DC, or a reset plan)	Boot the restored DC into DSRM	Printed in a safe + break-glass vault
FSMO role-holder map	Know what to seize and where	Offline doc, refreshed on change
DC inventory (sites, IPs, roles)	Metadata cleanup + rebuild targets	Offline doc
Trust list (internal + external)	Reset trust passwords; rebuild trusts	Offline doc
Tombstone lifetime value	Validate backup age	Offline doc
The runbook itself	The procedure under pressure	Offline + non-AD vault
Break-glass admin creds	Operate the IRE tooling	Non-AD vault (e.g. cloud PIM, HSM)
Backup encryption keys / vault creds	Decrypt/restore the backups	Separately-secured, not on DCs

The rule: if the only copy of your procedure or credentials authenticates against the dead forest, you do not have a procedure. And a green backup report proves bytes were written — it does not prove they restore into a bootable DC. Only a real restore test does that.

Recovery point / recovery time targets, honestly

RPO is bounded by backup frequency; RTO is bounded by your timed rehearsal, not your aspiration.

Backup cadence	RPO (data-loss window)	Notes
Daily system state	Up to ~24 h	Common baseline; retain > dwell time
Twice-daily	Up to ~12 h	For high-change forests
Continuous VM-level (with DC caveats)	Minutes (data)	But VM revert has USN-rollback risk
Retention < dwell time	You lose the pre-compromise point	Attacker predates all your restore points

Designing an isolated recovery environment (IRE)

The IRE is the sealed room you rebuild the forest inside. Build it — and document it — before the incident; you cannot design isolation while an attacker watches your network. The IRE’s contract is bidirectional: nothing the attacker controls can reach it, and nothing inside it phones home to production until you deliberately lift the gap.

The five IRE requirements

Requirement	On-prem realisation	Azure realisation	Failure if you skip it
Isolated network	Physically separate switch/VLAN, no route to prod	Standalone VNet, no peering, NSG deny-all except intra-subnet	Restored DC re-compromised on the prod LAN
Independent DNS	Standalone resolver; DC hosts AD-integrated DNS	Same; no custom DNS pointing at prod	Names resolve against poisoned prod DNS
Independent time	Dedicated NTP/GPS appliance	Azure host time / dedicated NTP VM	Kerberos fails past 5-min skew
Clean management host	Freshly imaged jump box, not a prod admin PWS	Fresh VM, hardened, current OS	Your admin workstation may be compromised too
Capacity for the seed set	One DC per domain + staging host	Sized VMs for all domains + staging	Can’t restore all domains in parallel

# Azure IRE skeleton: an isolated VNet with a deny-by-default NSG (no peering created).
az network vnet create -g rg-identity-dr -n vnet-ire \
  --address-prefix 10.250.0.0/16 --subnet-name snet-dc --subnet-prefix 10.250.1.0/24

az network nsg create -g rg-identity-dr -n nsg-ire-deny
az network nsg rule create -g rg-identity-dr --nsg-name nsg-ire-deny \
  -n DenyAllInbound --priority 4096 --access Deny --direction Inbound \
  --protocol '*' --source-address-prefixes '*' --destination-address-prefixes '*' \
  --destination-port-ranges '*'
# Then allow ONLY intra-subnet traffic explicitly at a higher priority.

The staging vs production-restore model

Microsoft’s guidance and the ADFR tool support recovering into a staging environment that mirrors production naming, then either cutting over to it or using it to validate the procedure. In practice you choose one of these postures:

IRE posture	What it is	Pros	Cons
Restore-in-place (isolated LAN)	Restore onto original hardware after wiping, on an isolated segment	No new naming; familiar	Trusting the same hardware; slower to isolate
Parallel staging forest (same names)	Rebuild the forest in a fresh isolated network, same domain/DC names	Clean hardware; safe to work; becomes production	Must physically/logically retire the old kit
Cloud IRE (Azure VNet)	Restore into Azure isolated VNet, then extend/cutover	Fast to stand up; elastic capacity	Needs backups reachable in Azure; egress planning

Whichever you choose, the invariant holds: air-gapped until proven clean, then a single deliberate gap-lift onto cleansed segments.

The recovery procedure, step by step

This is the core of the runbook — the Microsoft AD Forest Recovery sequence, adapted with the specifics that trip teams up. Order matters; do not improvise it live. The table is the map; the subsections are the detail.

Step	Action	Primary tool	Gate before proceeding
0	Declare forest recovery; convene team; open offline runbook	Human decision	Trust gate met; leadership informed
1	Isolate — sever production; stand up the IRE	Network + IRE	No route between prod and IRE
2	Restore the first writable DC of the forest root into DSRM	`wbadmin`	Chosen backup < tombstone lifetime
3	Disable inbound/outbound replication on that DC	`repadmin /options`	Confirmed both flags set
4	Seize all five FSMO roles onto the restored DC	`Move-ADDirectoryServerOperationMasterRole -Force` / `ntdsutil`	All five now local
5	Clean metadata of every other DC	`ntdsutil metadata cleanup` / `Remove-ADDomainController`	No stale DC objects remain
6	Invalidate the RID pool; raise `rIDAvailablePool` if needed	`ntdsutil` / ADSI	Pool invalidated; gap accounted for
7	Clean up DNS + purge lingering objects	DNS + `repadmin /removelingeringobjects`	No lingering objects, clean DNS
8	Reset `krbtgt` twice (per domain)	`Set-ADAccountPassword`	Two resets, replication/wait between
9	Reset trust passwords, DSRM, Tier 0, gMSA secrets	`netdom` / `ntdsutil` / `Set-ADAccountPassword`	All compromise-era secrets rotated
10	Repeat 2–9 for each child domain in order	(all above)	Every domain has one clean seed
11	Rebuild additional DCs fresh from the clean seed	`Install-ADDSDomainController`	Enough DCs for redundancy
12	Re-enable replication; converge inside the IRE	`repadmin /options -` + `/syncall`	`repadmin /replsummary` clean
13	Redistribute FSMO to intended holders; fix Sites & Services	`Move-ADDirectoryServerOperationMasterRole`	Topology matches target
14	Health gate: `dcdiag /e`, DNS, SYSVOL/DFSR, auth smoke tests	`dcdiag` / `repadmin` / `nltest`	Everything green
15	Lift the air gap onto cleansed segments; rejoin/reimage endpoints	Network + endpoint	Only after step 14 passes
16	Rebuild external trusts; restore member-server workloads	`netdom` / app teams	Trusts and apps validated

Step 1 — Isolate

Sever every path between production and the IRE. On-prem: pull the uplinks / disable the trunk to the recovery segment. Azure: confirm no VNet peering, no VPN/ExpressRoute reaching the IRE VNet, and a deny-by-default NSG. The restored DC must never see a production DC on first boot — a single inbound replication from a poisoned partner re-poisons your clean seed.

Step 2 — Restore the first writable DC of the forest root into DSRM

Always recover the forest root domain first — it holds the schema and the forest-wide roles, and every child depends on it. Restore the chosen system-state backup and boot into Directory Services Restore Mode (DSRM) using the DSRM password you stored offline. DSRM is essential: you do not want this DC advertising or replicating while it is being cleansed.

# Identify the system-state backup version on the recovery media, then restore it.
wbadmin get versions -backupTarget:E:

# Authoritative system-state recovery (also authoritatively restores SYSVOL in one shot).
wbadmin start systemstaterecovery -version:05/25/2026-09:00 -authsysvol -quiet

The -authsysvol switch performs an authoritative restore of SYSVOL, forcing this DC’s GPO/script content to win DFSR/FRS convergence when replicas come up. Boot options: use bcdedit /set safeboot dsrepair (then remove it after) or F8/advanced-boot to enter DSRM.

Restore choice	Command / flag	When to use
Non-authoritative restore	`wbadmin start systemstaterecovery` (no auth)	Rejoin a DC to an existing healthy forest (not this scenario)
Authoritative SYSVOL	`-authsysvol`	Forest recovery — make this DC’s SYSVOL win
Authoritative object restore	`ntdsutil authoritative restore` in DSRM	Reanimate specific deleted objects/subtrees
Boot into DSRM	`bcdedit /set safeboot dsrepair`	Every restore step here
Leave DSRM	`bcdedit /deletevalue safeboot`	After the DC is cleansed and ready

Step 3 — Disable replication before the DC ever touches a network

Before this DC can reach any partner, block replication both ways so it cannot pull from — or push to — a compromised, currently-unreachable partner on first boot:

# Block inbound and outbound replication until the forest is rebuilt and trusted.
repadmin /options localhost +DISABLE_INBOUND_REPL +DISABLE_OUTBOUND_REPL

# Verify both flags are set.
repadmin /options localhost

Step 4 — Seize the FSMO roles

The original role holders are gone or untrusted. Seize (not transfer) all five roles onto this restored DC — transfer requires the old holder online, which by definition it is not.

# Seize all five FSMO roles onto the restored DC (PowerShell path).
Move-ADDirectoryServerOperationMasterRole `
  -Identity "DC-ROOT-RECOVERY-01" `
  -OperationMasterRole SchemaMaster,DomainNamingMaster,PDCEmulator,RIDMaster,InfrastructureMaster `
  -Force

# Confirm the seizure.
netdom query fsmo

-Force triggers the seizure path when the holder is unreachable. The classic ntdsutil roles → seize <role> commands do the same; the cmdlet is cleaner and scriptable. Understand what each role does so you know the blast radius of getting it wrong:

FSMO role	Scope	What it controls	Seizure note during recovery
Schema Master	Forest	Schema modifications	Seize once, onto the root seed
Domain Naming Master	Forest	Add/remove domains, cross-refs	Seize onto the root seed
PDC Emulator	Domain	Time, password chaining, lockout, GPO edits	Most critical — the DFSR primary; seize per domain
RID Master	Domain	Issues RID pools	Seize per domain; then invalidate pool (step 6)
Infrastructure Master	Domain	Cross-domain reference updates	Seize per domain; keep off GC if multi-domain

Step 5 — Clean metadata of every other DC

AD still believes all the dead DCs exist — their NTDS Settings objects, computer accounts, DFSR/FRS members, and DNS records are stale and will poison replication and topology. Remove every DC except the one you restored.

:: ntdsutil metadata cleanup — the authoritative scrub for orphaned DC objects.
ntdsutil
  metadata cleanup
  connections
    connect to server DC-ROOT-RECOVERY-01
    quit
  select operation target
    list domains
    select domain 0
    list sites
    select site 0
    list servers in site
    select server 1
    quit
  remove selected server
  quit
quit

On modern Windows Server, Remove-ADDomainController -Identity <deadDC> -ForceRemoval (run against the surviving DC) does the equivalent for most objects, but ntdsutil metadata cleanup remains authoritative when objects are truly orphaned. After removing the server objects, hunt the leftovers by hand:

Stale artifact	Where it lives	How to remove
NTDS Settings object	`Sites → <site> → Servers → <DC>` (Sites & Services)	Metadata cleanup / delete the server object
Computer account	Domain Controllers OU	`Remove-ADObject` after cleanup
DFSR/FRS member	SYSVOL subscription / member objects	Metadata cleanup handles; verify in ADSI Edit
DNS A / AAAA record	Forward zone	Delete via DNS console / `Remove-DnsServerResourceRecord`
DNS SRV / CNAME under `_msdcs`	`_msdcs.<forestroot>`	Delete stale `_ldap`, `_kerberos`, GUID CNAME records
Site link / subnet mapping	Sites & Services	Fix to match the recovery topology

Step 6 — Invalidate the RID pool (and raise `rIDAvailablePool` if the gap is large)

A restored DC has an older RID pool than the forest had at compromise; if it starts minting SIDs it may reuse RIDs the lost DCs already issued, creating duplicate SIDs. Invalidate the current pool so this DC requests a fresh, higher block from the (now local) RID Master:

# Invalidate the current RID pool so the DC leases a fresh, higher block.
$rootDSE   = Get-ADRootDSE
$ridManager = Get-ADObject -Identity ($rootDSE.rIDManagerReference) -Properties rIDAvailablePool
# Signal a pool invalidation:
Set-ADDomain -Identity (Get-ADDomain) -Replace @{ } -ErrorAction SilentlyContinue

# The supported invalidation via rootDSE operational attribute:
Set-ADObject -Identity $rootDSE -Add @{ invalidateRidPool = "1" }

When Microsoft’s guidance calls for it (a large gap, or you know a big block was consumed pre-compromise), raise rIDAvailablePool directly so the forest skips past any RIDs that may already be in use. Do the arithmetic carefully — the value is a packed 64-bit number (upper 32 bits = RIDs allocated/ceiling, lower 32 bits = next available):

# Read the current rIDAvailablePool (packed 64-bit value).
Get-ADObject "CN=RID Manager$,CN=System,$((Get-ADDomain).DistinguishedName)" `
  -Properties rIDAvailablePool | Select-Object rIDAvailablePool
# Raise it (per the Microsoft Forest Recovery Guide's arithmetic) only when the gap warrants.

RID concern	Symptom if ignored	Recovery action
Restored DC has old pool	Reuses RIDs → duplicate SIDs	Invalidate pool (always)
Large consumption pre-compromise	Even a fresh block may overlap	Raise `rIDAvailablePool` per guide
RID pool ~90% exhausted	Global RID exhaustion looms	Monitor `SID Pool` health post-recovery
Unsupported manual edit	Corrupt RID master state	Follow the guide’s exact arithmetic only

Duplicate SIDs are silent until two objects collide months later — do not skip this.

Step 7 — Clean DNS and purge lingering objects

Delete the stale DNS records enumerated in step 5, then enable strict replication consistency and purge any lingering objects so a reconnected/rebuilt DC cannot reintroduce garbage-collected objects:

# Enforce strict replication consistency (blocks replication of lingering objects).
repadmin /regkey * +strict

# Detect lingering objects against a known-good reference DC (advisory mode).
repadmin /removelingeringobjects <targetDC> <referenceDC-GUID> <NamingContext> /advisory_mode

# Then remove for real (drop /advisory_mode).
repadmin /removelingeringobjects <targetDC> <referenceDC-GUID> <NamingContext>

Step 8 — Reset `krbtgt` twice, per domain

This is the step that kills Golden Tickets — and the one most often botched. A single reset leaves the previous key valid (the account keeps N-1 history and Kerberos validates tickets against both current and previous keys), so forged TGTs still work. Reset twice, allowing replication (or, in the single-DC-so-far case, a deliberate wait long enough for the first key to be the previous key) between resets.

# First krbtgt reset with a random 32-char password.
Set-ADAccountPassword -Identity krbtgt -Reset `
  -NewPassword (ConvertTo-SecureString (New-Guid).Guid -AsPlainText -Force)

# Wait for the first new key to propagate (or to age from "current" to "previous"),
# then reset a SECOND time so the ORIGINAL (attacker-known) key is fully retired.
Set-ADAccountPassword -Identity krbtgt -Reset `
  -NewPassword (ConvertTo-SecureString (New-Guid).Guid -AsPlainText -Force)

Do this for the krbtgt of every domain. Microsoft ships a New-KrbtgtKeys.ps1 script that automates a safe, replication-aware double reset for larger environments; use it if your forest is big enough that manual timing is risky.

Reset mistake	What still works for the attacker	Correct action
Reset `krbtgt` once	Golden Tickets forged with the previous key	Reset twice
Reset twice with no wait between	The original key may not have aged out	Wait/replicate between the two resets
Reset only the root domain’s `krbtgt`	Golden Tickets in child domains	Reset per domain
Forget read-only DC `krbtgt_<n>` accounts	RODC-scoped tickets	Reset those too if RODCs existed

Step 9 — Reset trust, DSRM, Tier 0, and service-account secrets

Every credential that existed during compromise is assumed stolen. Rotate all of them while still isolated.

# Reset the trust secret on one side of an inter-domain / forest trust (repeat both sides).
netdom trust child.contoso.com /domain:contoso.com /resetOneSide /passwordT:* /userO:Administrator /passwordO:*

# Set a new DSRM password on the DC.
ntdsutil
  set DSRM password
    reset password on server null
    quit
  quit
quit

# Reset the built-in Administrator and rotate Tier 0 accounts.
Set-ADAccountPassword -Identity Administrator -Reset `
  -NewPassword (ConvertTo-SecureString (New-Guid).Guid -AsPlainText -Force)

Secret to rotate	Why it’s suspect	Tool
`krbtgt` (per domain)	Golden Ticket master key	`Set-ADAccountPassword` (×2)
Domain/forest trust passwords	Enable cross-domain forged auth	`netdom trust /resetOneSide`
DSRM password	Local admin on the DC	`ntdsutil set DSRM password`
Built-in `Administrator` (per domain)	Prime target	`Set-ADAccountPassword`
Domain Admins / Enterprise Admins	Tier 0 principals	Reset each; review membership
gMSA / sMSA for Tier 0 services	Managed service secrets	`Reset-ADServiceAccount` / rotate
Service accounts with SPNs (Kerberoastable)	Offline-crackable	Reset; move to gMSA
`AdminSDHolder` ACL	Attacker persistence via SDProp	Reset ACL to default; audit protected groups
DSRM-to-domain admin mappings, backdoors	Persistence (SID history, ACL abuse)	Audit and remove

Step 10 — Recover each child domain, in order

Repeat steps 2–9 for each child domain, after the forest root is up. The order is dictated by dependency: the root holds schema and forest roles; children need a reachable (isolated) root for their own recovery and DNS delegation. Within the children, recover any domain that hosts a global catalog or that other domains reference sooner.

Recovery order	Domain	Why this order
1st	Forest root	Schema, forest FSMO, `_msdcs`, DNS root of trust
2nd	Child domains hosting shared services (GC, DNS delegation targets)	Others depend on them
3rd	Remaining child / regional domains	Fewer downstream dependencies
Last	Read-only DCs (RODCs)	Rebuilt fresh from clean writable DCs, never restored

Step 11 — Rebuild additional DCs fresh from the clean seed

Do not restore additional DCs from backup — those backups carry the same compromise. Stand up new, fully-patched, hardened Windows Server VMs in the IRE and promote each as a replica, pulling its database from the clean seed (optionally via the IFM media from step earlier for speed):

# Promote a fresh server as an additional DC, replicating from the cleansed seed.
Install-ADDSDomainController `
  -DomainName "contoso.com" `
  -ReplicationSourceDC "DC-ROOT-RECOVERY-01.contoso.com" `
  -InstallDns `
  -SiteName "RecoverySite" `
  -InstallationMediaPath "C:\IFM-clean" `
  -SafeModeAdministratorPassword (Read-Host -AsSecureString) `
  -Force

Step 12 — Re-enable replication and converge inside the IRE

Once you have enough clean DCs for redundancy and confidence is high, re-enable replication on the seed and let convergence happen inside the gap:

# Re-enable replication once additional clean DCs exist, then force a sync.
repadmin /options localhost -DISABLE_INBOUND_REPL -DISABLE_OUTBOUND_REPL
repadmin /syncall /AdeP

Step 13 — Redistribute FSMO and fix Sites & Services

Move the FSMO roles from the emergency seed to their intended permanent holders, and reconfigure Sites and Services subnets and site links to match the topology you will cut back to. Confirm the DFSR primary (PDC emulator) authoritatively seeded SYSVOL.

Step 14 — Health gate (covered in the Verify section below)

Do not declare recovery complete on vibes — the objective health gate is its own section. Everything must be green before step 15.

Step 15–16 — Lift the gap, then rebuild the estate

Only after health validation do you lift the air gap, reconnecting on cleansed segments. Then treat the workforce estate as also compromised: rejoin or reimage member servers and workstations (their machine secrets are suspect), rebuild external trusts, and validate applications. This is often the longest tail of the whole event — measured in days.

Verify — the objective health gate

Run this gate before reintroducing the forest. Every check must be green; a forest that replicates but cannot issue tickets, or shares SYSVOL but serves a poisoned GPO, is not recovered.

# Replication health — look for any failures or large queues.
repadmin /replsummary
repadmin /showrepl * /csv | ConvertFrom-Csv | Where-Object { $_.'Number of Failures' -gt 0 }

# Comprehensive DC diagnostics across all DCs in the enterprise.
dcdiag /v /c /e

# DNS-specific tests across all DCs.
dcdiag /test:dns /v /e

# SYSVOL / DFSR state and migration state.
dfsrmig /getmigrationstate
Get-WmiObject -Namespace "root\microsoftdfs" -Class dfsrreplicatedfolderinfo |
  Select-Object ReplicatedFolderName, State

# Authentication smoke test from a clean client.
klist purge
nltest /sc_verify:contoso.com
gpupdate /force

Gate	Command / check	Green looks like	Red means
Replication summary	`repadmin /replsummary`	0 failures, small deltas	Failed links → topology/DNS issue
Per-partner replication	`repadmin /showrepl * /csv`	No `Number of Failures > 0`	A partner not converging
DC diagnostics	`dcdiag /v /c /e`	All tests pass on every DC	Named failing test → fix that subsystem
DNS	`dcdiag /test:dns /v /e`	`_msdcs`, SRV, zones healthy	Missing SRV → clients can’t find DCs
SYSVOL share	`net share` shows SYSVOL & NETLOGON	Both shared on every DC	Not shared → DFSR didn’t seed
DFSR state	`Get-WmiObject dfsrreplicatedfolderinfo`	State = `4` (Normal)	Stuck initial sync → set primary flag
Kerberos	`nltest /sc_verify` + create/delete test object	Success; secure channel OK	Ticket failures → time skew or krbtgt
GPO apply	`gpupdate /force` on a test client	Policies apply cleanly	Failure → SYSVOL/GPO corruption
Time	`w32tm /query /status`	Skew < 5 min forest-wide	Skew → Kerberos will fail

If SYSVOL did not converge, force the seed (PDC emulator) to be the authoritative DFSR primary: set its DFSR subscription object msDFSR-Options to 1, the others to non-authoritative, and restart the DFSR service — the DFSR equivalent of the old FRS D4/D2 burflags. The authoritative-SYSVOL controls differ by replication engine, and mixing them up leaves SYSVOL empty on every DC:

Replication engine	“This DC is authoritative”	“This DC is non-authoritative”	Applied via
DFSR (modern)	`msDFSR-Options = 1` on the local DFSR subscription, then restart DFSRS	leave/clear `msDFSR-Options`, set `msDFSR-Enabled` appropriately, restart DFSRS	ADSI Edit on the subscription object
FRS (legacy — migrate off)	`BurFlags = D4` (`HKLM\...\NtFrs\Parameters\Backup/Restore\...`)	`BurFlags = D2`	Registry, then restart NtFrs
Verify shared	`net share` lists SYSVOL and NETLOGON	—	On every DC after convergence
Check state	DFSR state `4` (Normal) via WMI	FRS event 13516 (share published)	`Get-WmiObject dfsrreplicatedfolderinfo` / event log

The ADFR automation tool

Restoring a multi-domain forest by hand across 03:40 fatigue is where mistakes creep in — a skipped metadata cleanup, a single krbtgt reset, a mis-typed ntdsutil target. Microsoft ships the AD Forest Recovery (ADFR) tool to orchestrate and time the procedure. It does not replace understanding the steps — it enforces them consistently and produces an audit trail — but it materially reduces human error and shrinks a chaotic manual run into a guided one.

ADFR capability	What it does	Why it matters in recovery
Orchestrated restore	Drives restore of the first DC per domain from backup	Consistent, ordered, less manual `wbadmin` fiddling
Automated metadata cleanup	Removes stale DC objects programmatically	Eliminates the error-prone `ntdsutil` walk
FSMO seizure	Seizes roles onto the recovery DCs	No missed role
RID pool handling	Invalidates/raises the pool per guidance	Prevents duplicate SIDs
`krbtgt` double reset	Performs the two resets with correct waits	Kills the #1 recovery mistake
Isolation enforcement	Keeps recovery DCs from replicating prematurely	Protects the clean seed
Timing / reporting	Records how long each phase took	Turns aspirational RTO into a measured one
Repeatability for rehearsal	Same tool in the quarterly test as on the day	You practise the exact thing you’ll run

Aspect	Manual runbook	ADFR-assisted
Consistency under fatigue	Depends on the operator	Enforced by the tool
Multi-domain ordering	You track it	Tool tracks it
`krbtgt` double reset	Easy to do once and stop	Tool does both with timing
Audit trail	Manual notes	Generated log/report
Learning value	High (you know every step)	High (still runs the same steps)
Recommendation	Know it cold anyway	Use it and understand it

The rule: automation is a force multiplier over competence, never a substitute. Rehearse with ADFR so the tool and the team are both proven before the day you need them. And a runbook decays the moment AD changes, so its maintenance is a scheduled activity, not an afterthought:

Cadence	Activity	Validates	Owner
Daily	System-state backups run on ≥2 DCs/domain; report checked	Backups exist and are current	Backup team
Weekly	Immutability/MUA status; recovery-point age vs tombstone	Backups are restorable and protected	Backup + IAM
Monthly	Purple Knight / ITDR posture scan; remediate top findings	Attack surface shrinking	Security
Quarterly	Single-DC restore into the IRE, timed end to end	Backup restores as a DC; RTO of slow steps	AD + security
Semi-annually	Refresh FSMO map, DC inventory, trust list, DSRM passwords	Artifacts match reality	AD team
Annually	Full forest-recovery tabletop with named participants	Team, tooling, and decisions under pressure	AD + security + leadership
On material change	Re-vault artifacts; update the runbook	No drift between docs and forest	AD team

ITDR and Purple Knight — shrinking the odds you ever run this

The best forest recovery is the one you never execute because you detected and evicted the attacker before they reached Tier 0. Identity Threat Detection and Response (ITDR) is the discipline of continuously assessing AD’s security posture and detecting the attacks that precede a forest-wide event. Bolt it on before the incident.

Tool / capability	What it does	Where it fits
Purple Knight (Semperis, free)	Scans AD/Entra for 100+ indicators of exposure & compromise	Point-in-time posture assessment; run monthly
Semperis DSP / ADFR	Continuous AD threat monitoring + forest recovery automation	ITDR + the recovery tool itself
Microsoft Defender for Identity	Real-time detection of AD attacks (DCSync, Golden Ticket, recon)	SOC-integrated detection
Microsoft Security Copilot / Sentinel	Correlate identity signals; hunt with KQL	Investigation & response
Quest / Cayosoft / Cohesity	AD backup + recovery + change auditing	Backup and rollback tooling
Native auditing + `Get-ADReplAccount`/`DSInternals`	Detect DCSync, dump/compare secrets	DIY hunting and validation

Some of the specific indicators worth watching — most of these are also what an attacker does on the road to forest compromise, so catching them early is how you avoid the runbook entirely:

Indicator of exposure/compromise	Why it precedes a forest event	Detection
`krbtgt` password age very old	A never-rotated `krbtgt` means old Golden Tickets stay valid	`Get-ADUser krbtgt -Properties pwdLastSet`
DCSync from a non-DC principal	Attacker dumping all hashes (incl. `krbtgt`)	Defender for Identity; audit 4662 replication rights
Unconstrained delegation on odd hosts	TGT theft / privilege escalation path	Purple Knight; `Get-ADComputer -Filter {TrustedForDelegation -eq $true}`
`AdminSDHolder` ACL modified	SDProp-based persistence	Compare ACL to baseline
Privileged group membership changes	Attacker adding themselves to EA/DA	Event 4728/4732; PIM review
SID history on unexpected accounts	Cross-domain privilege injection	`Get-ADUser -Filter * -Properties sidHistory`
Weak/reversible-encryption accounts	Easy credential theft	Purple Knight
Stale, over-privileged service accounts	Kerberoasting targets	Purple Knight; SPN inventory
Print Spooler on DCs	Coerced auth (PrinterBug)	Disable Spooler on DCs
Old SYSVOL FRS still in use	Legacy, fragile SYSVOL replication	`dfsrmig /getmigrationstate` — migrate to DFSR

The link to recovery is direct: the same hygiene that makes AD hard to compromise (rotated krbtgt, no unconstrained delegation, DFSR SYSVOL, clean AdminSDHolder, tight Tier 0) also makes it faster and safer to recover, because there is less attacker persistence to hunt down after you restore.

Architecture at a glance

Picture the recovery as three concentric zones and a strict one-way flow of trust between them. In the outer zone sits production — the compromised forest you must assume the attacker still owns: every domain controller across every site, SYSVOL, DNS, member servers, and workstations, all suspect. Nothing flows out of this zone into recovery; the only artifact you ever take from it is the knowledge (topology, names) you already documented offline, never live data or live connections.

In the protected zone sits your backup estate — deliberately unreachable from production. This is the Azure Recovery Services vault with immutability Locked and MUA enabled, or the WORM/air-gapped media in separate custody. The critical property is directionality: backups were written into this store on a schedule, but production holds no standing credential to delete or alter them, so when ransomware swept the outer zone it could not reach in here. Alongside the backups, in a separate non-AD vault, live the recovery credentials and the runbook: DSRM passwords, the FSMO map, the DC inventory, the trust list, and break-glass admin access that does not authenticate against the dead forest.

In the inner zone sits the isolated recovery environment (IRE) — a sealed network (a standalone Azure VNet with no peering and a deny-by-default NSG, or a physically separate on-prem segment) with its own DNS and its own time source. The recovery flows strictly inward and forward: backups move from the protected zone into the IRE, where you restore the first writable DC of the forest root into DSRM, disable its replication, seize all five FSMO roles onto it, clean the metadata of every dead DC, invalidate the RID pool, reset krbtgt twice, and rotate every trust, DSRM, and Tier 0 secret. That single cleansed DC becomes the seed; you then build fresh replicas around it (never restore them from the same compromised backups), and repeat the whole cleanse for each child domain in dependency order — root first, shared-service children next, regional children last. Replication stays disabled until enough clean DCs exist, then converges inside the gap. Only after the objective health gate — repadmin /replsummary, dcdiag /e, DNS, SYSVOL/DFSR, and Kerberos smoke tests all green — do you lift the air gap once, deliberately, onto cleansed network segments, and begin the long tail of rejoining or reimaging the workforce estate whose machine secrets are also burned. The whole design is one-directional distrust: compromised production can never touch the backups or the IRE, and the IRE never phones home to production until you prove it is safe.

Real-world scenario

A global manufacturer — call them Helvern Industrial — ran a single-forest, three-domain AD: an empty forest root (helvern.net) and two regional child domains (emea.helvern.net, amer.helvern.net), fifteen DCs across nine sites, SYSVOL on DFSR, hybrid-synced to Entra ID for Microsoft 365. At 02:10 on a Sunday, a ransomware crew that had lurked for three weeks used a stolen Domain Admin credential and Helvern’s own Group Policy to push their payload, then encrypted every domain controller in both regions inside forty minutes. By the time the on-call engineer logged in, no DC answered LDAP and the crew had already issued delete requests against the Azure Backup recovery points using the same Tier 0 service account they had compromised.

What saved Helvern was a control enabled six months earlier for exactly this threat model: multi-user authorization (MUA) on the Recovery Services vault. The compromised account could request deletion of the recovery points; it could not approve it, because approval required a second principal with access to a resource guard held in a separate security tenant. The delete requests sat pending and harmless. Helvern’s backups — daily DC system-state, retained 35 days, all younger than the 180-day tombstone lifetime, and critically dating back before the three-week dwell — were intact.

They executed the runbook. Into a pre-built isolated Azure VNet (no peering, standalone DNS, dedicated NTP), they restored the forest root’s PDC-emulator DC into DSRM with -authsysvol, disabled replication, seized all five FSMO roles, cleaned metadata for the four other root DCs, invalidated the RID pool, reset krbtgt twice, and rotated the trust, DSRM, and Tier 0 secrets. Then they repeated it for emea and amer in turn, and rebuilt eight fresh replicas from the cleansed seeds using IFM media to avoid saturating the recovery link.

The constraint that nearly broke them was time. Their stated RTO was 8 hours, but the slow steps — metadata cleanup across a stale topology and the double krbtgt reset with its replication waits — had never been timed. The first real run took 14 hours to a healthy core forest, and the estate rebuild (rejoining 4,000 member servers and reimaging workstations whose machine secrets were suspect) took a further nine days. Three lessons went into the post-incident review, and they are the reason this article exists: first, MUA (and immutability) is what turned an extinction event into a long shift — without it the recovery points were gone and the answer would have been “pay or fold.” Second, an untested RTO is a guess; Helvern re-baselined to 12 hours for the core forest, scripted the metadata cleanup, adopted the ADFR tool, and instituted a quarterly single-DC restore test so the wait windows were known, not discovered. Third, the workforce estate is the long tail nobody plans for — the DCs were healthy in 14 hours, but “the business is running normally again” was nine days out, because every machine secret was burned too.

Advantages and disadvantages

Forest recovery is a capability, not a feature you toggle on. It is expensive to build and maintain, and you hope never to use it — but the asymmetry is stark: the cost of having it is a quarterly rehearsal and some immutable storage; the cost of not having it is the company.

Advantages	Disadvantages
The only complete answer to forest-wide compromise / ransomware	Complex, high-stakes, easy to get wrong under pressure
Restores a trusted forest, not a re-compromised one	Long RTO — 8–24 h to core, days for the estate
Immutable/MUA backups defeat backup-deletion attacks	Requires ongoing investment: IRE, immutable storage, rehearsals
Rehearsed procedure turns chaos into a checklist	Skipping any step (esp. double `krbtgt`) can silently fail
Kills attacker persistence (Golden Tickets, trusts, Tier 0)	Requires deep AD expertise on the recovery team
Timed RTO gives leadership a defensible commitment	Data loss up to the RPO (backup cadence)
ADFR + ITDR reduce both error rate and likelihood	Endpoint/member-server rebuild is a huge separate effort

When each matters: the advantages dominate for any organisation that would cease to function without AD — which is nearly all of them. The disadvantages bite hardest on teams that treat the runbook as a document rather than a rehearsed drill; an un-rehearsed forest-recovery plan has roughly the reliability of no plan at all, because the first time you discover the DSRM password is undocumented or the RTO is triple your estimate is the worst possible time.

Hands-on lab: rehearse a single-DC restore into an isolated environment

You cannot safely rehearse a full multi-domain forest recovery against production, but you can — and must — rehearse the mechanics on a throwaway lab forest, which validates the backup, the DSRM restore, FSMO seizure, metadata cleanup, and the krbtgt double-reset, and times them. Do this quarterly against real backups of a lab that mirrors production. Everything below runs on two Windows Server VMs in an isolated Hyper-V/Azure network — never on production.

Step 1 — Stand up a lab forest (two DCs, isolated network)

# On LAB-DC1 (fresh Windows Server, isolated vSwitch, static IP):
Install-WindowsFeature AD-Domain-Services -IncludeManagementTools
Install-ADDSForest -DomainName "lab.local" `
  -SafeModeAdministratorPassword (Read-Host -AsSecureString) `
  -InstallDns -Force

# On LAB-DC2 (after joining lab.local):
Install-ADDSDomainController -DomainName "lab.local" `
  -SafeModeAdministratorPassword (Read-Host -AsSecureString) -InstallDns -Force

Expected: Get-ADDomainController -Filter * lists both DCs; repadmin /replsummary shows zero failures.

Step 2 — Seed data and take a system-state backup

# Create some objects so you can prove they survive the restore.
1..50 | ForEach-Object { New-ADUser -Name "labuser$_" -Enabled $true `
  -AccountPassword (ConvertTo-SecureString "P@ssw0rd$_!x" -AsPlainText -Force) }

# System-state backup of LAB-DC1 to a dedicated disk.
wbadmin start systemstatebackup -backupTarget:E: -quiet
wbadmin get versions -backupTarget:E:

Expected: wbadmin get versions lists a system-state version with a timestamp.

Step 3 — Simulate loss and isolate

Snapshot the VMs (so you can roll the lab back after the drill), then shut down LAB-DC2 and treat LAB-DC1 as “the surviving backup to restore.” Disconnect the lab vSwitch from anything else. Start a stopwatch — you are timing the recovery.

Step 4 — Restore LAB-DC1 into DSRM and disable replication

# Boot into DSRM.
bcdedit /set safeboot dsrepair
Restart-Computer -Force
# ... after reboot, log in with the DSRM (SafeMode) password ...

# Authoritative system-state restore including SYSVOL.
wbadmin start systemstaterecovery -version:<your-version> -authsysvol -quiet

# On next normal boot, disable replication immediately.
bcdedit /deletevalue safeboot
Restart-Computer -Force
# After reboot:
repadmin /options localhost +DISABLE_INBOUND_REPL +DISABLE_OUTBOUND_REPL

Expected: the 50 lab users are present after restore; replication options show both DISABLE flags.

Step 5 — Seize FSMO, clean metadata for the dead DC

# Seize all roles onto the restored DC.
Move-ADDirectoryServerOperationMasterRole -Identity "LAB-DC1" `
  -OperationMasterRole SchemaMaster,DomainNamingMaster,PDCEmulator,RIDMaster,InfrastructureMaster -Force
netdom query fsmo

# Remove LAB-DC2's metadata (it's "gone").
Remove-ADDomainController -Identity "LAB-DC2" -ForceRemoval -Confirm:$false
# Verify it's gone from Sites & Services / the DC list.
Get-ADDomainController -Filter *

Expected: netdom query fsmo shows all five roles on LAB-DC1; LAB-DC2 no longer listed.

Step 6 — Invalidate the RID pool and reset `krbtgt` twice

# Invalidate the RID pool.
Set-ADObject -Identity (Get-ADRootDSE) -Add @{ invalidateRidPool = "1" }

# Reset krbtgt twice (in the lab, a short wait suffices to demonstrate the two-step).
Set-ADAccountPassword -Identity krbtgt -Reset -NewPassword (ConvertTo-SecureString (New-Guid).Guid -AsPlainText -Force)
Start-Sleep -Seconds 30    # in production this is a replication/aging wait, not 30s
Set-ADAccountPassword -Identity krbtgt -Reset -NewPassword (ConvertTo-SecureString (New-Guid).Guid -AsPlainText -Force)
Get-ADUser krbtgt -Properties pwdLastSet | Select-Object pwdLastSet

Expected: pwdLastSet reflects a just-now reset.

Step 7 — Rebuild a fresh replica and validate

# On a NEW LAB-DC3 VM: promote as a replica from the cleansed seed.
Install-ADDSDomainController -DomainName "lab.local" `
  -ReplicationSourceDC "LAB-DC1.lab.local" -InstallDns `
  -SafeModeAdministratorPassword (Read-Host -AsSecureString) -Force

# Re-enable replication on the seed and sync.
repadmin /options localhost -DISABLE_INBOUND_REPL -DISABLE_OUTBOUND_REPL
repadmin /syncall /AdeP

# Health gate.
repadmin /replsummary
dcdiag /v /c /e

Expected: repadmin /replsummary clean; dcdiag passes. Stop the stopwatch — record the elapsed time. That number, not your hope, is your RTO baseline for this phase.

Step 8 — Teardown

# Roll the lab VMs back to the pre-drill snapshot, or delete them entirely.
# Azure: az group delete -n rg-adfr-lab --yes --no-wait

The point of the lab is not to “pass” it — it is to time the slow steps and surface the surprises (an undocumented DSRM password, a metadata-cleanup snag, a SYSVOL that won’t converge) in a lab on a Tuesday rather than in production at 03:40.

Common mistakes & troubleshooting

The playbook: symptom → root cause → how to confirm → fix. Most of these are the difference between a recovery that works and one that quietly fails or re-compromises.

#	Symptom	Root cause	Confirm (command / where)	Fix
1	Golden Tickets still work after recovery	`krbtgt` reset once, not twice	`Get-ADUser krbtgt -Properties pwdLastSet`; only one reset in logs	Reset `krbtgt` a second time (with wait), per domain
2	Restored DC re-encrypted / re-compromised on boot	Restored onto the production network, not isolated	Check routing/peering to the recovery segment	Rebuild in a true IRE; no route to production
3	`wbadmin` refuses to restore the backup	Backup older than tombstone lifetime	`Get-ADObject ... tombstoneLifetime`; compare to backup date	Use a newer valid backup; fix retention going forward
4	Duplicate SIDs appear weeks later	RID pool not invalidated on the restored DC	RID collision events; `dcdiag /test:ridmanager`	Invalidate pool; raise `rIDAvailablePool` per guide
5	Replication fails with lingering-object errors	Lingering objects reintroduced from a stale DC	`repadmin /showrepl` error 8606/8614	`repadmin /removelingeringobjects`; enable strict consistency
6	New DC can’t find partners / clients can’t log in	Stale DNS `_msdcs` SRV records after metadata cleanup	`dcdiag /test:dns`; inspect `_msdcs` zone	Delete stale SRV/CNAME/A; re-register (`ipconfig /registerdns`, restart Netlogon)
7	SYSVOL not shared; GPOs don’t apply	DFSR didn’t seed authoritatively	`net share` (no SYSVOL); `dfsrmig /getmigrationstate`	Set seed `msDFSR-Options=1` (authoritative), others non-auth, restart DFSR
8	Kerberos auth fails everywhere post-recovery	Time skew > 5 min between DCs/clients	`w32tm /query /status`; compare clocks	Fix NTP hierarchy; PDC emulator as authoritative time root
9	Backup recovery points deleted during the attack	Backups reachable/deletable by compromised Tier 0	Vault soft-delete/immutability status	Enable immutability Locked + MUA; move to offline/WORM
10	FSMO seize fails or roles “still” on a dead DC	Tried to transfer (needs old holder online)	`netdom query fsmo` shows dead holder	Use `-Force` / `ntdsutil roles seize`; then metadata-clean the dead holder
11	Restored DC shows old data / changes ignored by partners	USN rollback (VM snapshot revert outside VM-GenID)	Event 2095/1113; `repadmin /showrepl` anomalies	Never revert DCs via snapshot; rebuild the DC fresh
12	“The forest is up” but attacker persists	Skipped Tier 0 / trust / gMSA / AdminSDHolder rotation	Review privileged group members, `AdminSDHolder` ACL, SPNs	Rotate all Tier 0 secrets; reset AdminSDHolder ACL; audit SID history
13	DSRM login impossible during restore	DSRM password undocumented/forgotten	You can’t authenticate to DSRM	Pre-incident: document/rotate DSRM (`ntdsutil set DSRM password`) and vault it
14	Recovery takes 3× the planned RTO	RTO never timed; slow steps unknown	Compare drill time to stated RTO	Time a quarterly restore; re-baseline RTO with margin
15	Poisoned GPO re-deploys malware after recovery	SYSVOL restored non-authoritatively / from compromised copy	Inspect GPO edit history; compare SYSVOL contents	Authoritative SYSVOL restore (`-authsysvol`) from a clean point; audit GPOs
16	Child domain recovery fails / can’t validate	Recovered children before the forest root	Root not yet up when child restore ran	Recover root first, then children in dependency order

Best practices

Back up system state on ≥2 DCs per domain, one a FSMO holder, daily — and retain longer than your realistic attacker dwell time, so a restore point predates their first foothold, not just detonation.
Make at least one backup copy immutable and offline — Azure vault immutability Locked + MUA, or WORM/air-gapped media in separate custody. A backup Tier 0 can delete is not a backup.
Keep every restorable DC recovery point younger than tombstone lifetime (know your value — 180 or the legacy 60) and confirm it explicitly.
Store DSRM passwords, the FSMO map, DC inventory, trust list, and the runbook offline and outside the forest — in a break-glass vault that does not authenticate against AD.
Pre-build the IRE — isolated network with no route to production, independent DNS and NTP, a clean jump box — and document it now, not during the incident.
Reset krbtgt twice, per domain, with a wait between — bake it into the runbook and use New-KrbtgtKeys.ps1 or ADFR so it can’t be done once and forgotten.
Always recover the forest root first, then children in dependency order; rebuild replicas fresh from the clean seed, never from the same compromised backups.
Rotate every compromise-era secret — trusts, DSRM, built-in Administrator, Domain/Enterprise Admins, gMSA/sMSA, Kerberoastable service accounts — and reset the AdminSDHolder ACL.
Run the health gate before lifting the air gap — repadmin /replsummary, dcdiag /e, DNS, SYSVOL/DFSR, and Kerberos smoke tests all green, every time.
Rehearse quarterly (single-DC restore, timed) and tabletop annually (full forest, named participants) — an un-rehearsed runbook is unreliable by default.
Adopt ADFR to enforce the sequence and an ITDR posture (Purple Knight monthly, Defender for Identity continuous) to reduce the odds you ever run it.
Migrate SYSVOL off FRS to DFSR, kill unconstrained delegation, and disable Print Spooler on DCs — the hygiene that prevents compromise also speeds recovery.

Security notes

Tier 0 isolation is the prevention that makes recovery rare. Administer DCs only from hardened, isolated Privileged Access Workstations; never with a credential that also touches Tier 1/2. See Privileged Identity Management: PIM & PAM Architecture and Privileged Access Management: Vaulting, Session Brokering & Credential Rotation.
Break-glass access must not depend on the forest you might lose. Maintain cloud-side or HSM-anchored emergency access so you can operate the IRE tooling even with AD dead — the pattern in Entra Break-Glass Emergency Access: Monitoring & Governance.
Protect the backups as Tier 0 assets. Immutability, MUA, and network isolation of the backup store are security controls, not just DR features — they are what defeat the modern backup-deletion playbook.
Assume every credential and machine secret is burned. After recovery, rotate all Tier 0 secrets, reset krbtgt twice, and rejoin/reimage member servers and workstations — a machine account or cached credential the attacker holds is re-entry.
Audit the directory for persistence after restore. Check AdminSDHolder, SID history, privileged group membership, delegation settings, and GPO edit history — the point of forest recovery is to erase attacker persistence, which a naive restore preserves.
Encrypt and segregate the recovery artifacts. The runbook, DSRM passwords, and vault credentials are themselves high-value targets; store them with strict, separately-controlled access.
Least privilege on the backup and recovery paths. The account that writes backups should not be able to delete them (that is what MUA enforces); the IRE management identity should be scoped to the IRE only.

Cost & sizing

Forest-recovery readiness has three cost buckets: immutable backup storage, the (usually idle) IRE, and the human time to build and rehearse. All three are trivial against the cost of the alternative.

Immutable backup storage dominates the recurring bill. DC system-state backups are small (a few GB to tens of GB per DC depending on directory and SYSVOL size), so an Azure Recovery Services vault with 30–60 days of daily points across your DCs is typically a modest hundreds of INR/month in storage plus egress on restore. MUA and immutability add no direct charge — they are configuration.
The IRE is near-zero at rest. In Azure, the isolated VNet, NSG, and DNS cost effectively nothing when idle; you pay for DC VMs only during a rehearsal or a real recovery. Budget a handful of B-/D-series VMs for the seed set during the quarterly drill (a few hundred INR for a day). On-prem, it’s a switch and some standby capacity you likely already own.
The real cost is expertise and rehearsal time — a quarterly half-day drill and an annual tabletop for the AD/security team. That is the expensive part, and the non-negotiable one.

Cost driver	What you pay for	Rough INR / month	Notes
Immutable backup vault	System-state recovery points, 30–60 d	~hundreds–low thousands	Small data; MUA/immutability free
Restore egress	Data out during a restore/rehearsal	Per-GB, occasional	Only when you restore
IRE at rest	Isolated VNet, NSG, DNS	~0	Elastic; pay for VMs only when used
IRE during a drill	Seed-set VMs for a day	~hundreds (per drill)	Quarterly, not continuous
ADFR / ITDR tooling	Semperis/Quest/Defender for Identity	Varies (licensed)	Purple Knight is free; DSP/Defender licensed
Human rehearsal time	Quarterly drill + annual tabletop	(staff time)	The true cost and the true value

Sizing the seed set: one DC per domain for the initial restore, plus a staging/management host, then enough fresh replicas that each domain has at least two DCs before you lift the gap. A three-domain forest therefore needs three restored seeds plus three-to-six fresh replicas in the IRE during recovery — size the isolated network’s capacity for that peak, even though it sits idle the rest of the year.

Interview & exam questions

1. When is full forest recovery the right response, versus object restore or a single-DC rebuild? Full forest recovery is warranted only when you cannot prove the directory, schema, and SYSVOL are clean on every DC — i.e., forest-wide compromise (all DCs encrypted, krbtgt stolen, Golden Tickets in play). Object restore (AD Recycle Bin, Restore-ADObject) fits bounded deletions in a trusted forest; single-DC rebuild fits one dead DC when the rest is trusted. The decision gate is trust, not blast radius.

2. Why must you reset krbtgt twice during recovery, and what happens if you reset it only once? The krbtgt account keeps password history (the current and immediately previous keys both validate Kerberos tickets), so a single reset leaves the previous — attacker-known — key valid and Golden Tickets still working. Resetting twice, with a replication/aging wait between, retires the original key entirely. It must be done per domain.

3. What is an isolated recovery environment (IRE) and why is it mandatory for forest recovery? An IRE is a pre-built, air-gapped network — no routing/peering to production, independent DNS and NTP, a clean management host — where you restore and cleanse DCs. It’s mandatory because restoring a clean DC onto the network the attacker still owns simply re-compromises it; the IRE guarantees nothing hostile reaches the seed and the seed never phones home until proven clean.

4. Why do you seize (not transfer) FSMO roles during recovery, and which roles? All five roles (Schema Master, Domain Naming Master, PDC Emulator, RID Master, Infrastructure Master). You seize because transfer requires the current holder online, and in a forest recovery the holders are dead or untrusted. Seizure (Move-ADDirectoryServerOperationMasterRole -Force or ntdsutil roles seize) forces the roles onto the restored DC regardless.

5. What is the tombstone lifetime and how does it constrain your backups? It’s how long a deleted object persists as a tombstone before garbage collection — default 180 days (legacy forests may be 60). A system-state backup older than tombstone lifetime is un-restorable as a DC, because reviving it would reanimate objects the forest already purged. So every restorable recovery point must be younger than that value.

6. Why invalidate the RID pool on a restored DC, and what’s the risk if you don’t? A restored DC carries an older RID pool than the forest reached at compromise; if it mints new SIDs it may reuse RIDs the lost DCs already issued, producing duplicate SIDs — silent until two principals collide. Invalidating the pool (and raising rIDAvailablePool when the gap is large) forces a fresh, non-overlapping block.

7. How do multi-user authorization (MUA) and immutability protect backups from ransomware? Modern ransomware crews delete backups before detonating, using compromised Tier 0. Immutability (Locked) prevents deleting or shortening retention of recovery points before expiry; MUA requires a second, separately-controlled approver for protective changes, so a compromised backup admin can request but not approve deletion. Together they keep a pre-compromise restore point alive.

8. What is a lingering object and how do you deal with it during recovery? A lingering object exists on one DC but was deleted (and garbage-collected) everywhere else — typically reintroduced by a DC restored from too-old a backup or reconnected after being offline past tombstone lifetime. Enable strict replication consistency (repadmin /regkey * +strict) and purge with repadmin /removelingeringobjects against a known-good reference DC.

9. In a multi-domain forest, what order do you recover in and why? Forest root first — it holds the schema, forest-wide FSMO roles, _msdcs, and the DNS/trust root everything depends on — then child domains in dependency order (shared-service/GC-hosting children before purely regional ones). RODCs are rebuilt fresh from clean writable DCs, never restored.

10. Why must you rebuild additional DCs fresh rather than restore them from backup? Every backup taken during the compromise window carries the same poison (weaponised SYSVOL, backdoored objects, attacker persistence). Restoring more DCs from those backups reintroduces the compromise. You restore one clean seed per domain, then promote fresh, hardened replicas that pull a clean copy from it.

11. What does the ADFR tool do for you, and what does it not remove from your responsibility? ADFR (AD Forest Recovery) orchestrates and times the recovery: ordered restore, metadata cleanup, FSMO seizure, RID handling, the double krbtgt reset, isolation enforcement, and reporting — cutting human error and giving a measured RTO. It does not remove the need to understand the procedure, to have valid immutable backups, or to build the IRE; automation multiplies competence, it doesn’t replace it.

12. After the core forest is healthy, why isn’t the incident over? Because every member server and workstation’s machine secret and cached credentials are also suspect — re-entry vectors — so the estate must be rejoined or reimaged, external trusts rebuilt, and applications validated. In real incidents the DCs are healthy in hours but “business as usual” is days-to-weeks out; the endpoint/member-server rebuild is the long tail nobody budgets for.

These map to SC-300 (Identity and Access Administrator) and AZ-800/AZ-801 (Windows Server Hybrid Administrator) — AD DS operations, backup/recovery, and hybrid identity — and to security certifications (SC-200, CISSP domains on BC/DR and identity) for the ransomware-resilience and ITDR angle.

Quick check

You confirm every DC in the forest is encrypted and krbtgt was dumped via DCSync a week ago. Which recovery type do you invoke, and what is the single decision gate?
You reset krbtgt once during recovery and move on. Why are you still exposed, and what exactly must you do?
Your only backup of the forest root is 210 days old and tombstone lifetime is 180. Can you restore it as a DC? What does this tell you to fix?
Name three properties the isolated recovery environment must have before you restore the first DC into it.
After restoring the seed DC, you skip invalidating the RID pool. What silent problem have you planted, and how does it eventually surface?

Answers

Full forest recovery. The decision gate is trust — you cannot prove any DC’s directory/schema/SYSVOL is clean, and a stolen krbtgt means forest-wide Golden Tickets — so object restore or single-DC rebuild would restore into an attacker-controlled forest.
A single krbtgt reset leaves the previous key (which the attacker knows) still valid because Kerberos validates against current and previous keys and the account keeps that history. Golden Tickets still work. You must reset krbtgt a second time, per domain, with a replication/aging wait between the two resets so the original key is fully retired.
No — a backup older than the 180-day tombstone lifetime is un-restorable as a DC, because reviving it would reanimate garbage-collected objects. This tells you to fix retention/cadence immediately so you always keep a restorable recovery point younger than tombstone lifetime (and older than realistic attacker dwell time).
Any three of: no route/peering to production (true air gap); independent DNS; independent NTP/time source (Kerberos needs < 5-min skew); a clean, freshly-imaged management host; and capacity for the full seed set (one DC per domain plus staging).
You’ve planted duplicate SIDs: the restored DC’s old RID pool can mint RIDs the lost DCs already issued. It’s silent until two principals end up with the same SID and collide (access or replication errors) weeks or months later. Fix by invalidating the pool (and raising rIDAvailablePool if the gap is large) before the DC issues new objects.

Glossary

Forest recovery — the full teardown and rebuild of an entire AD forest from pre-compromise backups, in isolation, when no DC can be trusted.
System-state backup — a consistent OS-assembled snapshot of NTDS.dit, SYSVOL, registry, COM+, and boot files; the only backup restorable as a DC.
IFM (Install From Media) — an ntdsutil-generated media set used to seed a new DC without full WAN replication; not a DC backup.
DSRM (Directory Services Restore Mode) — a boot mode with AD offline, using a local password, in which you perform restores and authoritative restores.
Authoritative restore — marking restored data (objects or SYSVOL) as the winning version so it wins replication convergence.
FSMO (Flexible Single Master Operations) — the five single-holder roles (Schema, Domain Naming, PDC Emulator, RID, Infrastructure Master); seized onto the restored DC in recovery.
krbtgt — the account whose key encrypts every Kerberos TGT; a stolen hash yields Golden Tickets. Reset twice in recovery, per domain.
Golden Ticket — a forged TGT created offline with a stolen krbtgt hash, valid for any user for years; invalidated only by resetting krbtgt twice.
RID pool — a block of Relative Identifiers a DC leases from the RID Master to mint SIDs; invalidated on a restored DC to prevent duplicate SIDs.
Tombstone / tombstone lifetime — a deleted object’s residual state and the window (default 180 days) before garbage collection; backups older than this are un-restorable.
Lingering object — an object present on one DC but deleted and garbage-collected everywhere else; must be purged or it poisons replication.
Metadata cleanup — removal of a dead DC’s objects (NTDS Settings, computer, DFSR member, DNS) from the directory via ntdsutil or Remove-ADDomainController.
USN rollback — silent replication divergence caused by reverting a DC to a snapshot outside the VM-GenerationID mechanism, re-using update numbers.
IRE (Isolated Recovery Environment) — a pre-built, air-gapped network with independent DNS/time where the forest is restored and cleansed.
MUA (Multi-User Authorization) — a backup control requiring a second, separately-controlled approver for protective changes, defeating compromised-admin deletion.
ADFR (AD Forest Recovery tool) — Microsoft/Semperis tooling that orchestrates and times the recovery steps, reducing human error.
ITDR (Identity Threat Detection and Response) — the discipline and tooling (Purple Knight, Defender for Identity, Semperis DSP) for assessing AD posture and detecting the attacks that precede a forest event.
AdminSDHolder / SDProp — a container and process that stamps a template ACL onto protected (privileged) objects; a common attacker persistence vector to audit and reset.

Next steps

You can now build, harden, and rehearse a forest-recovery runbook. Build outward:

Foundation: Active Directory Domain Services: Forest Design & DC Promotion on Azure — a well-designed forest is far easier to recover; get the topology, DNS, and DC placement right first.
The general pattern: Ransomware Resilience: Immutable Backup & Isolated Recovery Environment — the cross-platform blueprint this article specialises for Active Directory.
Fit it into IR: Security Incident Response Runbooks, Tabletops & Cloud Forensics — wire the forest-recovery runbook into your broader incident-response program and tabletops.
Prevent it: Privileged Identity Management: PIM & PAM Architecture and Privileged Access Management: Vaulting, Session Brokering & Credential Rotation — Tier 0 isolation is what keeps you out of the IRE in the first place.
Break-glass: Entra Break-Glass Emergency Access: Monitoring & Governance — emergency access that survives even a dead forest.
Hybrid impact: Entra Connect Sync Deep Dive: PHS, PTA & Seamless SSO — understand what breaks in the cloud when the on-prem forest goes down, and how sync recovers.