Diagnosing AD Replication and FSMO Failures with repadmin and dcdiag

When Active Directory replication breaks, nothing fails loudly at first. Logons keep working off cached credentials, group policy keeps applying from the last good SYSVOL copy, and DNS keeps resolving because the records were already there. Then three weeks later a password reset does not take on half your domain controllers, a Kerberos ticket gets issued against a stale group membership, an account you disabled is still able to authenticate against one site, and repadmin starts spitting 8606 while the forest quietly diverges. By the time a human notices, you may already have lingering objects re-animating deleted principals, and if you let a domain controller roll its database backward you can poison the entire topology. Replication failures are the most consequential class of Active Directory incident precisely because they are silent, cumulative, and — when mishandled — self-propagating.

This is the triage playbook I run when a replication or FSMO (Flexible Single Master Operations) problem lands on my desk. It assumes Windows Server 2019/2022/2025 domain controllers, Domain Admin rights (plus Enterprise Admin for forest-wide operations), and that you understand the difference between fixing a symptom and fixing a cause. The two workhorse tools are repadmin (which reports and manipulates the replication engine directly) and dcdiag (which runs a battery of correctness tests against a DC’s role as a directory server, DNS resolver, and Kerberos endpoint). Every command in the early sections is read-only. The destructive steps — seizing a role, removing lingering objects, force-demoting a rolled-back DC, metadata cleanup — come later and each carries a warning, because the entire job is judgement about which DC holds the authoritative copy of the data before you delete anything.

By the end you will stop guessing. When repadmin /replsummary shows a wall of failures you will know whether you face a DNS resolution problem wearing an AD error code (8524), a blocked RPC dynamic port (1722), a genuine lingering object needing authoritative cleanup (8606), a DC that quarantined itself after a snapshot revert (event 2095), or a dead FSMO holder that needs seizing before you can issue new RIDs. Knowing which within the first ten minutes is what separates a contained incident from a weekend forest rebuild.

What problem this solves

Active Directory is a distributed database with no single source of truth by design — every writable DC accepts changes and reconciles them with every other DC through multimaster replication. That design is a gift: any DC can service a password change, a group edit, a new computer join, and the change propagates everywhere without a central bottleneck. It becomes a wall the moment reconciliation stops, because now you have N copies of “the truth” that disagree, and the platform will not tell you which one is right — it only tells you they differ, in the form of an error code on a replication link.

What breaks without this knowledge: an on-call engineer sees 8606 and “removes lingering objects” pointed at the wrong reference DC, deleting live accounts across the forest. Or they see 2095 (USN rollback) and clear the Dsa Not Writable registry flag to “fix replication,” re-injecting a rolled-back database and corrupting the topology permanently. Or they seize the RID Master from a DC that later comes back online, producing two DCs handing out overlapping RID pools and duplicate security principals. Every one of these is a well-intentioned action that turns a recoverable incident into an unrecoverable one. The failure mode of AD troubleshooting is not inaction — it is confident wrong action.

Who hits this: anyone operating on-premises or hybrid Active Directory at any scale beyond a single DC. It bites hardest on estates with multiple sites over WAN links (where a link outage lets a DC fall behind the tombstone lifetime), virtualised DCs (where snapshot reverts and P2V/clone operations cause USN rollback), and hybrid environments where Entra Connect is synchronising a diverging directory upward into the cloud. The fix is almost never “restart the Netlogon service” — it is “find the mechanism that stopped converging, confirm which DC is authoritative, and act with the specific tool that respects the replication invariants.”

To frame the whole field before the deep dive, here is every symptom class this article covers, the question it forces, and the first place to look:

Symptom class	What it usually means	First question to ask	First command to run	Most common single cause
Replication failing on a link	A specific partner pair cannot exchange changes	Is it DNS, RPC, or a directory problem?	`repadmin /showrepl <DC> /verbose`	DNS resolution of the partner’s CNAME
Forest-wide failures	Many links failing at once	Is one DC sick or the topology broken?	`repadmin /replsummary`	One DC unreachable or KCC not building links
8606 storm	A DC is offering references to deleted objects	Which DC has stale (lingering) data?	`repadmin /removelingeringobjects ... /advisory_mode`	A DC offline longer than tombstone lifetime
DC quarantined itself	Event 2095, `Dsa Not Writable` set	Was this DB restored improperly?	`reg query ...NTDS\Parameters /v "Dsa Not Writable"`	VM snapshot revert / clone → USN rollback
Can’t create users / RIDs exhausted	RID pool cannot be refilled	Is the RID Master alive and reachable?	`dcdiag /test:RidManager /v`	RID Master down or replication to it broken
FSMO operation fails	Schema change, RID issue, PDC-dependent op fails	Who holds the role and is it reachable?	`netdom query fsmo`	FSMO holder offline; role needs transfer/seize
SYSVOL not replicating	GPOs/scripts diverge across DCs	Is this FRS or DFSR, and is it healthy?	`dfsrmig /getglobalstate` / DFSR event log	DFSR backlog or a leftover FRS migration

Learning objectives

By the end of this article you can:

Model multimaster replication end to end — USNs, the high-watermark vector, the up-to-dateness vector, invocation IDs, propagation dampening, and how the KCC/ISTG builds the connection topology across sites and site links — well enough to reason about any convergence failure.
Read replication health top-down with repadmin /replsummary, then drill into a single DC with repadmin /showrepl /verbose, inspect the pending queue with repadmin /queue, and force convergence with repadmin /syncall /AdeP.
Run the correct dcdiag test suite for a replication incident (Replications, Intersite, DNS, KnowsOfRoleHolders, FSMOCheck, RidManager, KccEvent) and interpret each result.
Decode the error codes that account for most cases — 8524, 1722, 1256, 8606, 8456/8457, 8451, 8453, 1908, 1396 — and know for each whether it is a DNS, network, directory, or security problem.
Detect and remove lingering objects safely with advisory mode first, and enable strict replication consistency fleet-wide as a permanent guardrail.
Recognise USN rollback from event 2095 and the Dsa Not Writable flag, and execute the only correct recovery (force-demote and rebuild) rather than the fatal one (clearing the flag).
Enumerate the five FSMO roles, decide correctly between a graceful transfer and a unilateral seize, perform both via Move-ADDirectoryServerOperationMasterRole and ntdsutil, and complete metadata cleanup afterward.

Prerequisites & where this fits

You should already understand Active Directory Domain Services fundamentals: that a forest is the security and schema boundary, a domain is a partition and replication/administrative boundary within it, and that a domain controller hosts writable copies of several naming contexts (partitions): the Schema NC and Configuration NC (forest-wide), the Domain NC (per domain), plus application partitions such as the DNS zones (DomainDnsZones, ForestDnsZones). You should be comfortable in an elevated PowerShell prompt on a DC, able to read event logs, and know that the RSAT Active Directory module (Import-Module ActiveDirectory) and the classic tools (repadmin, dcdiag, ntdsutil, netdom, nltest) ship with the AD DS role and RSAT.

This sits at the operational heart of the on-premises identity track. It assumes the build knowledge from Building an AD DS Forest the Right Way: Deployment, FSMO, and a Tiered Admin Model, because you cannot troubleshoot a topology you did not design deliberately. It leans heavily on Highly Available DNS and DHCP on Windows Server, End to End, because the majority of “replication” incidents are DNS incidents in disguise. Kerberos correctness depends on time, so Accurate Hybrid Time Sync: chrony on Linux and w32time in Active Directory is upstream of the authentication symptoms here. In a hybrid estate, Microsoft Entra Connect Sync Deep Dive: Designing Hybrid Identity with PHS, PTA, and Seamless SSO is downstream — a diverging on-prem directory becomes a diverging cloud directory.

A quick map of who confirms what during an incident, so you open the right tool first:

Layer	What lives here	Tool that inspects it	Failure classes it causes
Name resolution (DNS)	DC A/CNAME records, `_msdcs` SRV	`dcdiag /test:DNS`, `nslookup`	8524, “target not found”, KCC can’t build links
Network / RPC	135/TCP endpoint mapper + dynamic ports	`repadmin /bind`, `portqry`	1722, 1256, RPC-over-SMB fallbacks
Replication engine	USN, UTDV, connection objects	`repadmin /showrepl`, `/queue`	8451, 8456, stuck queues, no inbound partners
Directory data	Objects, tombstones, lingering objects	`repadmin /removelingeringobjects`	8606, re-animated deleted objects
Database (ESE / NTDS.dit)	The Jet database, transaction logs	Directory Service event log, `esentutl`	8451, USN rollback (2095), journal wrap
FSMO roles	Schema/Naming/RID/PDC/Infrastructure masters	`netdom query fsmo`, `dcdiag /test:FSMOCheck`	Can’t extend schema, RID exhaustion, time skew
SYSVOL	Policies and scripts (DFSR or FRS)	`dfsrmig`, DFSR event log, `Get-DfsrBacklog`	GPO/script divergence, no logon scripts

Core concepts

Six mental models make every later diagnosis obvious. Skipping them is why engineers guess.

USNs are local, monotonic, and load-bearing. Every DC keeps a monotonically increasing 64-bit counter, the Update Sequence Number. Each originating write (a real change made on that DC) and each replicated-in change increments the local USN and stamps the affected attribute with the value. USNs are strictly local — DC-A’s USN 50,000 has nothing to do with DC-B’s USN 50,000. The USN is the currency of “how far along am I,” and everything about replication convergence is bookkeeping over USNs.

The high-watermark tracks per-partner progress. When DC-A replicates a partition from DC-B, it remembers the highest USN it has seen from DC-B for that partition — the high-watermark vector (HWMV). On the next cycle DC-A sends that watermark, and DC-B only ships changes with a USN higher than it. This is what makes replication incremental instead of a full re-sync every fifteen seconds.

The up-to-dateness vector prevents redundant transfers. The high-watermark alone is not enough in a mesh, because DC-A might already have a change indirectly via DC-C. So each DC also keeps an up-to-dateness vector (UTDV): for every DC that has ever originated a change to a partition (keyed by that DC’s invocation ID), the highest originating USN this DC has received, from any path. DC-A sends its UTDV, and DC-B uses it to skip changes DC-A already has by another route. The HWMV answers “where did you and I leave off”; the UTDV answers “what do I already have regardless of source.” Together they give propagation dampening — no loops, no redundant transfers — in an arbitrary mesh.

The invocation ID identifies the database instance, not the DC. Each writable copy of the NTDS.dit database has an invocation ID (a GUID). It is not the DC’s GUID — it identifies this specific instance of the database. When a DC is restored properly (from a system-state backup, or on a VM-GenerationID-aware host after a snapshot revert), AD resets the invocation ID so partners treat it as a “new” source and re-send anything it might have lost. The invocation ID is the safety mechanism that makes UTDV bookkeeping survive a restore. USN rollback is what happens when the database moves backward but the invocation ID does not change — partners think they already have those USNs, dampen the “new” changes, and the rolled-back DC silently stops originating updates. That is the single most dangerous failure in this article.

The KCC and ISTG build the topology you replicate over. You do not hand-wire who replicates from whom; the Knowledge Consistency Checker (KCC), running on every DC every fifteen minutes, computes the intra-site connection objects to keep replication within a two-hop diameter. For inter-site replication, one DC per site is elected the Inter-Site Topology Generator (ISTG) and builds the connections across site links you define. Sites model network locality (a set of subnets), site links model the WAN paths between them (with a cost, a schedule, and a replication interval, default 180 minutes), and bridgehead servers are the DCs that actually carry inter-site traffic. Get the site/subnet mapping wrong and DCs replicate over expensive or nonexistent paths — or a new DC never gets a connection object at all.

Tombstones, deletion, and the two lifetimes. When you delete an object, AD does not remove it immediately; it converts it to a tombstone (stripped of most attributes, marked deleted) so the deletion can replicate. The tombstone persists for the tombstone lifetime — 180 days on any forest first built at Server 2003 SP1 or later — after which garbage collection physically removes it. If a DC is offline longer than the tombstone lifetime, it never learns of deletions that garbage-collected elsewhere; when it returns it still holds those objects as live, and offering them to partners produces the lingering object class of failure (error 8606). (Modern forests with the AD Recycle Bin enabled keep a recoverable, attribute-complete “deleted” state before the tombstone stage, but the lifetimes and the lingering-object risk are unchanged.)

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to troubleshooting
USN	Local 64-bit change counter	Per DC, in NTDS.dit	The unit of “how far along”; rollback = disaster
High-watermark (HWMV)	Highest USN seen from a specific partner	Per partner, per partition	Drives incremental “give me changes since X”
Up-to-dateness vector (UTDV)	Highest originating USN per source DB (invocation ID)	Per partition	Propagation dampening; skips redundant changes
Invocation ID	GUID for a database instance	Per NTDS.dit copy	Must change on restore; unchanged + rewound = USN rollback
KCC	Builds intra-site connection objects	On every DC (15 min)	Broken KCC → no inbound partners → no replication
ISTG	Elects/builds inter-site connections	One DC per site	Broken ISTG → sites stop replicating to each other
Site / site link	Network locality / WAN path model	Configuration NC	Wrong subnet-to-site map → bad or missing topology
Bridgehead	DC that carries inter-site traffic	Per site	Overloaded/failed bridgehead → inter-site backlog
Tombstone	A deleted object awaiting garbage collection	Every partition	Object outlives it on an isolated DC → lingering object
Tombstone lifetime	Days a tombstone survives (default 180)	Configuration NC attribute	The threshold past which a stale DC is dangerous
Lingering object	A deleted object surviving on an isolated DC	The stale DC	Offered to partners → error 8606
FSMO role	One of five single-master operations	Specific role-holder DCs	Loss blocks schema/RID/PDC-dependent operations
Connection object	“Replicate inbound from partner X”	Under a DC’s NTDS Settings	Missing/orphaned → a link never runs

How multimaster replication actually works

You cannot diagnose what you cannot model, so this section walks a single change from origination to convergence, then names the places it can stall. Suppose an administrator resets a user’s password on DC-A.

DC-A performs an originating write: it updates unicodePwd and related attributes, increments its local USN (say to 50,120), and stamps each changed attribute with (originating DSA = DC-A's invocation ID, originating USN = 50,120, version = n+1, timestamp). The local USN and per-attribute version number are the two counters that later decide conflicts.
DC-A’s change is now in its copy of the Domain NC. Nothing has left the box yet. Intra-site, a change notification is queued: DC-A will notify its replication partners after a short delay (default 15 seconds, historically 5 minutes on older builds) so bursts coalesce.
DC-B, a notified partner, calls back to DC-A over RPC asking for changes to the Domain NC, presenting its high-watermark for DC-A and its full UTDV. DC-A replies with every change past DC-B’s high-watermark that DC-B does not already have per the UTDV — here, the password reset.
DC-B applies the change. If DC-B’s current version for unicodePwd is lower, DC-A’s wins and DC-B takes it, incrementing DC-B’s own local USN and updating its UTDV entry for DC-A’s invocation ID to 50,120. If the versions tie, the higher timestamp wins; if those tie too, the higher originating invocation-ID GUID wins — deterministic, so every DC resolves the conflict identically without coordination.
DC-B, having changed, now notifies its partners, and the change fans out across the mesh, each hop dampened by UTDV so nobody re-applies it and no loop forms. Inter-site, there is no notification by default (to conserve WAN): the destination bridgehead polls on the site-link schedule/interval.

That is the happy path. Here is where each step can stall, which is your diagnostic map:

Step that stalls	Symptom you observe	Underlying cause	Confirm with
DC-B can’t resolve DC-A	Link fails, error 8524	DNS: missing/stale CNAME or A record	`dcdiag /test:DNS`, `nslookup <guid>._msdcs.<forest>`
DC-B can’t RPC to DC-A	Link fails, error 1722 / 1256	Firewall/dynamic-port block, dead partner	`repadmin /bind DC-A`, `portqry -n DC-A -e 135`
DC-A offers a reference DC-B lacks	Link fails, error 8606	Lingering object on DC-A	`repadmin /removelingeringobjects ... /advisory_mode`
No connection object exists	DC-B has no inbound partner for the NC	KCC/ISTG not building links (site config, KCC error)	`repadmin /showrepl`, `dcdiag /test:KccEvent`
Notifications don’t fire	Changes converge only every poll interval	Notification disabled or DC not in a site	`repadmin /showconn`, check `Site` membership
Change never applied	Object differs across DCs indefinitely	ESE/database error on destination	`repadmin /showobjmeta`, Directory Service log
Rolled-back DC dampens changes	It stops originating; partners look “fine”	USN rollback (invocation ID unchanged)	Event 2095, `Dsa Not Writable` registry flag

Notification, polling, and the numbers that matter

The two knobs that shape when a change lands are notification (push) intra-site and polling (pull) inter-site. Know the defaults so “it hasn’t replicated yet” versus “it’s stuck” is a decision, not a hope:

Parameter	Scope	Default	Where set	Troubleshooting relevance
Change notification delay (first)	Intra-site	15 s (modern)	Registry / forest functional level	Sub-minute intra-site convergence is normal
Change notification delay (subsequent)	Intra-site	3 s	Same	Bursts coalesce, then fan out fast
Inter-site replication interval	Site link	180 min	Site link object	“Not replicated for 2 h” inter-site can be normal
Inter-site notification	Site link	Off (poll-based)	Enable per link if needed	Enable for low-latency inter-site (WAN cost trade-off)
KCC run interval	Per DC	15 min	Registry	Topology repairs itself within ~15 min
Garbage collection interval	Per DC	12 h	Configuration NC	Tombstones/lingering candidates cleared on this cadence
Tombstone lifetime	Forest	180 days	Configuration NC attribute	The isolation threshold for lingering objects

Reading replication health with repadmin

Start wide, then drill in. repadmin /replsummary is the one-screen forest health check and the correct first command in almost every incident.

repadmin /replsummary

It returns a source-side table and a destination-side table, each listing every DC with its largest delta (time since the oldest successful replication), fails/total counts, and an error percentage. Any non-zero fails column, or a largest delta measured in hours-to-days, is your starting thread. A DC missing from the list entirely is often the sick one — it could not even be contacted to report on.

Next, get the full per-partition, per-partner state from the DC you suspect:

repadmin /showrepl DC01 /verbose

Read it partition by partition — Schema, Configuration, the Domain NC, and each application partition (DomainDnsZones, ForestDnsZones). For each inbound neighbor you want Last attempt @ <time> was successful. A failing line is the smoking gun and names the error:

CORP\DC02 via RPC
    DSA object GUID: 8f4c2a91-...-a1b2c3d4e5f6
    Last attempt @ 2026-06-07 02:14:31 failed, result 8606 (0x219e):
        Insufficient attributes were given to create an object.
    142 consecutive failure(s).
    Last success @ 2026-05-19 23:51:07.

Two numbers matter here: the consecutive failure count (how long it’s been broken) and the last success (whether it ever worked — a link that never succeeded is a topology/DNS problem, one that stopped is a data or connectivity problem). To sweep the whole forest and dump parseable CSV for triage:

repadmin /showrepl * /csv > C:\Temp\showrepl.csv

The commands I reach for constantly, and exactly what each tells you:

Command	What it reports	When to run it
`repadmin /replsummary`	One-line health per DC: largest delta, fails/total, error %	First, every incident
`repadmin /showrepl <DC> /verbose`	Every inbound partner, per partition, last success/failure + error	To localise which link/partition is broken
`repadmin /showrepl * /csv`	The above for all DCs, as CSV	Forest-wide triage, sorting/filtering failures
`repadmin /queue <DC>`	Pending inbound replication work items	A growing queue means it’s stuck, not just slow
`repadmin /showutdvec <DC> <NC-DN>`	The UTDV for a partition — who this DC trusts and to what USN	Diagnosing dampening / “why won’t it re-send”
`repadmin /showobjmeta <DC> "<object-DN>"`	Per-attribute version/originating-DC/USN for one object	One object disagrees across DCs (a user, a GPO)
`repadmin /bind <DC>`	Confirms an RPC bind to the AD replication interface	Rule connectivity in or out (1722 vs directory error)
`repadmin /showconn <DC>`	The connection objects (who this DC replicates from)	Missing/orphaned connections; KCC problems
`repadmin /showreps <DC>`	Legacy view of neighbours (older syntax, still useful)	Quick partner list without full verbosity
`repadmin /kcc <DC>`	Forces the KCC to recompute the topology now	After fixing sites/subnets, to rebuild connections
`repadmin /replicate <dest> <src> <NC>`	Force one specific inbound replication now	Reproduce a single failing link on demand
`repadmin /syncall <DC> /AdeP`	Sync all partitions, push+pull, cross site links	Force convergence after a fix; verify

/queue — is it stuck or just busy?

repadmin /queue <DC> lists work items the replication engine has queued but not yet processed. A queue that drains over seconds is a healthy, busy DC. A queue that only grows, with the same items pinned at the top, is a stuck DC — often blocked on a single failing source or an ESE problem.

repadmin /queue DC01

Queue contains 0 items.

An empty queue with failing links means the failures are being reported (the engine tried and got an error), not stuck in a backlog. A large, non-draining queue means the engine is jammed and you should look at CPU, disk, and the Directory Service event log on that DC.

/showobjmeta — when exactly one object disagrees

The most surgical command in the set. When a single user, group, or GPO differs between DCs (a password that “didn’t take,” a group membership that reverted), /showobjmeta shows the per-attribute version, the originating DC, the originating USN, and the timestamp for the winning value of each attribute:

repadmin /showobjmeta DC01 "CN=jsmith,OU=Users,DC=corp,DC=example,DC=com"

Loc.USN   Originating DSA     Org.USN  Org.Time/Date        Ver Attribute
=======   ===============     =======  =============        === =========
 50120    CORP\DC02            50118   2026-06-07 02:10:44    7  unicodePwd
 41003    CORP\DC01            41003   2026-05-30 09:12:01    3  memberOf

Run it on two DCs and compare. If DC-A shows version 7 of unicodePwd originated on DC-B, and DC-C shows version 6 originated elsewhere, DC-C is behind — and now you know which link into DC-C to chase. This is how you resolve “the reset worked for some users but not others” without guessing.

Confirming with dcdiag

repadmin reports the replication engine’s mechanics; dcdiag runs correctness tests against a DC’s broader responsibilities as a directory server, DNS resolver, and Kerberos endpoint. In practice you run repadmin to localise and dcdiag to characterise — especially to rule DNS in or out, which is the single highest-value test.

Run the replication-focused tests first:

dcdiag /test:Replications /test:Intersite /v

Then the DNS suite, because the overwhelming majority of “replication” problems are name-resolution problems. A DC cannot replicate from a partner it cannot resolve, and inter-site topology building fails if the <guid>._msdcs.<forest> CNAME records are missing:

dcdiag /test:DNS /v /e

/e runs the tests against every DC in the enterprise; /s:<DC> targets a single DC (useful when /e itself hangs on an unreachable DC). For a fast, broad smoke test of one DC, the connectivity/advertising/role tests cover most ground:

dcdiag /test:Connectivity /test:Advertising /test:KnowsOfRoleHolders /test:FSMOCheck /v

KnowsOfRoleHolders and FSMOCheck verify this DC can locate and bind to every FSMO role holder — critical before you decide whether a role needs transferring or seizing. The full test catalogue you should know, and what each proves:

dcdiag test	What it verifies	When it’s the right test	Failure points to
`Connectivity`	DNS-resolvable + LDAP/RPC reachable	Always (it’s implicit before others)	DNS or network (8524/1722)
`Replications`	Inbound replication succeeded recently, no errors	Every replication incident	The specific link error (8606, 8451…)
`Intersite`	Inter-site topology and bridgehead health	Sites not converging	ISTG/site-link/bridgehead problems
`KccEvent`	KCC ran without logging errors	“No inbound partners” / topology gaps	KCC can’t build the topology
`DNS` (`/DnsAll`)	A/CNAME/SRV records, forwarders, delegation	Suspected DNS root cause (usually)	Missing SRV/CNAME, broken delegation
`KnowsOfRoleHolders`	This DC can locate all five FSMO holders	Before any FSMO decision	A role holder is unreachable/dead
`FSMOCheck`	This DC can bind to the FSMO holders	Before transfer/seize	The holder is down or unreachable
`RidManager`	RID Master reachable, pool allocatable	“Can’t create accounts”, RID warnings	RID Master down / replication to it broken
`Advertising`	The DC advertises itself as a DC (Netlogon)	DC “invisible” to clients	Netlogon/SYSVOL not ready, not advertising
`NetLogons`	Correct Netlogon/SYSVOL share permissions	Logon/GPO problems on one DC	SYSVOL not shared/permissioned
`MachineAccount`	The DC’s own computer account is correct	Secure-channel / trust problems	Broken machine account / secure channel
`Services`	Required AD services are running	DC misbehaving broadly	A stopped service (NTDS, DNS, Netlogon, KDC)
`SysVolCheck`	SYSVOL is ready and advertised	GPO/script divergence	DFSR/FRS SYSVOL not healthy
`VerifyReferences`	FRS/DFSR/Netlogon reference attributes intact	Post-cleanup validation	Orphaned metadata after a bad demotion

A reading habit that saves time: dcdiag prints passed / failed per test, but the reason is in the verbose (/v) body above the summary line. Never act on the one-word summary alone — read the error text, which usually contains the same Win32 code repadmin gave you, plus context (which SRV record was missing, which partner it couldn’t bind to).

Decoding the common errors

The same handful of codes account for most cases. Memorise the meaning and the layer, not just the number — because the layer tells you which tool fixes it. The cardinal rule: 8524 and 1722 are network problems wearing an AD error code; do not reach for directory surgery until you have ruled out DNS and RPC.

Code	Text (abbreviated)	Layer	Likely cause	How to confirm	First fix
8524	The DSA operation could not proceed because of a DNS lookup failure	DNS	Can’t resolve the source DC’s CNAME/A record	`dcdiag /test:DNS`; `nslookup <guid>._msdcs.<forest>`	Fix DNS; register records (`ipconfig /registerdns`, restart Netlogon)
1722	The RPC server is unavailable	Network/RPC	Firewall blocking 135 or dynamic RPC ports; dead partner	`repadmin /bind <src>`; `portqry -n <src> -e 135`	Open RPC endpoint mapper + dynamic range; revive/verify partner
1256	The remote system is not available	Network	Partner unreachable/offline; routing	`ping`/`Test-NetConnection`; is the DC up?	Restore connectivity or remove the dead DC (metadata cleanup)
8606	Insufficient attributes were given to create an object	Directory	Source offered a reference to an object the dest deleted (lingering)	`repadmin /removelingeringobjects ... /advisory_mode`	Authoritative lingering-object removal, then re-verify
8451	The replication operation encountered a database error	Database (ESE)	NTDS.dit / Jet error on the destination	Directory Service event log for the ESE error	If transient, retry; if persistent, restore from backup
8456 / 8457	Source/destination is currently rejecting replication requests	Engine	DC quarantined after USN rollback, or `Dsa Not Writable` set	Event 2095; `reg query ...NTDS\Parameters /v "Dsa Not Writable"`	Do NOT clear the flag; force-demote + rebuild (USN rollback)
8453	Replication access was denied	Security	Missing replication rights / broken secure channel	`dcdiag /test:MachineAccount`; check `Get/Replicate` ACEs	Reset secure channel; restore replication permissions
8452	The naming context is in the process of being removed / not replicated	Engine	Partition being removed, or not yet instantiated	`repadmin /showrepl` partition state	Wait for instantiation; verify the NC should exist here
1908	Could not find the domain controller for this domain	DNS/Locator	DC locator can’t find a DC (SRV records)	`nltest /dsgetdc:<domain>`; `dcdiag /test:DNS`	Fix `_ldap`/`_kerberos` SRV records under `_msdcs`
1396	The target account name is incorrect	Kerberos/SPN	SPN mismatch / duplicate SPN / stale DNS→wrong host	`setspn -X` (duplicates); check DNS points at the right DC	Fix DNS or remove the duplicate SPN
5	Access is denied (during an op)	Security	Rights/UAC/secure channel on the operation	Run elevated; `nltest /sc_verify:<domain>`	Elevate; repair secure channel
1753	There are no more endpoints available from the endpoint mapper	RPC	RPC dynamic port range exhausted/blocked	`portqry`/`netstat`; firewall rules	Fix the RPC port range / firewall; check for port exhaustion

Three reading traps that waste the most hours:

Trap	Why it misleads	How to avoid it
Treating 8524/1722 as AD problems	The error says “DSA”/“replication,” so people edit AD	Always run `dcdiag /test:DNS` and `repadmin /bind` first — rule out the network layer before touching the directory
Chasing 8606 by “just deleting” the object	Deleting live objects if you point at a stale reference DC	Advisory mode first; verify the reference DC is authoritative and current before any real removal
Clearing `Dsa Not Writable` to “fix” 8456	It re-injects the rolled-back database and corrupts the forest	That flag is AD protecting you; the fix is force-demote + rebuild, never clearing the flag

The five FSMO roles: what they do and how they fail

Most operations in AD are multimaster, but five are single-master because concurrent multimaster changes would corrupt them. These are the FSMO roles. Two are forest-wide (one holder per forest); three are domain-wide (one holder per domain). Losing a role is not usually an immediate outage — AD tolerates a missing FSMO holder for a while — but each has a specific set of operations that stop working while it’s gone.

Role	Scope	What it does	What breaks if it’s down	Urgency
Schema Master	Forest	Only DC that can write schema changes	Can’t extend the schema (e.g. install Exchange, raise functional level)	Low — schema changes are rare
Domain Naming Master	Forest	Authorises adding/removing domains and app partitions	Can’t add/remove a domain or an application partition	Low — infrequent operation
RID Master	Domain	Hands out pools of RIDs to every DC	DCs eventually can’t create new security principals (users/groups/computers) once their pool runs low	Medium — bites when a DC’s pool empties
PDC Emulator	Domain	Time source, password-change hub, account-lockout authority, GPO editing target, legacy PDC	Time skew, worse password/lockout behaviour, GPO edit conflicts	High — most user-visible role
Infrastructure Master	Domain	Updates cross-domain object references (phantom cleanup)	Stale cross-domain group memberships/names	Low (and moot if all DCs are GCs)

Two facts that resolve most FSMO confusion:

The PDC Emulator is the one you protect most. It is the authoritative time source for the domain (the forest-root PDC is the root of the whole time hierarchy), the DC that other DCs forward password changes to for immediate effect, the account-lockout authority, and the default target for GPO edits (so two admins editing GPOs both hit the same DC and don’t conflict). A missing PDCe degrades authentication quality and time immediately.
The Infrastructure Master vs global catalog gotcha. The Infrastructure Master updates references to objects in other domains. If the DC holding it is also a global catalog (GC), it can’t detect stale references (the GC already knows every object, so nothing looks stale) and cross-domain reference updates silently stop. The rule: in a multi-domain forest, the Infrastructure Master should not be a GC — unless every DC in the domain is a GC, in which case it doesn’t matter (there are no non-GC DCs to have stale references). Single-domain forests are immune.

Confirm who holds the roles two ways (they should agree):

netdom query fsmo

Get-ADForest | Select-Object SchemaMaster, DomainNamingMaster
Get-ADDomain  | Select-Object PDCEmulator, RIDMaster, InfrastructureMaster

And verify from the directory’s perspective that every DC agrees on the holders — a disagreement is itself a symptom (partial replication of the Configuration NC where the role is stamped):

dcdiag /test:KnowsOfRoleHolders /v

RID Master mechanics, because RID exhaustion is a real outage

The RID Master allocates RID pools — blocks of ~500 relative IDs — to each DC on request. A DC uses its pool to mint the SID of each new principal (domain SID + RID). When a DC has used ~50% of its pool, it asks the RID Master for the next block. If the RID Master is down and a DC exhausts its current pool, that DC can no longer create users, groups, or computers until the RID Master returns or the role is seized. Check pool status:

dcdiag /test:RidManager /v

* Available RID Pool for the Domain is 21500 to 1073741823
* DC01.corp.example.com is the RID Master
* DC02 has SID Pool 15500 to 16499
* rIDPreviousAllocationPool is 15000 to 15499
* rIDNextRID: 15087

The domain-wide RID space is huge (2^30 ≈ 1.07 billion), so global exhaustion is rare, but local pool exhaustion on one DC while the RID Master is unreachable is a concrete “can’t onboard anyone” outage. This is why RID Master reachability is a first-class check in any FSMO incident.

Transferring vs. seizing FSMO roles

The decision between transfer and seize is the single most consequential FSMO judgement, and it has exactly one deciding question: is the current holder alive and reachable, and will it ever come back?

If the holder is alive and reachable, transfer. A transfer is a clean, coordinated handoff: the roles move, both DCs agree, and it is fully reversible (transfer it back). Do this for planned maintenance, decommissioning, or rebalancing.
If the holder is permanently dead, seize — but only after committing that it will NEVER come back online holding that role. A seizure is unilateral: the new holder simply asserts the role. If the old holder later rejoins the network still believing it holds the role, you have a split-brain, which is catastrophic for the RID Master (two DCs allocating overlapping RID pools → duplicate SIDs) and the Schema Master (two authorities for schema writes).

Situation	Old holder state	Action	Reversible?	The trap
Planned move / decommission	Alive, reachable	Transfer	Yes	None — the clean path
Holder briefly offline (reboot, patch)	Temporarily down	Wait	n/a	Seizing prematurely then the DC returns = split-brain
Holder permanently dead (hardware, USN rollback)	Dead, will not return	Seize	No	Ever powering the old holder back on with the role
Holder isolated by network only	Alive but unreachable	Fix network, then transfer	Yes	Seizing when the DC is actually fine, just partitioned

Transfer with the AD module (the modern way — you can move all five at once):

Move-ADDirectoryServerOperationMasterRole -Identity "DC02" `
  -OperationMasterRole PDCEmulator,RIDMaster,InfrastructureMaster,SchemaMaster,DomainNamingMaster

Seize is the same cmdlet with -Force, used only against a holder you have retired for good:

# Seize from an unrecoverable holder — commit to never bringing it back
Move-ADDirectoryServerOperationMasterRole -Identity "DC02" `
  -OperationMasterRole RIDMaster,PDCEmulator -Force

The classic ntdsutil seize path is fully supported and worth knowing when the AD module isn’t available (Server Core without RSAT, an emergency shell):

ntdsutil
  roles
    connections
      connect to server DC02
      quit
    seize RID master
    seize PDC
    seize infrastructure master
    seize schema master
    seize naming master
    quit
  quit

A note on the mapping between friendly names and ntdsutil verbs, because they differ and it matters under pressure:

Role	AD module `-OperationMasterRole` value	`ntdsutil` verb	Post-seize must-do
Schema Master	`SchemaMaster`	`seize schema master`	Ensure old holder never returns with the role
Domain Naming Master	`DomainNamingMaster`	`seize naming master`	Same
RID Master	`RIDMaster`	`seize RID master`	Critical: old holder must be destroyed (duplicate RIDs)
PDC Emulator	`PDCEmulator`	`seize PDC`	Re-point time hierarchy; verify with `w32tm /query /status`
Infrastructure Master	`InfrastructureMaster`	`seize infrastructure master`	Re-check the GC-vs-Infra rule on the new holder

After any transfer or seize, confirm with netdom query fsmo and dcdiag /test:KnowsOfRoleHolders, and — if you seized the PDC Emulator — reconfigure the time hierarchy (the forest-root PDCe should sync from an external NTP source; see the time-sync article) and validate with w32tm /query /status.

Detecting and removing lingering objects safely

A lingering object is one that was deleted (tombstoned, then garbage-collected) on most DCs but survives on a DC that was offline longer than the tombstone lifetime. When that isolated DC comes back and replicates, its partners reject the reference to the “revived” object with 8606 — because from their perspective, the source is offering a reference to an object that no longer exists. Lingering objects are dangerous beyond the error noise: a re-animated user or group is a security problem (a disabled account back in existence, a deleted admin group re-populated), not just a replication annoyance.

The cardinal safety rule: detect first, delete never on the first pass. Run removelingeringobjects in advisory mode against the suspect DC, using a known-good reference DC as the source of truth. Advisory mode deletes nothing; it logs every object it would remove to the Directory Service event log so you can review them.

# Advisory only — logs candidates (event 1946), deletes NOTHING
repadmin /removelingeringobjects DC02 <reference-DC-GUID> "DC=corp,DC=example,DC=com" /advisory_mode

The three arguments, precisely, because getting them backward is how people delete live data:

Argument	Meaning	Common mistake
First DC name	The DC to clean (the one with stale/lingering objects)	Swapping it with the reference DC → cleans the wrong box
Reference DC GUID	The authoritative, current source of truth (its DSA object GUID)	Using a stale reference DC → deletes live objects everywhere
NC distinguished name	The partition to check	Forgetting to repeat for Configuration/Schema/DNS partitions

Get the reference DC’s DSA GUID (the first argument to point at the good copy):

repadmin /showrepl DC01 | Select-String "DSA object GUID"
# or
(Get-ADDomainController -Identity DC01).InvocationId  # note: use the DSA GUID from showrepl for repadmin

Advisory mode writes event 1946 (and 1942 summaries) listing every lingering object it would remove. Review that list object by object. Confirm the reference DC genuinely holds the authoritative, current copy of that partition — pointing at a DC that is itself behind will delete objects that are actually live. Once you trust the list, remove for real by dropping /advisory_mode:

repadmin /removelingeringobjects DC02 <reference-DC-GUID> "DC=corp,DC=example,DC=com"

Repeat for every partition that showed 8606 — do not forget the Configuration and Schema NCs and the DomainDnsZones/ForestDnsZones application partitions. After cleanup, re-run repadmin /showrepl DC02 /verbose and confirm the 8606s are gone. The event IDs you’ll see during this process:

Event ID	Log	Meaning
1938	Directory Service	Lingering-object removal starting on an NC
1939	Directory Service	Lingering-object removal completed
1946	Directory Service	(Advisory) a specific lingering object that would be removed
1942	Directory Service	(Advisory) count summary of candidates found
1988	Directory Service	Strict replication blocked a lingering object (the guardrail firing)
2042	Directory Service	Replication has been stopped because a DC has not replicated in the tombstone lifetime

Strict replication consistency — the permanent guardrail

There is a defense-in-depth setting that should be enabled fleet-wide so a single isolated DC can never inject lingering objects in the first place: strict replication consistency. With it on, a DC that receives a reference to an object it lacks halts that replication and demands attention (logging event 1988) rather than silently accepting the reference. It converts a silent forest-poisoning into a loud, contained, single-link failure.

# Enable strict replication consistency on every DC
repadmin /regkey * +strict

# Verify it took (per DC): the "Strict Replication Consistency" registry value = 1
reg query "HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters" /v "Strict Replication Consistency"

Strict is the default on forests built from clean 2003 SP1+ media, but upgraded forests may have it off. Confirm and enforce it — it is the cheapest insurance in this entire article. The behaviour difference:

Mode	On receiving a reference to an unknown object	Consequence	Recommended?
Strict (`+strict`, value 1)	Halt that inbound replication, log 1988, wait for admin	One link stops; forest stays clean; you get a loud signal	Yes — everywhere
Loose (`-strict`, value 0)	Accept the reference and reanimate the object	Lingering objects spread silently across the forest	No — legacy default only

Recognising and recovering from USN rollback

This is the failure that turns a bad afternoon into a forest rebuild if you mishandle it, and it deserves its own discipline. USN rollback happens when a DC’s database moves backward in USN space without its invocation ID changing — so partners, tracking it by invocation ID, believe they already have those USNs and dampen the “new” changes. The rolled-back DC keeps accepting local writes that never propagate; the divergence is invisible until something downstream breaks.

The causes are almost always improper “recovery” of a virtualised DC:

Trigger	Why it rolls USNs back	The safe alternative
VM snapshot/checkpoint revert on a non-GenID-aware host	Restores an older NTDS.dit with an older USN, invocation ID unchanged	Use a GenID-aware host (2012+) OR restore via system-state backup
P2V of a running DC	Captures a moving database; the copy is stale on boot	Never P2V a live DC; build a fresh VM and `Install-ADDSDomainController`
Cloning/imaging a running DC	Same as P2V; two DCs with the same DB state/invocation ID	Use supported cloning (`DCCloneConfig.xml`) which resets identity
Restoring a copied VMDK/VHD as a “backup”	Not an AD-aware restore; invocation ID doesn’t reset	Only restore from Windows Server Backup system state
SAN rollback / storage-array snapshot revert	Storage-level rewind bypasses AD’s safety net	Coordinate storage recovery with AD; treat as USN rollback

The signature is unmistakable once you’ve seen it:

Directory Service event 2095 on the affected DC at boot: “the local DSA has detected that the USN has rolled back.”
Inbound and/or outbound replication is paused; the DC sets Dsa Not Writable in the registry.
Partners reject replication from it with 8456/8457 and may log 2103 about the rolled-back partner.

Check whether a DC has quarantined itself:

# A non-zero value here means the DC detected USN rollback and quarantined itself
reg query "HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters" /v "Dsa Not Writable"

There is exactly one correct recovery, and it is not clearing that flag. That flag is AD protecting the forest from you; clearing it re-injects the rolled-back state permanently.

Isolate and force-demote the affected DC — dcpromo /forceremoval (or the modern Uninstall-ADDSDomainController -ForceRemoval on 2019+), removing it from the domain. Do this first so it can never originate more divergent writes.
Metadata cleanup for it from a healthy DC (next section), removing its NTDS Settings, server object, computer account, and DNS records.
Garbage-collect / converge, verify no traces of it remain in the topology.
Rebuild: promote a clean OS as a new DC (Install-ADDSDomainController against a fresh install), or restore the role properly from a supported, AD-aware backup (Windows Server Backup system state, or a hypervisor exposing VM-GenerationID — 2012+ DCs on GenID-aware hosts survive a snapshot revert because the changed GenID triggers an invocation-ID reset and a re-sync).

The root-cause prevention is policy, not a command: never revert a DC from a VM snapshot on a host that does not expose VM-GenerationID, never image/clone a running DC by hand, and never P2V a live DC. Use system-state backups or Install-ADDSDomainController against a fresh OS. The decision table for a DC showing 2095:

Situation	Do this	Do NOT do this
Single DC, no other writable copy	Restore system state from a pre-rollback backup; if none, this DC is the forest — treat as forest recovery	Clear `Dsa Not Writable`; run it forward diverged
Multiple DCs, this one rolled back	Force-demote, metadata cleanup, rebuild clean	Clear the flag; try to “merge” the divergence
Host is GenID-aware, snapshot reverted	Let AD’s GenID reset re-sync it (it may self-heal)	Panic-demote before checking if GenID fired
You cleared the flag already (mistake)	Force-demote and rebuild immediately; assess damage	Assume it’s fine because replication “resumed”

Metadata cleanup after a DC is gone

After a DC is gone for good — USN-rollback demotion, dead hardware, a botched promotion, a forced removal — its objects linger in the Configuration partition: the NTDS Settings object, the server object, the computer account, and DNS records (A/CNAME and the _ldap/_kerberos SRV records). Left behind, they cause other DCs to attempt replication with a ghost (producing 1722/1256 forever) and confuse the DC locator. Metadata cleanup removes them authoritatively.

Modern ntdsutil (and the AD module) does most of the cleanup automatically when you Remove-ADDomainController or remove the server object, but the canonical manual sequence — the one that works even when the DC is unreachable — is:

ntdsutil
  metadata cleanup
    connections
      connect to server HEALTHY-DC
      quit
    select operation target
      list domains
      select domain 0
      list sites
      select site 0
      list servers in site
      select server <number-of-dead-DC>
      quit
    remove selected server
    quit
  quit

The modern equivalent, when the server object still shows in AD Sites and Services, is simply to delete the DC’s server node there (which triggers the same cleanup) or:

# If the DC's metadata still exists, this removes the failed DC's objects
Remove-ADDomainController -Identity "DC03" -ForceRemoval -Force
# When the DC is already gone, remove the leftover NTDS/server object via Sites and Services,
# or ntdsutil metadata cleanup as above.

Then clean up what ntdsutil does not always remove. This checklist is the difference between a truly-gone DC and a haunted topology:

Leftover	Where it lives	How to remove it	Symptom if left
NTDS Settings object	AD Sites and Services → Site → Server → NTDS Settings	Deleted by metadata cleanup	Partners keep trying to replicate (1722/1256)
Server object	AD Sites and Services → Site → Server	Delete the server node	Empty server shell in the topology
Computer account	AD Users and Computers → Domain Controllers OU	Delete the computer object	Stale DC account; auth confusion
A / AAAA record	Forward DNS zone	Delete in DNS Manager	Clients/DCs resolve to a dead host
CNAME (`<guid>._msdcs`)	`_msdcs.<forest>` zone	Delete in DNS Manager	Replication targets a dead alias (8524 elsewhere)
`_ldap` / `_kerberos` SRV	`_msdcs`, `_sites`, `_tcp`, `_udp`	Delete stale SRV records	DC locator hands clients a dead DC
FRS/DFSR member object	Configuration NC	Removed by cleanup / delete manually	SYSVOL replication references a ghost member
FSMO role ownership	On the dead DC (if it held any)	Seize the roles first, then clean up	Roles stuck on a nonexistent DC

The ordering matters: if the dead DC held any FSMO roles, seize those roles first (previous section), then run metadata cleanup — cleaning up a DC that still “owns” a role leaves the role orphaned.

SYSVOL replication: DFSR, FRS, and the migration trap

A subtlety that catches people: SYSVOL (the share holding group policy objects and logon scripts) replicates through a separate engine from the directory itself. On any forest at 2008 functional level or higher that completed migration, SYSVOL uses DFS Replication (DFSR); older forests used the legacy File Replication Service (FRS). A directory that replicates perfectly can still have divergent GPOs if SYSVOL replication is broken — GPO edits made on one DC never reach the others, so users in different sites get different policy.

Check which engine SYSVOL uses and its migration state:

dfsrmig /getglobalstate

Current DFSR global state: 'Eliminated'

Eliminated (state 3) means migration to DFSR is complete and FRS is gone — the correct end state. States Prepared (1) or Redirected (2) mean a migration was started but never finished, which is its own class of fragility. Check DFSR health and backlog:

# Backlog of files waiting to replicate between two DCs' SYSVOL
Get-DfsrBacklog -GroupName "Domain System Volume" `
  -SourceComputerName DC01 -DestinationComputerName DC02 -FolderName "SYSVOL Share"

# DFSR state and any errors
Get-DfsrState -ComputerName DC02
Get-WinEvent -LogName "DFS Replication" -MaxEvents 50 -ComputerName DC02

The failure modes and how to tell them apart:

SYSVOL symptom	Engine/cause	Confirm with	Fix
GPOs differ across DCs	DFSR backlog or stopped	`Get-DfsrBacklog`; DFSR event log	Clear the blocking file; resume DFSR; check disk/quota
DFSR event 2213 (dirty shutdown)	DFSR paused after unclean stop	DFSR event log 2213	Resume via the documented `ResumeReplication` WMI method
SYSVOL not shared on a DC	DFSR/FRS not initialised	`dcdiag /test:SysVolCheck`, `net share`	Non-authoritative sync (D2) of SYSVOL on that DC
Migration stuck at `Prepared`/`Redirected`	Incomplete FRS→DFSR migration	`dfsrmig /getglobalstate`	Complete the migration to `Eliminated`
Journal wrap (legacy FRS)	FRS lost its change journal (NTFS USN journal overflow)	FRS event log (13568)	Non-authoritative restore (BurFlags D2) of FRS

Journal wrap deserves a note because it echoes across both engines’ history: it happens when the volume’s NTFS change journal overflows (too many changes during an outage, or a journal too small) and the replication engine loses its place. Legacy FRS recovers via a non-authoritative restore (BurFlags D2), pulling a fresh copy from a good partner; DFSR handles the equivalent more gracefully but a dirty shutdown (event 2213) requires an explicit resume. In both cases the fix is “re-seed from a known-good copy,” never “delete SYSVOL and hope.”

Architecture at a glance

Picture the replication system as three concentric layers, because that ordering is the diagnostic method — you work from the outside in, and most failures live in the outer two layers, not the directory itself.

The outermost layer is the network fabric: DNS and RPC. Every replication attempt begins with the destination DC resolving the source’s host record and its <invocation-guid>._msdcs.<forest-root> CNAME, then binding to the source over RPC — the endpoint mapper on 135/TCP hands back a dynamic high port, and the actual replication rides that. When a link fails with 8524, resolution failed; when it fails with 1722/1256, the RPC bind failed. This layer is where the majority of “replication is broken” tickets actually resolve, which is why dcdiag /test:DNS and repadmin /bind come before anything directory-related.

The middle layer is the replication engine and topology: the KCC computing intra-site connection objects, the ISTG building inter-site connections across the site links you defined, and the bridgeheads carrying that inter-site traffic on a poll schedule. Each DC holds a high-watermark per partner and an up-to-dateness vector per partition, and the whole mesh converges through change notification (intra-site push) and polling (inter-site pull), dampened so nothing loops. Failures here look like missing inbound partners, stuck queues (repadmin /queue), or sites not converging (dcdiag /test:Intersite) — the plumbing is up but the topology or the engine is wrong.

The innermost layer is the directory data and the ESE database: the actual objects, their per-attribute versions and originating stamps, the tombstones awaiting garbage collection, and the invocation ID that identifies this database instance. This is where the dangerous failures live — 8606 (a lingering object being offered), 8451 (an ESE/Jet error), and USN rollback / event 2095 (the database rewound without its invocation ID resetting, so partners silently dampen its changes). Reaching this layer means the network and topology are healthy and the problem is genuinely in the data — and it is exactly here that a wrong move (deleting live objects, clearing Dsa Not Writable) turns a link failure into a forest failure.

Overlaid on all three layers are the five FSMO roles, single-master operations pinned to specific DCs: Schema and Domain Naming (forest-wide), RID/PDC/Infrastructure (per domain). They are not a layer so much as a set of responsibilities that must remain reachable — and the moment a role holder is unreachable or dead, the question flips from “why won’t this link replicate” to “who holds this role, is it coming back, and do I transfer or seize.” The whole method is: confirm the network layer, then the topology layer, then the data layer, and in parallel confirm the FSMO holders are alive — localise the failure to exactly one layer before you act, because the tool that fixes a DNS problem must never be pointed at the directory, and vice versa.

Real-world scenario

Meridian Logistics runs a single-forest, four-site Active Directory estate: roughly 30,000 objects, six DCs (two in the primary datacentre, two in a DR site, one each in two regional offices over MPLS), all Windows Server 2022, forest at the 2016 functional level. The identity team is three engineers; the forest also feeds Entra Connect for Microsoft 365. Their monthly cloud spend is not the point — the point is that six DCs and 30,000 objects is small enough that one bad decision touches everything.

The incident began on a Friday afternoon during an unrelated storage migration. A vSphere administrator “recovered” DC04 (a DR-site DC) by reverting it to a three-week-old VM snapshot, to roll back a driver change. The host template predated VM-GenerationID exposure for that VM, so the GenID safety net never fired. DC04 booted, logged event 2095, set Dsa Not Writable, and its partners began rejecting its outbound changes with 8456. Worse: deletions that had garbage-collected across the forest during those three weeks were still live on DC04’s rolled-back database, so as partners pulled the Configuration and Domain partitions, 8606 storms erupted across two sites — a re-animated service account and a deleted distribution group among the lingering objects.

The constraint made it a real incident: DC04 held the RID Master and PDC Emulator for the domain. It was a Friday, the business would not approve a weekend rebuild window, and — critically — one engineer’s first instinct was to “clear that registry flag to get replication going.” That would have re-injected three weeks of divergence permanently. The senior on the bridge stopped it: Dsa Not Writable is the platform protecting the forest, and the only correct outcome for a genuine USN rollback is demote-and-rebuild.

The fix followed the playbook exactly. First, triage: repadmin /replsummary showed DC04 failing every link and two other DCs failing inbound from it; repadmin /showrepl DC02 /verbose confirmed 8606 on the Domain NC; the reg query confirmed Dsa Not Writable = 1 on DC04 and event 2095 in its log. Diagnosis was unambiguous: USN rollback on DC04, lingering objects on its partners. They isolated DC04 from replication immediately (disabled its NIC), then seized the two FSMO roles onto the healthy DC02 rather than wait:

# Seize from the unrecoverable rolled-back holder, run on DC02
Move-ADDirectoryServerOperationMasterRole -Identity "DC02" `
  -OperationMasterRole RIDMaster,PDCEmulator -Force

They confirmed the seizure with netdom query fsmo, re-pointed the domain time hierarchy to DC02’s new PDCe (validated with w32tm /query /status), then force-demoted DC04 with Uninstall-ADDSDomainController -ForceRemoval and ran metadata cleanup from DC02, scrubbing DC04’s NTDS Settings, server object, computer account, and stale DNS/SRV records. On each surviving partner they ran repadmin /removelingeringobjects DC0x <DC02-GUID> "<NC>" /advisory_mode for every partition, using DC02’s DSA GUID as the authoritative reference, reviewed the event 1946 candidate lists object by object (confirming the re-animated service account and group were genuinely deleted), and only then removed them for real. They enabled repadmin /regkey * +strict fleet-wide so the same class of incident would halt a future DC (event 1988) instead of poisoning the topology. A clean Server 2022 VM was promoted into DC04’s slot the following week.

The proof: repadmin /replsummary returned zero failures, dcdiag /e /test:DNS /test:FSMOCheck passed enterprise-wide, and a throwaway test object created on DC01 converged to the DR site inside two minutes. User-facing impact: none — PDC and RID were live on DC02 within the hour, so authentication, time, and account creation never stopped. Entra Connect, which had been about to sync the re-animated objects into Microsoft 365, was paused during the cleanup and resumed clean. The lesson written on the runbook afterward: “A rolled-back DC is not a replication problem to un-stick — it is a database to destroy. The flag is your friend.”

The incident as a timeline, because the order of moves is the lesson:

Time	Symptom / event	Action taken	Effect	What it should have been
14:32	vSphere admin reverts DC04 snapshot	(unrelated to AD, done for a driver)	Seeds the rollback	Never revert a DC without GenID
14:40	Event 2095 on DC04; 8456 on partners	(alert fires)	—	Recognise USN rollback immediately
14:45	8606 storm across two sites	Engineer proposes clearing the flag	Would corrupt the forest	Stop — the flag is protection
14:50	Triage	`replsummary` + `showrepl` + `reg query`	Confirms rollback + lingering objects	Correct first moves
15:00	Isolate	Disable DC04 NIC	Stops further divergence	Correct
15:10	FSMO at risk on a dead DC	Seize RID + PDC to DC02	Roles safe; users unaffected	Correct — seize before rebuild
15:30	DC04 removed	Force-demote + metadata cleanup	DC04 gone cleanly	Correct
16:00–17:30	Lingering objects	Advisory mode → review 1946 → remove	Forest converges clean	Never remove without advisory review
17:45	Guardrail	`repadmin /regkey * +strict`	Future rollbacks halt loudly	Should have been on already
+1 week	Rebuild	Fresh Server 2022 DC into DC04 slot	Full six-DC estate restored	Correct end state

Advantages and disadvantages

Multimaster replication is what makes Active Directory resilient and available — and the same properties are what make its failures silent and self-propagating. Weigh the model honestly:

Advantages (why multimaster helps you)	Disadvantages (why it bites)
Any writable DC accepts changes — no central write bottleneck, survives losing any single DC	No single source of truth: when DCs diverge, the platform tells you that they differ, not which is right
Incremental replication (HWMV/UTDV) is efficient — only deltas cross the wire, dampened against loops	The same dampening lets a USN-rolled-back DC silently stop originating — the failure is invisible until downstream breakage
Deterministic conflict resolution (version → timestamp → GUID) converges every DC identically without coordination	A “resolved” conflict can silently pick the wrong value if clocks are skewed (why time sync is load-bearing)
Tombstones let deletions replicate reliably across an unreliable WAN	A DC offline past the tombstone lifetime returns with lingering objects — a security problem, not just noise
The KCC/ISTG auto-build and self-heal the topology every 15 min — you don’t hand-wire connections	A wrong subnet-to-site map makes the KCC build bad or missing links, and it “self-heals” back to the wrong topology
FSMO isolates the few operations that can’t be multimaster, keeping them consistent	A dead FSMO holder blocks a specific class of operation, and seizing wrong (RID/Schema) is catastrophic and irreversible
Rich, specific diagnostics (repadmin/dcdiag) expose the engine’s exact state	The tooling is old-school and terse; the error codes (8524 vs 8606) demand you know which layer each belongs to

The model is right — there is no better design for a globally-distributed, highly-available directory. But its failure modes reward discipline over speed: the wrong fast action (clear the flag, delete the object, seize the role) is worse than doing nothing while you confirm which DC is authoritative. Every disadvantage above is manageable — strict replication consistency, GenID-aware hosts, correct site design, and the transfer-vs-seize discipline neutralise each one — but only if you know it exists, which is the entire point of this article.

Hands-on lab

This lab establishes a health baseline and safely exercises the diagnostic tools on a live (lab) forest, then simulates a recoverable link failure and fixes it — no destructive operations against a production forest. Run it in a two-DC lab (DC01, DC02 in corp.example.com); everything is read-only except the deliberately-created test object and the firewall rule you add and remove. Run from an elevated PowerShell prompt on DC01.

Step 1 — Baseline the forest health. Capture the “good” state so you can recognise “bad” later.

repadmin /replsummary
repadmin /showrepl * /csv > C:\Temp\baseline-showrepl.csv
dcdiag /test:Replications /test:FSMOCheck /test:KnowsOfRoleHolders /v > C:\Temp\baseline-dcdiag.txt

Expected: /replsummary shows both DCs with 0 in the fails column and a small largest-delta (minutes). dcdiag reports passed on Replications, FSMOCheck, and KnowsOfRoleHolders.

Step 2 — Confirm FSMO placement and the GC/Infra rule.

netdom query fsmo
Get-ADDomainController -Filter * | Select-Object Name, Site, IsGlobalCatalog, OperationMasterRoles

Expected: five roles listed against their holders; note whether the Infrastructure Master is a GC (in a single-domain lab this is fine).

Step 3 — Inspect a partition’s up-to-dateness vector. See who DC01 trusts, and to what USN.

repadmin /showutdvec DC01 "DC=corp,DC=example,DC=com"

Expected: one line per DC (by invocation ID) with the highest originating USN DC01 has from it, and a timestamp. This is the UTDV made concrete.

Step 4 — Force a full convergence and confirm it’s clean.

repadmin /syncall DC01 /AdeP
repadmin /replsummary

Expected: /syncall reports SyncAll terminated with no errors; /replsummary still shows zero fails.

Step 5 — Simulate a link failure (block RPC) and observe 1722. Add a temporary firewall rule on DC02 blocking the RPC endpoint mapper from DC01, then force replication and watch it fail.

# On DC02 (elevated): block inbound TCP 135 from DC01 only — REMEMBER to remove this in Step 7
New-NetFirewallRule -DisplayName "LAB-block-rpc-from-DC01" -Direction Inbound `
  -Protocol TCP -LocalPort 135 -RemoteAddress <DC01-IP> -Action Block

# On DC01: try to replicate from DC02 and watch it fail
repadmin /replicate DC01 DC02 "DC=corp,DC=example,DC=com"

Expected: the /replicate fails with 1722 (The RPC server is unavailable). Confirm the layer with a bind test:

repadmin /bind DC02

Expected: the bind fails too — proving this is a network/RPC problem, not a directory one. This is the exact signal that tells you to check firewalls, not AD.

Step 6 — Confirm it’s not DNS (rule out the other network cause).

dcdiag /s:DC01 /test:DNS /v
nslookup DC02.corp.example.com

Expected: DNS resolves fine and dcdiag /test:DNS passes — so with 1722 present but DNS healthy, you’ve correctly isolated it to RPC/firewall.

Step 7 — Remove the block and prove recovery.

# On DC02: remove the lab firewall rule
Remove-NetFirewallRule -DisplayName "LAB-block-rpc-from-DC01"

# On DC01: replicate again — now it succeeds
repadmin /replicate DC01 DC02 "DC=corp,DC=example,DC=com"
repadmin /showrepl DC01 /verbose | Select-String "was successful","failed"

Expected: the /replicate succeeds; /showrepl shows the DC02 neighbour with was successful.

Step 8 — End-to-end origination proof. Create a tagged test object on DC01, force a sync, confirm it lands on DC02, then remove it.

# On DC01: originate a throwaway contact object
$name = "repltest-$(Get-Date -f yyyyMMddHHmmss)"
New-ADObject -Type contact -Name $name -Path "CN=Users,DC=corp,DC=example,DC=com" `
  -OtherAttributes @{ description = "repltest" }
repadmin /syncall DC01 /AdeP

# On DC02 (or targeting it): confirm it exists, then clean up
Get-ADObject -Server DC02 -Filter "description -eq 'repltest'" -SearchBase "CN=Users,DC=corp,DC=example,DC=com" |
  Tee-Object -Variable o | Format-Table Name, ObjectGUID
$o | Remove-ADObject -Confirm:$false

Expected: the object created on DC01 is returned by the query against DC02 (convergence proven), then is removed.

Validation checklist — the lab steps mapped to what each proves:

Step	What you did	What it proves	Real-incident analogue
1–2	Baseline `/replsummary`, FSMO, GC rule	You have a “good” reference to compare against	The first thing to capture before touching anything
3	`/showutdvec`	The UTDV is real and inspectable	Diagnosing “why won’t it re-send”
4	`/syncall /AdeP`	You can force and confirm convergence	Post-fix verification
5	Block RPC → 1722; `/bind` fails	1722 is a network error, provable by bind	The 90-second “is it network or directory?” call
6	`dcdiag /test:DNS` passes	Isolating RPC vs DNS within the network layer	Ruling out the #1 false lead
7	Remove block → success	The fix was network, and it converged	Confirming the real cause was outside AD
8	Originate + converge a test object	End-to-end replication genuinely works	The final proof any recovery is truly done

Teardown. The lab leaves nothing persistent if you completed Steps 7 and 8 (firewall rule removed, test object deleted). Verify:

Get-NetFirewallRule -DisplayName "LAB-block-rpc-from-DC01" -ErrorAction SilentlyContinue  # should return nothing
Get-ADObject -Filter "description -eq 'repltest'" -SearchBase "CN=Users,DC=corp,DC=example,DC=com"  # should return nothing
Remove-Item C:\Temp\baseline-*.csv, C:\Temp\baseline-*.txt -ErrorAction SilentlyContinue

Common mistakes & troubleshooting

This is the playbook — the part you bookmark and open at 02:14. First the scannable symptom→cause→confirm→fix table, then the expanded reasoning for the entries that bite hardest. Read the prose once; keep the table open during the incident.

#	Symptom	Root cause	Confirm (exact command / path)	Fix
1	Link fails with 8524 on one or many partners	DNS: can’t resolve the source DC’s CNAME/A record	`dcdiag /test:DNS /v`; `nslookup <guid>._msdcs.<forest>`	Register records (`ipconfig /registerdns`; restart Netlogon); fix the `_msdcs` zone/delegation
2	Link fails with 1722 (RPC server unavailable)	Firewall blocking 135 or the dynamic RPC port range; or a dead partner	`repadmin /bind <src>`; `portqry -n <src> -e 135`; is the DC up?	Open RPC endpoint mapper + dynamic range; if the DC is dead, metadata-cleanup it
3	Link fails with 1256 (remote system not available)	Partner offline / unreachable (routing, powered off)	`Test-NetConnection <src> -Port 135`; ping	Restore connectivity, or remove the dead DC via metadata cleanup
4	8606 storm; “insufficient attributes to create an object”	Lingering objects on a DC offline past the tombstone lifetime	`repadmin /removelingeringobjects <staleDC> <goodGUID> "<NC>" /advisory_mode`; review event 1946	Advisory review, then remove for real per NC; enable `+strict` fleet-wide
5	A DC won’t originate; partners reject it with 8456/8457	USN rollback — DB rewound, invocation ID unchanged	Event 2095; `reg query ...NTDS\Parameters /v "Dsa Not Writable"` (=1)	Never clear the flag. Force-demote, metadata-cleanup, rebuild clean
6	`dcdiag` shows a DC with no inbound partners for an NC	KCC/ISTG not building connection objects (site/subnet or KCC error)	`repadmin /showconn <DC>`; `dcdiag /test:KccEvent /v`	Fix subnet-to-site mapping; `repadmin /kcc <DC>` to rebuild; check ISTG
7	Can’t create users/groups/computers on a DC	Local RID pool exhausted while the RID Master is unreachable	`dcdiag /test:RidManager /v`; `netdom query fsmo`	Restore/seize the RID Master; the DC then refills its pool
8	Schema change / functional-level raise fails	Schema Master unreachable or down	`dcdiag /test:KnowsOfRoleHolders /v`; `netdom query fsmo`	Bring the holder online, or transfer/seize the Schema Master
9	Widespread time skew; Kerberos “clock skew” (KRB_AP_ERR_SKEW)	PDC Emulator down/misconfigured as the time root	`w32tm /query /status`; `netdom query fsmo`	Restore/seize PDCe; re-point the time hierarchy to external NTP
10	Replication fails with 8451 (database error)	ESE/Jet error or corruption in NTDS.dit on the destination	Directory Service event log for the ESE error code	If transient, retry; if persistent, restore that DC from system-state backup
11	Link fails with 8453 (replication access denied)	Missing replication rights or a broken secure channel	`dcdiag /test:MachineAccount /v`; check ACEs on the NC head	Reset the secure channel (`netdom resetpwd`); restore replication permissions
12	Objects converge only every few hours (inter-site)	Normal poll interval, or a broken bridgehead/site link	`repadmin /showrepl` inter-site neighbours; `dcdiag /test:Intersite`	If truly stuck, fix the bridgehead/site link; else it’s the 180-min interval
13	GPOs/logon scripts differ across DCs	SYSVOL (DFSR) backlog or a stalled DFSR after dirty shutdown	`Get-DfsrBacklog`; DFSR event log (2213); `dfsrmig /getglobalstate`	Resume DFSR; clear the blocking file; complete any stuck migration
14	`/replsummary` fails against a DC that’s actually up	Stale metadata for a removed DC still in the topology	`repadmin /showrepl` referencing a ghost DSA GUID	Metadata cleanup for the removed DC; delete stale Sites-and-Services/DNS entries
15	Deleted/disabled account is authenticating in one site	Re-animated lingering object (a live copy on a stale DC)	`repadmin /showobjmeta` on the object across DCs	Lingering-object removal (advisory first); investigate as a security event
16	`repadmin` reports 1908 (“can’t find the DC for this domain”)	DC locator can’t resolve a DC — missing SRV records	`nltest /dsgetdc:<domain>`; `dcdiag /test:DNS`	Re-register SRV records (restart Netlogon); fix the `_msdcs` SRV set

The expanded form, for the entries where the reasoning is the whole game:

1. Link fails with 8524 (DNS lookup failure). This is the single most common “replication” ticket and it is not a directory problem. Replication resolves the source DC by its <invocation-guid>._msdcs.<forest-root> CNAME and its A record; if either is missing or stale, the destination can’t even find the source. Confirm: dcdiag /test:DNS /v names the missing record; nslookup <guid>._msdcs.<forest-root> fails. Fix: on the source DC, ipconfig /registerdns and Restart-Service Netlogon to re-register the records; ensure the _msdcs.<forest> zone exists, is AD-integrated, and its delegation from the parent zone is intact. Do not touch the directory — this is DNS, top to bottom.

2. Link fails with 1722 (RPC server unavailable). Replication rides RPC: the destination hits the endpoint mapper on 135/TCP, gets back a dynamic high port, and replicates over it. A firewall that blocks 135 or the dynamic range (or a genuinely dead partner) produces 1722. Confirm: repadmin /bind <src> fails (proving it’s connectivity, not data); portqry -n <src> -e 135 shows the endpoint mapper unreachable or the dynamic port filtered. Fix: open the RPC endpoint mapper and the dynamic port range between DCs (or pin AD replication to a static port via the TCP/IP Port registry value and open just that); if the partner is actually dead, this becomes a metadata-cleanup job, not a firewall one.

4. 8606 storm — lingering objects. A DC offline past the tombstone lifetime returns holding objects everyone else deleted; offering references to them triggers 8606 on partners (with strict replication) or silently reanimates them (without it). Confirm: run repadmin /removelingeringobjects <staleDC> <good-DC-GUID> "<NC>" /advisory_mode and read event 1946 for the candidate list. Fix: verify the reference DC is authoritative and current, remove per partition (Config, Schema, Domain, DNS app partitions), then enable repadmin /regkey * +strict everywhere. The failure mode to avoid is deleting live objects by pointing at a stale reference DC — advisory mode exists precisely to make you look before you delete.

5. USN rollback — event 2095, Dsa Not Writable. The DC’s database moved backward without its invocation ID changing, so partners dampen its “new” changes and it silently diverges; it quarantines itself and rejects outbound replication with 8456. Confirm: event 2095 in the Directory Service log; reg query "HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters" /v "Dsa Not Writable" returns a non-zero value. Fix — the only correct one: force-demote the DC (Uninstall-ADDSDomainController -ForceRemoval), metadata-cleanup its remnants from a healthy DC, and rebuild a clean OS as a new DC (or restore from a supported system-state backup). Clearing the flag is the fatal mistake — it re-injects the rolled-back state and corrupts the forest permanently.

6. No inbound partners — KCC/ISTG. If a DC has no connection object for a partition, no replication can happen — the plumbing exists but there’s no pipe. This usually traces to a wrong subnet-to-site mapping (the DC or its partners are in the wrong site, so the KCC builds nonsensical or no links) or a KCC that’s erroring. Confirm: repadmin /showconn <DC> shows missing inbound connections; dcdiag /test:KccEvent /v surfaces KCC errors; check the subnet-to-site associations in Sites and Services. Fix: correct the subnet objects and their site assignment, then repadmin /kcc <DC> to force an immediate topology recompute; confirm the ISTG is healthy for inter-site links.

9. Time skew and Kerberos. Kerberos rejects tickets when the clock skew between client and DC exceeds the tolerance (default 5 minutes). If the PDC Emulator (the domain’s time root) is down or misconfigured to sync from a bad source, the whole domain’s time can drift and authentication breaks with KRB_AP_ERR_SKEW. Confirm: w32tm /query /status on DCs and the PDCe; netdom query fsmo to find the PDCe. Fix: restore or seize the PDCe, and configure the (forest-root) PDCe to sync from a reliable external NTP source with all other DCs following the domain hierarchy — see the hybrid time-sync article. This is why time and FSMO are entangled: the PDCe is the time authority.

13. SYSVOL divergence (DFSR). The directory can replicate perfectly while SYSVOL — a separate DFSR engine — is stalled, so GPO edits made on one DC never reach others and users in different sites get different policy. Confirm: Get-DfsrBacklog shows a growing backlog; the DFSR event log shows event 2213 (paused after a dirty shutdown) or another error; dfsrmig /getglobalstate should read Eliminated. Fix: resume DFSR (the documented ResumeReplication method for a 2213), clear whatever file is blocking, and complete any migration stuck at Prepared/Redirected. Never “just delete SYSVOL” — re-seed non-authoritatively from a good DC.

Best practices

Baseline health continuously and alert on non-zero failures. Schedule repadmin /replsummary and dcdiag /e (e.g. via a scheduled task feeding a monitor) so you learn about a broken link in minutes, not three weeks later when a password reset “doesn’t take.” The whole point is to catch silent divergence before it becomes lingering objects.
Enable strict replication consistency (+strict) on every DC. It converts silent forest-poisoning into a loud, contained, single-link halt (event 1988). It’s the cheapest insurance in this article and should be on everywhere, especially in upgraded forests where it may default off.
Never revert, clone, image, or P2V a running DC by hand. Use VM-GenerationID-aware hosts (2012+), supported cloning (DCCloneConfig.xml), and system-state backups. This one policy prevents the entire USN-rollback disaster class.
Rule out DNS and RPC before touching the directory. For any link failure, run dcdiag /test:DNS and repadmin /bind first. 8524 and 1722 are network errors wearing AD codes; directory surgery for a DNS problem is how good afternoons become bad ones.
Get the site and subnet topology right. Map every subnet to a site; a DC or client in the wrong site makes the KCC build bad or missing links and pushes authentication to a distant DC. Review it whenever you add subnets or DCs.
Protect the PDC Emulator hardest and re-point time after any PDCe move. It’s the time root, the password/lockout authority, and the GPO-edit target — the most user-visible FSMO. After seizing or transferring it, reconfigure the external NTP source and validate with w32tm /query /status.
Transfer when you can, seize only when the holder is dead for good. A seize is irreversible for RID and Schema without catastrophe; never seize a role from a DC you might power back on. If in doubt whether the holder is truly dead, it isn’t — fix the network and transfer.
Always run lingering-object removal in advisory mode first, verify the reference DC is authoritative and current, and review the event 1946 list object by object. Treat any re-animated user/group as a security incident, not just a replication one.
Complete SYSVOL migration to DFSR (Eliminated) and monitor DFSR backlog. A directory that replicates while SYSVOL doesn’t gives you divergent GPOs — a silent, confusing failure. Don’t leave a migration stuck at Prepared/Redirected.
Keep supported, tested AD backups (system state) with a retention shorter than the tombstone lifetime. A backup older than 180 days is itself a lingering-object time bomb on restore. Test restores; a backup you can’t restore is a rumour.
Force and verify convergence after every fix with repadmin /syncall /AdeP, then re-run /replsummary and a test-object round-trip. Recovery isn’t done until an originating write demonstrably reaches a distant DC.
Document FSMO placement and keep it deliberate. Know which DC holds each role, keep the Infrastructure Master off GCs in multi-domain forests (unless all DCs are GCs), and keep the record current so an incident starts from knowledge, not discovery.

Security notes

Lingering objects are a security event, not just a replication one. A re-animated user, a re-populated privileged group, or a “deleted” service account back in the directory is exactly the kind of thing an attacker (or a mistake) hides behind. When you find and remove lingering objects, investigate what was re-animated and whether it was used, and treat privileged principals with special scrutiny.
Replication rights are powerful — audit them. The Replicating Directory Changes / Replicating Directory Changes All permissions enable DCSync, the attack that pulls password hashes from a DC by impersonating a DC. Only DCs and specific sync accounts (e.g. Entra Connect) should hold them; audit for any other principal that does. A replication problem is a good moment to also verify who can replicate.
Guard the FSMO holders and the transfer/seize capability. Seizing FSMO roles is a high-privilege operation; the ability to move roles, run ntdsutil, and force-demote DCs must be tightly held (tiered admin). A malicious or mistaken seize of the RID or Schema Master is as damaging as any data breach.
Protect DC backups and the restore path. A system-state backup contains everything needed to reconstruct authentication material; a stolen or tampered DC backup is a full compromise. Encrypt backups, restrict who can restore, and never restore an unverified image (that’s also the USN-rollback path).
Secure the RPC and DNS surface. The dynamic RPC range and 135/TCP between DCs should be reachable only among DCs and management hosts, not the general network — segment it. DNS records under _msdcs are a map of your DCs; treat DNS integrity as part of the security boundary, since poisoned SRV/CNAME records misdirect authentication.
Keep DFSR/SYSVOL integrity in scope. SYSVOL carries logon scripts and GPOs — code and policy that run with high privilege on every domain-joined machine. A DFSR compromise or a divergent SYSVOL is a policy-injection vector; monitor SYSVOL content changes and DFSR health.
Prefer least-privilege sync accounts and rotate them. In hybrid estates, the account Entra Connect uses to read (and its on-prem sync account) should be scoped to exactly what it needs; a diverging or lingering-object-laden directory synced upward can propagate the mess into the cloud tenant.

Cost & sizing

Active Directory itself has no license SKU beyond the Windows Server licenses on your DCs, so “cost” here is about sizing DCs and site topology for reliable replication, plus the operational cost of getting it wrong.

Number and placement of DCs. The dominant reliability lever. Every site with local authentication needs enough DCs to survive one failing (two per critical site is the practical floor); a single-DC site means every restart or patch is an outage and a snapshot revert there is a USN-rollback landmine. Each DC is a Windows Server license plus a modest VM — a few thousand rupees a month of infrastructure — cheap insurance against the multi-hour incidents in this article.
DC sizing. DCs are not resource-hungry for typical directories: 2–4 vCPU and 8–16 GB RAM comfortably serves tens of thousands of objects, with the working set (ideally the whole NTDS.dit) cached in RAM for fast LDAP/replication. The one thing to not skimp on is disk: put NTDS.dit and its logs on fast, reliable storage (SSD), because ESE performance and recovery time depend on it, and a slow disk turns a routine reboot into a long, tense wait.
Site links and WAN. Inter-site replication is pull-based on a schedule (default 180-minute interval) precisely to bound WAN cost; for 30,000 objects the steady-state delta traffic is small (kilobytes per cycle), but a re-sync after a lingering-object cleanup or a rebuild can be large. Size site-link schedules to your convergence tolerance, not to minimise a WAN bill that replication barely touches at steady state.
The real cost is the incident. The infrastructure to prevent these failures — a second DC per site, GenID-aware hosts, monitoring — costs far less than one mishandled USN rollback (a forest rebuild is days of senior time and a business-wide authentication risk) or one lingering-object security investigation. Right-size for resilience, not for the smallest DC count that “works” on a quiet day.

A rough sizing picture for a mid-size single-forest estate:

Item	Guidance	Rough cost signal	What it buys you
DCs per critical site	≥ 2	2× Windows Server license + small VM each	Survives one DC failing; patching without outage
DC vCPU / RAM	2–4 vCPU / 8–16 GB	Small VM tier	NTDS.dit cached; fast LDAP + replication
DC disk	SSD for NTDS.dit + logs	Modest premium over HDD	ESE performance; fast recovery/reboot
GenID-aware virtualisation host	2012+ hypervisor	Usually already have it	Snapshot reverts self-heal, not USN-rollback
Monitoring (`replsummary`/`dcdiag`)	Scheduled + alerting	Time to set up	Catch divergence in minutes, not weeks
System-state backups	Retention < tombstone lifetime	Backup storage	Restorable DCs without a rollback risk

Interview & exam questions

1. Explain how AD decides which change wins when two DCs modify the same attribute concurrently. Conflict resolution is deterministic and needs no coordination: the higher per-attribute version number wins; if versions tie, the later timestamp wins; if those tie, the higher originating invocation-ID GUID wins. Because the rule is identical on every DC, all replicas converge to the same value. This is also why time sync matters — a skewed clock can make the “wrong” write win the timestamp tiebreak.

2. What is the difference between the high-watermark vector and the up-to-dateness vector? The high-watermark (HWMV) is per-partner-per-partition: the highest USN a DC has seen from a specific source, so that source only sends changes past it (incremental pull). The up-to-dateness vector (UTDV) is per-partition: for every DC that ever originated a change (by invocation ID), the highest originating USN this DC has from any path, so the source can skip changes the destination already has indirectly (propagation dampening). HWMV = “where you and I left off”; UTDV = “what I already have regardless of source.”

3. What is USN rollback, why is it so dangerous, and how do you recover? USN rollback is when a DC’s database moves backward in USN space without its invocation ID changing — usually from a VM snapshot revert, P2V, or clone of a running DC. Partners track it by invocation ID, so they believe they already have those USNs and dampen its new changes; it silently stops originating and diverges. AD detects it (event 2095) and sets Dsa Not Writable. The only correct recovery is to force-demote the DC, metadata-clean it, and rebuild — never clear the flag, which re-injects the rolled-back state permanently.

4. When do you transfer a FSMO role versus seize it? Transfer when the current holder is alive and reachable — it’s a clean, reversible handoff (Move-ADDirectoryServerOperationMasterRole). Seize (-Force) only when the holder is permanently dead and you’ve committed it will never come back online with that role, because a seizure is unilateral and an old holder returning with the same role causes split-brain — catastrophic for the RID Master (duplicate RID pools → duplicate SIDs) and Schema Master. If unsure whether the holder is truly dead, it isn’t; fix connectivity and transfer.

5. A DC is failing replication with error 8524. What layer is the problem in and what do you check? 8524 is a DNS problem, not a directory one — the destination can’t resolve the source DC’s _msdcs CNAME or A record. Confirm with dcdiag /test:DNS /v and nslookup <guid>._msdcs.<forest>. Fix DNS: re-register records on the source (ipconfig /registerdns, restart Netlogon), verify the _msdcs zone is AD-integrated and its delegation is intact. Do not touch the directory.

6. What is a lingering object, how does it arise, and how do you remove it safely? A lingering object was deleted (tombstoned and garbage-collected) on most DCs but survives on a DC that was offline longer than the tombstone lifetime (default 180 days); when it returns, offering references to the object produces 8606 on partners. Remove it with repadmin /removelingeringobjects <staleDC> <good-DC-GUID> "<NC>" /advisory_mode first (logs candidates as event 1946, deletes nothing), verify the reference DC is authoritative and current, then rerun without /advisory_mode. Enable strict replication consistency (+strict) to prevent recurrence.

7. Why can the Infrastructure Master conflict with a global catalog, and when doesn’t it matter? The Infrastructure Master updates references to objects in other domains. If its DC is also a global catalog, the GC already knows every object so nothing appears stale, and cross-domain reference updates silently stop. So in a multi-domain forest the Infrastructure Master should not be a GC — unless every DC in the domain is a GC (then there are no non-GC DCs to have stale references, so it’s moot). Single-domain forests are unaffected.

8. Error 1722 vs 8524 — how do you tell them apart and why does it matter? 1722 (RPC server unavailable) is a network/RPC problem — a blocked 135/TCP or dynamic RPC port, or a dead partner — confirmed by repadmin /bind <src> failing and portqry -n <src> -e 135. 8524 is a DNS resolution failure, confirmed by dcdiag /test:DNS. It matters because the fixes are entirely different (firewall/RPC vs DNS records), and both are network-layer — neither is a reason to edit the directory. Ruling out the network layer first is the core discipline.

9. What does the PDC Emulator do, and what breaks if it’s down? The PDC Emulator is the domain’s authoritative time source (the forest-root PDCe roots the whole hierarchy), the DC other DCs forward password changes to for immediate effect, the account-lockout authority, the default target for GPO edits (avoiding edit conflicts), and the legacy NT PDC. If it’s down you get time skew (Kerberos KRB_AP_ERR_SKEW), degraded password/lockout behaviour, and GPO edit conflicts — the most user-visible FSMO loss, so it’s the one to protect and restore first.

10. How do you prove that replication has fully recovered after a fix? Force convergence with repadmin /syncall /AdeP (all partitions, push and pull, across site links), then confirm repadmin /replsummary shows zero failures and no multi-day deltas, and dcdiag /e /test:Replications /test:DNS /test:FSMOCheck passes enterprise-wide. The definitive test is an end-to-end origination proof: create a tagged throwaway object on one DC, sync, and confirm it appears on a distant DC — then remove it. Recovery isn’t done until an originating write demonstrably propagates.

11. What is strict replication consistency and why should it be on? With strict replication consistency (repadmin /regkey * +strict), a DC that receives a reference to an object it doesn’t have halts that inbound replication and logs event 1988 instead of silently reanimating the object. It turns a silent, forest-wide lingering-object spread into a loud, contained, single-link failure — the difference between an alert and a forest cleanup. It’s the default on clean 2003 SP1+ forests but may be off on upgraded ones, so verify and enforce it everywhere.

12. SYSVOL and the directory replicate through different engines — why does that matter? The directory (NTDS.dit) replicates via the AD replication engine; SYSVOL (GPOs and scripts) replicates via DFSR (or legacy FRS). A directory that replicates perfectly can still have divergent GPOs if DFSR is stalled — GPO edits on one DC never reach others, and users in different sites get different policy. Diagnose SYSVOL separately (Get-DfsrBacklog, DFSR event log, dfsrmig /getglobalstate); a healthy repadmin /replsummary does not prove SYSVOL is healthy.

These map primarily to AZ-800 / AZ-801 (Windows Server Hybrid Administrator) — manage Active Directory Domain Services, troubleshoot on-premises and hybrid AD DS — and to the legacy AD DS knowledge assessed across Windows Server certifications. The hybrid-identity angle (Entra Connect, DCSync/replication rights) touches SC-300 (Identity and Access Administrator). A compact cert-mapping for revision:

Question theme	Primary cert	Objective area
Replication model (USN/HWMV/UTDV, conflict)	AZ-800	Manage/ troubleshoot AD DS
FSMO roles, transfer vs seize	AZ-800/AZ-801	Manage AD DS; disaster recovery
USN rollback, metadata cleanup, recovery	AZ-801	Implement AD DS disaster recovery
repadmin/dcdiag diagnostics, error codes	AZ-800	Troubleshoot AD DS
Lingering objects, strict replication, DCSync rights	AZ-801 / SC-300	Secure & recover AD; hybrid identity
Sites, site links, KCC/ISTG topology	AZ-800	Manage AD DS sites and replication

Quick check

A DC fails replication from one partner with error 1722, but dcdiag /test:DNS passes and nslookup resolves the partner fine. What layer is the problem in, and what single command confirms it?
True or false: if a domain controller shows event 2095 and Dsa Not Writable, the fastest safe fix is to clear the registry flag so replication resumes.
Your on-call peer wants to seize the RID Master because its holder rebooted for a monthly patch and is offline for ten minutes. What do you tell them, and why?
A previously-disabled user account is suddenly able to authenticate against one branch-office DC only. Name the most likely cause and the first (non-destructive) command you run.
You’ve fixed a broken link and repadmin /replsummary now shows zero failures. Name the one additional test that proves originating writes actually propagate end to end.

Answers

It’s a network/RPC problem (not DNS, since DNS passed) — a blocked RPC endpoint mapper/dynamic port or a firewall between the DCs. The confirming command is repadmin /bind <partner>: if the bind fails while DNS resolves, the failure is RPC connectivity, so you check firewalls/ports, not the directory.
False. Dsa Not Writable is AD protecting the forest from a USN rollback; clearing it re-injects the rolled-back database and corrupts the forest permanently. The only correct fix is to force-demote the DC, run metadata cleanup, and rebuild it clean (or restore from a supported system-state backup).
Do not seize — wait. A ten-minute patch reboot is a temporary outage, not a dead holder. Seizing the RID Master and then having the original holder come back online produces two DCs allocating overlapping RID pools (duplicate SIDs) — a catastrophic split-brain. Seize only when the holder is permanently gone and will never return with the role.
Most likely a lingering object — a re-animated (deleted) account surviving on a DC that was offline past the tombstone lifetime. First run repadmin /showobjmeta <that-DC> "<account-DN>" (and compare to a healthy DC) to confirm the object exists there; then use removelingeringobjects in advisory mode before removing it — and treat it as a security event, since a re-animated account is exactly what an attacker or mistake hides behind.
An end-to-end origination proof: create a tagged throwaway object on one DC, run repadmin /syncall /AdeP, and confirm it appears on a distant DC (then remove it). Zero failures in /replsummary means links report success; a converged test object proves an originating write genuinely reaches the far side.

Glossary

Multimaster replication — the AD model in which every writable DC accepts changes and reconciles them with all others; no single write master.
USN (Update Sequence Number) — a per-DC, monotonically increasing 64-bit counter incremented on every originating or replicated-in change; the unit of replication progress. Local to each DC.
Originating write — a change made on a DC (versus one replicated in from a partner); stamped with the DC’s invocation ID, USN, version, and timestamp.
High-watermark vector (HWMV) — per-partner-per-partition record of the highest USN a DC has received from a specific source, driving incremental replication.
Up-to-dateness vector (UTDV) — per-partition record, keyed by invocation ID, of the highest originating USN a DC has from every source by any path; enables propagation dampening.
Invocation ID — a GUID identifying a specific instance of the NTDS.dit database (not the DC); must reset on a proper restore, and an unchanged invocation ID over a rewound database causes USN rollback.
Propagation dampening — using the UTDV to avoid replication loops and redundant transfers in a mesh topology.
KCC (Knowledge Consistency Checker) — the per-DC process that builds and maintains intra-site connection objects every ~15 minutes.
ISTG (Inter-Site Topology Generator) — one DC per site that builds the inter-site connection objects across site links.
Site / site link / bridgehead — a site models network locality (a set of subnets); a site link models a WAN path with cost/schedule/interval; a bridgehead is the DC that carries inter-site replication.
Connection object — a directory object under a DC’s NTDS Settings that says “replicate inbound from partner X”; the KCC/ISTG create these.
FSMO (Flexible Single Master Operations) — the five single-master roles: Schema Master and Domain Naming Master (forest-wide); RID Master, PDC Emulator, Infrastructure Master (per domain).
RID Master — the domain FSMO that allocates pools of relative IDs (RIDs) to DCs so they can mint SIDs for new security principals.
PDC Emulator — the domain FSMO that is the authoritative time source, password-change hub, account-lockout authority, and default GPO-edit target.
Tombstone — a deleted object stripped of most attributes and marked deleted, retained so the deletion can replicate.
Tombstone lifetime — the number of days a tombstone survives before garbage collection (default 180); the isolation threshold past which a returning DC carries lingering objects.
Garbage collection — the periodic (default 12-hour) process that physically removes tombstones past their lifetime.
Lingering object — a deleted object that survives on a DC isolated longer than the tombstone lifetime; offering references to it causes error 8606.
Strict replication consistency — the setting (+strict) that makes a DC halt (event 1988) rather than reanimate a referenced object it lacks; prevents lingering-object spread.
USN rollback — a DC’s database moving backward in USN without an invocation-ID reset, causing partners to dampen its changes; signalled by event 2095 and Dsa Not Writable.
Dsa Not Writable — the NTDS registry flag AD sets to quarantine a DC that detected USN rollback; must never be cleared to “fix” replication.
Metadata cleanup — the authoritative removal of a dead DC’s objects (NTDS Settings, server object, computer account, DNS/SRV records) from the directory.
DFSR (DFS Replication) — the engine that replicates SYSVOL (GPOs and scripts) on modern forests; separate from the directory replication engine (legacy: FRS).
repadmin / dcdiag — the two primary AD diagnostic tools: repadmin reports and manipulates the replication engine; dcdiag runs correctness tests across a DC’s directory, DNS, and Kerberos roles.

Next steps

You can now triage any replication or FSMO failure top-down — network, then topology, then data — and act with the tool that respects the invariants. Build outward:

Prerequisite: Building an AD DS Forest the Right Way: Deployment, FSMO, and a Tiered Admin Model — design the topology and role placement you’re troubleshooting here, and the tiered-admin model that keeps seize/transfer in the right hands.
Related: Highly Available DNS and DHCP on Windows Server, End to End — because most “replication” failures are DNS failures; get the _msdcs zone and SRV records right so 8524 never happens.
Related: Accurate Hybrid Time Sync: chrony on Linux and w32time in Active Directory — the PDC Emulator’s time hierarchy that Kerberos (and conflict-resolution tiebreaks) depend on.
Related: Resilient File Services with DFS Namespaces and DFS Replication — the DFSR engine that also carries SYSVOL, and how to diagnose its backlog and recovery.
Related: Microsoft Entra Connect Sync Deep Dive: Designing Hybrid Identity with PHS, PTA, and Seamless SSO — where a diverging on-prem directory becomes a diverging cloud one, and why you pause sync during a cleanup.
Related: Hardening SMB and Enabling Credential Guard to Block Lateral Movement — because replication rights (DCSync) and lingering objects are security surfaces, not just availability ones.