Servers Administration

Diagnosing AD Replication and FSMO Failures with repadmin and dcdiag

When Active Directory replication breaks, nothing fails loudly at first. Logons keep working off cached credentials and GPOs keep applying from the last good SYSVOL copy, and then three weeks later a password reset does not take on half your domain controllers, a Kerberos ticket gets issued against a stale object, or repadmin starts spitting 8606 and the forest is quietly diverging. By then you may already have lingering objects, and if you let a DC roll back its database you can poison the topology.

This is the triage playbook I run when a replication or FSMO problem lands. It assumes Windows Server 2019/2022/2025 DCs and Domain Admin (plus Enterprise Admin for forest-wide operations) rights. Every command here is read-only until the section explicitly says otherwise. Do not run destructive recovery steps until you understand which DC holds the authoritative copy of the data - that judgment is the entire job.

1. How multi-master replication actually works

You cannot diagnose what you cannot model. Three concepts carry almost every troubleshooting decision.

USNs (Update Sequence Numbers). Every DC keeps a monotonically increasing 64-bit counter. Each originating write (a real change made on that DC) and each replicated-in change bumps the local USN and stamps the affected attribute. USNs are local - DC-A’s USN 50,000 has nothing to do with DC-B’s USN 50,000.

The up-to-dateness vector (UTDV). Each DC tracks, per source DC, the highest USN it has successfully received from that source, keyed by the source DC’s Invocation ID (a GUID identifying the specific copy of the NTDS database). When DC-A asks DC-B for changes, it sends its UTDV so DC-B only ships deltas. This is what makes replication efficient and convergent instead of a full re-sync every cycle.

Propagation dampening. Combined with per-attribute version numbers and originating-DC stamps, the UTDV lets AD avoid both endless loops and redundant transfers in a mesh topology. The Knowledge Consistency Checker (KCC) builds and maintains the connection objects that define who replicates from whom.

The Invocation ID is the load-bearing detail for the scariest failure here. If a DC’s database is restored incorrectly (a VM snapshot rollback, a P2V of a live DC, a non-AD-aware restore), the Invocation ID does not change but the USN counter jumps backward. Partners think they already have those USNs, dampen the “new” changes, and the rolled-back DC silently stops originating updates. That is USN rollback, covered in step 5.

2. Reading replication health with repadmin

Start wide, then drill in. repadmin /replsummary is the one-screen forest health check.

repadmin /replsummary

It returns a source-side and destination-side table with largest delta, fails/total, and an error percentage per DC. Any non-zero fails column or a largest delta measured in days is your starting thread.

Next, get the full per-partition, per-partner state from the DC you suspect:

repadmin /showrepl DC01 /verbose

Read it partition by partition (Schema, Configuration, the domain NC, and any application partitions like the DNS zones). For each inbound neighbor you want to see Last attempt @ ... was successful. A line like this is the smoking gun:

Last attempt @ 2026-06-07 02:14:31 failed, result 8606 (0x219e):
    Insufficient attributes were given to create an object.

To sweep the whole forest at once and dump it as parseable CSV:

repadmin /showrepl * /csv > C:\Temp\showrepl.csv

A few more I reach for constantly:

Command What it tells you
repadmin /replsummary One-line health per DC, error percentages, largest delta
repadmin /showrepl <DC> /verbose Every inbound partner, per partition, with last success/failure
repadmin /queue <DC> Pending inbound replication work; a growing queue means it is stuck
repadmin /showutdvec <DC> <NC-DN> The up-to-dateness vector for a partition - who this DC trusts and to what USN
repadmin /bind <DC> Confirms RPC bind to the AD replication interface (rules out connectivity)
repadmin /showobjmeta <DC> "<object-DN>" Per-attribute version/originating-DC/USN metadata for one object

/showobjmeta is invaluable when one object (often a specific user or a GPO) disagrees across DCs - it shows you exactly which DC originated the winning version of each attribute.

3. Confirming with dcdiag

repadmin tells you replication mechanics; dcdiag runs a battery of correctness tests against the DC’s role as a directory server, DNS resolver, and Kerberos endpoint.

Run the replication-focused tests first:

dcdiag /test:Replications /test:Intersite /v

Then the DNS suite, because the overwhelming majority of “replication” problems are actually name-resolution problems - a DC cannot replicate from a partner it cannot resolve or whose CNAME (<GUID>._msdcs.<forest>) does not exist.

dcdiag /test:DNS /v /e

/e runs the tests against every DC in the enterprise; /s:<DC> targets one. For a fast, broad smoke test of a single DC, the bare dcdiag plus the connectivity, advertising, and Knows-Of-Role-Holders tests covers most ground:

dcdiag /test:Connectivity /test:Advertising /test:KnowsOfRoleHolders /test:FSMOCheck /v

KnowsOfRoleHolders and FSMOCheck verify that this DC can locate and bind to every FSMO role holder - critical before you decide whether a role needs transferring or seizing (step 6).

4. Decoding the common errors

The same handful of error codes account for most cases. Memorize the meaning, not just the number.

8606 - “Insufficient attributes were given to create an object.” The destination was sent a reference to an object it does not have. This is the canonical lingering-object symptom: a source DC is replicating a change that references an object the destination already deleted as a tombstone. Go to step 5.

8451 - “The replication operation encountered a database error.” A problem in the NTDS (ESE) database on the destination DC. Check the Directory Service event log for the underlying ESE error. Sometimes transient; if persistent, you are looking at corruption and a restore-from-backup conversation, not a config fix.

8524 - “The DSA operation is unable to proceed because of a DNS lookup failure.” Pure DNS - the destination cannot resolve the source DC’s CNAME or A record. Fix DNS (dcdiag /test:DNS); do not touch the directory.

1722 (RPC server is unavailable) / 1256 (remote system not available). Network or firewall. Replication rides RPC over dynamic ports plus 135/TCP (endpoint mapper); a blocked dynamic RPC range or a dead partner produces these. Confirm with repadmin /bind <DC>.

8456 / 8457 - source or destination is “currently rejecting replication requests.” Usually a DC that detected a USN rollback on itself and quarantined replication (step 6), or Dsa Not Writable set in the registry.

The mistake I see most often is treating 8524 and 1722 as AD problems. They are network problems wearing an AD error code. Rule out DNS and RPC connectivity before you reach for directory surgery.

5. Detecting and removing lingering objects safely

A lingering object is one that was deleted (tombstoned, then garbage-collected) on most DCs but survives on a DC that was offline longer than the tombstone lifetime (180 days on any forest first built on Server 2003 SP1 or later; read it from the tombstoneLifetime attribute on CN=Directory Service,CN=Windows NT,CN=Services in the Configuration NC). When that isolated DC comes back and tries to replicate, its partners reject the reference with 8606.

Detect, do not delete yet. Run removelingeringobjects in /advisory_mode against the suspect DC, using a known-good reference DC as the source of truth. The reference DC’s GUID is the first argument; the naming context is the partition to check.

# Advisory only - logs candidates to the Directory Service event log, deletes nothing
repadmin /removelingeringobjects DC02 <reference-DC-GUID> "DC=corp,DC=example,DC=com" /advisory_mode

Advisory mode writes event 1946/1942 entries listing every lingering object it would remove. Review them. Confirm the reference DC genuinely holds the authoritative, current copy of that partition - pointing at a stale reference DC will delete live objects.

Once you trust the list, remove them by dropping /advisory_mode:

repadmin /removelingeringobjects DC02 <reference-DC-GUID> "DC=corp,DC=example,DC=com"

Repeat for every partition that showed 8606 (do not forget the Configuration and Schema NCs and the _msdcs / DNS application partitions). After cleanup, re-run repadmin /showrepl DC02 /verbose and confirm the 8606s are gone.

There is also a defense-in-depth setting worth enabling fleet-wide so a single isolated DC cannot inject lingering objects in the first place - strict replication consistency, which makes a DC halt and demand attention rather than accept a reference to an object it lacks:

repadmin /regkey * +strict

6. Recognizing and recovering from USN rollback

This is the failure that turns a bad afternoon into a forest rebuild if you mishandle it. The signature is unmistakable once you have seen it:

Check whether a DC has quarantined itself:

# A non-zero / DSA-not-writable value here means the DC has detected USN rollback
reg query "HKLM\SYSTEM\CurrentControlSet\Services\NTDS\Parameters" /v "Dsa Not Writable"

Do not clear that registry value to “fix” replication. That flag is AD protecting the forest from you. There is exactly one correct outcome for a DC in genuine USN rollback:

  1. Forcibly demote the affected DC (dcpromo /forceremoval, or on Core/2019+ the equivalent demotion), removing it from the domain.
  2. Perform metadata cleanup for it (step 7) from a healthy DC.
  3. Run garbage collection / let it converge, verify no traces remain.
  4. Re-promote a clean OS as a new DC, or restore the role properly from a supported, AD-aware backup (Windows Server Backup system state, or a hypervisor that exposes the VM-GenerationID - 2012+ DCs on Gen-ID-aware hosts can survive a snapshot revert because the changing Gen-ID triggers an Invocation ID reset).

The root-cause prevention is policy, not a command: never restore a DC from a VM snapshot on a host that does not expose VM-GenerationID, never image/clone a running DC, and never P2V a live DC. Use system-state backups or Install-ADDSDomainController against a fresh OS.

7. Transferring vs. seizing FSMO roles, and metadata cleanup

The five FSMO roles (Schema Master and Domain Naming Master per forest; PDC Emulator, RID Master, and Infrastructure Master per domain) are single-master. Confirm who holds them:

netdom query fsmo

If the role holder is alive and reachable, transfer. A graceful transfer is clean and reversible:

Move-ADDirectoryServerOperationMasterRole -Identity "DC02" `
  -OperationMasterRole PDCEmulator,RIDMaster,InfrastructureMaster,SchemaMaster,DomainNamingMaster

If the role holder is permanently dead, seize - but only after you have committed to never bringing it back online. A seizure is unilateral. If the old holder ever rejoins the network holding the same role, you have a split-brain that is especially catastrophic for the RID Master (duplicate RID pools) and Schema Master.

# Same cmdlet, add -Force to seize from a dead holder
Move-ADDirectoryServerOperationMasterRole -Identity "DC03" `
  -OperationMasterRole RIDMaster,PDCEmulator -Force

The old ntdsutil seize path is still fully supported and worth knowing when the module is unavailable:

ntdsutil
  roles
    connections
      connect to server DC03
      quit
    seize RID master
    seize PDC
    quit
  quit

Metadata cleanup. After a DC is gone for good (USN rollback demotion, dead hardware, a botched promo), its objects linger in the Configuration partition - the NTDS Settings object, the server object, computer account, and DNS records. Modern ntdsutil does most of the cleanup automatically when you remove the server, but the canonical sequence is:

ntdsutil
  metadata cleanup
    connections
      connect to server HEALTHY-DC
      quit
    select operation target
      list domains
      select domain 0
      list sites
      select site 0
      list servers in site
      select server <number-of-dead-DC>
      quit
    remove selected server
    quit
  quit

Then clean up what ntdsutil does not always remove: the CN=<deadDC> server object in Sites and Services, the DC’s computer account, and stale DNS records (the A/CNAME records and the _ldap/_kerberos SRV records under _msdcs and the domain zone).

Verify

Recovery is not done until replication has converged and you can prove it. Force replication of all partitions in every direction, then re-check:

# Sync the whole topology, push and pull, across site links
repadmin /syncall /AdeP

# Re-run the forest health summary - expect 0 fails everywhere
repadmin /replsummary

# Per-DC confirmation that every inbound partner last succeeded
repadmin /showrepl * /csv > C:\Temp\post-fix-showrepl.csv

# Full enterprise dcdiag, replication + DNS + FSMO
dcdiag /e /test:Replications /test:DNS /test:FSMOCheck /v

End-to-end proof that originating writes propagate: create a throwaway test object on one DC, force a sync, and confirm it appears on a distant partner.

# On DC01: originate a tagged test object
New-ADObject -Type contact -Name "repltest-$(Get-Date -f yyyyMMddHHmmss)" `
  -Path "OU=Temp,DC=corp,DC=example,DC=com" -OtherAttributes @{ description = "repltest" }

# After repadmin /syncall, on DC03: it must exist - then remove it
Get-ADObject -Filter "description -eq 'repltest'" -SearchBase "OU=Temp,DC=corp,DC=example,DC=com" |
  Tee-Object -Variable o | Format-Table Name, ObjectGUID
$o | Remove-ADObject -Confirm:$false

Confirm USNs are climbing again on a previously stuck DC with repadmin /showrepl, and that no DC still reports Dsa Not Writable.

Enterprise scenario

A platform team running a single-forest, four-site AD estate (roughly 30,000 objects, six DCs) hit 8606 across two sites after a vSphere admin “recovered” DC04 by reverting it to a three-week-old VM snapshot during an unrelated storage migration. The host predated VM-GenerationID exposure for that template, so the Gen-ID safety net never fired. DC04 booted, logged event 2095, set Dsa Not Writable, and its partners began rejecting its outbound changes - but not before deletions that had garbage-collected elsewhere reappeared as references, producing lingering objects and 8606 storms on its partners.

The constraint: DC04 held the RID Master and PDC Emulator for the domain, it was a Friday, and the business would not approve an outage window to rebuild over the weekend. Clearing the registry flag was not an option - that would re-inject the rolled-back state into the forest.

The fix followed the playbook. They isolated DC04 from replication immediately, then seized the two FSMO roles onto the healthy DC02 rather than wait:

# Seize from the unrecoverable rolled-back holder, on DC02
Move-ADDirectoryServerOperationMasterRole -Identity "DC02" `
  -OperationMasterRole RIDMaster,PDCEmulator -Force

They confirmed the seizure with netdom query fsmo, force-demoted DC04 with dcpromo /forceremoval, and ran metadata cleanup from DC02. On each surviving partner they ran repadmin /removelingeringobjects ... /advisory_mode using DC02’s GUID as the authoritative reference, reviewed the event 1946 candidate lists, and only then removed the objects for real. They enabled repadmin /regkey * +strict fleet-wide so the same class of incident would halt a future DC instead of poisoning the topology. A clean Server 2022 VM was promoted into the DC04 slot the following week. repadmin /replsummary returned zero failures and a converged test object proved end-to-end replication. User-facing impact: none, because PDC and RID were live on DC02 within the hour.

Checklist

windows-serveractive-directoryreplicationfsmotroubleshooting

Comments

Keep Reading