DNS and DHCP are the two services nobody notices until they break, at which point the whole estate appears to be on fire. This is the build I use for production: two domain controllers running AD-integrated DNS, a pair of DHCP servers in a failover relationship, and the scavenging, conditional-forwarding, and dynamic-update plumbing that keeps records clean without ever deleting a live host.
Scope: this assumes a single AD domain (
contoso.local) with two domain controllers,dc01(10.10.0.10) anddc02(10.10.0.11), each also running the DNS Server role. DHCP runs on the same two boxes. Everything is done in PowerShell so it is repeatable and reviewable. Run the DNS cmdlets from theDnsServermodule and DHCP cmdlets from theDhcpServermodule; both ship with the respective RSAT/role features.
1. AD-integrated vs. primary/secondary, and replication scope
If your DNS servers are domain controllers, use Active Directory-integrated zones and stop thinking about primary/secondary. The reasons are not stylistic:
- Multi-master writes. Every DC hosting the zone is writable. A classic primary/secondary setup has one writable primary; lose it and dynamic updates stop until you seize. AD-integrated zones have no single writable node.
- Replication rides AD. The zone lives in a directory partition and replicates over the existing DC replication topology with the same security and compression. You do not configure zone transfers between DCs.
- Secure dynamic updates. Only AD-integrated zones support
Securedynamic updates, which gate record writes with ACLs (Step 4). This is the whole reason scavenging and DHCP registration stay sane.
The decision that actually matters is replication scope — which partition the zone lands in:
| Scope | Replicates to | Use when |
|---|---|---|
Forest |
All DNS servers on DCs in the forest | Forest-wide name (e.g. a _msdcs root, shared infra zone) |
Domain |
All DNS servers on DCs in this domain | The default and right answer for a domain’s own zone |
Legacy |
All DCs in the domain (Windows 2000 partition) | Never, on a modern forest |
Custom |
DCs enlisted in a named app partition | You explicitly want a subset of DCs to host it |
Create the forward and reverse zones domain-wide:
# Forward lookup zone, AD-integrated, domain-wide replication, secure updates only
Add-DnsServerPrimaryZone -Name "contoso.local" `
-ReplicationScope "Domain" `
-DynamicUpdate "Secure" `
-ComputerName dc01
# Reverse lookup zone for 10.10.0.0/24
Add-DnsServerPrimaryZone -NetworkId "10.10.0.0/24" `
-ReplicationScope "Domain" `
-DynamicUpdate "Secure" `
-ComputerName dc01
You only run this on one DC. Replication delivers the zone to dc02 automatically; confirm with Get-DnsServerZone -ComputerName dc02 after convergence.
-DynamicUpdate "Secure"is non-negotiable for any zone DHCP writes into.NonsecureAndSecurelets any host overwrite any record, which is both a cleanup nightmare and a spoofing vector.
2. Forwarders, conditional forwarders, and root hints
Three different mechanisms, frequently confused. Get them straight:
- Forwarders are where a DNS server sends queries it is not authoritative for — typically your ISP or a cloud resolver. This is your default outbound path for general internet resolution.
- Conditional forwarders override that for a specific domain. “Anything for
partner.examplegoes to these nameservers” — the standard tool for resolving a partner domain or an Azure private DNS zone across a VPN. - Root hints are the fallback when no forwarder answers: the server walks the public root servers itself. Keep them present, but with forwarders configured they are a safety net, not the primary path.
Set general forwarders (do not point a DC at itself or at the other DC as a forwarder — that creates resolution loops; forwarders are for external resolvers):
# General forwarders -> a public resolver pair (use your approved upstreams)
Set-DnsServerForwarder -IPAddress 1.1.1.1, 9.9.9.9 -UseRootHint $true -ComputerName dc01
Set-DnsServerForwarder -IPAddress 1.1.1.1, 9.9.9.9 -UseRootHint $true -ComputerName dc02
Add a conditional forwarder, and make it AD-integrated so it replicates to both DCs instead of being configured twice:
Add-DnsServerConditionalForwarderZone -Name "partner.example" `
-MasterServers 192.0.2.53, 192.0.2.54 `
-ReplicationScope "Forest" `
-ComputerName dc01
-MasterServers is the list of authoritative servers for that domain. -ReplicationScope "Forest" (or Domain) makes it an AD object; omit it and you get a server-local forwarder you must replicate by hand. Verify root hints are intact with Get-DnsServerRootHint.
3. Aging and scavenging without deleting live records
Scavenging deletes stale dynamic records. Done wrong, it deletes records people are still using and you spend an afternoon explaining why a file server “disappeared.” The mechanism is two intervals plus a server-level sweep:
- No-refresh interval (default 7 days): after a record’s timestamp is written, the server refuses to re-stamp it. This exists to suppress AD write churn — without it every renewal would replicate.
- Refresh interval (default 7 days): the window after no-refresh during which the record can be refreshed and re-stamped.
- A record becomes eligible for scavenging only after no-refresh + refresh have both elapsed — 14 days by default — with no successful refresh in the refresh window.
The rule that keeps you out of trouble: no-refresh + refresh must be <= your DHCP lease duration. A client refreshes its DNS record at lease renewal (50% of lease). If the combined interval is shorter than the lease, a live machine that renews normally can still age out between renewals. With an 8-day lease, 4 + 4 is safe; with the default 7 + 7 = 14, never use a lease shorter than 14 days.
Scavenging has to be enabled in two places, and forgetting the second is the most common reason “scavenging is on but nothing gets cleaned”:
# 1. Per-zone aging: enable, set the two intervals
Set-DnsServerZoneAging -Name "contoso.local" -Aging $true `
-NoRefreshInterval 4.00:00:00 `
-RefreshInterval 4.00:00:00 `
-ComputerName dc01
# 2. Server-level scavenging: the sweep that actually deletes, plus how often it runs
Set-DnsServerScavenging -ScavengingState $true `
-ScavengingInterval 7.00:00:00 `
-ApplyOnAllZones $true `
-ComputerName dc01
TimeSpan strings are days.hours:minutes:seconds, so 4.00:00:00 is four days. Enable server-level scavenging on one DC only at first; a single scavenging server avoids two DCs racing to delete the same records. The intervals on the zone replicate; the server ScavengingState is per-server.
Before you trust it, force a pass and watch the DNS event log rather than waiting a week:
Start-DnsServerScavenging -ComputerName dc01 -Force -Verbose
4. Secure dynamic updates and cleaning up stale records
With Secure updates, the host (or DHCP, acting on its behalf) that first registers a record becomes its owner via an ACL. Only the owner can update it. This is great until DHCP failover enters the picture (Step 6): if each DHCP server registers under its own machine account, the partner cannot update the other’s records and you accumulate stale, un-updatable duplicates.
The fix is a dedicated, low-privilege service account used by both DHCP servers for dynamic updates, so a single identity owns every DHCP-registered record:
# Create a plain user account with no special rights; it only needs to own DNS records
New-ADUser -Name "svc-dhcp-dnsupdate" `
-SamAccountName "svc-dhcp-dnsupdate" `
-AccountPassword (Read-Host -AsSecureString "Password") `
-Enabled $true `
-PasswordNeverExpires $true `
-CannotChangePassword $true
Then point both DHCP servers at it (this command must be run on each DHCP server, as the credential is stored locally):
$cred = Get-Credential "contoso\svc-dhcp-dnsupdate"
Set-DhcpServerDnsCredential -Credential $cred -ComputerName dc01
Set-DhcpServerDnsCredential -Credential $cred -ComputerName dc02
Both failover partners must use the same DNS credential. Mismatched (or absent) credentials are the textbook cause of “DHCP works but half my hosts have wrong/stale DNS records.”
Finding and removing stale duplicates
To audit duplicate A records (multiple hosts on one IP, or one host with several IPs from churn), pull the resource records and group:
# A records whose IP is shared by more than one name -> likely stale duplicates
Get-DnsServerResourceRecord -ZoneName "contoso.local" -RRType "A" -ComputerName dc01 |
Where-Object { $_.RecordData.IPv4Address } |
Group-Object { $_.RecordData.IPv4Address.IPAddressToString } |
Where-Object Count -gt 1 |
Sort-Object Count -Descending |
Format-Table Name, Count, @{n='Hosts';e={ ($_.Group.HostName) -join ', ' }} -AutoSize
Remove a confirmed-dead record explicitly rather than waiting for scavenging:
$rr = Get-DnsServerResourceRecord -ZoneName "contoso.local" -Name "oldhost" -RRType "A" -ComputerName dc01
Remove-DnsServerResourceRecord -ZoneName "contoso.local" -InputObject $rr -ComputerName dc01 -Force
Always inspect the grouped output before deleting. Scavenging is the bulk safety mechanism; manual removal is for known offenders.
5. DHCP scopes, reservations, options, and policies
Authorize the servers in AD first — an unauthorized Windows DHCP server will not hand out leases:
Add-DhcpServerInDC -DnsName "dc01.contoso.local" -IPAddress 10.10.0.10
Add-DhcpServerInDC -DnsName "dc02.contoso.local" -IPAddress 10.10.0.11
Create a scope and set options. Lease duration here is 8 days to satisfy the scavenging rule from Step 3:
Add-DhcpServerv4Scope -Name "LAN-10.10.0.0" `
-StartRange 10.10.0.50 -EndRange 10.10.0.250 `
-SubnetMask 255.255.255.0 `
-LeaseDuration 8.00:00:00 `
-State Active -ComputerName dc01
# Scope-level options: gateway (003), DNS servers (006), DNS domain (015)
Set-DhcpServerv4OptionValue -ScopeId 10.10.0.0 `
-Router 10.10.0.1 `
-DnsServer 10.10.0.10, 10.10.0.11 `
-DnsDomain "contoso.local" `
-ComputerName dc01
Reservations pin an IP to a MAC for servers and printers that need a stable address but still benefit from centrally managed options:
Add-DhcpServerv4Reservation -ScopeId 10.10.0.0 `
-IPAddress 10.10.0.60 `
-ClientId "AA-BB-CC-11-22-33" `
-Name "printer-floor2" `
-ComputerName dc01
Policy-based assignment lets one scope hand different options to different device classes — e.g. give VoIP phones (matched by MAC OUI / vendor class) a different gateway or a dedicated address range:
# Policy matching a vendor's MAC prefix, carving a sub-range out of the scope
Add-DhcpServerv4Policy -Name "VoIP-Phones" -ScopeId 10.10.0.0 `
-MacAddress "EQ,AABBCC*" -ComputerName dc01
Set-DhcpServerv4OptionValue -PolicyName "VoIP-Phones" -ScopeId 10.10.0.0 `
-Router 10.10.0.2 -ComputerName dc01
6. DHCP failover: load-balance vs. hot-standby and MCLT
DHCP failover replicates lease and scope data between two servers so either can serve clients. It is a per-IPv4-scope relationship (IPv6 is not supported). Two modes:
- Load balance (default): both servers actively lease, split by
LoadBalancePercent(commonly 50/50). Best for a single site where both servers are healthy and you want active/active. - Hot standby: one server (
Active) leases; the partner (Standby) waits and takes over on failure.ReservePercentreserves a slice of the pool for the standby to lease during the MCLT window before it assumes the full pool. Best for a branch whose standby lives in another site.
The parameter that confuses everyone is MCLT (Maximum Client Lead Time). It is not the lease time. MCLT governs three things:
- The temporary lease length a server grants when it has lost contact with its partner (Communication Interrupted state).
- How long a server waits in Partner Down before it claims 100% of the address pool.
- How long an address is held back before it can be reassigned to a new client after the partner owned it.
Smaller MCLT = faster takeover but more replication overhead in normal operation; larger MCLT = less overhead but a longer delay before the survivor controls the whole pool. An hour is a reasonable middle ground for most LANs; very latency-sensitive shops go lower.
Create a load-balance relationship across both DCs (run once; it configures both ends):
Add-DhcpServerv4Failover -Name "LAN-Failover" `
-PartnerServer dc02.contoso.local `
-ScopeId 10.10.0.0 `
-LoadBalancePercent 50 `
-MaxClientLeadTime 01:00:00 `
-AutoStateTransition $true `
-StateSwitchInterval 01:00:00 `
-SharedSecret "UseAStrongSecretHere" `
-ComputerName dc01
For hot standby instead, swap the mode-specific parameters (ServerRole + ReservePercent replace LoadBalancePercent):
Add-DhcpServerv4Failover -Name "Branch-Failover" `
-PartnerServer dc02.contoso.local `
-ScopeId 10.10.0.0 `
-ServerRole Active `
-ReservePercent 5 `
-MaxClientLeadTime 01:00:00 `
-AutoStateTransition $true `
-SharedSecret "UseAStrongSecretHere" `
-ComputerName dc01
-AutoStateTransition $true with -StateSwitchInterval lets a server automatically move from Communication Interrupted to Partner Down after the interval, instead of waiting for an admin. Add scopes to an existing relationship later with Add-DhcpServerv4FailoverScope.
7. Diagnostics: Resolve-DnsName, nslookup, and the analytic log
Resolve-DnsName is the modern, scriptable resolver. Use -Server to test a specific DNS server (vital when verifying both DCs agree) and -DnsOnly to bypass other name providers:
Resolve-DnsName -Name dc01.contoso.local -Server 10.10.0.10 -Type A
Resolve-DnsName -Name dc01.contoso.local -Server 10.10.0.11 -Type A # confirm dc02 matches
Resolve-DnsName -Name partner.example -Server 10.10.0.10 # exercises the conditional forwarder
nslookup is still useful for interactive, low-level checks (and for proving the forwarder path):
nslookup
> server 10.10.0.10
> set type=srv
> _ldap._tcp.dc._msdcs.contoso.local
The DNS Analytic log is ETW-based, off by default, and the best tool for “who is querying what.” Audit events are on already; turn analytic on only when investigating, because at very high QPS it has measurable cost:
# Inspect current diagnostic settings
Get-DnsServerDiagnostics -ComputerName dc01
# Enable query logging detail (audit is already on; this raises diagnostic verbosity)
Set-DnsServerDiagnostics -ComputerName dc01 `
-Queries $true -Answers $true -ReceivePackets $true -SendPackets $true
The analytic channel itself (Microsoft-Windows-DNSServer/Analytical) is enabled via Event Viewer (DNS-Server node, Show Analytic and Debug Logs) or wevtutil, then read with ETW consumers. Disable it again once you have your answer.
Enterprise scenario
A retail client ran load-balance DHCP failover (50/50) across two DCs in the same datacenter. After a switch upgrade, the two DCs ended up on opposite sides of a firewall pair that NAT-ed nothing but did stateful inspection on the failover channel (TCP 647). Failover went Communication Interrupted, then both sides independently hit Partner Down and each started serving the full /23 pool. For about 40 minutes neither server knew the other was leasing, so both handed out addresses from the same ranges. Result: duplicate IP assignments, gratuitous-ARP conflicts, and a wave of clients dropping off the network.
Root cause was the firewall silently dropping idle 647 sessions after 30 minutes, well under our MaxClientLeadTime of one hour, so the survivor claimed the pool before the partner could re-establish. The real fix was a firewall exception for the channel, but we also stopped trusting auto-transition to paper over a flapping link:
# Don't auto-jump to Partner Down on a flaky channel; require a human for full-pool takeover
Set-DhcpServerv4Failover -Name "LAN-Failover" `
-AutoStateTransition $false `
-MaxClientLeadTime 01:00:00 `
-ComputerName dc01
The durable lesson: in load-balance mode an unreliable failover channel is worse than no failover, because both nodes confidently lease the same addresses. Either guarantee the channel (firewall rule, dedicated path) or switch that scope to hot-standby with a ReservePercent, where only the active node owns the bulk of the pool.
Verify
Run these after the build; every one should pass before you call it done.
# Zones exist and replicated to both DCs, secure updates on
Get-DnsServerZone -ComputerName dc01 | Where-Object IsDsIntegrated
Get-DnsServerZone -ComputerName dc02 | Where-Object IsDsIntegrated
# Aging/scavenging settings on the zone
Get-DnsServerZoneAging -Name "contoso.local" -ComputerName dc01
# Forwarders and conditional forwarders present
Get-DnsServerForwarder -ComputerName dc01
Get-DnsServerZone -ComputerName dc01 | Where-Object ZoneType -eq "Forwarder"
# DHCP authorized in AD
Get-DhcpServerInDC
# Failover relationship healthy on BOTH partners (state should be "Normal")
Get-DhcpServerv4Failover -ComputerName dc01
Get-DhcpServerv4Failover -ComputerName dc02
# Both DNS servers resolve a known host identically
Resolve-DnsName -Name dc01.contoso.local -Server 10.10.0.10
Resolve-DnsName -Name dc01.contoso.local -Server 10.10.0.11
A healthy failover relationship reports State : Normal on both servers. If one says Normal and the other says Communication Interrupted, the relationship is not actually redundant — investigate connectivity and the shared secret before trusting it.
Build checklist
Monitoring and a failover test runbook
Alert on the things that signal real trouble, not noise. At minimum:
- DHCP failover state != Normal on either partner (poll
Get-DhcpServerv4Failover; alert onCommunication InterruptedorPartner Down). - Scope utilization above ~85% (
Get-DhcpServerv4ScopeStatistics) — a pool about to exhaust is an outage waiting to happen, doubly so during a partner-down period. - DHCP service (
DHCPServer) and DNS service (DNS) stopped. - DNS scavenging deletions spiking — sudden large deletes in the DNS event log usually mean an interval is misconfigured against your lease.
A failover test you can actually run during a maintenance window:
- Record baseline:
Get-DhcpServerv4Failover(expectNormalboth sides) andGet-DhcpServerv4ScopeStatisticson both. - Stop DHCP on
dc01:Stop-Service DHCPServer(or take the box down to simulate a real failure). - Confirm
dc02keeps leasing — release/renew on a test client (ipconfig /releasethen/renew) and confirm it gets a valid address with correct options. - After the configured switch interval, confirm
dc02reportsPartner Downand is serving the full pool. - Restore
dc01:Start-Service DHCPServer. Confirm both sides return toNormaland lease data re-synchronizes (Get-DhcpServerv4Failovershows matching scope counts). - Repeat in the other direction (fail
dc02, verifydc01). Redundancy you have only tested in one direction is half-tested.
Pitfalls
- Scavenging interval shorter than the DHCP lease. The single most common cause of live records vanishing. Keep no-refresh + refresh <= lease, always.
- Mismatched DHCP DNS credentials across failover partners. Records register fine, then half of them go stale because the partner cannot update what it does not own.
- Pointing a DC’s forwarder at itself or the other DC. Forwarders are for external resolvers; doing this creates resolution loops and intermittent SERVFAIL.
- Trusting a relationship that is
Normalon only one side. Always verify both partners; a one-sidedNormalis not redundancy. - Enabling server-level scavenging on every DC at once on day one. Start with one, confirm behavior in the event log, then it is safe to leave as-is. Multiple scavenging servers are fine in steady state but make first-run debugging harder.