Security Multi-Cloud

Ransomware Resilience: Immutable Backups, Recovery Vaults, and Isolated Recovery Environments

Prevention is a probability game you will eventually lose. Recovery is the game you have to win every time. Modern ransomware crews do not encrypt and leave - they spend days living off the land, escalate to domain admin, and the first thing they do once they own the directory is find and destroy your backups. Veeam’s own incident data has shown that the backup repository is targeted in the large majority of attacks, and a meaningful fraction of victims lose some or all of their backups. If your last line of defense is reachable with the same credentials that just got compromised, you do not have a last line of defense.

This guide is about that last line: backups the attacker cannot delete, a control plane they cannot pivot into, and a clean room where you can actually restore without re-detonating the malware. I will use Azure and AWS primitives because most of you run there, but the architecture - assume the production identity plane is fully compromised, and design backwards from that - transfers to any platform.

1. Design for assume-breach: the 3-2-1-1-0 model

Start by writing down the threat model explicitly, because it changes every downstream decision: the adversary has domain admin, global admin, or the backup service’s own credentials, and they have had them for two weeks. Everything reachable from those identities is presumed gone. The job is to ensure something survives anyway.

The classic 3-2-1 rule is no longer sufficient against an active human adversary. The current bar is 3-2-1-1-0:

Digit Requirement Why it matters under ransomware
3 Three copies of data Survives correlated hardware/site loss
2 Two distinct media types One ransomware family can’t reach both
1 One copy off-site Survives site-wide compromise or destruction
1 One copy immutable or air-gapped Survives an attacker with admin on the backup system
0 Zero errors on recovery verification A backup you never restored is a hypothesis, not a backup

The two digits that matter most here are the second 1 and the 0. Immutability defeats the deletion attack; verification defeats the silent-corruption attack (encrypted-in-place data backed up faithfully for 60 days, so your retention window is full of garbage).

The mental shift: stop asking “is my backup running” and start asking “if my backup admin account is the attacker, what still survives, and have I proven I can restore it?” If the answer to either half is uncertain, you have a reporting system, not a recovery system.

A second principle: the backup control plane must live in a separate identity and trust boundary from production. In Azure, that means a dedicated tenant or at minimum a separate management group with no inherited role assignments and its own break-glass accounts. In AWS, a separate account in a locked-down OU. If a single global admin can both run production and purge the vault, you have one identity away from total loss.

2. Implement immutable and air-gapped backups (WORM + soft-delete)

Immutability has to be enforced by the storage platform, below the layer the backup admin operates at. There are two complementary mechanisms, and you want both:

Azure: Backup vault with immutability locked + soft-delete

On an Azure Recovery Services or Backup vault, enable immutability and then lock it. Unlocked immutability can be disabled by an attacker with vault-admin rights; locked immutability is irreversible - that irreversibility is the entire point.

# Enable immutability on a Recovery Services vault, then LOCK it.
# Unlocked = reversible (test here first). Locked = irreversible.
az backup vault update \
  --resource-group rg-backup-tier0 \
  --name rsv-prod-immutable \
  --immutability-state Unlocked

# After validating retention/policies, escalate to Locked.
# This CANNOT be undone, including by a Global Admin or Owner.
az backup vault update \
  --resource-group rg-backup-tier0 \
  --name rsv-prod-immutable \
  --immutability-state Locked

# Enforce multi-user authorization (see step 3) and a hard soft-delete window.
az backup vault backup-properties set \
  --resource-group rg-backup-tier0 \
  --name rsv-prod-immutable \
  --soft-delete-feature-state AlwaysON \
  --soft-delete-retention-period-in-days 30

AlwaysON is deliberate: it means soft-delete can no longer be turned off, even by a vault admin. Combined with a locked immutability state, an attacker who fully owns the vault still cannot shorten retention or hard-delete a recovery point inside the window.

Azure: WORM on the storage account (immutability policy with legal hold or time-based)

For backups that land in blob storage (database dumps, archive tiers), use a version-level or container-level time-based retention policy and lock it:

# Account must have versioning + immutability support enabled at creation.
az storage container immutability-policy create \
  --account-name stbackupworm \
  --container-name db-archives \
  --period 30 \
  --allow-protected-append-writes true

# Lock the policy - after this, blobs are WORM for the retention window.
ETAG=$(az storage container immutability-policy show \
  --account-name stbackupworm --container-name db-archives \
  --query etag -o tsv)
az storage container immutability-policy lock \
  --account-name stbackupworm --container-name db-archives \
  --if-match "$ETAG"

allow-protected-append-writes lets backup software append to existing log blobs without being able to modify already-written data - useful for streaming backups, and it does not weaken the WORM guarantee on committed blocks.

AWS: S3 Object Lock in Compliance mode + MFA Delete

The AWS equivalent is S3 Object Lock in Compliance mode. Governance mode can be bypassed by a principal with s3:BypassGovernanceRetention; Compliance mode cannot be bypassed by anyone, including the root account, until retention expires. For ransomware resilience, use Compliance.

# Object Lock must be enabled at bucket creation (cannot be added later).
aws s3api create-bucket \
  --bucket acme-cyber-recovery-vault \
  --object-lock-enabled-for-bucket \
  --region us-east-1

# Default retention: every new object is locked for 30 days, COMPLIANCE mode.
aws s3api put-object-lock-configuration \
  --bucket acme-cyber-recovery-vault \
  --object-lock-configuration '{
    "ObjectLockEnabled": "Enabled",
    "Rule": { "DefaultRetention": { "Mode": "COMPLIANCE", "Days": 30 } }
  }'

The strongest pattern is air-gapped rather than merely immutable: a target with no inbound network path and no shared credentials. AWS Backup’s logically air-gapped vault is purpose-built for this - it is immutable by default, stored in an AWS-owned account outside your control plane, and shared cross-account only for restore. The principle on any platform: the backup target should be write-only from production and readable only from the recovery environment, never both at once from the same identity.

3. Harden the control plane: multi-user authorization and RBAC

Immutable storage stops deletion of data. It does nothing about an attacker who reconfigures the policy - shortens future retention, disables protection, or deletes the vault wholesale before locking takes effect. Destructive control-plane operations need a second human.

Azure Backup provides Multi-User Authorization (MUA) built on Resource Guard. You place a Resource Guard in a separate tenant or subscription that the backup admin has no access to. Critical operations (disable soft-delete, reduce retention, stop protection with delete data, modify MUA) then require a just-in-time PIM-approved role on the Resource Guard - so the backup admin alone cannot perform them.

# Resource Guard lives in a SEPARATE security subscription/tenant.
# Backup admins have ZERO standing access to this resource group.
az resource create \
  --resource-group rg-security-guard \
  --name rg-prod-backup-guard \
  --resource-type "Microsoft.DataProtection/resourceGuards" \
  --properties '{
    "vaultCriticalOperationExclusionList": []
  }' \
  --api-version 2023-05-01

# Associate the vault with the Resource Guard. From now on, protected
# operations require a JIT role (granted via PIM) on the guard's scope.
az dataprotection resource-guard-mapping create \
  --resource-group rg-backup-tier0 \
  --vault-name rsv-prod-immutable \
  --resource-guard-id "/subscriptions/<sec-sub>/resourceGroups/rg-security-guard/providers/Microsoft.DataProtection/resourceGuards/rg-prod-backup-guard"

Empty vaultCriticalOperationExclusionList means nothing is excluded - every protected operation is gated. That is what you want for tier-0.

On the RBAC side, apply least privilege ruthlessly:

Treat the backup vault exactly like a tier-0 asset, because it is one. The richest target in your estate is not your database - it is the system that can simultaneously read every byte of it and delete the only way back.

4. Architect the isolated recovery environment (clean room)

Here is the failure mode that catches teams who did keep good backups: they restore the encrypted, exfiltration-laden image straight back into production, the dormant payload re-detonates, and they are back to square one - now with the attacker tipped off. You need somewhere clean to restore into.

An Isolated Recovery Environment (IRE), also called a clean room or cyber recovery vault, is a network-isolated landing zone where you restore, inspect, and clean systems before promoting them back. Its non-negotiable properties:

# Deny-all NSG for the clean-room subnet. No path to prod, no path out.
az network nsg create -g rg-ire -n nsg-cleanroom

az network nsg rule create -g rg-ire --nsg-name nsg-cleanroom \
  -n deny-all-inbound --priority 4096 \
  --direction Inbound --access Deny --protocol '*' \
  --source-address-prefixes '*' --destination-address-prefixes '*' \
  --destination-port-ranges '*'

az network nsg rule create -g rg-ire --nsg-name nsg-cleanroom \
  -n deny-all-outbound --priority 4096 \
  --direction Outbound --access Deny --protocol '*' \
  --source-address-prefixes '*' --destination-address-prefixes '*' \
  --destination-port-ranges '*'

# The clean-room VNet has NO peering and NO route to the prod hub.
az network vnet create -g rg-ire -n vnet-cleanroom \
  --address-prefixes 10.250.0.0/16 \
  --subnet-name snet-restore --subnet-prefixes 10.250.1.0/24

Access into the clean room for operators is via a single hardened, monitored jump host (Azure Bastion or a bastion in a separate management subnet) - not a flat RDP rule from the corporate LAN. Once a system is restored, scanned, confirmed clean, and patched against the original entry vector, only then does it get promoted to a rebuilt production network. The clean room is also where you mount immutable recovery points read-only to extract just the data when the OS itself is untrustworthy.

5. Define recovery tiers, RPO/RTO, and a tier-0-first sequence

Not everything recovers at once, and trying to recover everything in parallel during a real incident guarantees you recover nothing on time. Classify applications into recovery tiers up front, and recover the dependencies before the things that depend on them.

Tier Examples Target RTO Target RPO Backup cadence
Tier 0 (foundation) AD/Entra, DNS, PKI, IPAM, the backup system itself < 4 h < 1 h Continuous / hourly
Tier 1 (critical revenue) Core DB, payments, primary app < 8 h < 1 h Hourly + log shipping
Tier 2 (important) Internal apps, reporting < 24 h < 4 h 4-hourly
Tier 3 (deferrable) Dev/test, archives Best effort 24 h Daily

The recovery sequence is dependency order, not business-priority order. A common mistake is restoring the revenue app first; it then can’t authenticate because Active Directory isn’t back, can’t resolve names because DNS isn’t back, and can’t validate certs because PKI isn’t back. Restore the foundation, validate it in the clean room, then layer critical workloads on top.

Tier 0 deserves special handling: keep an AD forest recovery runbook that does not depend on any surviving production infrastructure - the System State / Veeam AD object backup of at least two DCs, the recovery sequence to seize FSMO roles, reset the krbtgt password twice, and clean up metadata. Microsoft publishes the canonical forest-recovery procedure; pre-stage it inside the IRE because you will not be able to download it when your domain is encrypted.

6. Validate backup integrity and detect tampering early

A backup you have not verified is Schrodinger’s recovery point. Two things must be automated: proving a restore works, and detecting that backups are being tampered with while there is still clean data behind the bad data.

Veeam’s SureBackup (or the equivalent in your stack) boots restored VMs in an isolated virtual lab, runs heartbeat/ping/application-level tests, and marks the recovery point verified - on a schedule, without touching production:

# Veeam SureBackup: schedule automated recovery verification in an
# isolated virtual lab. A pass = a restore you have actually performed.
$lab  = Get-VSBVirtualLab -Name "CleanRoom-Lab"
$job  = Get-VBRJob -Name "Tier1-Core-DB"
$app  = New-VSBApplicationGroup -BackupJob $job

Add-VSBJob -Name "Verify-Tier1-Nightly" `
  -VirtualLab $lab -ApplicationGroup $app `
  -VirtualMachine $job

# Run and assert it passed; alert if not.
Start-VSBJob -Job (Get-VSBJob -Name "Verify-Tier1-Nightly")

For tamper detection, treat backup activity as a security signal. Sudden mass deletions, retention policy changes, soft-delete being disabled, or a spike in “backup failed” across many jobs are all early indicators that the adversary has reached the backup tier. Stream the control-plane logs to your SIEM and alert. In Azure, the vault’s diagnostic logs land in Log Analytics:

// Sentinel/Log Analytics: detect destructive operations against backup vaults.
AzureActivity
| where ResourceProviderValue in (
    "MICROSOFT.RECOVERYSERVICES", "MICROSOFT.DATAPROTECTION")
| where OperationNameValue has_any (
    "delete", "stopProtection", "softDelete", "immutabilitySettings",
    "backupResourceGuardProxies")
| where ActivityStatusValue in ("Success", "Started")
| project TimeGenerated, Caller, OperationNameValue,
          ActivityStatusValue, Resource, CallerIpAddress
| sort by TimeGenerated desc

Pair that with an immutability check that an attacker cannot alter: keep a content hash of each critical recovery point’s manifest written to a separate, append-only store (a different cloud, a WORM bucket), and reconcile periodically. If the vault claims a recovery point exists but its hash no longer matches your external ledger, you have detected tampering independent of the system being tampered with.

7. Run recovery rehearsals and measure time-to-restore honestly

The number that matters in a board update is not “we have backups.” It is “we restored tier-0 and tier-1 in the clean room last quarter in 6 hours 40 minutes, against an 8-hour RTO.” You only have that number if you rehearse.

Run a full isolated recovery drill at least quarterly for tier-0/tier-1:

  1. Assume the production identity plane is gone - log in to the IRE with break-glass only.
  2. Restore AD/DNS/PKI from immutable points into the clean room.
  3. Restore the tier-1 application stack on top of the recovered foundation.
  4. Validate application-level health, not just “the VM booted.”
  5. Record wall-clock time per phase. The honest RTO includes decision time, approvals, and the inevitable stumbles - not just the restore-job duration.
# Measure restore wall-clock per item and emit a CSV for the drill report.
# Real time-to-restore = decision + approval + restore + validate, not just this.
start=$(date +%s)
az backup restore restore-disks \
  --resource-group rg-backup-tier0 --vault-name rsv-prod-immutable \
  --container-name "VMappContainer;compute;rg-prod;vm-dc01" \
  --item-name "VM;compute;rg-prod;vm-dc01" \
  --rp-name "$RECOVERY_POINT" \
  --storage-account stcleanroomstaging \
  --target-resource-group rg-ire
end=$(date +%s)
printf 'vm-dc01,%s,%d\n' "$(date -u +%FT%TZ)" "$((end - start))" >> drill-rto.csv

Track the trend over quarters. RTO should fall as the runbook tightens; if it is flat or rising, the rehearsal is theater. Also rehearse the ugly paths: the recovery point you wanted is corrupt and you fall back one generation; the IRE capacity is half of production and you have to triage which tier-1 systems come first.

Enterprise scenario

A European logistics company running a roughly 400-VM VMware estate took a ransomware hit through a compromised VPN appliance. The crew had domain admin for nine days, and on detonation night they used the backup service account - which had vCenter admin - to delete the Veeam backup jobs and the primary repository before encrypting the VMs. Standard playbook.

What saved them was a control they had added eighteen months earlier after a tabletop exercise exposed exactly this gap: a hardened Veeam Linux repository with immutability, on a dedicated box outside the Windows domain, where the backup data is made immutable at the filesystem level via the XFS i (immutable) attribute for the retention period. The attacker’s domain credentials were useless against it - it had no domain trust, SSH single-use credentials managed out of band, and the repo service itself sets and clears the immutable flag; even root cannot delete a locked block within the window.

The constraint they hit during recovery was speed: restoring 400 VMs over the repository’s network link was projected at four days, blowing every RTO. They solved it by restoring in dependency-tier order into an isolated vSphere cluster (the clean room), bringing back DCs, DNS, and the four revenue-critical workloads first - about 30 VMs - and getting the business transacting in under twelve hours, then back-filling tier 2 and 3 over the following days. The immutable flag is what made the restore possible; the tiered clean-room sequence is what made it fast enough.

The single config that mattered:

# Hardened Linux repo: data made immutable at the filesystem layer.
# Veeam's repo service sets/clears the +i attr; the immutability lock
# means even root cannot delete locked blocks before retention expires.
chattr +i /backups/veeam/Tier1-Core-DB/*.vbk
lsattr /backups/veeam/Tier1-Core-DB/    # ----i--------- on locked files

Post-incident, they added an external append-only hash ledger (step 6) so they would have detected the job-deletion attempt within minutes rather than at detonation, and moved the repo’s single-use credentials into a PAM vault in a separate tenant.

Verify

Before you call this resilient, confirm each guarantee holds against an admin-level adversary, not just on paper:

Checklist

ransomwareimmutable-backupisolated-recoverybackup-vaultcyber-recoveryresilience

Comments

Keep Reading