Automate MySQL Hot Backups with Percona XtraBackup and Binlog Point-in-Time Recovery

At 09:14 on a Tuesday an application deploy ships a migration with a missing WHERE clause, and UPDATE orders SET status='cancelled' runs against every row in a 600 GB MySQL 8.0 database before anyone notices. The nightly mysqldump is eleven hours stale and took four hours to restore last time you tried; restoring it loses a full business day of orders and still leaves you at 21:00 the previous night — not at 09:13:58, the instant before the bad transaction. This guide builds the capability that turns that incident from a career-defining outage into a forty-minute recovery: hot physical backups with Percona XtraBackup that never lock the database, continuous binary-log shipping so you can replay forward, and a point-in-time recovery that stops at an exact GTID or timestamp. Everything here is automated, scheduled, monitored, and gated so it survives an audit — not a heroic manual xtrabackup run remembered only by one engineer.

The reason mysqldump is the wrong tool at this size is worth saying once: it is a logical, single-threaded export that holds a consistent snapshot by way of a long-running transaction (and on non-InnoDB tables, real locks), it scales with row count not disk, and restoring it means re-executing every INSERT and rebuilding every index from scratch. Percona XtraBackup instead copies the InnoDB data files at the physical block level while the server runs, reads the redo log to stay consistent, and restores by simply putting files back and replaying redo — minutes of file copy, not hours of SQL replay. Pair that physical base backup with the binary log (MySQL’s transaction journal) and you can roll a restored backup forward transaction by transaction to any point you choose. That is point-in-time recovery (PITR), and it is the only thing that gets you to 09:13:58.

Prerequisites

A running MySQL 8.0 (or Percona Server 8.0) primary with log_bin enabled and gtid_mode = ON — GTIDs make PITR deterministic and are assumed throughout.
The binlog_format = ROW and binlog_row_image = FULL settings, so binary logs are replayable and auditable.
A backup host or volume with at least 1.5× the dataset size free, plus object storage (S3 / Azure Blob / GCS) for offsite copies.
Percona XtraBackup 8.0 whose minor version matches the server’s — XtraBackup is version-locked to the MySQL series it reads.
Root or sudo on the DB host, and a dedicated MySQL user for backups (created below) — not root.
HashiCorp Vault reachable for backup-user and storage credentials; Ansible (or your config tool) to roll the config out repeatably.

Target topology

Automate MySQL Hot Backups with Percona XtraBackup and Binlog Point-in-Time Recovery — topology

The topology has three planes. The data plane is the MySQL primary and its replicas; XtraBackup runs against a replica when one exists so the primary never feels the I/O. The backup plane is a dedicated backup host that pulls a weekly full plus daily incrementals, streams every binary log offsite within seconds of it closing, and pushes both to versioned, object-locked storage fronted by Akamai for fast, cached restores to remote regions. The control plane ties it together: a HashiCorp Vault AppRole issues short-lived MySQL and storage credentials so no secret sits in a cron file; Jenkins (or GitHub Actions) drives the scheduled jobs and the quarterly restore drill; Terraform provisions the storage bucket, IAM, and object-lock policy while Ansible lays down the XtraBackup config and systemd units; Dynatrace watches backup age, duration, and binlog-ship lag and pages on a missed window; Wiz scans the bucket and DB host for public-exposure or weak-encryption drift; CrowdStrike Falcon runs on every node for runtime threat detection; and a failed backup or a started recovery auto-raises a ServiceNow change/incident record so the audit trail writes itself. Okta, federated to Entra ID, gates every human who can reach Jenkins, Vault, or the restore runbook.

1. Prepare the server and a least-privilege backup user

XtraBackup needs a MySQL account with a specific, minimal grant set — never reuse root. Confirm the server is configured for PITR first, then create the user. Run inside mysql:

-- Verify PITR-critical settings; all four must be as shown.
SHOW VARIABLES WHERE Variable_name IN
  ('log_bin','gtid_mode','binlog_format','binlog_row_image','enforce_gtid_consistency');

-- Keep binary logs long enough that a weekend incident is still replayable.
SET PERSIST binlog_expire_logs_seconds = 1209600;   -- 14 days

-- Dedicated backup principal. Credentials come from Vault (Step 2), not here.
CREATE USER 'xtrabackup'@'localhost' IDENTIFIED BY '<<from-vault>>';
GRANT BACKUP_ADMIN, RELOAD, LOCK TABLES, PROCESS, REPLICATION CLIENT,
      SELECT ON *.* TO 'xtrabackup'@'localhost';
-- BACKUP_ADMIN unlocks the lightweight LOCK INSTANCE FOR BACKUP used by XtraBackup 8.0
-- instead of a global read lock.
FLUSH PRIVILEGES;

Install XtraBackup matched to the server series (Percona’s repo selects the right one):

sudo percona-release setup ps80
sudo apt-get update && sudo apt-get install -y percona-xtrabackup-80 qpress
xtrabackup --version   # must print an 8.0.x that matches your MySQL 8.0.x line

qpress enables on-the-fly compression. If you back up from a replica (recommended), point all xtrabackup commands at it and add --slave-info --safe-slave-backup so the replica’s coordinates and GTID position are captured in xtrabackup_slave_info.

2. Pull short-lived credentials from Vault

No long-lived password belongs in a cron file or ~/.my.cnf. Store the backup user’s password and the object-storage keys in HashiCorp Vault, and have the job fetch them with an AppRole at run time. This is also what lets Wiz pass the host without flagging a plaintext secret on disk.

# Authenticate with the host's AppRole (role_id baked by Ansible, secret_id wrapped).
VAULT_TOKEN=$(vault write -field=token auth/approle/login \
  role_id="$(cat /etc/vault/role_id)" \
  secret_id="$(cat /etc/vault/secret_id)")
export VAULT_TOKEN

# Pull just-in-time credentials into the process environment only.
export MYSQL_PWD=$(vault kv get -field=password secret/mysql/xtrabackup)
export AWS_ACCESS_KEY_ID=$(vault kv get -field=akid secret/backup/s3)
export AWS_SECRET_ACCESS_KEY=$(vault kv get -field=secret secret/backup/s3)

Write these into a ~/.mylogin.cnf via mysql_config_editor if you prefer obfuscated local creds, but the Vault-at-runtime pattern keeps the secret off disk entirely and is the one to standardize on. Okta → Entra ID SSO governs which engineers can even read these Vault paths or unwrap a secret_id.

3. Take the weekly full backup

The full backup is the foundation every incremental and every PITR builds on. Stream it compressed straight to a target directory; in production you would pipe it to object storage (Step 6). Take the base on a Sunday:

BASE=/backups/mysql/full/$(date +%F)
mkdir -p "$BASE"

xtrabackup --backup \
  --user=xtrabackup \
  --target-dir="$BASE" \
  --compress --compress-threads=4 \
  --parallel=4 \
  --slave-info --safe-slave-backup     # omit these two if backing up the primary

A clean run ends with completed OK! and writes xtrabackup_checkpoints (the LSN this backup reached — the anchor for the next incremental) and xtrabackup_binlog_info (the binlog file + GTID set at backup time — the anchor for PITR). Capture both; they are the coordinates everything downstream depends on.

Prepare the base immediately so it is restore-ready and can seed incrementals. Use --apply-log-only on the base and on every incremental except the last, which leaves the redo log open to accept the next delta:

xtrabackup --decompress --parallel=4 --target-dir="$BASE"   # qpress -> raw files
xtrabackup --prepare --apply-log-only --target-dir="$BASE"

4. Take daily incrementals against the base

An incremental copies only the InnoDB pages whose LSN advanced since a reference backup — typically last night’s. This is what makes a daily cadence cheap on a 600 GB database. Monday through Saturday:

PREV=/backups/mysql/full/2026-06-07          # Sunday's base, or yesterday's incremental
INCR=/backups/mysql/incr/$(date +%F)
mkdir -p "$INCR"

xtrabackup --backup \
  --user=xtrabackup \
  --target-dir="$INCR" \
  --incremental-basedir="$PREV" \
  --compress --compress-threads=4 --parallel=4

Roll each incremental into the base as it lands, so the prepared base always represents “as of last night” and a restore never has to merge a long chain under pressure:

xtrabackup --decompress --parallel=4 --target-dir="$INCR"
xtrabackup --prepare --apply-log-only \
  --target-dir="$BASE" --incremental-dir="$INCR"

5. Ship binary logs continuously for the “to-the-second” capability

A daily incremental alone still loses up to 24 hours. The binary log closes that gap: mysqlbinlog --read-from-remote-server --stop-never streams every committed transaction to the backup host as it happens, so your worst-case data loss (RPO) shrinks from a day to seconds. Run it as a long-lived systemd service, not a cron job:

# /usr/local/bin/binlog-ship.sh
NEXT=$(ls /backups/mysql/binlog/ | grep -E '^mysql-bin\.[0-9]+$' | sort | tail -1)
NEXT=${NEXT:-mysql-bin.000001}     # cold start from the first log

exec mysqlbinlog \
  --read-from-remote-server \
  --host=mysql-primary.internal --user=xtrabackup \
  --raw --stop-never --stop-never-slave-server-id=99 \
  --result-file=/backups/mysql/binlog/ \
  "$NEXT"

# /etc/systemd/system/binlog-ship.service
[Unit]
Description=Continuous MySQL binlog shipping
After=network-online.target
[Service]
ExecStart=/usr/local/bin/binlog-ship.sh
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target

sudo systemctl enable --now binlog-ship.service

A rclone/aws s3 sync sidecar then mirrors each closed mysql-bin.NNNNNN to object storage within seconds. The lag between a binlog closing on the primary and landing offsite is the single most important number to monitor (Step 9) — it is your real RPO.

6. Push everything to immutable object storage

Local backups die with the rack. Provision a versioned, object-locked (WORM) bucket so neither the ransomware that just hit your DB host nor a fat-fingered rm can delete your recovery point. Define it in Terraform so the retention and encryption are reviewed, not clicked:

resource "aws_s3_bucket" "mysql_backups" {
  bucket              = "kloudvin-mysql-backups-prod"
  object_lock_enabled = true
}

resource "aws_s3_bucket_object_lock_configuration" "lock" {
  bucket = aws_s3_bucket.mysql_backups.id
  rule { default_retention { mode = "COMPLIANCE"  days = 35 } }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "enc" {
  bucket = aws_s3_bucket.mysql_backups.id
  rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms" } }
}

Sync the prepared full, the incrementals, and the binlog stream:

aws s3 sync /backups/mysql/ s3://kloudvin-mysql-backups-prod/ \
  --storage-class STANDARD_IA --only-show-errors

Akamai fronts the bucket as a caching/origin-offload layer so a restore into a far region pulls a warm copy instead of hammering one bucket across an ocean. Wiz continuously scans this bucket and alerts the moment versioning, object-lock, or KMS encryption drifts — the independent backstop that the policy actually holds.

7. Automate the schedule and gate it through CI

Wire the steps into Jenkins (or GitHub Actions) so backups are scheduled, logged, retried, and visible — never a silent cron line one person owns. Ansible lays down the scripts and systemd timer; Jenkins owns orchestration, the restore drill, and the audit record.

// Jenkinsfile (declarative) — runs the right job per day, alarms on failure.
pipeline {
  agent { label 'db-backup' }
  triggers { cron('15 1 * * *') }            // 01:15 daily
  stages {
    stage('Credentials') { steps { sh 'vault-login.sh' } }   // Step 2
    stage('Backup') {
      steps {
        sh '''
          if [ "$(date +%u)" -eq 7 ]; then ./full-backup.sh; else ./incr-backup.sh; fi
        '''
      }
    }
    stage('Offsite + verify') { steps { sh './sync-s3.sh && ./verify-restore.sh' } }
  }
  post {
    failure {
      // Auto-raise a ServiceNow incident so a missed backup is never silent.
      sh './servicenow-incident.sh "MySQL backup FAILED on ${NODE_NAME}"'
    }
  }
}

GitHub Actions with an OIDC-federated cloud role (no stored keys) is an equally good driver if that is your stack; the gating principle is identical. Argo CD has no role here — this is scheduled batch, not a reconciled deployment — so resist the temptation to model it as one.

8. Recover to a precise transaction (the PITR drill)

This is the payoff. You know the bad transaction landed at 09:14:00 on 2026-06-10; you want the database as of 09:13:58. Recovery is two acts: restore the physical base, then replay binlog up to — but not including — the bad transaction.

Act 1 — restore the prepared base into a clean datadir. The base was already --prepare --apply-log-only’d in Steps 3–4; finalize it (a plain --prepare, no --apply-log-only) and copy it back:

systemctl stop mysql
xtrabackup --prepare --target-dir=/backups/mysql/full/2026-06-07     # final prepare
mv /var/lib/mysql /var/lib/mysql.broken
xtrabackup --copy-back --target-dir=/backups/mysql/full/2026-06-07
chown -R mysql:mysql /var/lib/mysql
systemctl start mysql

The server is now consistent as of last night’s backup. The binlog carries it forward.

Act 2 — find the cutoff, then replay. Read xtrabackup_binlog_info for the GTID set the backup already contains, then scan the binlog around 09:14 to find the exact GTID of the offending statement:

cat /backups/mysql/full/2026-06-07/xtrabackup_binlog_info
# -> mysql-bin.000412   9c4...:1-880431   (server_uuid:txn_range already in the base)

# Inspect transactions around the incident to identify the bad GTID.
mysqlbinlog --base64-output=DECODE-ROWS -vv \
  --start-datetime="2026-06-10 09:13:50" \
  --stop-datetime="2026-06-10 09:14:30" \
  /backups/mysql/binlog/mysql-bin.000412 | less
# Suppose the runaway UPDATE is GTID 9c4...:880440.

Replay everything the base does not already have, stopping before the bad GTID. GTID-based exclusion is exact — far safer than guessing a timestamp:

# Replay from the first GTID after the backup up to (not including) the bad one.
mysqlbinlog --skip-gtids \
  --include-gtids='9c4...:880432-880439' \
  /backups/mysql/binlog/mysql-bin.000412 \
  | mysql -u root -p

If you must use time instead (no GTIDs, or a damaged GTID set), substitute --stop-datetime="2026-06-10 09:13:59" — but prefer GTID exclusion, because two transactions can share a second. The database now sits at 09:13:58, every order intact and the cancellation never applied. Promote it, repoint the application, and let the replicas rebuild from it. A ServiceNow change record opened at the start of this drill captures who recovered what, to which point, and why.

9. Validation — prove the backup and the restore

A backup you have never restored is a hypothesis. Verify three layers, on every run and on a schedule.

# (a) Did the base prepare cleanly? The marker line must be present.
grep -q 'completed OK!' /backups/mysql/full/2026-06-07/xtrabackup.log && echo "prepare OK"

# (b) Page-level integrity of the restored InnoDB files — no torn/corrupt pages.
innochecksum /var/lib/mysql.restored/test/orders.ibd && echo "checksum OK"

# (c) The only validation that counts: an automated restore drill.
#     Jenkins restores last night's chain into a throwaway instance and asserts a row count.
RESTORED=$(mysql -N -e "SELECT COUNT(*) FROM shop.orders" -h restore-test.internal)
[ "$RESTORED" -ge 5000000 ] && echo "restore drill PASS" || exit 1

Schedule step © as a recurring Jenkins job (a full quarterly drill, a smoke restore nightly). Dynatrace ingests the emitted metrics — backup_age_seconds, backup_duration_seconds, binlog_ship_lag_seconds, last_successful_restore_epoch — and pages when backup age exceeds the window, a backup runs long enough to threaten the next one, or binlog-ship lag (your live RPO) breaches threshold. Alert on the absence of success, not just on errors: a cron that silently never fired is the classic way teams discover they have no backups exactly when they need one.

Rollback and teardown

Decommission cleanly so you never leave an orphaned binlog firehose or a forgotten WORM bucket on the bill.

# Stop continuous shipping and the schedule.
sudo systemctl disable --now binlog-ship.service
# In Jenkins: disable the backup pipeline; in Ansible: remove the timer + scripts.

# Drop the backup user once nothing references it.
mysql -e "DROP USER 'xtrabackup'@'localhost';"

# Object-locked objects in COMPLIANCE mode CANNOT be deleted before their retention
# expires — by design. Let the 35-day lock lapse, then:
aws s3 rm s3://kloudvin-mysql-backups-prod/ --recursive
terraform destroy -target=aws_s3_bucket.mysql_backups

To roll back a botched recovery, you kept /var/lib/mysql.broken from Step 8 — stop MySQL, move it back, and you are exactly where you started, having lost nothing. Never delete the pre-recovery datadir until the recovered instance is validated and promoted.

Common pitfalls

Version skew. XtraBackup 8.0.34 cannot reliably back up MySQL 8.0.36. Pin the XtraBackup minor to the server’s and bump them together — a mismatch fails loudly at best and silently corrupts at worst.
--apply-log-only on the final prepare. Using it on the last incremental (or the base, when no incrementals follow) leaves the backup un-finalized and unbootable. The final prepare is always a plain --prepare.
Binlog expiry shorter than your backup interval. If binlog_expire_logs_seconds purges logs the offsite copy has not captured, your PITR has a hole. Keep expiry well above your full-backup cadence and confirm the offsite stream is current.
STATEMENT binlog format. Non-deterministic statements (NOW(), UUID(), multi-row INSERT ... SELECT) replay differently and silently diverge your recovery. ROW format is non-negotiable for trustworthy PITR.
Backing up the primary at peak. XtraBackup’s I/O is real; run it on a replica with --safe-slave-backup, or throttle with --throttle=<MB/s> so it does not starve production.
Untested restores. The cardinal sin. The first time you run a restore must never be during the outage — drill it on a schedule (Step 9) or you do not have backups, you have files.

Security notes

Backups are a full, portable copy of your most sensitive data — treat the bucket like the database. Encrypt at rest with KMS and in transit with TLS; turn on object-lock (WORM) so backups survive the ransomware that encrypts the live DB. Credentials live only in HashiCorp Vault, fetched just-in-time (Step 2), never in a cron file or .my.cnf. Okta federated to Entra ID gates every human who can reach Vault, Jenkins, or the restore runbook, with least-privilege roles. Wiz scans the bucket and DB host for public-exposure or weak-encryption drift; CrowdStrike Falcon runs on every node for runtime threat detection and feeds the SOC. Every recovery opens a ServiceNow change record, so the audit trail — who restored what, to which point — writes itself.

Cost notes

The dominant cost is storage, and it is small relative to the data it protects. Keep prepared fulls and incrementals in a warm tier for fast restore, but lifecycle older recovery points to cold storage (S3 Glacier / Azure Archive) past your operational window — a 90-day compliance copy does not belong in STANDARD_IA. Compression (--compress, Step 3) cuts both storage and egress materially on typical schemas. Incrementals are the real lever: a daily incremental on a 600 GB database moves only the changed pages, often a few percent, so daily protection costs a fraction of daily fulls. Run the heavy backup I/O on a replica you already pay for rather than scaling the primary up for backup headroom. Dynatrace doubles as the cost guardrail here — trend backup size and storage growth so a runaway table or a stuck lifecycle policy shows up as a line on a graph before it shows up on the invoice.