The Only Rule That Matters
- Untested backups are not backups — schedule regular restore tests, not just backup jobs
- Test restores to different hardware/location — validates both backup and restore procedure
- Time the restore — know how long recovery actually takes before disaster strikes
3-2-1 Rule Violations
- Same disk as source data = not a backup — disk failure loses both
- Same server as source = not a backup — ransomware/fire/theft takes both
- Same cloud account = risky — account compromise or provider issue loses both
- Cloud sync (Dropbox, Drive) is not backup — syncs deletions and corruption too
Ransomware Protection
- Backups accessible from production get encrypted too — air gap or immutable storage required
- Append-only/immutable storage prevents deletion — S3 Object Lock, Backblaze B2 with retention
- Offline rotation (USB drives, tapes) for critical data — can't encrypt what's not connected
- Test restoring from immutable backup — verify ransomware can't corrupt the restore process
Database Backup Traps
- File copy of running database = corrupted backup — use pg_dump, mysqldump, mongodump
- Point-in-time recovery needs WAL/binlog archiving — dump alone loses recent transactions
- Large databases: pg_dump locks tables — use pg_basebackup or logical replication for zero downtime
- Test restore to different server — verifies backup is self-contained
Incremental Backup Pitfalls
- Incrementals depend on chain — one corrupted backup breaks all following
- Long chains slow restores — schedule periodic full backups
- Deduplication saves space but adds complexity — single repo corruption affects all backups
- Verify backup integrity regularly — bit rot happens, checksums catch it
Retention Mistakes
- No retention policy = storage fills up — define and automate cleanup
- Too aggressive retention = can't recover old corruption — keep monthlies for a year minimum
- Legal/compliance requirements may mandate retention — check before setting policy
- Grandfather-father-son pattern: daily/weekly/monthly tiers
Filesystem Traps
- Permissions and ownership often lost — verify restore preserves them, or document expected state
- Symlinks may not backup correctly — some tools follow, some copy link, test behavior
- Sparse files may inflate — 1GB sparse file becomes 1GB actual in backup
- Extended attributes and ACLs — not all tools preserve them
Cloud and Remote
- Encrypt before upload — cloud provider breach shouldn't expose your data
- Bandwidth costs add up — initial seed via physical drive for large datasets
- Region matters for disaster recovery — same region as production doesn't survive regional outage
- Egress fees can be brutal — know restore costs before emergency
Tool-Specific
- rsync
--deleteon wrong direction destroys source — always double-check source/destination - restic/borg need repository password — lose it = lose all backups, no recovery
- Tarball without compression: faster, but larger — choose based on CPU vs storage tradeoff
- Snapshots (LVM, ZFS, cloud) are not backups — same storage system, same failure domain
Documentation
- Document restore procedure — you won't remember under pressure
- Store procedure outside the backup — printed, different system, password manager
- Include credentials, paths, expected time — everything needed to restore at 3am