Databases

Backup & Disaster Recovery

3 min read
Focus: DATABASES

TL;DR — Quick Summary

  • 3-2-1 rule: 3 copies, 2 media types, 1 offsite — prevents any single failure from wiping all backups.
  • RPO = max acceptable data loss (drives backup frequency). RTO = max acceptable downtime (drives recovery speed).
  • PITR (Point-in-Time Recovery) via WAL archiving lets you restore to any exact minute — the gold standard.
  • An untested backup is not a backup — run restore drills regularly.

Lesson Overview

💾 The Data Loss Reality Check

Data loss doesn't just come from hardware failures. The most common causes are human error (accidental DELETE with no WHERE clause), software bugs that corrupt data, and ransomware. The question isn't if you'll need a backup — it's when.

The 3-2-1 Backup Rule:

  • 🔢 3 copies of your data
  • 💾 2 different storage media types
  • 🌍 1 copy stored offsite (geographically separate)

📦 Types of Backups

  • Full backup: Complete snapshot of the entire database. Largest, slowest, but simplest to restore. Run weekly or daily.
  • Incremental backup: Only changes since the last backup (full or incremental). Small and fast. Must replay all increments for restoration.
  • Differential backup: Changes since the last full backup. Larger than incremental, but only need last full + last differential to restore.
  • Point-in-Time Recovery (PITR): Restore to any exact moment in time by replaying the transaction log (WAL). The gold standard for production databases.

⏱️ Key Recovery Metrics

  • RPO (Recovery Point Objective): Maximum acceptable data loss — "we can tolerate losing up to 1 hour of data."
  • RTO (Recovery Time Objective): Maximum acceptable downtime — "the system must be back online within 4 hours."

An untested backup is not a backup — it's a false sense of security. Run restore drills regularly. Many companies discover their backups are corrupt or incomplete only when they desperately need them.

Conceptual Deep Dive

Think of RPO and RTO as a contract with your business. RPO answers 'how much work can we afford to redo?' RTO answers 'how long can the business survive without the database?'

A hospital's patient records system might have RPO = 0 (zero data loss tolerated) and RTO = 15 minutes (must be back online fast). A blog might have RPO = 24 hours (daily backup is fine) and RTO = 4 hours (a few hours of downtime is acceptable). Your backup strategy should be sized to meet these objectives — and they should be defined by the business, not guessed by engineers.

Architecture & Data Flow

3-2-1 Backup Rule Visualized
Rendering diagram…
Point-in-Time Recovery — How It Works
Rendering diagram…

Implementation Lab

MySQL / PostgreSQL Backup Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# ── MySQL: Full backup with mysqldump ──
mysqldump \
  --user=root \
  --password \
  --single-transaction \
  --routines \
  --triggers \
  --all-databases \
  | gzip > /backups/full_$(date +%Y%m%d_%H%M%S).sql.gz
 
# Restore from mysqldump
gunzip < backup_20240115.sql.gz | mysql -u root -p myapp
 
# ── PostgreSQL: pg_dump (logical) ──
pg_dump -U postgres -Fc myapp > /backups/myapp_$(date +%Y%m%d).dump
# -Fc = custom compressed format
 
# Restore
pg_restore -U postgres -d myapp /backups/myapp_20240115.dump
 
# ── PostgreSQL: pg_basebackup (physical — for PITR) ──
pg_basebackup -U replicator -D /backups/base -Ft -z -Xs -P
# This creates a binary backup + WAL stream for point-in-time recovery
 
# ── Automated backup with cron ──
# Edit crontab: crontab -e
# Daily at 2am: backup and delete files older than 30 days
0 2 * * * mysqldump -u root -p myapp | gzip > /backups/daily_$(date +\%Y\%m\%d).sql.gz
0 3 * * * find /backups -name 'daily_*.sql.gz' -mtime +30 -delete
Point-in-Time Recovery (PITR) — PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Scenario: a developer ran DELETE FROM orders WHERE... at 14:32
# and forgot the WHERE clause. You need to restore to 14:31.
 
# 1. Stop the database
systemctl stop postgresql
 
# 2. Restore the base backup
rm -rf /var/lib/postgresql/data/*
tar -xzf /backups/base.tar.gz -C /var/lib/postgresql/data/
 
# 3. Create recovery config to replay WAL up to 14:31
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'cp /backups/wal/%f %p'
recovery_target_time = '2024-01-15 14:31:00'01-15 14:31:00'
recovery_target_action = 'promote'
EOF
 
# 4. Start PostgreSQL — it replays WAL up to 14:31 and stops
systemctl start postgresql
 
# 5. Verify the data is correct, then open to connections
psql -c "SELECT pg_wal_replay_resume();"();"
 
# The 13-minute batch of accidentally deleted orders is restored.

Best Practices — Interactive Comparison

Always use --single-transaction for InnoDB — never lock your database during a backup

bash
1
2
3
4
5
6
# Default mysqldump locks EVERY table
# App gets read-only for the duration of the backup
# At 100GB this means 30+ minutes of degraded service
mysqldump -u root -p myapp > backup.sql
 
# Users see: ERROR 1205 Lock wait timeout exceeded

Pro Tips — Senior Dev Insights

1

Use managed database services (AWS RDS, GCP Cloud SQL, Azure Database) — they handle automated backups, PITR, and cross-region replication with a few clicks.

2

Implement backup encryption — backup files often contain sensitive data. Encrypt with GPG or use your cloud provider's KMS-managed encryption.

3

Set up a read replica in a separate region — serves as both a disaster recovery target and reduces read load on primary.

4

pg_dump --schema-only exports just the schema without data — invaluable for quickly spinning up a fresh environment or comparing schema versions.

⚖️ Backup Types — Trade-offs

FeatureFullIncrementalDifferentialPITR
Backup sizeLargestSmallestMediumMedium + WAL logs
Backup speedSlowestFastestMediumContinuous
Restore complexitySimpleHigh — replay all incrementsMedium — full + latest diffAny point in time
Restore speedFastSlowMediumVaries by target time
Recovery granularitySnapshot onlySnapshot onlySnapshot onlyAny second
Best used forWeekly full backupDaily deltasDaily delta v2Production standard

Common Developer Pitfalls

!

Never testing backups — discovering they're corrupt or incomplete during a crisis is far too late.

!

Storing backups in the same location as the database — a single failure (fire, ransomware, account compromise) destroys both.

!

Not monitoring backup jobs — a silent failure means you think you have backups but you don't (the GitLab incident).

!

Ignoring the restore time — a 1TB backup that takes 6 hours to restore is useless if your RTO is 2 hours.

Interview Mastery

RPO (Recovery Point Objective) is the maximum amount of data loss acceptable — measured in time. RPO = 1 hour means you can tolerate losing at most 1 hour of data. This drives backup frequency (hourly backups for RPO = 1 hour). RTO (Recovery Time Objective) is the maximum acceptable downtime before the system must be operational again. RTO = 4 hours means the database must be restored within 4 hours of an incident. This drives your recovery process speed and whether you need hot standby replicas.

PITR allows restoring a database to any specific moment in time — not just the last backup. It works by: (1) taking a full base backup periodically, (2) continuously archiving the transaction log (WAL in PostgreSQL, binary log in MySQL) to a separate location. During recovery, you restore the base backup then replay all WAL/binlog records up to your desired timestamp. This means if someone accidentally deleted data at 3:42 PM, you can restore to 3:41 PM — just 1 minute of data loss.

The 3-2-1 rule: keep 3 total copies of data (1 primary + 2 backups), store backups on 2 different types of storage media (e.g., local disk + cloud), and keep 1 copy offsite (geographically separate). This ensures that no single failure — hardware failure, fire, flood, ransomware attack — can destroy all copies simultaneously.

Real-World Blueprint

"In 2017, GitLab accidentally deleted their production database during a maintenance operation. The incident revealed their backup system had been silently failing for months — they had no working backups. Recovery took 18 hours of manual data reconstruction from fragments. They recovered ~6 hours of data permanently lost. GitLab then published a public post-mortem and rebuilt their entire backup/DR process from scratch. The lesson: test your backups regularly, don't just assume they work."

Hands-on Lab Exercises

1

Set up automated daily mysqldump backups with a cron job. Verify the file is created and is non-zero size.

2

Practice a full restore: take a backup, create a test database, restore the backup to it, verify the data matches.

3

Enable PostgreSQL WAL archiving and perform a point-in-time recovery: make changes, wait 2 minutes, then restore to 1 minute ago.

4

Write a backup monitoring script that alerts you if a backup file hasn't been created in the last 25 hours.

Real-World Practice Scenarios

A developer ran UPDATE users SET role = 'admin' without a WHERE clause — every user is now an admin. You need to fix this. What's your recovery plan and how long will it take?

Your company's RPO is 4 hours and RTO is 2 hours. Design a backup strategy that meets both requirements.

Your database backup files are stored in the same AWS account as the production database. A security breach compromises the account. Why is this a problem, and how should backups be isolated?

You restored from backup and the data is from 6 hours ago, but you need the last 5 hours. What process (if you had set it up in advance) could recover this data?