Databases

Backup & Disaster Recovery

3 min read

Focus: DATABASES

⚡

TL;DR — Quick Summary

3-2-1 rule: 3 copies, 2 media types, 1 offsite — prevents any single failure from wiping all backups.
RPO = max acceptable data loss (drives backup frequency). RTO = max acceptable downtime (drives recovery speed).
PITR (Point-in-Time Recovery) via WAL archiving lets you restore to any exact minute — the gold standard.
An untested backup is not a backup — run restore drills regularly.

Lesson Overview

💾 The Data Loss Reality Check

Data loss doesn't just come from hardware failures. The most common causes are human error (accidental DELETE with no WHERE clause), software bugs that corrupt data, and ransomware. The question isn't if you'll need a backup — it's when.

The 3-2-1 Backup Rule:

🔢 3 copies of your data
💾 2 different storage media types
🌍 1 copy stored offsite (geographically separate)

📦 Types of Backups

Full backup: Complete snapshot of the entire database. Largest, slowest, but simplest to restore. Run weekly or daily.
Incremental backup: Only changes since the last backup (full or incremental). Small and fast. Must replay all increments for restoration.
Differential backup: Changes since the last full backup. Larger than incremental, but only need last full + last differential to restore.
Point-in-Time Recovery (PITR): Restore to any exact moment in time by replaying the transaction log (WAL). The gold standard for production databases.

⏱️ Key Recovery Metrics

RPO (Recovery Point Objective): Maximum acceptable data loss — "we can tolerate losing up to 1 hour of data."
RTO (Recovery Time Objective): Maximum acceptable downtime — "the system must be back online within 4 hours."

An untested backup is not a backup — it's a false sense of security. Run restore drills regularly. Many companies discover their backups are corrupt or incomplete only when they desperately need them.

Conceptual Deep Dive

Think of RPO and RTO as a contract with your business. RPO answers 'how much work can we afford to redo?' RTO answers 'how long can the business survive without the database?'

A hospital's patient records system might have RPO = 0 (zero data loss tolerated) and RTO = 15 minutes (must be back online fast). A blog might have RPO = 24 hours (daily backup is fine) and RTO = 4 hours (a few hours of downtime is acceptable). Your backup strategy should be sized to meet these objectives — and they should be defined by the business, not guessed by engineers.

Architecture & Data Flow

3-2-1 Backup Rule Visualized

Rendering diagram…

Point-in-Time Recovery — How It Works

Rendering diagram…

Implementation Lab

MySQL / PostgreSQL Backup Commands

# ── MySQL: Full backup with mysqldump ──
mysqldump \
  --user=root \
  --password \
  --single-transaction \
  --routines \
  --triggers \
  --all-databases \
  | gzip > /backups/full_$(date +%Y%m%d_%H%M%S).sql.gz
 
# Restore from mysqldump
gunzip < backup_20240115.sql.gz | mysql -u root -p myapp
 
# ── PostgreSQL: pg_dump (logical) ──
pg_dump -U postgres -Fc myapp > /backups/myapp_$(date +%Y%m%d).dump
# -Fc = custom compressed format
 
# Restore
pg_restore -U postgres -d myapp /backups/myapp_20240115.dump
 
# ── PostgreSQL: pg_basebackup (physical — for PITR) ──
pg_basebackup -U replicator -D /backups/base -Ft -z -Xs -P
# This creates a binary backup + WAL stream for point-in-time recovery
 
# ── Automated backup with cron ──
# Edit crontab: crontab -e
# Daily at 2am: backup and delete files older than 30 days
0 2 * * * mysqldump -u root -p myapp | gzip > /backups/daily_$(date +\%Y\%m\%d).sql.gz
0 3 * * * find /backups -name 'daily_*.sql.gz' -mtime +30 -delete

Point-in-Time Recovery (PITR) — PostgreSQL

# Scenario: a developer ran DELETE FROM orders WHERE... at 14:32
# and forgot the WHERE clause. You need to restore to 14:31.
 
# 1. Stop the database
systemctl stop postgresql
 
# 2. Restore the base backup
rm -rf /var/lib/postgresql/data/*
tar -xzf /backups/base.tar.gz -C /var/lib/postgresql/data/
 
# 3. Create recovery config to replay WAL up to 14:31
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'cp /backups/wal/%f %p'
recovery_target_time = '2024-01-15 14:31:00'01-15 14:31:00'
recovery_target_action = 'promote'
EOF
 
# 4. Start PostgreSQL — it replays WAL up to 14:31 and stops
systemctl start postgresql
 
# 5. Verify the data is correct, then open to connections
psql -c "SELECT pg_wal_replay_resume();"();"
 
# The 13-minute batch of accidentally deleted orders is restored.

Best Practices — Interactive Comparison

Always use --single-transaction for InnoDB — never lock your database during a backup

bash

# Default mysqldump locks EVERY table
# App gets read-only for the duration of the backup
# At 100GB this means 30+ minutes of degraded service
mysqldump -u root -p myapp > backup.sql
 
# Users see: ERROR 1205 Lock wait timeout exceeded

Default dump — locks all tables

bash

# Default mysqldump locks EVERY table
# App gets read-only for the duration of the backup
# At 100GB this means 30+ minutes of degraded service
mysqldump -u root -p myapp > backup.sql
 
# Users see: ERROR 1205 Lock wait timeout exceeded

Better approach

Hot backup — zero downtime

bash

# --single-transaction uses a consistent snapshot
# InnoDB tables are NOT locked — app runs normally
mysqldump \
  --user=root \
  --password \
  --single-transaction \
  --routines \
  --triggers \
  myapp | gzip > /backups/myapp_$(date +%Y%m%d_%H%M).sql.gz
 
# Always compress to save storage
# Always timestamp the filename for easy identification

Pro Tips — Senior Dev Insights

Use managed database services (AWS RDS, GCP Cloud SQL, Azure Database) — they handle automated backups, PITR, and cross-region replication with a few clicks.

Implement backup encryption — backup files often contain sensitive data. Encrypt with GPG or use your cloud provider's KMS-managed encryption.

Set up a read replica in a separate region — serves as both a disaster recovery target and reduces read load on primary.

pg_dump --schema-only exports just the schema without data — invaluable for quickly spinning up a fresh environment or comparing schema versions.

⚖️ Backup Types — Trade-offs

Feature	Full	Incremental	Differential	PITR
Backup size	Largest	Smallest	Medium	Medium + WAL logs
Backup speed	Slowest	Fastest	Medium	Continuous
Restore complexity	Simple	High — replay all increments	Medium — full + latest diff	Any point in time
Restore speed	Fast	Slow	Medium	Varies by target time
Recovery granularity	Snapshot only	Snapshot only	Snapshot only	Any second
Best used for	Weekly full backup	Daily deltas	Daily delta v2	Production standard

Common Developer Pitfalls

Never testing backups — discovering they're corrupt or incomplete during a crisis is far too late.

Storing backups in the same location as the database — a single failure (fire, ransomware, account compromise) destroys both.

Not monitoring backup jobs — a silent failure means you think you have backups but you don't (the GitLab incident).

Ignoring the restore time — a 1TB backup that takes 6 hours to restore is useless if your RTO is 2 hours.

Interview Mastery

RPO (Recovery Point Objective) is the maximum amount of data loss acceptable — measured in time. RPO = 1 hour means you can tolerate losing at most 1 hour of data. This drives backup frequency (hourly backups for RPO = 1 hour). RTO (Recovery Time Objective) is the maximum acceptable downtime before the system must be operational again. RTO = 4 hours means the database must be restored within 4 hours of an incident. This drives your recovery process speed and whether you need hot standby replicas.

PITR allows restoring a database to any specific moment in time — not just the last backup. It works by: (1) taking a full base backup periodically, (2) continuously archiving the transaction log (WAL in PostgreSQL, binary log in MySQL) to a separate location. During recovery, you restore the base backup then replay all WAL/binlog records up to your desired timestamp. This means if someone accidentally deleted data at 3:42 PM, you can restore to 3:41 PM — just 1 minute of data loss.

The 3-2-1 rule: keep 3 total copies of data (1 primary + 2 backups), store backups on 2 different types of storage media (e.g., local disk + cloud), and keep 1 copy offsite (geographically separate). This ensures that no single failure — hardware failure, fire, flood, ransomware attack — can destroy all copies simultaneously.

Real-World Blueprint

"In 2017, GitLab accidentally deleted their production database during a maintenance operation. The incident revealed their backup system had been silently failing for months — they had no working backups. Recovery took 18 hours of manual data reconstruction from fragments. They recovered ~6 hours of data permanently lost. GitLab then published a public post-mortem and rebuilt their entire backup/DR process from scratch. The lesson: test your backups regularly, don't just assume they work."

Hands-on Lab Exercises

Set up automated daily mysqldump backups with a cron job. Verify the file is created and is non-zero size.

Practice a full restore: take a backup, create a test database, restore the backup to it, verify the data matches.

Enable PostgreSQL WAL archiving and perform a point-in-time recovery: make changes, wait 2 minutes, then restore to 1 minute ago.

Write a backup monitoring script that alerts you if a backup file hasn't been created in the last 25 hours.

Real-World Practice Scenarios

A developer ran UPDATE users SET role = 'admin' without a WHERE clause — every user is now an admin. You need to fix this. What's your recovery plan and how long will it take?

Your company's RPO is 4 hours and RTO is 2 hours. Design a backup strategy that meets both requirements.

Your database backup files are stored in the same AWS account as the production database. A security breach compromises the account. Why is this a problem, and how should backups be isolated?

You restored from backup and the data is from 6 hours ago, but you need the last 5 hours. What process (if you had set it up in advance) could recover this data?

DevHub

Global Software Engineering Curriculum

Generated Tracking ID

DH-TX-data-back

Databases • Module Reference

Backup & Disaster Recovery

⚡

TL;DR — Quick Summary

3-2-1 rule: 3 copies, 2 media types, 1 offsite — prevents any single failure from wiping all backups.
RPO = max acceptable data loss (drives backup frequency). RTO = max acceptable downtime (drives recovery speed).
PITR (Point-in-Time Recovery) via WAL archiving lets you restore to any exact minute — the gold standard.
An untested backup is not a backup — run restore drills regularly.

Overview

💾 The Data Loss Reality Check

The 3-2-1 Backup Rule:

🔢 3 copies of your data
💾 2 different storage media types
🌍 1 copy stored offsite (geographically separate)

📦 Types of Backups

Full backup: Complete snapshot of the entire database. Largest, slowest, but simplest to restore. Run weekly or daily.
Incremental backup: Only changes since the last backup (full or incremental). Small and fast. Must replay all increments for restoration.
Differential backup: Changes since the last full backup. Larger than incremental, but only need last full + last differential to restore.
Point-in-Time Recovery (PITR): Restore to any exact moment in time by replaying the transaction log (WAL). The gold standard for production databases.

⏱️ Key Recovery Metrics

RPO (Recovery Point Objective): Maximum acceptable data loss — "we can tolerate losing up to 1 hour of data."
RTO (Recovery Time Objective): Maximum acceptable downtime — "the system must be back online within 4 hours."

Architecture & Logic Flow

Diagram

Rendering diagram…

Diagram

Rendering diagram…

⚖️ Backup Types — Trade-offs

Feature	Full	Incremental	Differential	PITR
Backup size	Largest	Smallest	Medium	Medium + WAL logs
Backup speed	Slowest	Fastest	Medium	Continuous
Restore complexity	Simple	High — replay all increments	Medium — full + latest diff	Any point in time
Restore speed	Fast	Slow	Medium	Varies by target time
Recovery granularity	Snapshot only	Snapshot only	Snapshot only	Any second
Best used for	Weekly full backup	Daily deltas	Daily delta v2	Production standard

Deep Dive Analysis

Think of RPO and RTO as a contract with your business. RPO answers 'how much work can we afford to redo?' RTO answers 'how long can the business survive without the database?' A hospital's patient records system might have RPO = 0 (zero data loss tolerated) and RTO = 15 minutes (must be back online fast). A blog might have RPO = 24 hours (daily backup is fine) and RTO = 4 hours (a few hours of downtime is acceptable). Your backup strategy should be sized to meet these objectives — and they should be defined by the business, not guessed by engineers.

Implementation Reference

MySQL / PostgreSQL Backup Commands

bash

# ── MySQL: Full backup with mysqldump ──
mysqldump \
  --user=root \
  --password \
  --single-transaction \
  --routines \
  --triggers \
  --all-databases \
  | gzip > /backups/full_$(date +%Y%m%d_%H%M%S).sql.gz

# Restore from mysqldump
gunzip < backup_20240115.sql.gz | mysql -u root -p myapp

# ── PostgreSQL: pg_dump (logical) ──
pg_dump -U postgres -Fc myapp > /backups/myapp_$(date +%Y%m%d).dump
# -Fc = custom compressed format

# Restore
pg_restore -U postgres -d myapp /backups/myapp_20240115.dump

# ── PostgreSQL: pg_basebackup (physical — for PITR) ──
pg_basebackup -U replicator -D /backups/base -Ft -z -Xs -P
# This creates a binary backup + WAL stream for point-in-time recovery

# ── Automated backup with cron ──
# Edit crontab: crontab -e
# Daily at 2am: backup and delete files older than 30 days
0 2 * * * mysqldump -u root -p myapp | gzip > /backups/daily_$(date +\%Y\%m\%d).sql.gz
0 3 * * * find /backups -name 'daily_*.sql.gz' -mtime +30 -delete

Point-in-Time Recovery (PITR) — PostgreSQL

bash

# Scenario: a developer ran DELETE FROM orders WHERE... at 14:32
# and forgot the WHERE clause. You need to restore to 14:31.

# 1. Stop the database
systemctl stop postgresql

# 2. Restore the base backup
rm -rf /var/lib/postgresql/data/*
tar -xzf /backups/base.tar.gz -C /var/lib/postgresql/data/

# 3. Create recovery config to replay WAL up to 14:31
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'cp /backups/wal/%f %p'
recovery_target_time = '2024-01-15 14:31:00'
recovery_target_action = 'promote'
EOF

# 4. Start PostgreSQL — it replays WAL up to 14:31 and stops
systemctl start postgresql

# 5. Verify the data is correct, then open to connections
psql -c "SELECT pg_wal_replay_resume();"

# The 13-minute batch of accidentally deleted orders is restored.

Comparative Best Practices

Always use --single-transaction for InnoDB — never lock your database during a backup

bash

# Default mysqldump locks EVERY table
# App gets read-only for the duration of the backup
# At 100GB this means 30+ minutes of degraded service
mysqldump -u root -p myapp > backup.sql
 
# Users see: ERROR 1205 Lock wait timeout exceeded

Default dump — locks all tables

bash

# Default mysqldump locks EVERY table
# App gets read-only for the duration of the backup
# At 100GB this means 30+ minutes of degraded service
mysqldump -u root -p myapp > backup.sql
 
# Users see: ERROR 1205 Lock wait timeout exceeded

Better approach

Hot backup — zero downtime

bash

# --single-transaction uses a consistent snapshot
# InnoDB tables are NOT locked — app runs normally
mysqldump \
  --user=root \
  --password \
  --single-transaction \
  --routines \
  --triggers \
  myapp | gzip > /backups/myapp_$(date +%Y%m%d_%H%M).sql.gz
 
# Always compress to save storage
# Always timestamp the filename for easy identification

Common Pitfalls

•Never testing backups — discovering they're corrupt or incomplete during a crisis is far too late.
•Storing backups in the same location as the database — a single failure (fire, ransomware, account compromise) destroys both.
•Not monitoring backup jobs — a silent failure means you think you have backups but you don't (the GitLab incident).
•Ignoring the restore time — a 1TB backup that takes 6 hours to restore is useless if your RTO is 2 hours.

Key Takeaways

• Always use <code>--single-transaction</code> with mysqldump for InnoDB tables — this takes a consistent snapshot without locking the database.

• Store backups in a different cloud region or account than your production database — if your cloud account is compromised, your backups should be isolated.

• Test your restore process at least quarterly. Restore to a test server, verify data integrity, and measure how long it takes.

• Enable point-in-time recovery (WAL archiving in PostgreSQL, binary logging in MySQL) on all production databases — the ability to restore to any minute in the past is worth the disk overhead.

Hands-on Practice

✓Set up automated daily mysqldump backups with a cron job. Verify the file is created and is non-zero size.
✓Practice a full restore: take a backup, create a test database, restore the backup to it, verify the data matches.
✓Enable PostgreSQL WAL archiving and perform a point-in-time recovery: make changes, wait 2 minutes, then restore to 1 minute ago.
✓Write a backup monitoring script that alerts you if a backup file hasn't been created in the last 25 hours.

Expert Pro Tips

"Use managed database services (AWS RDS, GCP Cloud SQL, Azure Database) — they handle automated backups, PITR, and cross-region replication with a few clicks."

"Implement backup encryption — backup files often contain sensitive data. Encrypt with GPG or use your cloud provider's KMS-managed encryption."

"Set up a read replica in a separate region — serves as both a disaster recovery target and reduces read load on primary."

"<code>pg_dump --schema-only</code> exports just the schema without data — invaluable for quickly spinning up a fresh environment or comparing schema versions."

Interview Preparation

Q: What is the difference between RPO and RTO?

Master Answer:

RPO (Recovery Point Objective) is the maximum amount of data loss acceptable — measured in time. RPO = 1 hour means you can tolerate losing at most 1 hour of data. This drives backup frequency (hourly backups for RPO = 1 hour). RTO (Recovery Time Objective) is the maximum acceptable downtime before the system must be operational again. RTO = 4 hours means the database must be restored within 4 hours of an incident. This drives your recovery process speed and whether you need hot standby replicas.

Q: What is Point-in-Time Recovery and how does it work?

Master Answer:

Q: What is the 3-2-1 backup rule?

Master Answer:

The 3-2-1 rule: keep 3 total copies of data (1 primary + 2 backups), store backups on 2 different types of storage media (e.g., local disk + cloud), and keep 1 copy offsite (geographically separate). This ensures that no single failure — hardware failure, fire, flood, ransomware attack — can destroy all copies simultaneously.

Industrial Blueprint

Simulated Scenarios

"A developer ran <code>UPDATE users SET role = 'admin'</code> without a WHERE clause — every user is now an admin. You need to fix this. What's your recovery plan and how long will it take?"

"Your company's RPO is 4 hours and RTO is 2 hours. Design a backup strategy that meets both requirements."

"Your database backup files are stored in the same AWS account as the production database. A security breach compromises the account. Why is this a problem, and how should backups be isolated?"

"You restored from backup and the data is from 6 hours ago, but you need the last 5 hours. What process (if you had set it up in advance) could recover this data?"

DevHub

Generated on March 7, 2026 • Ver: 4.0.2

Document Class: Master Education

Confidential Information • Licensed to User