Databases

Database Monitoring & Maintenance

3 min read
Focus: DATABASES

TL;DR — Quick Summary

  • Monitor: query latency, connections, cache hit rate, replication lag, disk space, and lock contention.
  • Slow query log is your most actionable monitoring tool — review it weekly and fix the top slow queries.
  • VACUUM (PostgreSQL) reclaims space from dead tuples — autovacuum handles it, but tune it for write-heavy tables.
  • Set up monitoring before problems occur — troubleshooting without historical data is like driving blind.

Lesson Overview

🔭 You Can't Fix What You Can't See

A database problem that surprises you in production at 3am is a monitoring failure. Proper database monitoring gives you visibility into performance trends, catches issues before they become outages, and builds the evidence base for optimization decisions.

📊 Key Metrics to Monitor

  • Query Performance: Slow queries, average query time, queries per second (QPS).
  • Connections: Active connections, connection pool utilization, waiting connections.
  • Cache Hit Rate: Buffer pool hit rate (how often data is served from memory vs disk). Target: >99%.
  • Replication Lag: How far behind replicas are from the primary. Should stay under 1 second.
  • Resource Utilization: CPU, memory, disk I/O, disk space, network throughput.
  • Table Bloat: Dead rows accumulating from updates/deletes (PostgreSQL's MVCC model).
  • Lock Contention: Blocked queries waiting for locks — indicates transaction design issues.

🔍 The Slow Query Log

Every database has a slow query log — queries taking longer than a threshold are automatically recorded. This is your most direct signal of performance problems. Review it weekly. The top 5 slowest queries almost always have simple index or rewrite fixes that yield enormous improvements.

🔧 Routine Maintenance

  • VACUUM (PostgreSQL): Reclaims space from dead tuples (rows updated/deleted but not yet cleaned up). Auto-vacuum handles this automatically but needs tuning for write-heavy tables.
  • ANALYZE: Updates query planner statistics. Run after bulk data loads.
  • OPTIMIZE TABLE (MySQL): Rebuilds the table to reclaim space and defragment.
  • Index Maintenance: Remove unused indexes, rebuild fragmented ones.

Set up monitoring before you have problems — troubleshooting a crisis without historical metrics is like diagnosing a patient who only visits the doctor once they're already in the ER.

Conceptual Deep Dive

Think of database monitoring like the dashboard of a car. You don't stare at the gauges constantly, but you glance at them regularly. If the temperature gauge rises, you know to pull over before the engine blows. If the fuel gauge drops, you stop before you're stranded. The same idea applies: when query response time starts climbing gradually, or disk space approaches 80%, you get ahead of it — you don't wait until the car catches fire. Trends matter more than point-in-time values — a CPU spike at noon is noise; CPU steadily climbing each day for two weeks is a signal.

Architecture & Data Flow

Database Monitoring Stack
Rendering diagram…
Alert Severity Tiers
Rendering diagram…

Implementation Lab

Identifying Performance Issues
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- ── MySQL: Enable and query the slow log ──
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;  -- log queries over 1 second
SHOW VARIABLES LIKE 'slow_query_log_file';  -- find log location
 
-- ── MySQL: Find tables with most lock contention ──
SELECT
  r.trx_id AS waiting_trx,
  r.trx_mysql_thread_id AS waiting_thread,
  b.trx_id AS blocking_trx,
  b.trx_query AS blocking_query
FROM information_schema.INNODB_TRX b
JOIN information_schema.INNODB_TRX r ON r.trx_id != b.trx_id;
 
-- ── PostgreSQL: Find long-running queries ──
SELECT
  pid,
  now() - pg_stat_activity.query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE state != 'idle'
  AND (now() - pg_stat_activity.query_start) > INTERVAL '5 seconds'
ORDER BY duration DESC;
 
-- Kill a specific query if needed:
SELECT pg_cancel_backend(pid);
-- Force kill (for stuck transactions):
SELECT pg_terminate_backend(pid);
 
-- ── PostgreSQL: Buffer cache hit rate (target >99%) ──
SELECT
  sum(heap_blks_read) AS disk_reads,
  sum(heap_blks_hit) AS cache_hits,
  ROUND(sum(heap_blks_hit)::numeric /
    NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2) AS cache_hit_pct
FROM pg_statio_user_tables;
PostgreSQL Maintenance — VACUUM and ANALYZE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- ── Check table bloat (dead tuples) ──
SELECT
  relname AS table_name,
  n_live_tup AS live_rows,
  n_dead_tup AS dead_rows,
  ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
  last_autovacuum,
  last_autoanalyze
FROM pg_stat_user_tables
ORDER BY dead_rows DESC
LIMIT 10;
 
-- ── Manual VACUUM (usually auto-vacuum handles this) ──
VACUUM ANALYZE users;           -- reclaim dead tuples + update stats
VACUUM FULL users;              -- full rewrite — reclaims more space but LOCKS TABLE!
REINDEX TABLE users;            -- rebuild fragmented indexes
 
-- ── Find unused indexes (candidates for removal) ──
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan AS times_used,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0  -- never used!
  AND indexname NOT LIKE '%pkey'  -- don't drop primary keys!
ORDER BY pg_relation_size(indexrelid) DESC;
Monitoring Dashboard Queries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- ── MySQL: Check current database size ──
SELECT
  table_schema AS 'Database',
  ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)'(MB)'
FROM information_schema.TABLES
GROUP BY table_schema
ORDER BY SUM(data_length + index_length) DESC;
 
-- ── MySQL: Top 10 largest tables ──
SELECT
  TABLE_NAME,
  TABLE_ROWS AS estimated_rows,
  ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024, 2) AS 'Size MB'MB'
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'myapp'
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC
LIMIT 10;
 
-- ── PostgreSQL: Active connection count by state ──
SELECT state, COUNT(*) AS connections
FROM pg_stat_activity
GROUP BY state
ORDER BY connections DESC;
 
-- ── PostgreSQL: Replication lag on replicas ──
SELECT
  client_addr,
  state,
  sent_lsn,
  replay_lsn,
  (sent_lsn - replay_lsn) AS lag_bytes
FROM pg_stat_replication;

Best Practices — Interactive Comparison

Use pg_stat_statements to find your most expensive queries — far more actionable than a raw slow log

bash
1
2
3
4
5
6
7
8
9
# Raw slow query log — no aggregation
# Same query appears thousands of times separately
# Impossible to see total impact across all executions
tail -f /var/log/postgresql/slow.log
 
# Time: 2024-01-15T14:32:01
# Query_time: 1.2  Lock_time: 0.001  Rows: 5000
# SELECT * FROM orders WHERE user_id = 42
# (this line appears 50,000 times — you'd never know)

Pro Tips — Senior Dev Insights

1

Use pg_stat_statements extension (PostgreSQL) to aggregate and rank queries by total execution time across all runs — far more useful than the slow log for finding your most expensive queries.

2

Set up Prometheus + postgres_exporter + Grafana for a free, production-grade database monitoring stack with beautiful dashboards and alerting.

3

PgHero (open source) is a fantastic Rails-centric but database-agnostic web UI for PostgreSQL health — shows long-running queries, bloated tables, missing indexes, and more with zero configuration.

4

Configure connection limit per user/database in PostgreSQL (ALTER USER app CONNECTION LIMIT 50) — prevents runaway connection leaks from exhausting the entire database.

⚖️ Key Metrics — Healthy vs Concerning Thresholds

FeatureHealthyWarningCritical
Buffer / Cache hit rate> 99%95 – 99%< 95%
Replication lag< 1 second1 – 10 seconds> 60 seconds
Disk space used< 70%70 – 85%> 90%
Active connections< 50% of max50 – 80% of max> 80% of max
Dead tuple ratio (PG)< 5%5 – 20%> 20%
P99 query latency< 50 ms50 – 500 ms> 500 ms

Common Developer Pitfalls

!

Not setting up monitoring until after the first production outage — reactive monitoring is always too late.

!

Ignoring the slow query log — it's the single most valuable performance signal and most developers never look at it.

!

Disabling autovacuum to 'improve write performance' — this causes table bloat that eventually causes far worse performance problems.

!

Only monitoring that the database is 'up' — availability monitoring tells you the database is running, not that it's running well.

Interview Mastery

Key database metrics to monitor: (1) Query latency — P50, P95, P99 response times, and slow query rate. (2) Connection pool — active connections vs pool size, waiting queue length. (3) Buffer/cache hit rate — % of reads served from memory vs disk (target >99%). (4) Replication lag — seconds behind primary. (5) Resource utilization — CPU, memory, disk I/O, disk space. (6) Error rates — deadlocks, connection timeouts, query errors. (7) Table bloat (PostgreSQL) — dead tuples ratio.

PostgreSQL uses MVCC (Multi-Version Concurrency Control) — when you UPDATE or DELETE a row, the old version isn't immediately deleted. Instead, it's marked as a 'dead tuple' and a new version is created. Over time, dead tuples accumulate, wasting disk space and slowing scans. VACUUM reclaims this space by removing dead tuples and making that space available for reuse. VACUUM ANALYZE also updates the query planner's statistics. PostgreSQL's autovacuum process handles this automatically, but high-traffic tables may need autovacuum settings tuned more aggressively.

Systematic diagnosis: (1) Check the slow query log — which specific queries slowed down? (2) Run EXPLAIN ANALYZE on slow queries — are they using indexes or doing full scans? Did a statistics update cause a bad query plan? (3) Check lock contention — are queries waiting for locks from long-running transactions? (4) Check buffer hit rate — if it dropped, you might have a new query pattern loading too much data into memory. (5) Check resource utilization — disk I/O saturation, CPU spike, memory pressure? (6) Check for table bloat — did a bulk operation create millions of dead tuples?

Real-World Blueprint

"Cloudflare monitors millions of database queries per second across their globally distributed infrastructure. Their SRE team uses dashboards in Grafana showing P50/P95/P99 query latencies, connection pool saturation, and replication lag — visible to every engineer. When a query's P99 latency spikes from 5ms to 500ms, an alert fires within 2 minutes, a Slack message includes the slow query from their query analyzer, and the on-call engineer knows immediately what to look at. No surprise 3am pages — the monitoring tells the story before it becomes an outage."

Hands-on Lab Exercises

1

Enable the slow query log in MySQL/PostgreSQL with a 1-second threshold. Run several intentionally slow queries and verify they appear in the log.

2

Set up a Grafana dashboard connecting to PostgreSQL's pg_stat_* views, showing query count, connection count, and cache hit rate.

3

Run VACUUM ANALYZE on a table and check pg_stat_user_tables before and after to see the dead tuple count change.

4

Find unused indexes on a sample database using pg_stat_user_indexes and practice evaluating whether they're safe to remove.

Real-World Practice Scenarios

At 2pm, your API response times jump from 50ms to 3 seconds. Your monitoring shows database CPU at 95%. Walk through your investigation process step by step.

A table that was 5GB six months ago is now 50GB, but the actual row count only doubled. What's likely happening and how do you fix it?

Your staging database has 50 connections and runs fine. Production has 1,000 connections and is degraded. What configuration change resolves this?

You want to remove what appears to be an unused index from a production table. What verification steps do you take before dropping it?