Database Monitoring & Maintenance
TL;DR — Quick Summary
- Monitor: query latency, connections, cache hit rate, replication lag, disk space, and lock contention.
- Slow query log is your most actionable monitoring tool — review it weekly and fix the top slow queries.
- VACUUM (PostgreSQL) reclaims space from dead tuples — autovacuum handles it, but tune it for write-heavy tables.
- Set up monitoring before problems occur — troubleshooting without historical data is like driving blind.
Lesson Overview
🔭 You Can't Fix What You Can't See
A database problem that surprises you in production at 3am is a monitoring failure. Proper database monitoring gives you visibility into performance trends, catches issues before they become outages, and builds the evidence base for optimization decisions.
📊 Key Metrics to Monitor
- Query Performance: Slow queries, average query time, queries per second (QPS).
- Connections: Active connections, connection pool utilization, waiting connections.
- Cache Hit Rate: Buffer pool hit rate (how often data is served from memory vs disk). Target: >99%.
- Replication Lag: How far behind replicas are from the primary. Should stay under 1 second.
- Resource Utilization: CPU, memory, disk I/O, disk space, network throughput.
- Table Bloat: Dead rows accumulating from updates/deletes (PostgreSQL's MVCC model).
- Lock Contention: Blocked queries waiting for locks — indicates transaction design issues.
🔍 The Slow Query Log
Every database has a slow query log — queries taking longer than a threshold are automatically recorded. This is your most direct signal of performance problems. Review it weekly. The top 5 slowest queries almost always have simple index or rewrite fixes that yield enormous improvements.
🔧 Routine Maintenance
- VACUUM (PostgreSQL): Reclaims space from dead tuples (rows updated/deleted but not yet cleaned up). Auto-vacuum handles this automatically but needs tuning for write-heavy tables.
- ANALYZE: Updates query planner statistics. Run after bulk data loads.
- OPTIMIZE TABLE (MySQL): Rebuilds the table to reclaim space and defragment.
- Index Maintenance: Remove unused indexes, rebuild fragmented ones.
Set up monitoring before you have problems — troubleshooting a crisis without historical metrics is like diagnosing a patient who only visits the doctor once they're already in the ER.
Conceptual Deep Dive
Architecture & Data Flow
Implementation Lab
-- ── MySQL: Enable and query the slow log ──
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1; -- log queries over 1 second
SHOW VARIABLES LIKE 'slow_query_log_file'; -- find log location
-- ── MySQL: Find tables with most lock contention ──
SELECT
r.trx_id AS waiting_trx,
r.trx_mysql_thread_id AS waiting_thread,
b.trx_id AS blocking_trx,
b.trx_query AS blocking_query
FROM information_schema.INNODB_TRX b
JOIN information_schema.INNODB_TRX r ON r.trx_id != b.trx_id;
-- ── PostgreSQL: Find long-running queries ──
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE state != 'idle'
AND (now() - pg_stat_activity.query_start) > INTERVAL '5 seconds'
ORDER BY duration DESC;
-- Kill a specific query if needed:
SELECT pg_cancel_backend(pid);
-- Force kill (for stuck transactions):
SELECT pg_terminate_backend(pid);
-- ── PostgreSQL: Buffer cache hit rate (target >99%) ──
SELECT
sum(heap_blks_read) AS disk_reads,
sum(heap_blks_hit) AS cache_hits,
ROUND(sum(heap_blks_hit)::numeric /
NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2) AS cache_hit_pct
FROM pg_statio_user_tables;-- ── Check table bloat (dead tuples) ──
SELECT
relname AS table_name,
n_live_tup AS live_rows,
n_dead_tup AS dead_rows,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY dead_rows DESC
LIMIT 10;
-- ── Manual VACUUM (usually auto-vacuum handles this) ──
VACUUM ANALYZE users; -- reclaim dead tuples + update stats
VACUUM FULL users; -- full rewrite — reclaims more space but LOCKS TABLE!
REINDEX TABLE users; -- rebuild fragmented indexes
-- ── Find unused indexes (candidates for removal) ──
SELECT
schemaname,
tablename,
indexname,
idx_scan AS times_used,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0 -- never used!
AND indexname NOT LIKE '%pkey' -- don't drop primary keys!
ORDER BY pg_relation_size(indexrelid) DESC;-- ── MySQL: Check current database size ──
SELECT
table_schema AS 'Database',
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)'(MB)'
FROM information_schema.TABLES
GROUP BY table_schema
ORDER BY SUM(data_length + index_length) DESC;
-- ── MySQL: Top 10 largest tables ──
SELECT
TABLE_NAME,
TABLE_ROWS AS estimated_rows,
ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024, 2) AS 'Size MB'MB'
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'myapp'
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC
LIMIT 10;
-- ── PostgreSQL: Active connection count by state ──
SELECT state, COUNT(*) AS connections
FROM pg_stat_activity
GROUP BY state
ORDER BY connections DESC;
-- ── PostgreSQL: Replication lag on replicas ──
SELECT
client_addr,
state,
sent_lsn,
replay_lsn,
(sent_lsn - replay_lsn) AS lag_bytes
FROM pg_stat_replication;Best Practices — Interactive Comparison
Use pg_stat_statements to find your most expensive queries — far more actionable than a raw slow log
# Raw slow query log — no aggregation
# Same query appears thousands of times separately
# Impossible to see total impact across all executions
tail -f /var/log/postgresql/slow.log
# Time: 2024-01-15T14:32:01
# Query_time: 1.2 Lock_time: 0.001 Rows: 5000
# SELECT * FROM orders WHERE user_id = 42
# (this line appears 50,000 times — you'd never know)# Raw slow query log — no aggregation
# Same query appears thousands of times separately
# Impossible to see total impact across all executions
tail -f /var/log/postgresql/slow.log
# Time: 2024-01-15T14:32:01
# Query_time: 1.2 Lock_time: 0.001 Rows: 5000
# SELECT * FROM orders WHERE user_id = 42
# (this line appears 50,000 times — you'd never know)-- Ranks queries by TOTAL time across all executions
-- Fix these 10 → biggest performance wins
SELECT
LEFT(query, 80) AS query_snippet,
calls AS total_calls,
ROUND(total_exec_time::numeric, 2) AS total_ms,
ROUND(mean_exec_time::numeric, 2) AS avg_ms,
ROUND(stddev_exec_time::numeric, 2) AS stddev_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;Pro Tips — Senior Dev Insights
Use pg_stat_statements extension (PostgreSQL) to aggregate and rank queries by total execution time across all runs — far more useful than the slow log for finding your most expensive queries.
Set up Prometheus + postgres_exporter + Grafana for a free, production-grade database monitoring stack with beautiful dashboards and alerting.
PgHero (open source) is a fantastic Rails-centric but database-agnostic web UI for PostgreSQL health — shows long-running queries, bloated tables, missing indexes, and more with zero configuration.
Configure connection limit per user/database in PostgreSQL (ALTER USER app CONNECTION LIMIT 50) — prevents runaway connection leaks from exhausting the entire database.
⚖️ Key Metrics — Healthy vs Concerning Thresholds
| Feature | Healthy | Warning | Critical |
|---|---|---|---|
| Buffer / Cache hit rate | > 99% | 95 – 99% | < 95% |
| Replication lag | < 1 second | 1 – 10 seconds | > 60 seconds |
| Disk space used | < 70% | 70 – 85% | > 90% |
| Active connections | < 50% of max | 50 – 80% of max | > 80% of max |
| Dead tuple ratio (PG) | < 5% | 5 – 20% | > 20% |
| P99 query latency | < 50 ms | 50 – 500 ms | > 500 ms |
Common Developer Pitfalls
Not setting up monitoring until after the first production outage — reactive monitoring is always too late.
Ignoring the slow query log — it's the single most valuable performance signal and most developers never look at it.
Disabling autovacuum to 'improve write performance' — this causes table bloat that eventually causes far worse performance problems.
Only monitoring that the database is 'up' — availability monitoring tells you the database is running, not that it's running well.
Interview Mastery
Key database metrics to monitor: (1) Query latency — P50, P95, P99 response times, and slow query rate. (2) Connection pool — active connections vs pool size, waiting queue length. (3) Buffer/cache hit rate — % of reads served from memory vs disk (target >99%). (4) Replication lag — seconds behind primary. (5) Resource utilization — CPU, memory, disk I/O, disk space. (6) Error rates — deadlocks, connection timeouts, query errors. (7) Table bloat (PostgreSQL) — dead tuples ratio.
PostgreSQL uses MVCC (Multi-Version Concurrency Control) — when you UPDATE or DELETE a row, the old version isn't immediately deleted. Instead, it's marked as a 'dead tuple' and a new version is created. Over time, dead tuples accumulate, wasting disk space and slowing scans. VACUUM reclaims this space by removing dead tuples and making that space available for reuse. VACUUM ANALYZE also updates the query planner's statistics. PostgreSQL's autovacuum process handles this automatically, but high-traffic tables may need autovacuum settings tuned more aggressively.
Systematic diagnosis: (1) Check the slow query log — which specific queries slowed down? (2) Run EXPLAIN ANALYZE on slow queries — are they using indexes or doing full scans? Did a statistics update cause a bad query plan? (3) Check lock contention — are queries waiting for locks from long-running transactions? (4) Check buffer hit rate — if it dropped, you might have a new query pattern loading too much data into memory. (5) Check resource utilization — disk I/O saturation, CPU spike, memory pressure? (6) Check for table bloat — did a bulk operation create millions of dead tuples?
Real-World Blueprint
"Cloudflare monitors millions of database queries per second across their globally distributed infrastructure. Their SRE team uses dashboards in Grafana showing P50/P95/P99 query latencies, connection pool saturation, and replication lag — visible to every engineer. When a query's P99 latency spikes from 5ms to 500ms, an alert fires within 2 minutes, a Slack message includes the slow query from their query analyzer, and the on-call engineer knows immediately what to look at. No surprise 3am pages — the monitoring tells the story before it becomes an outage."
Hands-on Lab Exercises
Enable the slow query log in MySQL/PostgreSQL with a 1-second threshold. Run several intentionally slow queries and verify they appear in the log.
Set up a Grafana dashboard connecting to PostgreSQL's pg_stat_* views, showing query count, connection count, and cache hit rate.
Run VACUUM ANALYZE on a table and check pg_stat_user_tables before and after to see the dead tuple count change.
Find unused indexes on a sample database using pg_stat_user_indexes and practice evaluating whether they're safe to remove.
Real-World Practice Scenarios
At 2pm, your API response times jump from 50ms to 3 seconds. Your monitoring shows database CPU at 95%. Walk through your investigation process step by step.
A table that was 5GB six months ago is now 50GB, but the actual row count only doubled. What's likely happening and how do you fix it?
Your staging database has 50 connections and runs fine. Production has 1,000 connections and is degraded. What configuration change resolves this?
You want to remove what appears to be an unused index from a production table. What verification steps do you take before dropping it?
Database Monitoring & Maintenance
TL;DR — Quick Summary
- Monitor: query latency, connections, cache hit rate, replication lag, disk space, and lock contention.
- Slow query log is your most actionable monitoring tool — review it weekly and fix the top slow queries.
- VACUUM (PostgreSQL) reclaims space from dead tuples — autovacuum handles it, but tune it for write-heavy tables.
- Set up monitoring before problems occur — troubleshooting without historical data is like driving blind.
Overview
🔭 You Can't Fix What You Can't See
A database problem that surprises you in production at 3am is a monitoring failure. Proper database monitoring gives you visibility into performance trends, catches issues before they become outages, and builds the evidence base for optimization decisions.
📊 Key Metrics to Monitor
- Query Performance: Slow queries, average query time, queries per second (QPS).
- Connections: Active connections, connection pool utilization, waiting connections.
- Cache Hit Rate: Buffer pool hit rate (how often data is served from memory vs disk). Target: >99%.
- Replication Lag: How far behind replicas are from the primary. Should stay under 1 second.
- Resource Utilization: CPU, memory, disk I/O, disk space, network throughput.
- Table Bloat: Dead rows accumulating from updates/deletes (PostgreSQL's MVCC model).
- Lock Contention: Blocked queries waiting for locks — indicates transaction design issues.
🔍 The Slow Query Log
Every database has a slow query log — queries taking longer than a threshold are automatically recorded. This is your most direct signal of performance problems. Review it weekly. The top 5 slowest queries almost always have simple index or rewrite fixes that yield enormous improvements.
🔧 Routine Maintenance
- VACUUM (PostgreSQL): Reclaims space from dead tuples (rows updated/deleted but not yet cleaned up). Auto-vacuum handles this automatically but needs tuning for write-heavy tables.
- ANALYZE: Updates query planner statistics. Run after bulk data loads.
- OPTIMIZE TABLE (MySQL): Rebuilds the table to reclaim space and defragment.
- Index Maintenance: Remove unused indexes, rebuild fragmented ones.
Set up monitoring before you have problems — troubleshooting a crisis without historical metrics is like diagnosing a patient who only visits the doctor once they're already in the ER.
Architecture & Logic Flow
⚖️ Key Metrics — Healthy vs Concerning Thresholds
| Feature | Healthy | Warning | Critical |
|---|---|---|---|
| Buffer / Cache hit rate | > 99% | 95 – 99% | < 95% |
| Replication lag | < 1 second | 1 – 10 seconds | > 60 seconds |
| Disk space used | < 70% | 70 – 85% | > 90% |
| Active connections | < 50% of max | 50 – 80% of max | > 80% of max |
| Dead tuple ratio (PG) | < 5% | 5 – 20% | > 20% |
| P99 query latency | < 50 ms | 50 – 500 ms | > 500 ms |
Deep Dive Analysis
Think of database monitoring like the dashboard of a car. You don't stare at the gauges constantly, but you glance at them regularly. If the temperature gauge rises, you know to pull over before the engine blows. If the fuel gauge drops, you stop before you're stranded. The same idea applies: when query response time starts climbing gradually, or disk space approaches 80%, you get ahead of it — you don't wait until the car catches fire. <strong>Trends matter more than point-in-time values</strong> — a CPU spike at noon is noise; CPU steadily climbing each day for two weeks is a signal.
Implementation Reference
-- ── MySQL: Enable and query the slow log ──
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1; -- log queries over 1 second
SHOW VARIABLES LIKE 'slow_query_log_file'; -- find log location
-- ── MySQL: Find tables with most lock contention ──
SELECT
r.trx_id AS waiting_trx,
r.trx_mysql_thread_id AS waiting_thread,
b.trx_id AS blocking_trx,
b.trx_query AS blocking_query
FROM information_schema.INNODB_TRX b
JOIN information_schema.INNODB_TRX r ON r.trx_id != b.trx_id;
-- ── PostgreSQL: Find long-running queries ──
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE state != 'idle'
AND (now() - pg_stat_activity.query_start) > INTERVAL '5 seconds'
ORDER BY duration DESC;
-- Kill a specific query if needed:
SELECT pg_cancel_backend(pid);
-- Force kill (for stuck transactions):
SELECT pg_terminate_backend(pid);
-- ── PostgreSQL: Buffer cache hit rate (target >99%) ──
SELECT
sum(heap_blks_read) AS disk_reads,
sum(heap_blks_hit) AS cache_hits,
ROUND(sum(heap_blks_hit)::numeric /
NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2) AS cache_hit_pct
FROM pg_statio_user_tables;-- ── Check table bloat (dead tuples) ──
SELECT
relname AS table_name,
n_live_tup AS live_rows,
n_dead_tup AS dead_rows,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY dead_rows DESC
LIMIT 10;
-- ── Manual VACUUM (usually auto-vacuum handles this) ──
VACUUM ANALYZE users; -- reclaim dead tuples + update stats
VACUUM FULL users; -- full rewrite — reclaims more space but LOCKS TABLE!
REINDEX TABLE users; -- rebuild fragmented indexes
-- ── Find unused indexes (candidates for removal) ──
SELECT
schemaname,
tablename,
indexname,
idx_scan AS times_used,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0 -- never used!
AND indexname NOT LIKE '%pkey' -- don't drop primary keys!
ORDER BY pg_relation_size(indexrelid) DESC;-- ── MySQL: Check current database size ──
SELECT
table_schema AS 'Database',
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS 'Size (MB)'
FROM information_schema.TABLES
GROUP BY table_schema
ORDER BY SUM(data_length + index_length) DESC;
-- ── MySQL: Top 10 largest tables ──
SELECT
TABLE_NAME,
TABLE_ROWS AS estimated_rows,
ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024, 2) AS 'Size MB'
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'myapp'
ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC
LIMIT 10;
-- ── PostgreSQL: Active connection count by state ──
SELECT state, COUNT(*) AS connections
FROM pg_stat_activity
GROUP BY state
ORDER BY connections DESC;
-- ── PostgreSQL: Replication lag on replicas ──
SELECT
client_addr,
state,
sent_lsn,
replay_lsn,
(sent_lsn - replay_lsn) AS lag_bytes
FROM pg_stat_replication;Comparative Best Practices
Use pg_stat_statements to find your most expensive queries — far more actionable than a raw slow log
# Raw slow query log — no aggregation
# Same query appears thousands of times separately
# Impossible to see total impact across all executions
tail -f /var/log/postgresql/slow.log
# Time: 2024-01-15T14:32:01
# Query_time: 1.2 Lock_time: 0.001 Rows: 5000
# SELECT * FROM orders WHERE user_id = 42
# (this line appears 50,000 times — you'd never know)# Raw slow query log — no aggregation
# Same query appears thousands of times separately
# Impossible to see total impact across all executions
tail -f /var/log/postgresql/slow.log
# Time: 2024-01-15T14:32:01
# Query_time: 1.2 Lock_time: 0.001 Rows: 5000
# SELECT * FROM orders WHERE user_id = 42
# (this line appears 50,000 times — you'd never know)-- Ranks queries by TOTAL time across all executions
-- Fix these 10 → biggest performance wins
SELECT
LEFT(query, 80) AS query_snippet,
calls AS total_calls,
ROUND(total_exec_time::numeric, 2) AS total_ms,
ROUND(mean_exec_time::numeric, 2) AS avg_ms,
ROUND(stddev_exec_time::numeric, 2) AS stddev_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;Common Pitfalls
- •Not setting up monitoring until after the first production outage — reactive monitoring is always too late.
- •Ignoring the slow query log — it's the single most valuable performance signal and most developers never look at it.
- •Disabling autovacuum to 'improve write performance' — this causes table bloat that eventually causes far worse performance problems.
- •Only monitoring that the database is 'up' — availability monitoring tells you the database is running, not that it's running well.
Key Takeaways
Hands-on Practice
- ✓Enable the slow query log in MySQL/PostgreSQL with a 1-second threshold. Run several intentionally slow queries and verify they appear in the log.
- ✓Set up a Grafana dashboard connecting to PostgreSQL's pg_stat_* views, showing query count, connection count, and cache hit rate.
- ✓Run VACUUM ANALYZE on a table and check pg_stat_user_tables before and after to see the dead tuple count change.
- ✓Find unused indexes on a sample database using pg_stat_user_indexes and practice evaluating whether they're safe to remove.
Expert Pro Tips
Interview Preparation
Q: What metrics would you monitor to ensure database health?
Master Answer:
Key database metrics to monitor: (1) <strong>Query latency</strong> — P50, P95, P99 response times, and slow query rate. (2) <strong>Connection pool</strong> — active connections vs pool size, waiting queue length. (3) <strong>Buffer/cache hit rate</strong> — % of reads served from memory vs disk (target >99%). (4) <strong>Replication lag</strong> — seconds behind primary. (5) <strong>Resource utilization</strong> — CPU, memory, disk I/O, disk space. (6) <strong>Error rates</strong> — deadlocks, connection timeouts, query errors. (7) <strong>Table bloat</strong> (PostgreSQL) — dead tuples ratio.
Q: What is VACUUM in PostgreSQL and why is it necessary?
Master Answer:
PostgreSQL uses MVCC (Multi-Version Concurrency Control) — when you UPDATE or DELETE a row, the old version isn't immediately deleted. Instead, it's marked as a 'dead tuple' and a new version is created. Over time, dead tuples accumulate, wasting disk space and slowing scans. <strong>VACUUM</strong> reclaims this space by removing dead tuples and making that space available for reuse. <code>VACUUM ANALYZE</code> also updates the query planner's statistics. PostgreSQL's <strong>autovacuum</strong> process handles this automatically, but high-traffic tables may need autovacuum settings tuned more aggressively.
Q: How would you diagnose a sudden increase in query response times?
Master Answer:
Systematic diagnosis: (1) Check the <strong>slow query log</strong> — which specific queries slowed down? (2) Run <code>EXPLAIN ANALYZE</code> on slow queries — are they using indexes or doing full scans? Did a statistics update cause a bad query plan? (3) Check <strong>lock contention</strong> — are queries waiting for locks from long-running transactions? (4) Check <strong>buffer hit rate</strong> — if it dropped, you might have a new query pattern loading too much data into memory. (5) Check <strong>resource utilization</strong> — disk I/O saturation, CPU spike, memory pressure? (6) Check for <strong>table bloat</strong> — did a bulk operation create millions of dead tuples?
Industrial Blueprint
"Cloudflare monitors millions of database queries per second across their globally distributed infrastructure. Their SRE team uses dashboards in Grafana showing P50/P95/P99 query latencies, connection pool saturation, and replication lag — visible to every engineer. When a query's P99 latency spikes from 5ms to 500ms, an alert fires within 2 minutes, a Slack message includes the slow query from their query analyzer, and the on-call engineer knows immediately what to look at. No surprise 3am pages — the monitoring tells the story before it becomes an outage."
Simulated Scenarios
© 2026 DevHub Engineering • All Proprietary Rights Reserved
Generated on March 7, 2026 • Ver: 4.0.2
Document Class: Master Education
Confidential Information • Licensed to User