Databases

Indexes & Query Performance

4 min read
Focus: DATABASES

TL;DR — Quick Summary

  • Indexes speed up reads (O(log n) vs O(n)) at the cost of slower writes and extra storage.
  • Use EXPLAIN to see if your query uses an index — look for type=ALL (bad) vs type=ref (good).
  • Composite indexes follow the leftmost prefix rule — design them with your most-filtered column first.
  • Never apply functions to indexed columns in WHERE — it prevents index usage.

Lesson Overview

⚡ Indexes: The Database's Speed Secret

Without an index, a database scans every single row to find matching records — this is called a full table scan. On a table with 10 million rows, that means checking 10 million rows for every query. An index is a separate data structure (usually a B-tree) that allows the database to jump directly to matching rows — like an index in the back of a book.

📊 Types of Indexes

  • Single-column index: Index on one column. Great for frequent WHERE/JOIN conditions on that column.
  • Composite index: Index on multiple columns. Follows the leftmost prefix rule — an index on (A, B, C) can be used for queries on A, or A+B, or A+B+C, but NOT on B alone or C alone.
  • Unique index: Enforces uniqueness + provides lookup speed. Created automatically for PRIMARY KEY and UNIQUE constraints.
  • Full-text index: For searching text content efficiently (MySQL FULLTEXT, PostgreSQL tsvector).
  • Partial index (PostgreSQL): Index only a subset of rows — e.g., WHERE status = 'active'. Smaller and faster for filtered queries.

⚠️ The Index Trade-off

  • Faster reads — queries that use the index are dramatically faster
  • Slower writes — every INSERT, UPDATE, DELETE must also update all indexes on the table
  • More storage — indexes take disk space, sometimes as much as the table itself

Rule of thumb: index columns you frequently query against, not columns you frequently update. A write-heavy table with 20 indexes will be painfully slow on inserts.

🔍 Understanding EXPLAIN

EXPLAIN shows you the database's query execution plan — how it intends to run your query. Key things to look for:

  • type: ALL — full table scan. Usually bad on large tables.
  • type: index, ref, eq_ref, const — index is being used. Good.
  • rows: estimated number of rows examined — lower is better.
  • Extra: Using filesort / Using temporary — indicates expensive operations.

Conceptual Deep Dive

A B-tree index works like a sorted phone book. If you want to find everyone named 'Smith', you don't start at page 1 — you go directly to the 'S' section. The B-tree structure allows the database to do this in O(log n) time instead of O(n) for a full scan. The leftmost prefix rule for composite indexes is critical: an index on (last_name, first_name, email) works for queries filtering on last_name, or last_name + first_name, but NOT for queries filtering only on first_name or only on email. Design composite indexes with your most-selective and most-queried column first.

Architecture & Data Flow

B-Tree Index Structure
Rendering diagram…
Composite Index — Leftmost Prefix Rule
Rendering diagram…

Implementation Lab

Creating and Analyzing Indexes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Single column index on a frequently searched column
CREATE INDEX idx_users_email ON users(email);
 
-- Composite index for common query patterns
-- Good for: WHERE user_id = ? AND created_at > ?
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);
 
-- Partial index (PostgreSQL) — only index active users
CREATE INDEX idx_active_users ON users(email)
WHERE status = 'active';
 
-- See all indexes on a table (MySQL)
SHOW INDEX FROM users;
 
-- See all indexes (PostgreSQL)
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'users';
 
-- Drop an index that's no longer useful
DROP INDEX idx_users_email ON users;
Using EXPLAIN to Diagnose Query Performance
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Check if your query is using an index
EXPLAIN SELECT * FROM users WHERE email = 'alice@example.com';
-- BEFORE index: type=ALL, rows=500000 ← full scan!
-- AFTER index:  type=ref, rows=1     ← direct lookup!
 
-- PostgreSQL: EXPLAIN ANALYZE runs the query and shows real stats
EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id
ORDER BY order_count DESC
LIMIT 10;
 
-- Look for:
-- Seq Scan = full table scan (bad on large tables)
-- Index Scan = using index (good)
-- Bitmap Heap Scan = using index on multiple values (good)
-- actual rows= vs rows= — large discrepancy = bad planner estimates
-- Execution Time: X ms (shown with ANALYZE)
Practical Performance Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- BAD: Function in WHERE defeats the index
SELECT * FROM users WHERE YEAR(created_at) = 2024;
-- The database can't use an index on YEAR(created_at)
 
-- GOOD: Range condition allows index use
SELECT * FROM users
WHERE created_at >= '2024-01-01'01-01'
  AND created_at < '2025-01-01'01-01';
 
-- BAD: Leading wildcard defeats the index
SELECT * FROM users WHERE name LIKE '%Johnson';
 
-- GOOD: Trailing wildcard can use index
SELECT * FROM users WHERE name LIKE 'Johnson%';
 
-- BAD: Implicit type conversion defeats the index
-- If phone is stored as VARCHAR:
SELECT * FROM users WHERE phone = 5551234567;  -- numeric
 
-- GOOD: Match the column's data type
SELECT * FROM users WHERE phone = '555-123-4567'123-4567';

Best Practices — Interactive Comparison

Never wrap indexed columns in functions — it prevents the index from being used

sql
1
2
3
4
5
6
7
8
9
-- YEAR() wraps the column — index on created_at
-- cannot be used. Causes full table scan on 10M rows!
SELECT * FROM orders
WHERE YEAR(created_at) = 2024;
 
-- Same problem with other functions:
WHERE LOWER(email) = 'alice@example.com'
WHERE DATE(created_at) = '2024-01-15'01-15'
WHERE LENGTH(name) > 10

Pro Tips — Senior Dev Insights

1

A covering index includes all columns needed by a query — the database can answer the entire query from the index alone without touching the main table. Dramatically faster for read-heavy queries.

2

In PostgreSQL, use pg_stat_user_indexes to find indexes that have never been used (idx_scan = 0) — safe candidates for deletion.

3

For UUID primary keys, consider random-vs-sequential UUID generation — random UUIDs cause B-tree fragmentation (lots of page splits). Use UUIDv7 or COMB UUIDs that are time-ordered.

4

VACUUM ANALYZE (PostgreSQL) or ANALYZE TABLE (MySQL) updates query planner statistics — run after large data loads to ensure the optimizer makes good index decisions.

⚖️ Index Types — When to Use Each

FeatureSingle ColumnCompositeUniqueFull-TextPartial (PG)
Best forSingle WHERE filterMulti-column queriesUniqueness + speedText searchFiltered subset
Enforces uniqueness
Supports range scans
Storage overheadLowMediumLowHighLow
Write overheadLowMediumLowHighLow

Common Developer Pitfalls

!

Adding indexes on every column 'just in case' — over-indexing destroys write performance and wastes storage.

!
Using functions in WHERE conditions on indexed columns (WHERE LOWER(email) = ...) — the index is bypassed.
!
Using leading wildcards (LIKE '%value') — the database can't use a B-tree index for prefix-unknown searches.
!

Indexing a boolean or low-cardinality column — an index that can only eliminate 50% of rows provides almost no benefit.

Interview Mastery

A clustered index determines the physical order of rows on disk — the table's data IS the index. There can only be one per table (usually the primary key in MySQL InnoDB). Reading the clustered index is the fastest possible access method. A non-clustered index is a separate data structure that stores a copy of the indexed columns plus a pointer to the actual row. Every additional index you create is non-clustered. In MySQL/InnoDB, all non-clustered index lookups first find the primary key, then do a second lookup on the clustered index.

A composite index on columns (A, B, C) can be used by queries that filter on: A alone, A+B together, or A+B+C together. The database can NOT use this index for queries filtering only on B, only on C, or B+C without A. The index is ordered by A first, then B within each A, then C within each A+B — so skipping the leftmost column makes the index unsorted and unusable. Design your composite index with the column you filter on most often (and most selectively) as the first column.

Avoid adding indexes when: (1) the table is small (< a few thousand rows) — full scans are fast enough and indexes add overhead, (2) the column has very low cardinality (e.g., a boolean or a 'status' with only 2-3 values) — the index won't eliminate enough rows to be worth it, (3) the table is write-heavy and read-light — every write must update all indexes, (4) you already have too many indexes — a table with 20 indexes will have painfully slow INSERTs. Also watch out for unused indexes — they cost write overhead with no read benefit.

Real-World Blueprint

Instagram's posts table has billions of rows. The feed query — 'show me posts from accounts I follow, ordered by time' — is fundamentally a JOIN between posts and follows, filtered by user and sorted by timestamp. Without indexes on (user_id, created_at), every feed load would scan billions of rows. With a composite index, it's a lightning-fast range scan. Instagram engineers obsess over EXPLAIN output and index usage — it's the difference between a 3ms query and a 30-second timeout.

Hands-on Lab Exercises

1
Run EXPLAIN SELECT on a query before and after adding an index. Compare the 'rows' and 'type' values.
2

Create a composite index and test which query patterns use it (using EXPLAIN) and which ones don't — verify the leftmost prefix rule.

3

Run a slow query on 1 million rows, measure execution time, add an appropriate index, and measure again.

4

Find all unused indexes in a database schema and practice identifying which ones could be safely removed.

Real-World Practice Scenarios

A query SELECT * FROM orders WHERE MONTH(created_at) = 6 is slow even though created_at is indexed. Why and how do you fix it?
Your team added 15 indexes to a users table to speed up reads. Now new user registration takes 3 seconds. What's happening and how do you balance it?
A query WHERE email LIKE '%@gmail.com' is doing a full table scan even though email is indexed. Why, and what's an alternative approach?

The DBA asks you to 'add a covering index' for a specific query. What does that mean and how do you implement it?