Databases

Joins & Complex Queries

3 min read
Focus: DATABASES
⚑

TL;DR β€” Quick Summary

  • INNER JOIN = intersection (both sides must match). LEFT JOIN = all left rows + matched right rows.
  • LEFT JOIN + WHERE right.id IS NULL = find rows in left table with NO match in right table.
  • CTEs (WITH name AS) make complex multi-step queries readable and maintainable.
  • Window functions (OVER()) calculate across rows without collapsing them like GROUP BY does.

Lesson Overview

πŸ”— JOINs: Bringing Tables Together

Real-world data lives across multiple tables. JOINs let you combine rows from two or more tables based on a related column β€” usually a foreign key relationship. A well-written JOIN is one of the most powerful tools in a developer's SQL toolkit.

πŸ—‚οΈ Types of JOINs

  • INNER JOIN: Returns only rows where the join condition matches in both tables. Most common. Think: intersection.
  • LEFT JOIN (LEFT OUTER): Returns ALL rows from the left table, plus matching rows from the right. Non-matching right-side values are NULL. Use when you need all records from one table regardless of matches.
  • RIGHT JOIN: Mirror of LEFT JOIN β€” all rows from right table. Rarely used (just swap the table order and use LEFT JOIN).
  • FULL OUTER JOIN: All rows from both tables. NULLs where no match exists. PostgreSQL supports this natively; MySQL requires a UNION workaround.
  • CROSS JOIN: Every row from the left table paired with every row from the right. Returns MΓ—N rows. Rarely intentional β€” an accidental CROSS JOIN can return billions of rows.
  • SELF JOIN: A table joined to itself. Used for hierarchical data like org charts or finding related records in the same table.

πŸ“¦ Subqueries and CTEs

Sometimes one query isn't enough. You can nest queries inside each other:

  • Subquery (inline): A query inside another query's WHERE, FROM, or SELECT clause.
  • CTE (Common Table Expression): Named temporary result sets using WITH. Far more readable than nested subqueries.
  • Window Functions: Perform calculations across rows related to the current row β€” like ROW_NUMBER(), RANK(), LAG(), SUM() OVER.

The Venn diagram mental model for JOINs: INNER JOIN = intersection, LEFT JOIN = full left circle, FULL OUTER = full union of both circles.

Conceptual Deep Dive

Imagine you have a users table and an orders table. An INNER JOIN gives you only users who have placed orders. A LEFT JOIN gives you all users β€” those with orders will have order data, and users with no orders will have NULL in the order columns. When would you choose LEFT over INNER? Use LEFT JOIN when the absence of a match is meaningful information β€” like 'show me all users AND their order count (even users with zero orders).' Use INNER JOIN when you only care about records that exist in both tables.

Architecture & Data Flow

Visual Guide to SQL JOIN Types
Rendering diagram…

Implementation Lab

All Four Main JOINs Compared
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- INNER JOIN: only users WHO HAVE orders
SELECT u.name, o.total
FROM users u
INNER JOIN orders o ON u.id = o.user_id;
 
-- LEFT JOIN: ALL users, with order info if it exists
-- Users with no orders show NULL for order columns
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
 
-- Find users who have NEVER placed an order
SELECT u.name
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE o.id IS NULL;  -- the NULL trick: no matching row = never ordered
 
-- SELF JOIN: find employees and their manager's name
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;
Aggregations with JOINs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Count orders per user, including users with zero orders
SELECT
  u.name,
  u.email,
  COUNT(o.id) AS order_count,
  COALESCE(SUM(o.total), 0) AS lifetime_value
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name, u.email
ORDER BY lifetime_value DESC;
 
-- Posts with author name AND tag list
SELECT
  p.title,
  u.name AS author,
  GROUP_CONCAT(t.name ORDER BY t.name SEPARATOR ', ') AS tags
FROM posts p
INNER JOIN users u ON p.user_id = u.id
LEFT JOIN post_tags pt ON p.id = pt.post_id
LEFT JOIN tags t ON pt.tag_id = t.id
GROUP BY p.id, p.title, u.name;
CTEs and Window Functions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- CTE: readable multi-step query
WITH monthly_revenue AS (
  SELECT
    DATE_FORMAT(created_at, '%Y-%m') AS month,
    SUM(total) AS revenue
  FROM orders
  WHERE status = 'completed'
  GROUP BY month
),
revenue_with_growth AS (
  SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue
  FROM monthly_revenue
)
SELECT
  month,
  revenue,
  ROUND((revenue - prev_month_revenue) / prev_month_revenue * 100, 2) AS growth_pct
FROM revenue_with_growth
ORDER BY month;
 
-- Window function: rank users by order count
SELECT
  name,
  order_count,
  RANK() OVER (ORDER BY order_count DESC) AS rank,
  NTILE(4) OVER (ORDER BY order_count DESC) AS quartile
FROM user_order_counts;

Best Practices β€” Interactive Comparison

Use LEFT JOIN to include records with no related data β€” INNER JOIN silently drops them

sql
1
2
3
4
5
6
7
8
9
-- INNER JOIN silently excludes users
-- who have never placed an order
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
INNER JOIN orders o ON u.id = o.user_id
GROUP BY u.id;
 
-- Users with 0 orders are completely invisible
-- Your report is missing data and you won't know it

Refactor nested subqueries into readable CTEs

sql
1
2
3
4
5
6
7
8
9
10
11
12
-- 4-level nesting β€” impossible to debug or modify
SELECT * FROM (
  SELECT user_id, SUM(total) AS rev FROM (
    SELECT * FROM orders
    WHERE status = 'completed'
  ) o GROUP BY user_id
) r WHERE r.rev > (
  SELECT AVG(total_rev) FROM (
    SELECT user_id, SUM(total) AS total_rev
    FROM orders GROUP BY user_id
  ) x
);

Pro Tips β€” Senior Dev Insights

1

Run EXPLAIN on JOIN queries to see if the database is doing nested loops (slow) vs hash joins (fast) β€” the difference can be orders of magnitude.

2

In PostgreSQL, LATERAL JOIN is extremely powerful β€” it allows the right side of the join to reference columns from the left side, like a correlated subquery but joinable.

3

For many-to-many JOINs returning duplicate rows, consider EXISTS or IN instead of a JOIN when you only need to check existence, not retrieve data.

4

EXPLAIN ANALYZE (PostgreSQL) runs the query and shows actual vs estimated row counts β€” invaluable for identifying bad query planner estimates.

βš–οΈ JOIN Types at a Glance

FeatureINNERLEFTRIGHTFULL OUTERCROSS
Returns unmatched left rows
Returns unmatched right rows
Can produce NULLs
Common in production
Risk of row explosion

Common Developer Pitfalls

!

Using INNER JOIN when LEFT JOIN is needed β€” losing users/products with no related records silently.

!

Forgetting to include all non-aggregated SELECT columns in the GROUP BY clause β€” causes 'ambiguous column' errors or wrong results.

!

Accidentally creating a CROSS JOIN by forgetting the ON clause β€” can return billions of rows and crash your database.

!

Nesting subqueries 4-5 levels deep instead of using CTEs β€” creates unreadable, unmaintainable query spaghetti.

Interview Mastery

INNER JOIN returns only rows where the join condition matches in both tables β€” rows with no match are excluded entirely. LEFT JOIN returns all rows from the left table regardless of whether a match exists in the right table; non-matching right-side columns are NULL. Use INNER when you only want matched records; use LEFT when you want all records from the primary table and optional data from the secondary.

A CTE (Common Table Expression) is a named temporary result set defined with WITH name AS (...) before the main query. Use CTEs when: (1) you need to reference the same subquery multiple times in a query, (2) the logic is complex enough that nesting subqueries becomes unreadable, (3) you need recursive queries (like traversing org charts). CTEs are not materialized in most databases β€” they're just syntactic sugar for readability, but they make complex queries dramatically easier to understand and debug.

GROUP BY collapses many rows into one row per group β€” you lose the individual row details. Window functions (using OVER()) perform calculations across a set of rows related to the current row, but do not collapse the rows β€” you can see both the individual row and the aggregate value in the same result. Example: GROUP BY gives you 'total sales per department'. A window function gives you 'each employee's salary AND their department's average salary in the same row'.

Real-World Blueprint

"LinkedIn's 'People You May Know' feature uses self-joins on the connections table: find users A and C who both have a connection to user B, but A and C aren't directly connected yet. This is a classic multi-level self-join query. The analytics pipeline that powers LinkedIn's recruiter insights uses CTEs and window functions to calculate rolling averages, percentile rankings, and month-over-month growth β€” all in SQL."

Hands-on Lab Exercises

1

Write a query using INNER, LEFT, and RIGHT JOIN on the same two tables and compare the row counts returned by each.

2

Find all users who have never made a purchase using the LEFT JOIN + IS NULL pattern.

3

Rewrite a deeply nested subquery as a CTE and compare readability.

4
Use ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) to rank employees within each department by salary.

Real-World Practice Scenarios

You need a report showing every product and its total sales revenue β€” including products that have never been sold (show 0). Which JOIN and aggregation do you use?

A query with 5 nested subqueries is timing out and impossible to debug. How do you refactor it using CTEs?

You need to find 'second highest salary per department' β€” GROUP BY alone can't do this. How do you use a window function to solve it?

Your JOIN query returns more rows than expected β€” 3x the actual data. What likely went wrong and how do you diagnose it?