Try / SQL in Y minutes

This is an introduction to SQL from basic to fairly advanced queries. All queries have been tested on SQLite and PostgreSQL.

✨ This is an open source guide. Feel free to improve it!

Concepts: Database

A database is a collection of data that can be searched and retrieved. A database management system (DBMS) is a program that manages a particular kind of database.

Each DBMS stores data in its own way:

  • SQLite stores each database in a single file
  • PostgreSQL spreads information across many files for higher performance

DBMS can be a library embedded in other programs (SQLite) or a server (PostgreSQL).

A relational database management system (RDBMS) stores data in tables and uses SQL for queries. Unfortunately, every RDBMS has its own dialect of SQL.

1: Selecting a constant

select 42;
42

select is a keyword. Normally it is used to select data from a table, but if all we want is a constant value, we don't need to specify one.

A semicolon terminator is required.

2: Selecting all values from table

Using the penguins database:

select * from little_penguins;
┌───────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬────────┐
│  species  │  island   │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │  sex   │
├───────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┤
│ Gentoo    │ Biscoe    │ 51.3           │ 14.2          │ 218.0             │ 5300.0      │ MALE   │
│ Adelie    │ Dream     │ 35.7           │ 18.0          │ 202.0             │ 3550.0      │ FEMALE │
│ Adelie    │ Torgersen │ 36.6           │ 17.8          │ 185.0             │ 3700.0      │ FEMALE │
│ Chinstrap │ Dream     │ 55.8           │ 19.8          │ 207.0             │ 4000.0      │ MALE   │
│ Adelie    │ Dream     │ 38.1           │ 18.6          │ 190.0             │ 3700.0      │ FEMALE │
│ Adelie    │ Dream     │ 36.2           │ 17.3          │ 187.0             │ 3300.0      │ FEMALE │
│ Adelie    │ Dream     │ 39.5           │ 17.8          │ 188.0             │ 3300.0      │ FEMALE │
│ Gentoo    │ Biscoe    │ 42.6           │ 13.7          │ 213.0             │ 4950.0      │ FEMALE │
│ Gentoo    │ Biscoe    │ 52.1           │ 17.0          │ 230.0             │ 5550.0      │ MALE   │
│ Adelie    │ Torgersen │ 36.7           │ 18.8          │ 187.0             │ 3800.0      │ FEMALE │
└───────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┘

Use * to select all columns and from tablename to specify a table.

3: Specifying columns

select species, island, sex
from little_penguins;
┌───────────┬───────────┬────────┐
│  species  │  island   │  sex   │
├───────────┼───────────┼────────┤
│ Gentoo    │ Biscoe    │ MALE   │
│ Adelie    │ Dream     │ FEMALE │
│ Adelie    │ Torgersen │ FEMALE │
│ Chinstrap │ Dream     │ MALE   │
│ Adelie    │ Dream     │ FEMALE │
│ Adelie    │ Dream     │ FEMALE │
│ Adelie    │ Dream     │ FEMALE │
│ Gentoo    │ Biscoe    │ FEMALE │
│ Gentoo    │ Biscoe    │ MALE   │
│ Adelie    │ Torgersen │ FEMALE │
└───────────┴───────────┴────────┘

Specify column names separated by commas, in any order. Duplicates are allowed. Line breaks are allowed and encouraged for readability.

4: Sorting

select species, sex, island
from little_penguins
order by island asc, sex desc;
┌───────────┬────────┬───────────┐
│  species  │  sex   │  island   │
├───────────┼────────┼───────────┤
│ Gentoo    │ MALE   │ Biscoe    │
│ Gentoo    │ MALE   │ Biscoe    │
│ Gentoo    │ FEMALE │ Biscoe    │
│ Chinstrap │ MALE   │ Dream     │
│ Adelie    │ FEMALE │ Dream     │
│ Adelie    │ FEMALE │ Dream     │
│ Adelie    │ FEMALE │ Dream     │
│ Adelie    │ FEMALE │ Dream     │
│ Adelie    │ FEMALE │ Torgersen │
│ Adelie    │ FEMALE │ Torgersen │
└───────────┴────────┴───────────┘

order by must follow from (which must follow select). asc is ascending, desc is descending. Default is ascending, but it's better to specify it while learning.

5: Limiting output

-- the full dataset has 344 rows,
-- let's select 10 of them
select species, sex, island
from penguins
order by species, sex, island
limit 10;
┌─────────┬────────┬───────────┐
│ species │  sex   │  island   │
├─────────┼────────┼───────────┤
│ Adelie  │        │ Dream     │
│ Adelie  │        │ Torgersen │
│ Adelie  │        │ Torgersen │
│ Adelie  │        │ Torgersen │
│ Adelie  │        │ Torgersen │
│ Adelie  │        │ Torgersen │
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ FEMALE │ Biscoe    │
└─────────┴────────┴───────────┘

Comments start with -- and continue to the end of the line. limit N specifies the maximum number of rows returned by the query.

6: Paging output

select species, sex, island
from penguins
order by species, sex, island
limit 10 offset 3;
-- skip the first 3 records and shows the next 10
┌─────────┬────────┬───────────┐
│ species │  sex   │  island   │
├─────────┼────────┼───────────┤
│ Adelie  │        │ Torgersen │
│ Adelie  │        │ Torgersen │
│ Adelie  │        │ Torgersen │
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ FEMALE │ Biscoe    │
└─────────┴────────┴───────────┘

offset N specifies the number of rows to skip from the start of the selection. It must follow the limit clause.

7: Removing duplicates

select distinct
  species, sex, island
from penguins;
┌───────────┬────────┬───────────┐
│  species  │  sex   │  island   │
├───────────┼────────┼───────────┤
│ Adelie    │ MALE   │ Torgersen │
│ Adelie    │ FEMALE │ Torgersen │
│ Adelie    │        │ Torgersen │
│ Adelie    │ FEMALE │ Biscoe    │
│ Adelie    │ MALE   │ Biscoe    │
│ Adelie    │ FEMALE │ Dream     │
│ Adelie    │ MALE   │ Dream     │
│ Adelie    │        │ Dream     │
│ Chinstrap │ FEMALE │ Dream     │
│ Chinstrap │ MALE   │ Dream     │
│ Gentoo    │ FEMALE │ Biscoe    │
│ Gentoo    │ MALE   │ Biscoe    │
│ Gentoo    │        │ Biscoe    │
└───────────┴────────┴───────────┘

The distinct keyword leaves only distinct combinations of selected columns. It must appear immediately after select.

Note that there are empty values in the sex column, indicating missing data. We'll talk about this later.

8: Filtering results

select distinct
  species, sex, island
from penguins
where island = 'Biscoe';
┌─────────┬────────┬───────────┐
│ species │  sex   │ living at │
├─────────┼────────┼───────────┤
│ Adelie  │ FEMALE │ Biscoe    │
│ Adelie  │ MALE   │ Biscoe    │
│ Gentoo  │ FEMALE │ Biscoe    │
│ Gentoo  │ MALE   │ Biscoe    │
│ Gentoo  │        │ Biscoe    │
└─────────┴────────┴───────────┘

The where condition filters the rows produced by the selection. The condition is evaluated independently for each row. Only rows that pass the test appear in results.

Use single quotes for 'text data'.

9: Filtering with more complex conditions

select distinct
  species, sex, island
from penguins
where island = 'Biscoe' and sex <> 'MALE';
-- `<>` means "not equal"
┌─────────┬────────┬────────┐
│ species │  sex   │ island │
├─────────┼────────┼────────┤
│ Adelie  │ FEMALE │ Biscoe │
│ Gentoo  │ FEMALE │ Biscoe │
└─────────┴────────┴────────┘

Logical operators:

  • and: both sub-conditions must be true
  • or: either or both sub-conditions must be true

Note that the row for Gentoo penguins on Biscoe island with unknown (empty) sex didn't pass the test. We'll talk about this later.

10: Doing calculations

select
  flipper_length_mm / 10.0,
  body_mass_g / 1000.0
from penguins
limit 3;
┌──────────────────────────┬──────────────────────┐
│ flipper_length_mm / 10.0 │ body_mass_g / 1000.0 │
├──────────────────────────┼──────────────────────┤
│ 18.1                     │ 3.75                 │
│ 18.6                     │ 3.8                  │
│ 19.5                     │ 3.25                 │
└──────────────────────────┴──────────────────────┘

We can perform the usual kinds of arithmetic on individual values. The calculation is done for each row independently.

11: Renaming columns

select
  flipper_length_mm / 10.0 as flipper_cm,
  body_mass_g / 1000.0 as weight_kg,
  island as "found at"
from penguins
limit 3;
┌────────────┬───────────┬───────────┐
│ flipper_cm │ weight_kg │ found at  │
├────────────┼───────────┼───────────┤
│ 18.1       │ 3.75      │ Torgersen │
│ 18.6       │ 3.8       │ Torgersen │
│ 19.5       │ 3.25      │ Torgersen │
└────────────┴───────────┴───────────┘

Use <expression> as name to give the result of the calculation a meaningful name. Use double quotes for aliases that contain spaces.

12: Calculating with missing values

select
  flipper_length_mm / 10.0 as flipper_cm,
  body_mass_g / 1000.0 as weight_kg,
  island as found_at
from penguins
limit 5;
┌────────────┬───────────┬───────────┐
│ flipper_cm │ weight_kg │ found_at  │
├────────────┼───────────┼───────────┤
│ 18.1       │ 3.75      │ Torgersen │
│ 18.6       │ 3.8       │ Torgersen │
│ 19.5       │ 3.25      │ Torgersen │
│            │           │ Torgersen │
│ 19.3       │ 3.45      │ Torgersen │
└────────────┴───────────┴───────────┘

SQL uses a special value null to represent missing data. It is neither 0 nor an empty string, but "I don't know". Here, the flipper length and body weight are unknown for the fourth penguin.

null is viral, so calculations involving it will always return null ("I don't know" divided by 10 or 1000 is "I don't know").

13: Null equality

select distinct
  species, sex, island
from penguins
where island = 'Biscoe';
┌─────────┬────────┬────────┐
│ species │  sex   │ island │
├─────────┼────────┼────────┤
│ Adelie  │ FEMALE │ Biscoe │
│ Adelie  │ MALE   │ Biscoe │
│ Gentoo  │ FEMALE │ Biscoe │
│ Gentoo  │ MALE   │ Biscoe │
│ Gentoo  │        │ Biscoe │
└─────────┴────────┴────────┘

Let's ask for female penguins only:

select distinct
  species, sex, island
from penguins
where island = 'Biscoe' and sex = 'FEMALE';
┌─────────┬────────┬────────┐
│ species │  sex   │ island │
├─────────┼────────┼────────┤
│ Adelie  │ FEMALE │ Biscoe │
│ Gentoo  │ FEMALE │ Biscoe │
└─────────┴────────┴────────┘

The row with the null sex is removed from the results.

14: Null inequality

Let's ask for penguins that aren't female:

select distinct
  species, sex, island
from penguins
where island = 'Biscoe' and sex <> 'FEMALE';
┌─────────┬──────┬────────┐
│ species │ sex  │ island │
├─────────┼──────┼────────┤
│ Adelie  │ MALE │ Biscoe │
│ Gentoo  │ MALE │ Biscoe │
└─────────┴──────┴────────┘

The row with the null sex is also removed from the results.

15: Ternary logic

select null = null;
┌─────────────┐
│ null = null │
├─────────────┤
│             │
└─────────────┘

If we don't know the left and right values, we don't know if they're equal or not — so the result is null. The same is true for null <> null.

It's ternary logic — a logic based on three values: true, false, and "don't know" (represented as null).

  equality:
┌──────┬───────┬───────┬──────┐
│      │   X   │   Y   │ null │
├──────┼───────┼───────┼──────┤
│  X   │ true  │ false │ null │
│  Y   │ false │ true  │ null │
│ null │ null  │ null  │ null │
└──────┴───────┴───────┴──────┘

16: Handling null safely

select
  species, sex, island
from penguins
where sex is null;
┌─────────┬─────┬───────────┐
│ species │ sex │  island   │
├─────────┼─────┼───────────┤
│ Adelie  │     │ Torgersen │
│ Adelie  │     │ Torgersen │
│ Adelie  │     │ Torgersen │
│ Adelie  │     │ Torgersen │
│ Adelie  │     │ Torgersen │
│ Adelie  │     │ Dream     │
│ Gentoo  │     │ Biscoe    │
│ Gentoo  │     │ Biscoe    │
│ Gentoo  │     │ Biscoe    │
│ Gentoo  │     │ Biscoe    │
│ Gentoo  │     │ Biscoe    │
└─────────┴─────┴───────────┘

Use is null and is not null to handle null safely.

Other parts of SQL treat null differently, often ignoring it in calculations (e.g. when summing values).

17: Aggregating

select sum(body_mass_g) as total_mass
from penguins;
┌────────────┐
│ total_mass │
├────────────┤
│ 1437000.0  │
└────────────┘

Aggregation combines column values from multiple rows into a single value. sum is an aggregation function.

18: Common aggregation functions

select
  max(bill_length_mm) as longest_bill,
  min(flipper_length_mm) as shortest_flipper,
  avg(bill_length_mm) / avg(bill_depth_mm) as weird_ratio
from penguins;
┌──────────────┬──────────────────┬──────────────────┐
│ longest_bill │ shortest_flipper │   weird_ratio    │
├──────────────┼──────────────────┼──────────────────┤
│ 59.6         │ 172.0            │ 2.56087082530644 │
└──────────────┴──────────────────┴──────────────────┘

Aggregation functions ignore null values, otherwise all three result values would be null.

19: Counting

select
  -- counts rows
  count(*) as count_star,
  -- counts non-null entries in column
  count(sex) as count_specific,
  -- counts distinct non-null entries
  count(distinct sex) as count_distinct
from penguins;
┌────────────┬────────────────┬────────────────┐
│ count_star │ count_specific │ count_distinct │
├────────────┼────────────────┼────────────────┤
│ 344        │ 333            │ 2              │
└────────────┴────────────────┴────────────────┘

20: Grouping

select avg(body_mass_g) as average_mass_g
from penguins
group by sex;
┌──────────────────┐
│  average_mass_g  │
├──────────────────┤
│ 4005.55555555556 │
│ 3862.27272727273 │
│ 4545.68452380952 │
└──────────────────┘

This query puts rows in groups based on distinct combinations of values in columns specified with group by, then performs aggregation separately for each group.

It's not very useful unless we list the sex value for each group in the result.

21: Aggregation key

select
  sex,
  avg(body_mass_g) as average_mass_g
from penguins
group by sex;
┌────────┬──────────────────┐
│  sex   │  average_mass_g  │
├────────┼──────────────────┤
│        │ 4005.55555555556 │
│ FEMALE │ 3862.27272727273 │
│ MALE   │ 4545.68452380952 │
└────────┴──────────────────┘

Distinct sex values identify each of the groups.

22: Unaggregated columns

select
  sex,
  body_mass_g
from penguins
group by sex;
┌────────┬─────────────┐
│  sex   │ body_mass_g │
├────────┼─────────────┤
│        │             │
│ FEMALE │ 3800.0      │
│ MALE   │ 3750.0      │
└────────┴─────────────┘

sex is an aggregation key, but body_mass_g is not (it is not listed in group by). We also didn't specify an aggregation function for body_mass_g, so the database engine doesn't know what to do with it.

SQLite is very forgiving, so it'll just return an arbitrary body_mass_g value for each sex. PostgreSQL (and other DBMS) will return an error:

ERROR:  column "penguins.body_mass_g" must appear in the GROUP BY clause or be used in an aggregate function
LINE 3:     body_mass_g
            ^

23: Filtering aggregated values

select
  sex,
  avg(body_mass_g) as average_mass_g
from penguins
group by sex
having avg(body_mass_g) > 4000.0;
┌──────┬──────────────────┐
│ sex  │  average_mass_g  │
├──────┼──────────────────┤
│      │ 4005.55555555556 │
│ MALE │ 4545.68452380952 │
└──────┴──────────────────┘

The having clause filters the results after aggregation (as opposed to where, which filters before aggregation).

24: Readable output

select
  sex,
  round(avg(body_mass_g), 1) as average_mass_g
from penguins
group by sex
having round(avg(body_mass_g), 1) > 4000.0;
┌──────┬────────────────┐
│ sex  │ average_mass_g │
├──────┼────────────────┤
│      │ 4005.6         │
│ MALE │ 4545.7         │
└──────┴────────────────┘

Use round(value, decimals) to round off a number.

Note that PostgreSQL requires a type conversion:

select
  sex,
  round(avg(body_mass_g)::numeric, 1) as average_mass_g
from penguins
group by sex
having round(avg(body_mass_g)::numeric, 1) > 4000.0;

25: Filtering aggregate inputs

select
  sex,
  round(
    avg(body_mass_g) filter (where body_mass_g < 4000.0),
    1
  ) as average_mass_g
from penguins
group by sex;
┌────────┬────────────────┐
│  sex   │ average_mass_g │
├────────┼────────────────┤
│        │ 3362.5         │
│ FEMALE │ 3417.3         │
│ MALE   │ 3729.6         │
└────────┴────────────────┘

Note: use avg(body_mass_g) filter (where body_mass_g < 4000.0)::numeric for PostgreSQL.

filter (where <condition>) filters the results after aggregation, just like having. But unlike having, it applies to each expression in the select list individually.

So we can define different filters for different select expressions:

select
  sex,
  max(body_mass_g) filter (where body_mass_g < 4000.0) as  max_4000,
  max(body_mass_g) filter (where species = 'Adelie') as max_adelie
from penguins
group by sex;
┌────────┬──────────┬────────────┐
│  sex   │ max_4000 │ max_adelie │
├────────┼──────────┼────────────┤
│        │ 3700.0   │ 4250.0     │
│ FEMALE │ 3950.0   │ 3900.0     │
│ MALE   │ 3975.0   │ 4775.0     │
└────────┴──────────┴────────────┘

26: Creating tables

create table job (
  name     varchar(10) not null,
  billable real        not null
);

create table work (
  person varchar(5)  not null,
  job    varchar(10) not null
);

To create a table, use create table <name> followed by a list of columns in parentheses. Each column has a name, a data type, and optional extra information (e.g., not null prohibits writing null values).

27: Inserting Data

insert into job values
('calibrate', 1.5),
('clean', 0.5);

insert into work values
('mik', 'calibrate'),
('mik', 'clean'),
('mik', 'complain'),
('po', 'clean'),
('po', 'complain'),
('tay', 'complain');

select * from job;
select * from work;
┌───────────┬──────────┐
│   name    │ billable │
├───────────┼──────────┤
│ calibrate │ 1.5      │
│ clean     │ 0.5      │
└───────────┴──────────┘
┌────────┬───────────┐
│ person │    job    │
├────────┼───────────┤
│ mik    │ calibrate │
│ mik    │ clean     │
│ mik    │ complain  │
│ po     │ clean     │
│ po     │ complain  │
│ tay    │ complain  │
└────────┴───────────┘

28: Updating rows

update work
set person = 'tae'
where person = 'tay';
select * from work;
┌────────┬───────────┐
│ person │    job    │
├────────┼───────────┤
│ mik    │ calibrate │
│ mik    │ clean     │
│ mik    │ complain  │
│ po     │ clean     │
│ po     │ complain  │
│ tae    │ complain  │
└────────┴───────────┘

The where clause specifies which rows to update. Without where, the query will update all rows in the table, which is usually not what you want.

29: Deleting rows

delete from work
where person = 'tae';

select * from work;
┌────────┬───────────┐
│ person │    job    │
├────────┼───────────┤
│ mik    │ calibrate │
│ mik    │ clean     │
│ mik    │ complain  │
│ po     │ clean     │
│ po     │ complain  │
└────────┴───────────┘

As with update, you use where to specify which rows to delete.

30: Copying tables

create table work_bak (
  person varchar(5) not null,
  job varchar(10) not null
);

insert into work_bak
select person, job
from work;

select count(*) from work_bak;
┌──────────┐
│ count(*) │
├──────────┤
│ 6        │
└──────────┘

insert into ... select inserts selected rows from one table into the other.

You can create and populate the table with a single query:

create table work_bak as
select person, job
from work;

select count(*) from work_bak;
┌──────────┐
│ count(*) │
├──────────┤
│ 6        │
└──────────┘

The work_bak table structure is automatically defined based on the names and types of the select expressions (in this case, the person and job columns).

31: Combining information

select *
from work cross join job;
┌────────┬───────────┬───────────┬──────────┐
│ person │    job    │   name    │ billable │
├────────┼───────────┼───────────┼──────────┤
│ mik    │ calibrate │ calibrate │ 1.5      │
│ mik    │ calibrate │ clean     │ 0.5      │
│ mik    │ clean     │ calibrate │ 1.5      │
│ mik    │ clean     │ clean     │ 0.5      │
│ mik    │ complain  │ calibrate │ 1.5      │
│ mik    │ complain  │ clean     │ 0.5      │
│ po     │ clean     │ calibrate │ 1.5      │
│ po     │ clean     │ clean     │ 0.5      │
│ po     │ complain  │ calibrate │ 1.5      │
│ po     │ complain  │ clean     │ 0.5      │
│ tay    │ complain  │ calibrate │ 1.5      │
│ tay    │ complain  │ clean     │ 0.5      │
└────────┴───────────┴───────────┴──────────┘

A join combines information from two tables. cross join constructs their cross product — all pairwise combinations of rows from the first table with rows from the second table.

The result isn't very useful. job and name values don't match: the combined data has records not related to each other.

32: Inner join

select *
from work
  inner join job on work.job = job.name;
┌────────┬───────────┬───────────┬──────────┐
│ person │    job    │   name    │ billable │
├────────┼───────────┼───────────┼──────────┤
│ mik    │ calibrate │ calibrate │ 1.5      │
│ mik    │ clean     │ clean     │ 0.5      │
│ po     │ clean     │ clean     │ 0.5      │
└────────┴───────────┴───────────┴──────────┘

Use the <table>.<column> notation to specify columns. Use on <condition> to specify the join condition. Here we are matching people from the work table with corresponding jobs from the job table, effectively selecting all jobs for each person.

Note that while some people in the work table have job = complain, there are no complain jobs in the job table. So for work.job = 'complain' there are no matching records in the job table, and these records are excluded from the result.

33: Aggregating joined data

select
  work.person,
  sum(job.billable) as pay
from work
  inner join job on work.job = job.name
group by work.person;
┌────────┬─────┐
│ person │ pay │
├────────┼─────┤
│ mik    │ 2.0 │
│ po     │ 0.5 │
└────────┴─────┘

Here we select all jobs for each person (as in the previous query), and then aggregate them by person, calculating the total pay for each person.

Since Tay in the work table has only complain jobs, and the jobs table is missing complain jobs, Tay is excluded from the result.

34: Left join

select *
from work
  left join job on work.job = job.name;
┌────────┬───────────┬───────────┬──────────┐
│ person │    job    │   name    │ billable │
├────────┼───────────┼───────────┼──────────┤
│ mik    │ calibrate │ calibrate │ 1.5      │
│ mik    │ clean     │ clean     │ 0.5      │
│ mik    │ complain  │           │          │
│ po     │ clean     │ clean     │ 0.5      │
│ po     │ complain  │           │          │
│ tay    │ complain  │           │          │
└────────┴───────────┴───────────┴──────────┘

A left join (also known as a left outer join) keeps all rows from the left table (work) and fills missing values from the right table (job) with null.

Now Tay has finally made it into the result, but since there are no jobs for them in the job table, the name and billable columns for Tay are empty.

35: Aggregating left joins

select
    work.person,
    sum(job.billable) as pay
from work
  left join job on work.job = job.name
group by work.person;
┌────────┬─────┐
│ person │ pay │
├────────┼─────┤
│ mik    │ 2.0 │
│ po     │ 0.5 │
│ tay    │     │
└────────┴─────┘

As discussed in the previous query, Tay has made it into the result thanks to the left join. And since the individual billable values for Tay are null, their sum is also null.

36: Coalescing values

select
  work.person,
  coalesce(sum(job.billable), 0.0) as pay
from work
  left join job on work.job = job.name
group by work.person;
┌────────┬─────┐
│ person │ pay │
├────────┼─────┤
│ mik    │ 2.0 │
│ po     │ 0.5 │
│ tay    │ 0.0 │
└────────┴─────┘

coalesce(val1, val2, ...) returns the first non-null value, so Tay gets 0 pay instead of null.

37: Full outer join

Suppose we add one more job to the jobs table:

insert into job values
('manage', 1);

Now let's select people with their corresponding job, but using the full join instead of left join:

select *
from work
  full join job on work.job = job.name;
┌────────┬───────────┬───────────┬──────────┐
│ person │    job    │   name    │ billable │
├────────┼───────────┼───────────┼──────────┤
│ mik    │ calibrate │ calibrate │ 1.5      │
│ mik    │ clean     │ clean     │ 0.5      │
│ mik    │ complain  │           │          │
│ po     │ clean     │ clean     │ 0.5      │
│ po     │ complain  │           │          │
│ tay    │ complain  │           │          │
│        │           │ manage    │ 1.0      │
└────────┴───────────┴───────────┴──────────┘

A full join (also known as a full outer join) selects the following:

  • matching rows from both tables,
  • rows from the left table (work) that do not match the right table (job).
  • rows from the right table (job) that do not match the left table (work).

So the query returns both Tay from the work table (whose job does not match the job table) and "manage" from the job table (whose name does not match the work table).

38: Negating incorrectly

Who doesn't calibrate?

select distinct person
from work
where job <> 'calibrate';
┌────────┐
│ person │
├────────┤
│ mik    │
│ po     │
│ tay    │
└────────┘

Mik made it into the result, which seems wrong: Mik does calibrate (name='mik', job='calibrate'). The problem is that there's an entry for Mik cleaning (name='mik', job='clean'). And since 'clean' <> 'calibrate', this entry is included in the result.

So essentially this query is Whose ONLY job is not calibrate?. We need a different approach to answer the original question.

39: Set membership

select *
from work
where person not in ('mik', 'tay');
-- select all people except Mik and Tay
┌────────┬──────────┐
│ person │   job    │
├────────┼──────────┤
│ po     │ clean    │
│ po     │ complain │
└────────┴──────────┘

in <values> and not in <values> clauses work exactly as you'd expect.

40: Subqueries

So who doesn't calibrate?

select distinct person
from work
where person not in (
  select distinct person
  from work
  where job = 'calibrate'
);
┌────────┐
│ person │
├────────┤
│ po     │
│ tay    │
└────────┘

Here we use a subquery to select the people who do calibrate. Then we select all the people who are not in that set (so essentially those who do not calibrate).

41: Union

Who clean or calibrate?

select person from work
where job = 'clean'

union

select person from work
where job = 'calibrate';
┌────────┐
│ person │
├────────┤
│ mik    │
│ po     │
└────────┘

union returns the records found in the first or second datasets. The records must have exactly the same structure (same set and order of fields).

union also removes the duplicate records from the result (otherwise we'd see mik twice). You can use union all if you don't mind the duplicates — this can be much faster on large datasets.

Union does not guarantee any particular order, so we can add an order by to the last select to sort the result:

select person from work
where job = 'clean'

union

select person from work
where job = 'calibrate'
order by person desc;
┌────────┐
│ person │
├────────┤
│ po     │
│ mik    │
└────────┘

42: Intersection

Who clean and calibrate?

select person from work
where job = 'clean'

intersect

select person from work
where job = 'calibrate';
┌────────┐
│ person │
├────────┤
│ mik    │
└────────┘

intersect returns the records found in both the first and second datasets. As with union, all records must have the same structure.

43: Exclusion

Who just complains and does no useful work?

select person from work
where job = 'complain'

except

select person from work
where job <> 'complain';
┌────────┐
│ person │
├────────┤
│ tay    │
└────────┘

except returns the records found in the first dataset but not in the second. As with union and intersect, all records must have the same structure.

44: Defining a primary key

A primary key uniquely identifies each record in a table. A table can use any field (or combination of fields) as a primary key, as long as the value(s) are unique for each record.

create table equipment (
  size  real        not null,
  color varchar(10) not null,
  num   integer     not null,

  primary key (size, color)
);

insert into equipment values
(1.5, 'blue', 2),
(1.5, 'green', 1),
(2.5, 'blue', 1);

select * from equipment;

insert into equipment values
(1.5, 'green', 2);
┌──────┬───────┬─────┐
│ size │ color │ num │
├──────┼───────┼─────┤
│ 1.5  │ blue  │ 2   │
│ 1.5  │ green │ 1   │
│ 2.5  │ blue  │ 1   │
└──────┴───────┴─────┘
Runtime error near line 17: UNIQUE constraint failed: equipment.size, equipment.color (19)
 (exit status 1)

Since the table already contains the size=1.5, color='green' record, we can't add another one with the same values.

45: Autoincrementing and primary keys

create table person (
  ident integer primary key autoincrement,
  name  varchar(5) not null
);

insert into person(name) values
('mik'), ('po'), ('tay');

select * from person;

insert into person values (1, 'prevented');
┌───────┬──────┐
│ ident │ name │
├───────┼──────┤
│ 1     │ mik  │
│ 2     │ po   │
│ 3     │ tay  │
└───────┴──────┘
Runtime error near line 37: UNIQUE constraint failed: person.ident (19)
 (exit status 1)

The database autoincrements ident each time a new record is added. Auto-incremented fields are unique for each record, so they are often used as primary keys.

Such surrogate identifiers are also useful because if the "business" attribute of a record changes (say, Mik changes their name), the identifier remains the same, and we don't need to change the data in other tables that reference said record.

Note that PostgreSQL requires a different autoincrement syntax:

create table person (
  ident serial primary key,
  name  varchar(5) not null
);

46: Altering tables

Let's add a unique identifier to the job table:

alter table job
add ident integer not null default -1;

update job
set ident = 1
where name = 'calibrate';

update job
set ident = 2
where name = 'clean';

select * from job;
┌───────────┬──────────┬───────┐
│   name    │ billable │ ident │
├───────────┼──────────┼───────┤
│ calibrate │ 1.5      │ 1     │
│ clean     │ 0.5      │ 2     │
└───────────┴──────────┴───────┘

Here we add the ident column. Since it can't be null and there is already data in the table, we have to provide a default value.

Then we use update to modify existing records and set the actual ident values. This is an example of data migration.

Concepts: M-to-N relationships

Relationships between entities are typically characterized as:

1-to-1. Fields in the same record (such as a person's first and last name)

1-to-many. Each record in table A can have multiple matching records in table B, but each record in table B has at most one matching record in table A (a person can have multiple toothbrushes, but each toothbrush belongs to at most one person).

Many-to-many. Each record in table A can have multiple matching records in table B, and vice versa (a person can have multiple jobs, and each job can be performed by multiple people).

47: Creating a join table

Our work and job tables have a many-to-many relationship. Typically, such relationships are implemented using a separate join table:

  • Extract people into table A.
  • Extract jobs into table B.
  • Create a join table with person-job pairs (referencing A and B).

Let's do this.

-- extract people
create table people (
  ident integer primary key autoincrement,
  name  varchar(5) not null
);

insert into people (name)
select distinct person
from work;

Note: use ident serial primary key for PostgreSQL.

-- extract jobs
create table jobs (
  ident    integer primary key autoincrement,
  name     varchar(10) not null,
  billable real        not null
);

insert into jobs (name, billable)
select name, billable from job;

Note: use ident serial primary key for PostgreSQL.

-- create a join table
create table people_jobs (
  person_id integer not null,
  job_id    integer not null,

  foreign key (person_id) references people (ident),
  foreign key (job_id) references jobs (ident)
);

The foreign key constraint defines references from people_jobs to people and jobs:

  • For each person_id value in people_jobs, there must be a record with the matching ident value in people.
  • For each job_id value in people_jobs, there must be a record with the matching ident value in jobs.
-- fill it with person-job pairs
insert into people_jobs
select
  people.ident as person_id,
  jobs.ident as job_id
from people
  inner join work on people.name = work.person
  inner join jobs on work.job = jobs.name;

select * from people_jobs;
┌───────────┬────────┐
│ person_id │ job_id │
├───────────┼────────┤
│ 1         │ 1      │
│ 1         │ 2      │
│ 2         │ 2      │
└───────────┴────────┘

48: Removing Tables

Delete old tables:

drop table work;
drop table job;

Running drop table twice on the same table will result in error. So it's safer to delete a table only if it really exists and do nothing otherwise:

drop table if exists work;
drop table if exists job;

49: Comparing individual values to aggregates

Going back to the penguins.sql database, let's select penguins who weight more than the average:

select body_mass_g
from penguins
where
  body_mass_g > (
    select avg(body_mass_g)
    from penguins
  )
limit 5;
┌─────────────┐
│ body_mass_g │
├─────────────┤
│ 4675.0      │
│ 4250.0      │
│ 4400.0      │
│ 4500.0      │
│ 4650.0      │
└─────────────┘

Here, we calculate the average body mass in the subquery, and then compare each row to that average.

Running the query requires two scans of the data. There's no way to avoid this, unless we store the average in every record (but then we have to update all records every time we insert or update).

null values aren't included in the average or in the final results.

50: Comparing individual values to aggregates within groups

Find penguins that are heavier than average within their species:

select
  penguins.species,
  penguins.body_mass_g,
  round(averaged.avg_mass_g, 1) as avg_mass_g
from penguins
  inner join (
    select species, avg(body_mass_g) as avg_mass_g
    from penguins
    group by species
  ) as averaged
  on penguins.species = averaged.species
where penguins.body_mass_g > averaged.avg_mass_g
limit 5;
┌─────────┬─────────────┬────────────┐
│ species │ body_mass_g │ avg_mass_g │
├─────────┼─────────────┼────────────┤
│ Adelie  │ 3750.0      │ 3700.7     │
│ Adelie  │ 3800.0      │ 3700.7     │
│ Adelie  │ 4675.0      │ 3700.7     │
│ Adelie  │ 4250.0      │ 3700.7     │
│ Adelie  │ 3800.0      │ 3700.7     │
└─────────┴─────────────┴────────────┘

Here, the subquery runs first to create a temporary table averaged with the average mass per species. Then the database engine joins it with the penguins table, and finally filters to find penguins that are heavier than average within their species.

Note: use round(averaged.avg_mass_g::numeric, 1) for PostgreSQL.

51: Common table expressions

with averaged as (
  select species, avg(body_mass_g) as avg_mass_g
  from penguins
  group by species
)

select
  penguins.species,
  penguins.body_mass_g,
  round(averaged.avg_mass_g, 1) as avg_mass_g
from penguins
  inner join averaged on penguins.species = averaged.species
where penguins.body_mass_g > averaged.avg_mass_g
limit 5;
┌─────────┬─────────────┬────────────┐
│ species │ body_mass_g │ avg_mass_g │
├─────────┼─────────────┼────────────┤
│ Adelie  │ 3750.0      │ 3700.7     │
│ Adelie  │ 3800.0      │ 3700.7     │
│ Adelie  │ 4675.0      │ 3700.7     │
│ Adelie  │ 4250.0      │ 3700.7     │
│ Adelie  │ 3800.0      │ 3700.7     │
└─────────┴─────────────┴────────────┘

We've extracted the averaged subquery into a common table expression (CTE) to make the query clearer. CTE is basically a named subquery that is defined before the main query and can be referenced by name like a regular table.

Nested subqueries quickly become hard to understand, so it's better to use CTEs. Do not worry about subqueries being more performant than CTEs — the database engine is (usually) smart enough to take care of that.

Note: use round(averaged.avg_mass_g::numeric, 1) for PostgreSQL.

52: Conditionals

Assign a size to each penguin according to its weight, then calculate the number of penguins of each size within each species:

with sized_penguins as (
  select
    species,
    (
      case when body_mass_g < 3500
      then 'small' else 'large' end
    ) as size
  from penguins
  where body_mass_g is not null
)

select species, size, count(*) as num
from sized_penguins
group by species, size
order by species, num;
┌───────────┬───────┬─────┐
│  species  │ size  │ num │
├───────────┼───────┼─────┤
│ Adelie    │ small │ 54  │
│ Adelie    │ large │ 97  │
│ Chinstrap │ small │ 17  │
│ Chinstrap │ large │ 51  │
│ Gentoo    │ large │ 123 │
└───────────┴───────┴─────┘

The following expression returns a single result according to the condition:

case when <condition> then <true_result> else <false_result> end

53: Selecting a case

Use three possible sizes instead of two:

with sized_penguins as (
  select
    species,
    case
      when body_mass_g < 3500 then 'small'
      when body_mass_g < 5000 then 'medium'
      else 'large'
    end as size
  from penguins
  where body_mass_g is not null
)

select species, size, count(*) as num
from sized_penguins
group by species, size
order by species, num;
┌───────────┬────────┬─────┐
│  species  │  size  │ num │
├───────────┼────────┼─────┤
│ Adelie    │ small  │ 54  │
│ Adelie    │ medium │ 97  │
│ Chinstrap │ small  │ 17  │
│ Chinstrap │ medium │ 51  │
│ Gentoo    │ medium │ 56  │
│ Gentoo    │ large  │ 67  │
└───────────┴────────┴─────┘

case can contain any number of conditions:

case
  when <condition_1> then <result_1>
  when <condition_2> then <result_2>
  when <condition_3> then <result_3>
  when ...
  else <result_else>
end

54: Checking a range

Consider the weight 3500 ≤ x ≤ 5000 as "normal", otherwise as "abnormal":

with sized_penguins as (
  select
    species,
    case
      when body_mass_g between 3500 and 5000 then 'normal'
      else 'abnormal'
    end as size
  from penguins
  where body_mass_g is not null
)

select species, size, count(*) as num
from sized_penguins
group by species, size
order by species, num;
┌───────────┬──────────┬─────┐
│  species  │   size   │ num │
├───────────┼──────────┼─────┤
│ Adelie    │ abnormal │ 54  │
│ Adelie    │ normal   │ 97  │
│ Chinstrap │ abnormal │ 17  │
│ Chinstrap │ normal   │ 51  │
│ Gentoo    │ abnormal │ 61  │
│ Gentoo    │ normal   │ 62  │
└───────────┴──────────┴─────┘

X between L and H is equivalent to X >= L and X <= H.

55: Pattern matching

Islands with names containing o:

select distinct island
from penguins
where island like '%o%';
┌───────────┐
│  island   │
├───────────┤
│ Torgersen │
│ Biscoe    │
└───────────┘

like matches the value against the pattern:

  • _ matches a single character
  • % matches zero or more characters

like is case-sensitive in some DBMS (such as PostgreSQL) and case-insensitive in others (SQLite).

Further reading

SQL is huge. The SQL standard was of ≈4000 pages in 2011, and there have been two more editions since then, so it hasn't gotten any smaller. It's also incredibly hard to read and understand.

What's worse, each DBMS implements only parts of the standard, and each has its own quirks. So your best bet is to read the documentation for the specific DBMS you are working with:

Greg Wilson + 1 others · original · CC-BY-NC-4.0 · 2024-02-18