Try / SQL in Y minutes
This is an introduction to SQL from basic to fairly advanced queries. All queries have been tested on SQLite and PostgreSQL.
✨ This is an open source guide. Feel free to improve it!
Concepts: Database
A database is a collection of data that can be searched and retrieved. A database management system (DBMS) is a program that manages a particular kind of database.
Each DBMS stores data in its own way:
- SQLite stores each database in a single file
- PostgreSQL spreads information across many files for higher performance
DBMS can be a library embedded in other programs (SQLite) or a server (PostgreSQL).
A relational database management system (RDBMS) stores data in tables and uses SQL for queries. Unfortunately, every RDBMS has its own dialect of SQL.
1: Selecting a constant
select 42;
42
select
is a keyword. Normally it is used to select data from a table, but if all we want is a constant value, we don't need to specify one.
A semicolon terminator is required.
2: Selecting all values from table
Using the penguins
database:
select * from little_penguins;
┌───────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────┬────────┐
│ species │ island │ bill_length_mm │ bill_depth_mm │ flipper_length_mm │ body_mass_g │ sex │
├───────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┤
│ Gentoo │ Biscoe │ 51.3 │ 14.2 │ 218.0 │ 5300.0 │ MALE │
│ Adelie │ Dream │ 35.7 │ 18.0 │ 202.0 │ 3550.0 │ FEMALE │
│ Adelie │ Torgersen │ 36.6 │ 17.8 │ 185.0 │ 3700.0 │ FEMALE │
│ Chinstrap │ Dream │ 55.8 │ 19.8 │ 207.0 │ 4000.0 │ MALE │
│ Adelie │ Dream │ 38.1 │ 18.6 │ 190.0 │ 3700.0 │ FEMALE │
│ Adelie │ Dream │ 36.2 │ 17.3 │ 187.0 │ 3300.0 │ FEMALE │
│ Adelie │ Dream │ 39.5 │ 17.8 │ 188.0 │ 3300.0 │ FEMALE │
│ Gentoo │ Biscoe │ 42.6 │ 13.7 │ 213.0 │ 4950.0 │ FEMALE │
│ Gentoo │ Biscoe │ 52.1 │ 17.0 │ 230.0 │ 5550.0 │ MALE │
│ Adelie │ Torgersen │ 36.7 │ 18.8 │ 187.0 │ 3800.0 │ FEMALE │
└───────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┘
Use *
to select all columns and from tablename
to specify a table.
3: Specifying columns
select species, island, sex
from little_penguins;
┌───────────┬───────────┬────────┐
│ species │ island │ sex │
├───────────┼───────────┼────────┤
│ Gentoo │ Biscoe │ MALE │
│ Adelie │ Dream │ FEMALE │
│ Adelie │ Torgersen │ FEMALE │
│ Chinstrap │ Dream │ MALE │
│ Adelie │ Dream │ FEMALE │
│ Adelie │ Dream │ FEMALE │
│ Adelie │ Dream │ FEMALE │
│ Gentoo │ Biscoe │ FEMALE │
│ Gentoo │ Biscoe │ MALE │
│ Adelie │ Torgersen │ FEMALE │
└───────────┴───────────┴────────┘
Specify column names separated by commas, in any order. Duplicates are allowed. Line breaks are allowed and encouraged for readability.
4: Sorting
select species, sex, island
from little_penguins
order by island asc, sex desc;
┌───────────┬────────┬───────────┐
│ species │ sex │ island │
├───────────┼────────┼───────────┤
│ Gentoo │ MALE │ Biscoe │
│ Gentoo │ MALE │ Biscoe │
│ Gentoo │ FEMALE │ Biscoe │
│ Chinstrap │ MALE │ Dream │
│ Adelie │ FEMALE │ Dream │
│ Adelie │ FEMALE │ Dream │
│ Adelie │ FEMALE │ Dream │
│ Adelie │ FEMALE │ Dream │
│ Adelie │ FEMALE │ Torgersen │
│ Adelie │ FEMALE │ Torgersen │
└───────────┴────────┴───────────┘
order by
must follow from
(which must follow select). asc
is ascending, desc
is descending. Default is ascending, but it's better to specify it while learning.
5: Limiting output
-- the full dataset has 344 rows,
-- let's select 10 of them
select species, sex, island
from penguins
order by species, sex, island
limit 10;
┌─────────┬────────┬───────────┐
│ species │ sex │ island │
├─────────┼────────┼───────────┤
│ Adelie │ │ Dream │
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ FEMALE │ Biscoe │
└─────────┴────────┴───────────┘
Comments start with --
and continue to the end of the line. limit N
specifies the maximum number of rows returned by the query.
6: Paging output
select species, sex, island
from penguins
order by species, sex, island
limit 10 offset 3;
-- skip the first 3 records and shows the next 10
┌─────────┬────────┬───────────┐
│ species │ sex │ island │
├─────────┼────────┼───────────┤
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ FEMALE │ Biscoe │
└─────────┴────────┴───────────┘
offset N
specifies the number of rows to skip from the start of the selection. It must follow the limit
clause.
7: Removing duplicates
select distinct
species, sex, island
from penguins;
┌───────────┬────────┬───────────┐
│ species │ sex │ island │
├───────────┼────────┼───────────┤
│ Adelie │ MALE │ Torgersen │
│ Adelie │ FEMALE │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ MALE │ Biscoe │
│ Adelie │ FEMALE │ Dream │
│ Adelie │ MALE │ Dream │
│ Adelie │ │ Dream │
│ Chinstrap │ FEMALE │ Dream │
│ Chinstrap │ MALE │ Dream │
│ Gentoo │ FEMALE │ Biscoe │
│ Gentoo │ MALE │ Biscoe │
│ Gentoo │ │ Biscoe │
└───────────┴────────┴───────────┘
The distinct
keyword leaves only distinct combinations of select
ed columns. It must appear immediately after select
.
Note that there are empty values in the sex
column, indicating missing data. We'll talk about this later.
8: Filtering results
select distinct
species, sex, island
from penguins
where island = 'Biscoe';
┌─────────┬────────┬───────────┐
│ species │ sex │ living at │
├─────────┼────────┼───────────┤
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ MALE │ Biscoe │
│ Gentoo │ FEMALE │ Biscoe │
│ Gentoo │ MALE │ Biscoe │
│ Gentoo │ │ Biscoe │
└─────────┴────────┴───────────┘
The where
condition filters the rows produced by the selection. The condition is evaluated independently for each row. Only rows that pass the test appear in results.
Use single quotes for 'text data'
.
9: Filtering with more complex conditions
select distinct
species, sex, island
from penguins
where island = 'Biscoe' and sex <> 'MALE';
-- `<>` means "not equal"
┌─────────┬────────┬────────┐
│ species │ sex │ island │
├─────────┼────────┼────────┤
│ Adelie │ FEMALE │ Biscoe │
│ Gentoo │ FEMALE │ Biscoe │
└─────────┴────────┴────────┘
Logical operators:
and
: both sub-conditions must be trueor
: either or both sub-conditions must be true
Note that the row for Gentoo penguins on Biscoe island with unknown (empty) sex didn't pass the test. We'll talk about this later.
10: Doing calculations
select
flipper_length_mm / 10.0,
body_mass_g / 1000.0
from penguins
limit 3;
┌──────────────────────────┬──────────────────────┐
│ flipper_length_mm / 10.0 │ body_mass_g / 1000.0 │
├──────────────────────────┼──────────────────────┤
│ 18.1 │ 3.75 │
│ 18.6 │ 3.8 │
│ 19.5 │ 3.25 │
└──────────────────────────┴──────────────────────┘
We can perform the usual kinds of arithmetic on individual values. The calculation is done for each row independently.
11: Renaming columns
select
flipper_length_mm / 10.0 as flipper_cm,
body_mass_g / 1000.0 as weight_kg,
island as "found at"
from penguins
limit 3;
┌────────────┬───────────┬───────────┐
│ flipper_cm │ weight_kg │ found at │
├────────────┼───────────┼───────────┤
│ 18.1 │ 3.75 │ Torgersen │
│ 18.6 │ 3.8 │ Torgersen │
│ 19.5 │ 3.25 │ Torgersen │
└────────────┴───────────┴───────────┘
Use <expression> as name
to give the result of the calculation a meaningful name. Use double quotes for aliases that contain spaces.
12: Calculating with missing values
select
flipper_length_mm / 10.0 as flipper_cm,
body_mass_g / 1000.0 as weight_kg,
island as found_at
from penguins
limit 5;
┌────────────┬───────────┬───────────┐
│ flipper_cm │ weight_kg │ found_at │
├────────────┼───────────┼───────────┤
│ 18.1 │ 3.75 │ Torgersen │
│ 18.6 │ 3.8 │ Torgersen │
│ 19.5 │ 3.25 │ Torgersen │
│ │ │ Torgersen │
│ 19.3 │ 3.45 │ Torgersen │
└────────────┴───────────┴───────────┘
SQL uses a special value null
to represent missing data. It is neither 0 nor an empty string, but "I don't know". Here, the flipper length and body weight are unknown for the fourth penguin.
null
is viral, so calculations involving it will always return null
("I don't know" divided by 10 or 1000 is "I don't know").
13: Null equality
select distinct
species, sex, island
from penguins
where island = 'Biscoe';
┌─────────┬────────┬────────┐
│ species │ sex │ island │
├─────────┼────────┼────────┤
│ Adelie │ FEMALE │ Biscoe │
│ Adelie │ MALE │ Biscoe │
│ Gentoo │ FEMALE │ Biscoe │
│ Gentoo │ MALE │ Biscoe │
│ Gentoo │ │ Biscoe │
└─────────┴────────┴────────┘
Let's ask for female penguins only:
select distinct
species, sex, island
from penguins
where island = 'Biscoe' and sex = 'FEMALE';
┌─────────┬────────┬────────┐
│ species │ sex │ island │
├─────────┼────────┼────────┤
│ Adelie │ FEMALE │ Biscoe │
│ Gentoo │ FEMALE │ Biscoe │
└─────────┴────────┴────────┘
The row with the null
sex is removed from the results.
14: Null inequality
Let's ask for penguins that aren't female:
select distinct
species, sex, island
from penguins
where island = 'Biscoe' and sex <> 'FEMALE';
┌─────────┬──────┬────────┐
│ species │ sex │ island │
├─────────┼──────┼────────┤
│ Adelie │ MALE │ Biscoe │
│ Gentoo │ MALE │ Biscoe │
└─────────┴──────┴────────┘
The row with the null
sex is also removed from the results.
15: Ternary logic
select null = null;
┌─────────────┐
│ null = null │
├─────────────┤
│ │
└─────────────┘
If we don't know the left and right values, we don't know if they're equal or not — so the result is null
. The same is true for null <> null
.
It's ternary logic — a logic based on three values: true, false, and "don't know" (represented as null
).
equality:
┌──────┬───────┬───────┬──────┐
│ │ X │ Y │ null │
├──────┼───────┼───────┼──────┤
│ X │ true │ false │ null │
│ Y │ false │ true │ null │
│ null │ null │ null │ null │
└──────┴───────┴───────┴──────┘
16: Handling null safely
select
species, sex, island
from penguins
where sex is null;
┌─────────┬─────┬───────────┐
│ species │ sex │ island │
├─────────┼─────┼───────────┤
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ │ Torgersen │
│ Adelie │ │ Dream │
│ Gentoo │ │ Biscoe │
│ Gentoo │ │ Biscoe │
│ Gentoo │ │ Biscoe │
│ Gentoo │ │ Biscoe │
│ Gentoo │ │ Biscoe │
└─────────┴─────┴───────────┘
Use is null
and is not null
to handle null safely.
Other parts of SQL treat null
differently, often ignoring it in calculations (e.g. when summing values).
17: Aggregating
select sum(body_mass_g) as total_mass
from penguins;
┌────────────┐
│ total_mass │
├────────────┤
│ 1437000.0 │
└────────────┘
Aggregation combines column values from multiple rows into a single value. sum
is an aggregation function.
18: Common aggregation functions
select
max(bill_length_mm) as longest_bill,
min(flipper_length_mm) as shortest_flipper,
avg(bill_length_mm) / avg(bill_depth_mm) as weird_ratio
from penguins;
┌──────────────┬──────────────────┬──────────────────┐
│ longest_bill │ shortest_flipper │ weird_ratio │
├──────────────┼──────────────────┼──────────────────┤
│ 59.6 │ 172.0 │ 2.56087082530644 │
└──────────────┴──────────────────┴──────────────────┘
Aggregation functions ignore null
values, otherwise all three result values would be null
.
19: Counting
select
-- counts rows
count(*) as count_star,
-- counts non-null entries in column
count(sex) as count_specific,
-- counts distinct non-null entries
count(distinct sex) as count_distinct
from penguins;
┌────────────┬────────────────┬────────────────┐
│ count_star │ count_specific │ count_distinct │
├────────────┼────────────────┼────────────────┤
│ 344 │ 333 │ 2 │
└────────────┴────────────────┴────────────────┘
20: Grouping
select avg(body_mass_g) as average_mass_g
from penguins
group by sex;
┌──────────────────┐
│ average_mass_g │
├──────────────────┤
│ 4005.55555555556 │
│ 3862.27272727273 │
│ 4545.68452380952 │
└──────────────────┘
This query puts rows in groups based on distinct combinations of values in columns specified with group by
, then performs aggregation separately for each group.
It's not very useful unless we list the sex
value for each group in the result.
21: Aggregation key
select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex;
┌────────┬──────────────────┐
│ sex │ average_mass_g │
├────────┼──────────────────┤
│ │ 4005.55555555556 │
│ FEMALE │ 3862.27272727273 │
│ MALE │ 4545.68452380952 │
└────────┴──────────────────┘
Distinct sex
values identify each of the groups.
22: Unaggregated columns
select
sex,
body_mass_g
from penguins
group by sex;
┌────────┬─────────────┐
│ sex │ body_mass_g │
├────────┼─────────────┤
│ │ │
│ FEMALE │ 3800.0 │
│ MALE │ 3750.0 │
└────────┴─────────────┘
sex
is an aggregation key, but body_mass_g
is not (it is not listed in group by
). We also didn't specify an aggregation function for body_mass_g
, so the database engine doesn't know what to do with it.
SQLite is very forgiving, so it'll just return an arbitrary body_mass_g
value for each sex
. PostgreSQL (and other DBMS) will return an error:
ERROR: column "penguins.body_mass_g" must appear in the GROUP BY clause or be used in an aggregate function
LINE 3: body_mass_g
^
23: Filtering aggregated values
select
sex,
avg(body_mass_g) as average_mass_g
from penguins
group by sex
having avg(body_mass_g) > 4000.0;
┌──────┬──────────────────┐
│ sex │ average_mass_g │
├──────┼──────────────────┤
│ │ 4005.55555555556 │
│ MALE │ 4545.68452380952 │
└──────┴──────────────────┘
The having
clause filters the results after aggregation (as opposed to where
, which filters before aggregation).
24: Readable output
select
sex,
round(avg(body_mass_g), 1) as average_mass_g
from penguins
group by sex
having round(avg(body_mass_g), 1) > 4000.0;
┌──────┬────────────────┐
│ sex │ average_mass_g │
├──────┼────────────────┤
│ │ 4005.6 │
│ MALE │ 4545.7 │
└──────┴────────────────┘
Use round(value, decimals)
to round off a number.
Note that PostgreSQL requires a type conversion:
select
sex,
round(avg(body_mass_g)::numeric, 1) as average_mass_g
from penguins
group by sex
having round(avg(body_mass_g)::numeric, 1) > 4000.0;
25: Filtering aggregate inputs
select
sex,
round(
avg(body_mass_g) filter (where body_mass_g < 4000.0),
1
) as average_mass_g
from penguins
group by sex;
┌────────┬────────────────┐
│ sex │ average_mass_g │
├────────┼────────────────┤
│ │ 3362.5 │
│ FEMALE │ 3417.3 │
│ MALE │ 3729.6 │
└────────┴────────────────┘
Note: use avg(body_mass_g) filter (where body_mass_g < 4000.0)::numeric
for PostgreSQL.
filter (where <condition>)
filters the results after aggregation, just like having
. But unlike having
, it applies to each expression in the select
list individually.
So we can define different filters for different select
expressions:
select
sex,
max(body_mass_g) filter (where body_mass_g < 4000.0) as max_4000,
max(body_mass_g) filter (where species = 'Adelie') as max_adelie
from penguins
group by sex;
┌────────┬──────────┬────────────┐
│ sex │ max_4000 │ max_adelie │
├────────┼──────────┼────────────┤
│ │ 3700.0 │ 4250.0 │
│ FEMALE │ 3950.0 │ 3900.0 │
│ MALE │ 3975.0 │ 4775.0 │
└────────┴──────────┴────────────┘
26: Creating tables
create table job (
name varchar(10) not null,
billable real not null
);
create table work (
person varchar(5) not null,
job varchar(10) not null
);
To create a table, use create table <name>
followed by a list of columns in parentheses. Each column has a name, a data type, and optional extra information (e.g., not null
prohibits writing null
values).
27: Inserting Data
insert into job values
('calibrate', 1.5),
('clean', 0.5);
insert into work values
('mik', 'calibrate'),
('mik', 'clean'),
('mik', 'complain'),
('po', 'clean'),
('po', 'complain'),
('tay', 'complain');
select * from job;
select * from work;
┌───────────┬──────────┐
│ name │ billable │
├───────────┼──────────┤
│ calibrate │ 1.5 │
│ clean │ 0.5 │
└───────────┴──────────┘
┌────────┬───────────┐
│ person │ job │
├────────┼───────────┤
│ mik │ calibrate │
│ mik │ clean │
│ mik │ complain │
│ po │ clean │
│ po │ complain │
│ tay │ complain │
└────────┴───────────┘
28: Updating rows
update work
set person = 'tae'
where person = 'tay';
select * from work;
┌────────┬───────────┐
│ person │ job │
├────────┼───────────┤
│ mik │ calibrate │
│ mik │ clean │
│ mik │ complain │
│ po │ clean │
│ po │ complain │
│ tae │ complain │
└────────┴───────────┘
The where
clause specifies which rows to update. Without where
, the query will update all rows in the table, which is usually not what you want.
29: Deleting rows
delete from work
where person = 'tae';
select * from work;
┌────────┬───────────┐
│ person │ job │
├────────┼───────────┤
│ mik │ calibrate │
│ mik │ clean │
│ mik │ complain │
│ po │ clean │
│ po │ complain │
└────────┴───────────┘
As with update
, you use where
to specify which rows to delete.
30: Copying tables
create table work_bak (
person varchar(5) not null,
job varchar(10) not null
);
insert into work_bak
select person, job
from work;
select count(*) from work_bak;
┌──────────┐
│ count(*) │
├──────────┤
│ 6 │
└──────────┘
insert into ... select
inserts selected rows from one table into the other.
You can create and populate the table with a single query:
create table work_bak as
select person, job
from work;
select count(*) from work_bak;
┌──────────┐
│ count(*) │
├──────────┤
│ 6 │
└──────────┘
The work_bak
table structure is automatically defined based on the names and types of the select
expressions (in this case, the person
and job
columns).
31: Combining information
select *
from work cross join job;
┌────────┬───────────┬───────────┬──────────┐
│ person │ job │ name │ billable │
├────────┼───────────┼───────────┼──────────┤
│ mik │ calibrate │ calibrate │ 1.5 │
│ mik │ calibrate │ clean │ 0.5 │
│ mik │ clean │ calibrate │ 1.5 │
│ mik │ clean │ clean │ 0.5 │
│ mik │ complain │ calibrate │ 1.5 │
│ mik │ complain │ clean │ 0.5 │
│ po │ clean │ calibrate │ 1.5 │
│ po │ clean │ clean │ 0.5 │
│ po │ complain │ calibrate │ 1.5 │
│ po │ complain │ clean │ 0.5 │
│ tay │ complain │ calibrate │ 1.5 │
│ tay │ complain │ clean │ 0.5 │
└────────┴───────────┴───────────┴──────────┘
A join combines information from two tables. cross join
constructs their cross product —
all pairwise combinations of rows from the first table with rows from the second table.
The result isn't very useful. job
and name
values don't match: the combined data has records not related to each other.
32: Inner join
select *
from work
inner join job on work.job = job.name;
┌────────┬───────────┬───────────┬──────────┐
│ person │ job │ name │ billable │
├────────┼───────────┼───────────┼──────────┤
│ mik │ calibrate │ calibrate │ 1.5 │
│ mik │ clean │ clean │ 0.5 │
│ po │ clean │ clean │ 0.5 │
└────────┴───────────┴───────────┴──────────┘
Use the <table>.<column>
notation to specify columns. Use on <condition>
to specify the join condition. Here we are matching people from the work
table with corresponding jobs from the job
table, effectively selecting all jobs for each person.
Note that while some people in the work
table have job
= complain
, there are no complain
jobs in the job
table. So for work.job = 'complain'
there are no matching records in the job
table, and these records are excluded from the result.
33: Aggregating joined data
select
work.person,
sum(job.billable) as pay
from work
inner join job on work.job = job.name
group by work.person;
┌────────┬─────┐
│ person │ pay │
├────────┼─────┤
│ mik │ 2.0 │
│ po │ 0.5 │
└────────┴─────┘
Here we select all jobs for each person (as in the previous query), and then aggregate them by person, calculating the total pay for each person.
Since Tay in the work
table has only complain
jobs, and the jobs
table is missing complain
jobs, Tay is excluded from the result.
34: Left join
select *
from work
left join job on work.job = job.name;
┌────────┬───────────┬───────────┬──────────┐
│ person │ job │ name │ billable │
├────────┼───────────┼───────────┼──────────┤
│ mik │ calibrate │ calibrate │ 1.5 │
│ mik │ clean │ clean │ 0.5 │
│ mik │ complain │ │ │
│ po │ clean │ clean │ 0.5 │
│ po │ complain │ │ │
│ tay │ complain │ │ │
└────────┴───────────┴───────────┴──────────┘
A left join (also known as a left outer join) keeps all rows from the left table (work
) and fills missing values from the right table (job
) with null
.
Now Tay has finally made it into the result, but since there are no jobs for them in the job
table, the name
and billable
columns for Tay are empty.
35: Aggregating left joins
select
work.person,
sum(job.billable) as pay
from work
left join job on work.job = job.name
group by work.person;
┌────────┬─────┐
│ person │ pay │
├────────┼─────┤
│ mik │ 2.0 │
│ po │ 0.5 │
│ tay │ │
└────────┴─────┘
As discussed in the previous query, Tay has made it into the result thanks to the left join
. And since the individual billable
values for Tay are null
, their sum is also null
.
36: Coalescing values
select
work.person,
coalesce(sum(job.billable), 0.0) as pay
from work
left join job on work.job = job.name
group by work.person;
┌────────┬─────┐
│ person │ pay │
├────────┼─────┤
│ mik │ 2.0 │
│ po │ 0.5 │
│ tay │ 0.0 │
└────────┴─────┘
coalesce(val1, val2, ...)
returns the first non-null value, so Tay gets 0 pay instead of null
.
37: Full outer join
Suppose we add one more job to the jobs
table:
insert into job values
('manage', 1);
Now let's select people with their corresponding job, but using the full join
instead of left join
:
select *
from work
full join job on work.job = job.name;
┌────────┬───────────┬───────────┬──────────┐
│ person │ job │ name │ billable │
├────────┼───────────┼───────────┼──────────┤
│ mik │ calibrate │ calibrate │ 1.5 │
│ mik │ clean │ clean │ 0.5 │
│ mik │ complain │ │ │
│ po │ clean │ clean │ 0.5 │
│ po │ complain │ │ │
│ tay │ complain │ │ │
│ │ │ manage │ 1.0 │
└────────┴───────────┴───────────┴──────────┘
A full join (also known as a full outer join) selects the following:
- matching rows from both tables,
- rows from the left table (
work
) that do not match the right table (job
). - rows from the right table (
job
) that do not match the left table (work
).
So the query returns both Tay from the work
table (whose job does not match the job
table) and "manage" from the job
table (whose name does not match the work
table).
38: Negating incorrectly
Who doesn't calibrate?
select distinct person
from work
where job <> 'calibrate';
┌────────┐
│ person │
├────────┤
│ mik │
│ po │
│ tay │
└────────┘
Mik made it into the result, which seems wrong: Mik does calibrate (name='mik', job='calibrate'
). The problem is that there's an entry for Mik cleaning (name='mik', job='clean'
). And since 'clean' <> 'calibrate'
, this entry is included in the result.
So essentially this query is Whose ONLY job is not calibrate?. We need a different approach to answer the original question.
39: Set membership
select *
from work
where person not in ('mik', 'tay');
-- select all people except Mik and Tay
┌────────┬──────────┐
│ person │ job │
├────────┼──────────┤
│ po │ clean │
│ po │ complain │
└────────┴──────────┘
in <values>
and not in <values>
clauses work exactly as you'd expect.
40: Subqueries
So who doesn't calibrate?
select distinct person
from work
where person not in (
select distinct person
from work
where job = 'calibrate'
);
┌────────┐
│ person │
├────────┤
│ po │
│ tay │
└────────┘
Here we use a subquery to select the people who do calibrate. Then we select all the people who are not in that set (so essentially those who do not calibrate).
41: Union
Who clean or calibrate?
select person from work
where job = 'clean'
union
select person from work
where job = 'calibrate';
┌────────┐
│ person │
├────────┤
│ mik │
│ po │
└────────┘
union
returns the records found in the first or second datasets. The records must have exactly the same structure (same set and order of fields).
union
also removes the duplicate records from the result (otherwise we'd see mik
twice). You can use union all
if you don't mind the duplicates — this can be much faster on large datasets.
Union does not guarantee any particular order, so we can add an order by
to the last select
to sort the result:
select person from work
where job = 'clean'
union
select person from work
where job = 'calibrate'
order by person desc;
┌────────┐
│ person │
├────────┤
│ po │
│ mik │
└────────┘
42: Intersection
Who clean and calibrate?
select person from work
where job = 'clean'
intersect
select person from work
where job = 'calibrate';
┌────────┐
│ person │
├────────┤
│ mik │
└────────┘
intersect
returns the records found in both the first and second datasets. As with union, all records must have the same structure.
43: Exclusion
Who just complains and does no useful work?
select person from work
where job = 'complain'
except
select person from work
where job <> 'complain';
┌────────┐
│ person │
├────────┤
│ tay │
└────────┘
except
returns the records found in the first dataset but not in the second. As with union and intersect, all records must have the same structure.
44: Defining a primary key
A primary key uniquely identifies each record in a table. A table can use any field (or combination of fields) as a primary key, as long as the value(s) are unique for each record.
create table equipment (
size real not null,
color varchar(10) not null,
num integer not null,
primary key (size, color)
);
insert into equipment values
(1.5, 'blue', 2),
(1.5, 'green', 1),
(2.5, 'blue', 1);
select * from equipment;
insert into equipment values
(1.5, 'green', 2);
┌──────┬───────┬─────┐
│ size │ color │ num │
├──────┼───────┼─────┤
│ 1.5 │ blue │ 2 │
│ 1.5 │ green │ 1 │
│ 2.5 │ blue │ 1 │
└──────┴───────┴─────┘
Runtime error near line 17: UNIQUE constraint failed: equipment.size, equipment.color (19)
(exit status 1)
Since the table already contains the size=1.5, color='green'
record, we can't add another one with the same values.
45: Autoincrementing and primary keys
create table person (
ident integer primary key autoincrement,
name varchar(5) not null
);
insert into person(name) values
('mik'), ('po'), ('tay');
select * from person;
insert into person values (1, 'prevented');
┌───────┬──────┐
│ ident │ name │
├───────┼──────┤
│ 1 │ mik │
│ 2 │ po │
│ 3 │ tay │
└───────┴──────┘
Runtime error near line 37: UNIQUE constraint failed: person.ident (19)
(exit status 1)
The database autoincrements ident
each time a new record is added. Auto-incremented fields are unique for each record, so they are often used as primary keys.
Such surrogate identifiers are also useful because if the "business" attribute of a record changes (say, Mik changes their name), the identifier remains the same, and we don't need to change the data in other tables that reference said record.
Note that PostgreSQL requires a different autoincrement syntax:
create table person (
ident serial primary key,
name varchar(5) not null
);
46: Altering tables
Let's add a unique identifier to the job
table:
alter table job
add ident integer not null default -1;
update job
set ident = 1
where name = 'calibrate';
update job
set ident = 2
where name = 'clean';
select * from job;
┌───────────┬──────────┬───────┐
│ name │ billable │ ident │
├───────────┼──────────┼───────┤
│ calibrate │ 1.5 │ 1 │
│ clean │ 0.5 │ 2 │
└───────────┴──────────┴───────┘
Here we add the ident
column. Since it can't be null and there is already data in the table, we have to provide a default value.
Then we use update
to modify existing records and set the actual ident
values. This is an example of data migration.
Concepts: M-to-N relationships
Relationships between entities are typically characterized as:
1-to-1. Fields in the same record (such as a person's first and last name)
1-to-many. Each record in table A can have multiple matching records in table B, but each record in table B has at most one matching record in table A (a person can have multiple toothbrushes, but each toothbrush belongs to at most one person).
Many-to-many. Each record in table A can have multiple matching records in table B, and vice versa (a person can have multiple jobs, and each job can be performed by multiple people).
47: Creating a join table
Our work
and job
tables have a many-to-many relationship. Typically, such relationships are implemented using a separate join table:
- Extract people into table A.
- Extract jobs into table B.
- Create a join table with person-job pairs (referencing A and B).
Let's do this.
-- extract people
create table people (
ident integer primary key autoincrement,
name varchar(5) not null
);
insert into people (name)
select distinct person
from work;
Note: use ident serial primary key
for PostgreSQL.
-- extract jobs
create table jobs (
ident integer primary key autoincrement,
name varchar(10) not null,
billable real not null
);
insert into jobs (name, billable)
select name, billable from job;
Note: use ident serial primary key
for PostgreSQL.
-- create a join table
create table people_jobs (
person_id integer not null,
job_id integer not null,
foreign key (person_id) references people (ident),
foreign key (job_id) references jobs (ident)
);
The foreign key
constraint defines references from people_jobs
to people
and jobs
:
- For each
person_id
value inpeople_jobs
, there must be a record with the matchingident
value inpeople
. - For each
job_id
value inpeople_jobs
, there must be a record with the matchingident
value injobs
.
-- fill it with person-job pairs
insert into people_jobs
select
people.ident as person_id,
jobs.ident as job_id
from people
inner join work on people.name = work.person
inner join jobs on work.job = jobs.name;
select * from people_jobs;
┌───────────┬────────┐
│ person_id │ job_id │
├───────────┼────────┤
│ 1 │ 1 │
│ 1 │ 2 │
│ 2 │ 2 │
└───────────┴────────┘
48: Removing Tables
Delete old tables:
drop table work;
drop table job;
Running drop table
twice on the same table will result in error. So it's safer to delete a table only if it really exists and do nothing otherwise:
drop table if exists work;
drop table if exists job;
49: Comparing individual values to aggregates
Going back to the penguins.sql database, let's select penguins who weight more than the average:
select body_mass_g
from penguins
where
body_mass_g > (
select avg(body_mass_g)
from penguins
)
limit 5;
┌─────────────┐
│ body_mass_g │
├─────────────┤
│ 4675.0 │
│ 4250.0 │
│ 4400.0 │
│ 4500.0 │
│ 4650.0 │
└─────────────┘
Here, we calculate the average body mass in the subquery, and then compare each row to that average.
Running the query requires two scans of the data. There's no way to avoid this, unless we store the average in every record (but then we have to update all records every time we insert or update).
null
values aren't included in the average or in the final results.
50: Comparing individual values to aggregates within groups
Find penguins that are heavier than average within their species:
select
penguins.species,
penguins.body_mass_g,
round(averaged.avg_mass_g, 1) as avg_mass_g
from penguins
inner join (
select species, avg(body_mass_g) as avg_mass_g
from penguins
group by species
) as averaged
on penguins.species = averaged.species
where penguins.body_mass_g > averaged.avg_mass_g
limit 5;
┌─────────┬─────────────┬────────────┐
│ species │ body_mass_g │ avg_mass_g │
├─────────┼─────────────┼────────────┤
│ Adelie │ 3750.0 │ 3700.7 │
│ Adelie │ 3800.0 │ 3700.7 │
│ Adelie │ 4675.0 │ 3700.7 │
│ Adelie │ 4250.0 │ 3700.7 │
│ Adelie │ 3800.0 │ 3700.7 │
└─────────┴─────────────┴────────────┘
Here, the subquery runs first to create a temporary table averaged
with the average mass per species. Then the database engine joins it with the penguins
table, and finally filters to find penguins that are heavier than average within their species.
Note: use round(averaged.avg_mass_g::numeric, 1)
for PostgreSQL.
51: Common table expressions
with averaged as (
select species, avg(body_mass_g) as avg_mass_g
from penguins
group by species
)
select
penguins.species,
penguins.body_mass_g,
round(averaged.avg_mass_g, 1) as avg_mass_g
from penguins
inner join averaged on penguins.species = averaged.species
where penguins.body_mass_g > averaged.avg_mass_g
limit 5;
┌─────────┬─────────────┬────────────┐
│ species │ body_mass_g │ avg_mass_g │
├─────────┼─────────────┼────────────┤
│ Adelie │ 3750.0 │ 3700.7 │
│ Adelie │ 3800.0 │ 3700.7 │
│ Adelie │ 4675.0 │ 3700.7 │
│ Adelie │ 4250.0 │ 3700.7 │
│ Adelie │ 3800.0 │ 3700.7 │
└─────────┴─────────────┴────────────┘
We've extracted the averaged
subquery into a common table expression (CTE) to make the query clearer. CTE is basically a named subquery that is defined before the main query and can be referenced by name like a regular table.
Nested subqueries quickly become hard to understand, so it's better to use CTEs. Do not worry about subqueries being more performant than CTEs — the database engine is (usually) smart enough to take care of that.
Note: use round(averaged.avg_mass_g::numeric, 1)
for PostgreSQL.
52: Conditionals
Assign a size
to each penguin according to its weight, then calculate the number of penguins of each size within each species:
with sized_penguins as (
select
species,
(
case when body_mass_g < 3500
then 'small' else 'large' end
) as size
from penguins
where body_mass_g is not null
)
select species, size, count(*) as num
from sized_penguins
group by species, size
order by species, num;
┌───────────┬───────┬─────┐
│ species │ size │ num │
├───────────┼───────┼─────┤
│ Adelie │ small │ 54 │
│ Adelie │ large │ 97 │
│ Chinstrap │ small │ 17 │
│ Chinstrap │ large │ 51 │
│ Gentoo │ large │ 123 │
└───────────┴───────┴─────┘
The following expression returns a single result according to the condition:
case when <condition> then <true_result> else <false_result> end
53: Selecting a case
Use three possible size
s instead of two:
with sized_penguins as (
select
species,
case
when body_mass_g < 3500 then 'small'
when body_mass_g < 5000 then 'medium'
else 'large'
end as size
from penguins
where body_mass_g is not null
)
select species, size, count(*) as num
from sized_penguins
group by species, size
order by species, num;
┌───────────┬────────┬─────┐
│ species │ size │ num │
├───────────┼────────┼─────┤
│ Adelie │ small │ 54 │
│ Adelie │ medium │ 97 │
│ Chinstrap │ small │ 17 │
│ Chinstrap │ medium │ 51 │
│ Gentoo │ medium │ 56 │
│ Gentoo │ large │ 67 │
└───────────┴────────┴─────┘
case
can contain any number of conditions:
case
when <condition_1> then <result_1>
when <condition_2> then <result_2>
when <condition_3> then <result_3>
when ...
else <result_else>
end
54: Checking a range
Consider the weight 3500 ≤ x ≤ 5000 as "normal", otherwise as "abnormal":
with sized_penguins as (
select
species,
case
when body_mass_g between 3500 and 5000 then 'normal'
else 'abnormal'
end as size
from penguins
where body_mass_g is not null
)
select species, size, count(*) as num
from sized_penguins
group by species, size
order by species, num;
┌───────────┬──────────┬─────┐
│ species │ size │ num │
├───────────┼──────────┼─────┤
│ Adelie │ abnormal │ 54 │
│ Adelie │ normal │ 97 │
│ Chinstrap │ abnormal │ 17 │
│ Chinstrap │ normal │ 51 │
│ Gentoo │ abnormal │ 61 │
│ Gentoo │ normal │ 62 │
└───────────┴──────────┴─────┘
X between L and H
is equivalent to X >= L and X <= H
.
55: Pattern matching
Islands with names containing o:
select distinct island
from penguins
where island like '%o%';
┌───────────┐
│ island │
├───────────┤
│ Torgersen │
│ Biscoe │
└───────────┘
like
matches the value against the pattern:
_
matches a single character%
matches zero or more characters
like
is case-sensitive in some DBMS (such as PostgreSQL) and case-insensitive in others (SQLite).
Further reading
SQL is huge. The SQL standard was of ≈4000 pages in 2011, and there have been two more editions since then, so it hasn't gotten any smaller. It's also incredibly hard to read and understand.
What's worse, each DBMS implements only parts of the standard, and each has its own quirks. So your best bet is to read the documentation for the specific DBMS you are working with:
Greg Wilson + 1 others · original · CC-BY-NC-4.0 · 2024-02-18
Greg Wilson, Anton Zhiyanov