SQL queries¶
The SQL SELECT
statement is a tool for constructing queries which extract data from a database. We will show various ways in which this statement can be used. For demonstration purposes we will be working with a small SQLite database which can be downloaded here:
To execute SQL code in Jupyter Notebook we will use the ipython-sql
notebook extension. Assuming that this extension is already installed (see SQL Basics for installation instructions), the code below will load it:
[1]:
%load_ext sql
Next, we connect to the gradebook.sqlite
database:
[2]:
%%sql
sqlite:///gradebook_data.sqlite
This database consists of a single table gradebook
, with records (names, exam scores etc.) of students enrolled in a college course. The whole table looks as follows:
[3]:
%%sql
SELECT *
FROM gradebook
* sqlite:///gradebook_data.sqlite
Done.
[3]:
student_id | first_name | last_name | major | exam1 | exam2 | exam3 |
---|---|---|---|---|---|---|
2712 | Starr | Bachand | computer science | None | 27 | 85 |
3548 | Marcia | Hinkel | biology | 79 | 10 | 41 |
3877 | Chu | Flax | mathematics | 72 | None | 29 |
5108 | Shawnna | Coate | computer science | 79 | 96 | 61 |
5775 | Rosina | Hovey | biology | 96 | 13 | 99 |
5787 | Tenisha | Rippel | computer science | 27 | 54 | 79 |
6245 | Adelina | Brummitt | mathematics | None | 67 | 99 |
7138 | Fran | Mount | biology | 71 | 5 | 80 |
7256 | Rosalinda | Trippe | computer science | None | None | 82 |
8911 | Lottie | Champine | computer science | None | None | 62 |
9165 | Ludivina | Hoskin | mathematics | 27 | 55 | None |
9346 | Stanton | Mcgaugh | computer science | 6 | 60 | 40 |
9667 | Jonnie | Hillock | mathematics | 56 | 24 | 72 |
SELECT … FROM …¶
The code above shows the simplest application of the SELECT
statement: it can be used to retrieve the entire content of a database table. The complete syntax of this statement, however, can be much more complex:
SELECT [DISTINCT] column_list
FROM table_list
[WHERE condition]
[GROUP BY column [HAVING conditon]]
[ORDER BY column [ASC | DESC]]
[LIMIT count OFFSET offset]
All parts in [square brackets] are optional. The only mandatory parts are SELECT
followed a list of columns we want to retrieve, and FROM
followed by names of tables where these columns come from. In the example below we use just these two keywords to extract a few columns from the gradebook
table:
[5]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[5]:
first_name | last_name | exam3 |
---|---|---|
Rosina | Hovey | 99 |
Stanton | Mcgaugh | 40 |
Jonnie | Hillock | 72 |
Chu | Flax | 29 |
Rosalinda | Trippe | 82 |
Fran | Mount | 80 |
Shawnna | Coate | 61 |
Marcia | Hinkel | 41 |
Starr | Bachand | 85 |
Lottie | Champine | 62 |
Adelina | Brummitt | 99 |
Tenisha | Rippel | 79 |
Ludivina | Hoskin | None |
In some cases it may be convenient to give columns aliases instead of using their original names. This can be accomplished as follows:
[6]:
%%sql
SELECT first_name AS FName, last_name AS LName, exam3 AS Last_Exam
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[6]:
FName | LName | Last_Exam |
---|---|---|
Rosina | Hovey | 99 |
Stanton | Mcgaugh | 40 |
Jonnie | Hillock | 72 |
Chu | Flax | 29 |
Rosalinda | Trippe | 82 |
Fran | Mount | 80 |
Shawnna | Coate | 61 |
Marcia | Hinkel | 41 |
Starr | Bachand | 85 |
Lottie | Champine | 62 |
Adelina | Brummitt | 99 |
Tenisha | Rippel | 79 |
Ludivina | Hoskin | None |
We can use the SELECT DISTINCT ...
syntax to retrieve all distinct values in given columns. As an example, we can use it to get a list of majors listed in the gradebook
table:
[7]:
%%sql
SELECT DISTINCT major
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[7]:
major |
---|
biology |
computer science |
mathematics |
Instead of extracting data as it appears in a table, we can modify it in various ways. For example, SQLite provides the ||
operator which concatenates strings. We can use it to combine columns with first and last names into a single column:
[8]:
%%sql
SELECT last_name || ", " || first_name, exam1, exam2, exam3
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[8]:
last_name || ", " || first_name | exam1 | exam2 | exam3 |
---|---|---|---|
Hovey, Rosina | 96 | 13 | 99 |
Mcgaugh, Stanton | 6 | 60 | 40 |
Hillock, Jonnie | 56 | 24 | 72 |
Flax, Chu | 72 | None | 29 |
Trippe, Rosalinda | None | None | 82 |
Mount, Fran | 71 | 5 | 80 |
Coate, Shawnna | 79 | 96 | 61 |
Hinkel, Marcia | 79 | 10 | 41 |
Bachand, Starr | None | 27 | 85 |
Champine, Lottie | None | None | 62 |
Brummitt, Adelina | None | 67 | 99 |
Rippel, Tenisha | 27 | 54 | 79 |
Hoskin, Ludivina | 27 | 55 | None |
The default name of a column created using some operation is the code specifying the operation. Aliases let us replace it with a more meaningful name:
[39]:
%%sql
SELECT last_name || ", " || first_name AS full_name, exam1, exam2, exam3
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[39]:
full_name | exam1 | exam2 | exam3 |
---|---|---|---|
Hovey, Rosina | 96 | 13 | 99 |
Mcgaugh, Stanton | 6 | 60 | 40 |
Hillock, Jonnie | 56 | 24 | 72 |
Flax, Chu | 72 | None | 29 |
Trippe, Rosalinda | None | None | 82 |
Mount, Fran | 71 | 5 | 80 |
Coate, Shawnna | 79 | 96 | 61 |
Hinkel, Marcia | 79 | 10 | 41 |
Bachand, Starr | None | 27 | 85 |
Champine, Lottie | None | None | 62 |
Brummitt, Adelina | None | 67 | 99 |
Rippel, Tenisha | 27 | 54 | 79 |
Hoskin, Ludivina | 27 | 55 | None |
NULL values and COALESCE¶
Lets say that we want to calculate for each student the sum of scores from the three exams. We can try to do it as follows:
[41]:
%%sql
SELECT first_name, last_name, exam1 + exam2 + exam3 AS exam_total
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[41]:
first_name | last_name | exam_total |
---|---|---|
Rosina | Hovey | 208 |
Stanton | Mcgaugh | 106 |
Jonnie | Hillock | 152 |
Chu | Flax | None |
Rosalinda | Trippe | None |
Fran | Mount | 156 |
Shawnna | Coate | 236 |
Marcia | Hinkel | 130 |
Starr | Bachand | None |
Lottie | Champine | None |
Adelina | Brummitt | None |
Tenisha | Rippel | 160 |
Ludivina | Hoskin | None |
In some cases this works as expected, but several rows show None
in the exam_total
column. The None
table entries correspond to the SQL NULL
value which signifies missing data. In the original gradebook
table several exam scores have the NULL
value, which may indicate e.g. that some students have not taken an exam. Any number added toNULL
results in NULL
(since it is not possible to determine what the missing data stands for), which results in NULL
values
in the column exam_total
.
In our example, it may be sensible to treat NULL
exam scores as zeros for the purpose of computing the total score. This can be accomplished using the COALESCE
function. This function takes a sequence of values as arguments and returns the first value which is not NULL
(or NULL
if there is no such value). For example, COALESCE(exam1, 0)
will return either the value of the column exam1
or 0, depending on whether exam1
is or is not NULL
:
[11]:
%%sql
SELECT first_name, last_name, COALESCE(exam1, 0) AS exam1_modified
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[11]:
first_name | last_name | exam1_modified |
---|---|---|
Rosina | Hovey | 96 |
Stanton | Mcgaugh | 6 |
Jonnie | Hillock | 56 |
Chu | Flax | 72 |
Rosalinda | Trippe | 0 |
Fran | Mount | 71 |
Shawnna | Coate | 79 |
Marcia | Hinkel | 79 |
Starr | Bachand | 0 |
Lottie | Champine | 0 |
Adelina | Brummitt | 0 |
Tenisha | Rippel | 27 |
Ludivina | Hoskin | 27 |
Using COALESCE
we can improve the query which computes the total score from the three exams:
[12]:
%%sql
SELECT first_name, last_name, COALESCE(exam1, 0) + COALESCE(exam2, 0) + COALESCE(exam3, 0) AS exam_total
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[12]:
first_name | last_name | exam_total |
---|---|---|
Rosina | Hovey | 208 |
Stanton | Mcgaugh | 106 |
Jonnie | Hillock | 152 |
Chu | Flax | 101 |
Rosalinda | Trippe | 82 |
Fran | Mount | 156 |
Shawnna | Coate | 236 |
Marcia | Hinkel | 130 |
Starr | Bachand | 112 |
Lottie | Champine | 62 |
Adelina | Brummitt | 166 |
Tenisha | Rippel | 160 |
Ludivina | Hoskin | 82 |
The CASE expression¶
Lets say that we want to assign letter grades based on exam 3 scores: ‘Great’ for scores above 80 points, ‘Pass’ for scores between 80 and 50 points, and ‘Fail’ for all other scores. This can be accomplished using the CASE
expression:
[13]:
%%sql
SELECT first_name, last_name, exam3,
CASE
WHEN exam3 > 80 THEN 'Great'
WHEN exam3 > 50 THEN 'Pass'
ELSE "Fail"
END AS grade
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[13]:
first_name | last_name | exam3 | grade |
---|---|---|---|
Rosina | Hovey | 99 | Great |
Stanton | Mcgaugh | 40 | Fail |
Jonnie | Hillock | 72 | Pass |
Chu | Flax | 29 | Fail |
Rosalinda | Trippe | 82 | Great |
Fran | Mount | 80 | Pass |
Shawnna | Coate | 61 | Pass |
Marcia | Hinkel | 41 | Fail |
Starr | Bachand | 85 | Great |
Lottie | Champine | 62 | Pass |
Adelina | Brummitt | 99 | Great |
Tenisha | Rippel | 79 | Pass |
Ludivina | Hoskin | None | Fail |
WHERE¶
We can use the WHERE
clause of the SELECT
statement to specify which rows of a table should be retrieved. For example, below we use it to select students who received at least 70 points on exam 1:
[14]:
%%sql
SELECT *
FROM gradebook
WHERE exam1 > 70;
* sqlite:///gradebook_data.sqlite
Done.
[14]:
student_id | first_name | last_name | major | exam1 | exam2 | exam3 |
---|---|---|---|---|---|---|
5775 | Rosina | Hovey | biology | 96 | 13 | 99 |
3877 | Chu | Flax | mathematics | 72 | None | 29 |
7138 | Fran | Mount | biology | 71 | 5 | 80 |
5108 | Shawnna | Coate | computer science | 79 | 96 | 61 |
3548 | Marcia | Hinkel | biology | 79 | 10 | 41 |
Logical conditions can be combined using AND
, OR
and NOT
operators. Below we select students with scores above 50 points on each of the three exams:
[15]:
%%sql
SELECT *
FROM gradebook
WHERE exam1 > 50 AND exam2 > 50 AND exam3 > 50;
* sqlite:///gradebook_data.sqlite
Done.
[15]:
student_id | first_name | last_name | major | exam1 | exam2 | exam3 |
---|---|---|---|---|---|---|
5108 | Shawnna | Coate | computer science | 79 | 96 | 61 |
Comparing NULL values¶
In some cases we may be interested in retrieving records which have (or do not have) the NULL
value in some column. The conditions column = NULL
and column != NULL
will not work, since it is not possible to determine if something has the same value as the value of unknown data. Instead, we can use the conditions column IS NULL
and column IS NOT NULL
.
As an example, below we select records of students with NULL
value in the exam1
column:
[16]:
%%sql
SELECT *
FROM gradebook
WHERE exam1 IS NULL;
* sqlite:///gradebook_data.sqlite
Done.
[16]:
student_id | first_name | last_name | major | exam1 | exam2 | exam3 |
---|---|---|---|---|---|---|
7256 | Rosalinda | Trippe | computer science | None | None | 82 |
2712 | Starr | Bachand | computer science | None | 27 | 85 |
8911 | Lottie | Champine | computer science | None | None | 62 |
6245 | Adelina | Brummitt | mathematics | None | 67 | 99 |
LIKE operator¶
The LIKE
operator can be used to perform simple pattern matches in strings. Below we use it to select records of all students whose last names start with an “H”:
[17]:
%%sql
SELECT *
FROM gradebook
WHERE last_name LIKE 'H%';
* sqlite:///gradebook_data.sqlite
Done.
[17]:
student_id | first_name | last_name | major | exam1 | exam2 | exam3 |
---|---|---|---|---|---|---|
5775 | Rosina | Hovey | biology | 96 | 13 | 99 |
9667 | Jonnie | Hillock | mathematics | 56 | 24 | 72 |
3548 | Marcia | Hinkel | biology | 79 | 10 | 41 |
9165 | Ludivina | Hoskin | mathematics | 27 | 55 | None |
In patterns used by LIKE
the percentage symbol %
represents zero of more characters, and the underscore _
stands for a single character. By default pattern matches are case insensitive, so ‘H%’ will match both ‘Hello’ and ‘hello’.
ORDER¶
The ORDER
clause specifies in which order records retrieved by a query should be returned. We can use it, for example, to arrange students records according to the exam 3 scores:
[42]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook
ORDER BY exam3;
* sqlite:///gradebook_data.sqlite
Done.
[42]:
first_name | last_name | exam3 |
---|---|---|
Ludivina | Hoskin | None |
Chu | Flax | 29 |
Stanton | Mcgaugh | 40 |
Marcia | Hinkel | 41 |
Shawnna | Coate | 61 |
Lottie | Champine | 62 |
Jonnie | Hillock | 72 |
Tenisha | Rippel | 79 |
Fran | Mount | 80 |
Rosalinda | Trippe | 82 |
Starr | Bachand | 85 |
Rosina | Hovey | 99 |
Adelina | Brummitt | 99 |
By default, records are returned in the ascending order. NULL
is regarded as the smallest possible value, so it appears at the top of the list. By adding the DESC
keyword we can arrange records in the descending order:
[19]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook
ORDER BY exam3 DESC;
* sqlite:///gradebook_data.sqlite
Done.
[19]:
first_name | last_name | exam3 |
---|---|---|
Rosina | Hovey | 99 |
Adelina | Brummitt | 99 |
Starr | Bachand | 85 |
Rosalinda | Trippe | 82 |
Fran | Mount | 80 |
Tenisha | Rippel | 79 |
Jonnie | Hillock | 72 |
Lottie | Champine | 62 |
Shawnna | Coate | 61 |
Marcia | Hinkel | 41 |
Stanton | Mcgaugh | 40 |
Chu | Flax | 29 |
Ludivina | Hoskin | None |
LIMIT¶
The LIMIT
clause constrains the number of records returned by the SELECT
statement. For example, we can use it to get records of 5 students with the highest exam 3 scores:
[20]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook
ORDER BY exam3 DESC
LIMIT 5
* sqlite:///gradebook_data.sqlite
Done.
[20]:
first_name | last_name | exam3 |
---|---|---|
Rosina | Hovey | 99 |
Adelina | Brummitt | 99 |
Starr | Bachand | 85 |
Rosalinda | Trippe | 82 |
Fran | Mount | 80 |
LIMIT
can be used with an additional OFFSET
keyword. For example, LIMIT 3 OFFSET 2
will skip the first 2 records and return the following 3 records:
[21]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook
ORDER BY exam3 DESC
LIMIT 3 OFFSET 2
* sqlite:///gradebook_data.sqlite
Done.
[21]:
first_name | last_name | exam3 |
---|---|---|
Starr | Bachand | 85 |
Rosalinda | Trippe | 82 |
Fran | Mount | 80 |
Aggregate functions¶
Aggregate functions summarize values in a column. SQLite provides a few aggregate functions:
AVG(column)
returns the average of numerical values in a column.MIN(column)
andMAX(my_column)
return the minimal and maximal value of a column, respectively.SUM(column)
returns the sum of numerical values in a column.COUNT(column)
returns the number of non-NULL values in a column.COUNT(*)
returns the number of rows, regardless of their values.
Note. COUNT(column)
will count all non-NULL values, even if some values repeat. In order to count only distinct values we can use COUNT (DISTINCT column)
.
For example, below we count the number of non-NULL values in columns with exam scores.
[22]:
%%sql
SELECT COUNT(exam1), COUNT(exam2), COUNT(exam3)
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[22]:
COUNT(exam1) | COUNT(exam2) | COUNT(exam3) |
---|---|---|
9 | 10 | 12 |
Some values of the columns with exam scores repeat, the above counts include these repetition. We can obtain the number of unique scores in each column as follows:
[4]:
%%sql
SELECT COUNT(DISTINCT exam1), COUNT(DISTINCT exam2), COUNT(DISTINCT exam3)
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[4]:
COUNT(DISTINCT exam1) | COUNT(DISTINCT exam2) | COUNT(DISTINCT exam3) |
---|---|---|
7 | 10 | 11 |
Next, we compute the maximal score recorded for each exam:
[23]:
%%sql
SELECT MAX(exam1), MAX(exam2), MAX(exam3)
FROM gradebook;
* sqlite:///gradebook_data.sqlite
Done.
[23]:
MAX(exam1) | MAX(exam2) | MAX(exam3) |
---|---|---|
96 | 96 | 99 |
GROUP BY¶
The GROUP BY expression
clause of the SELECT
statement adds more flexibility to the aggregate functions. It divides records into groups in such a way, that the specified expression
has the same value for all records in a group. Aggregate functions can be then used to summarize columns in each group separately.
As an example, below we group records based on the value of the major
column. Then we calculate the average exam scores for each group:
[28]:
%%sql
SELECT major, AVG(exam1), AVG(exam2), AVG(exam3)
FROM gradebook
GROUP BY major;
* sqlite:///gradebook_data.sqlite
Done.
[28]:
major | AVG(exam1) | AVG(exam2) | AVG(exam3) |
---|---|---|---|
biology | 82.0 | 9.333333333333334 | 73.33333333333333 |
computer science | 37.333333333333336 | 59.25 | 68.16666666666667 |
mathematics | 51.666666666666664 | 48.666666666666664 | 66.66666666666667 |
The table above is somewhat difficult to read, because it contains too many decimal digits. We can improve it using the ROUND(value, num_digits)
function, which rounds numerical values to a given number of digits:
[31]:
%%sql
SELECT major,
ROUND(AVG(exam1), 2) AS exam1_avg,
ROUND(AVG(exam2), 2) AS exam1_avg,
ROUND(AVG(exam3), 2) AS exam3_avg
FROM gradebook
GROUP BY major;
* sqlite:///gradebook_data.sqlite
Done.
[31]:
major | exam1_avg | exam1_avg_1 | exam3_avg |
---|---|---|---|
biology | 82.0 | 9.33 | 73.33 |
computer science | 37.33 | 59.25 | 68.17 |
mathematics | 51.67 | 48.67 | 66.67 |
SQLite provides an additional aggregate function GROUP_CONCAT(column, separator)
, which concatenates all non-NULL values in a column, separating them by the specified separator string. We can use it to list exam 1 scores obtained by students in each major:
[32]:
%%sql
SELECT major, GROUP_CONCAT(exam1, ", ") AS exam1_scores
FROM gradebook
GROUP BY major;
* sqlite:///gradebook_data.sqlite
Done.
[32]:
major | exam1_scores |
---|---|
biology | 96, 71, 79 |
computer science | 6, 79, 27 |
mathematics | 56, 72, 27 |
HAVING¶
GROUP BY
can be used with an additional HAVING
clause. In such a case, only groups that satisfy the condition given after HAVING
will be returned. Below we use it to check which majors averaged more than 50 points on exam 1.
[38]:
%%sql
SELECT major, ROUND(AVG(exam1),2) AS exam1_avg
FROM gradebook
GROUP BY major
HAVING exam1_avg > 50
* sqlite:///gradebook_data.sqlite
Done.
[38]:
major | exam1_avg |
---|---|
biology | 82.0 |
mathematics | 51.67 |