Here’s a non-exhaustive laundry list of SQL concepts that I’ve found useful. You could think of them as intermediate? advanced? concepts that anyone doing data science or machine learning should probably know. I tried to order the concepts list is loosely in order of complexity, however that may vary from person to person.

All the examples here are done using the postgres Docker image available here (I talk about how to work with a postgres container in a separate post). Below we create the table synthetic and populate it with data that I’ll use to illustrate the various concepts.

CREATE TABLE synthetic (id VARCHAR(1), var1 INT, var2 INT);
CREATE TABLE dim (id VARCHAR(1), meta VARCHAR(100));

INSERT INTO synthetic (id, var1, var2)
VALUES ('A', 1, 1),
('A', 1, 0),
('A', 2, 1),
('B', 5, 1),
('B', 8, 0),
('C', 3, 0),
('C', 4, 1),
('C', 9, 1),
('D', 7, 0);

INSERT INTO dim (id, meta)
VALUES ('A', 'supercali'),
('B', 'fragilistic'),
('C', 'expiali'),
('D', 'docious'),
('E', '!');

SELECT * FROM synthetic;
id | var1 | var2
----+------+------
 A  |    1 |    1
 A  |    1 |    0
 A  |    2 |    1
 B  |    5 |    1
 B  |    8 |    0
 C  |    3 |    0
 C  |    4 |    1
 C  |    9 |    1
 D  |    7 |    0

Processing Order

It’s useful to know the processing order of a SQL query. It helps you understand when certain parts of your query are executed which will help you optimize your code. For example, if you only need to join data on a filtered set of rows then it might make sense to create CTEs that first filter the data and then perform joins using those CTEs.

Basically, the processing order of a SQL query goes something like this (see this doc for more detail),

  1. FROM
  2. JOIN
  3. WHERE
  4. GROUP BY
  5. HAVING
  6. SELECT (including aggregations, window functions, etc)
  7. DISTINCT
  8. ORDER BY
  9. LIMIT

Joins

Joins are pretty basic and pretty fundamental. Sometimes (actually most of the time) your information is split across multiple tables and you need to join them together in order to get all the information in one place.

SELECT a.id, b.meta
FROM synthetic a
JOIN dim b
ON a.id = b.id;
id |    meta
----+-------------
 A  | supercali
 A  | supercali
 A  | supercali
 B  | fragilistic
 B  | fragilistic
 C  | expiali
 C  | expiali
 C  | expiali
 D  | docious

There’s also the anti-join. This can come in handy for problems when you want to select the records that don’t have a match defined by the join. In this case we want any records that are in dim that are not in synthetic.

SELECT b.*
FROM synthetic a
RIGHT JOIN dim b
ON a.id = b.id
WHERE a.id IS NULL;
id | meta
----+------
 E  | !

Another way we could have done this anti-join is with NOT IN.

SELECT *
FROM dim
WHERE id NOT IN (SELECT id FROM synthetic);
id | meta
----+------
 E  | !

Case Statements (Conditionals)

This allows you to implement if-else conditions on columns in your table. The typical formulation of case statements is CASE WHEN <condition> THEN <result> ELSE <result> END <new column name>.

SELECT *
	, CASE WHEN var2 = 1 THEN 'Y'
	       WHEN var2 = 0 THEN 'N'
	       ELSE Null END var3
FROM synthetic;
id | var1 | var2 | var3
----+------+------+------
 A  |    1 |    1 | Y
 A  |    1 |    0 | N
 A  |    2 |    1 | Y
 B  |    5 |    1 | Y
 B  |    8 |    0 | N
 C  |    3 |    0 | N
 C  |    4 |    1 | Y
 C  |    9 |    1 | Y
 D  |    7 |    0 | N

Group By

This is another obvious one. But because it’s so commonly used it’s worth mentioning. A GROUP BY enables you to apply aggregation functions (COUNT, SUM, AVG, etc) to parts of a table. These parts are provided by an identifier (or set of identifiers) which allows you to partition the data into groups that you want to apply your aggregation functions over.

SELECT id
	, COUNT(*) id_count
	, SUM(var1) var1_sum
FROM synthetic
GROUP BY id;
id | id_count | var1_sum
----+----------+----------
 B  |        2 |       13
 C  |        3 |       16
 D  |        1 |        7
 A  |        3 |        4

Where vs Having

Most people know of the WHERE statement. It allows you to filter (reduce the rows of your table) by a specific condition.

You can’t use WHERE on a column that was created by an aggregate function. If you do you’ll be performing the filter before the aggregation calculations take place. You can use the HAVING statement on a column created by an aggregate function to filter results after the aggregations have taken place.

SELECT *
FROM synthetic
WHERE var1 < 5;
id | var1 | var2
----+------+------
 A  |    1 |    1
 A  |    1 |    0
 A  |    2 |    1
 C  |    3 |    0
 C  |    4 |    1
SELECT id
	, COUNT(*) id_count
FROM synthetic
WHERE var1 = 1
GROUP BY id
HAVING id = 'A';
id | id_count
----+----------
 A  |        2

Window Functions

Window functions allow you to perform row-wise calculations on partitions (groups) of your data while allowing you to consider other rows in the calculation. This is useful when you want to create ranks, cumulative sums, rolling averages, etc. There is no group by aggregation taking place. The output table produced after applying a window function will always have the same number of rows as the original table.

SELECT id
	, var1
	, RANK() OVER (ORDER BY var1) var1_rank
	, LAG(var1, 1, Null) OVER (PARTITION BY Id ORDER BY var1) var1_lag
FROM synthetic
ORDER BY id, var1;
id | var1 | var1_rank | var1_lag
----+------+-----------+----------
 A  |    1 |         1 |
 A  |    1 |         1 |        1
 A  |    2 |         3 |        1
 B  |    5 |         6 |
 B  |    8 |         8 |        5
 C  |    3 |         4 |
 C  |    4 |         5 |        3
 C  |    9 |         9 |        4
 D  |    7 |         7 |

Rows Preceding/Following

How do window functions know which data to consider before and/or after the row in question? It does this by using ROWS # PRECEDING # FOLLOWING, where # is how many rows you want the function to consider preceding/following the row you are currently at.

For example, to compute the cumulative sum we use the SUM window function and specify ROWS UNBOUNDED PRECEDING. This means that we are considering all previous rows (within the partition) in addition to the current row when calculating the cumulative sum. If we don’t specify this then we’ll just get the entire sum for each partition assigned to each row of the partition.

SELECT id
	, var1
	, SUM(var1) OVER (PARTITION BY Id ORDER BY var1 ROWS UNBOUNDED PRECEDING) var1_cumsum_1
	, SUM(var1) OVER (PARTITION BY Id ORDER BY var1 ROWS 1 PRECEDING) var1_cumsum_2
FROM synthetic;
id | var1 | var1_cumsum_1 | var1_cumsum_2
----+------+---------------+---------------
 A  |    1 |             1 |             1
 A  |    1 |             2 |             2
 A  |    2 |             4 |             3
 B  |    5 |             5 |             5
 B  |    8 |            13 |            13
 C  |    3 |             3 |             3
 C  |    4 |             7 |             7
 C  |    9 |            16 |            13
 D  |    7 |             7 |             7

Here’s what happens if we don’t provide PRECEDING or FOLLOWING.

SELECT id
	, var1
	, SUM(var1) OVER (PARTITION BY Id ORDER BY var1) var1_cumsum_1
	, SUM(var1) OVER (PARTITION BY Id ORDER BY var1) var1_cumsum_2
FROM synthetic;
id | var1 | var1_cumsum_1 | var1_cumsum_2
----+------+---------------+---------------
 A  |    1 |             2 |             2
 A  |    1 |             2 |             2
 A  |    2 |             4 |             4
 B  |    5 |             5 |             5
 B  |    8 |            13 |            13
 C  |    3 |             3 |             3
 C  |    4 |             7 |             7
 C  |    9 |            16 |            16
 D  |    7 |             7 |             7

Rank vs Dense Rank vs Row Number

This one always trips me up. Dense rank and rank will be the same except when there are ties. In tie situations dense will always give you consecutive ranks (no gaps in ranking values) where as rank will skip ranks in situations where there are ties.

Rank is similar to row number, however rank provides the same value in situations where there are ties.

SELECT id
	, var1
	, RANK() OVER (ORDER BY var1) var1_rank
	, DENSE_RANK() OVER (ORDER BY var1) var1_dense_rank
	, ROW_NUMBER() OVER (ORDER BY var1) var1_row_number
FROM synthetic
ORDER BY var1;
id | var1 | var1_rank | var1_dense_rank | var1_row_number
----+------+-----------+-----------------+-----------------
 A  |    1 |         1 |               1 |               1
 A  |    1 |         1 |               1 |               2
 A  |    2 |         3 |               2 |               3
 C  |    3 |         4 |               3 |               4
 C  |    4 |         5 |               4 |               5
 B  |    5 |         6 |               5 |               6
 D  |    7 |         7 |               6 |               7
 B  |    8 |         8 |               7 |               8
 C  |    9 |         9 |               8 |               9

Views

A view represents a table. It’s not a table that’s stored on disk. Rather the SQL code to construct the table is stored. Typically, every time you call a view, the SQL code that defines that view is executed to generate the table. (I say typically since sometimes the results can be stored/read from a cache). You’ll notice that a view is not stored in the “Table” section of the database, but in a separate “View” section… because it ain’t a table.

CREATE VIEW synthetic_agg AS
SELECT id
	, COUNT(*) id_count
FROM synthetic
GROUP BY id;

SELECT * FROM synthetic_agg;
id | id_count
----+----------
 B  |        2
 C  |        3
 D  |        1
 A  |        3

User Defined Functions

Just as in any programing language, functions in SQL allow you to parameterize your query. The way you specify functions will vary depending on the variant of SQL used by your choice of SQL server provider.

Here’s a function that returns a filtered table that has the same schema as the synthetic table.

CREATE OR REPLACE FUNCTION synthetic_filter(param INT)
RETURNS SETOF synthetic
AS
$$
	SELECT *
	FROM synthetic
	WHERE var1 = param;
$$
LANGUAGE SQL;
id | var1 | var2
----+------+------
 A  |    1 |    1
 A  |    1 |    0

Here’s a function that returns a schema that’s different from the synthetic table (we have to define the schema of the returning table).

CREATE OR REPLACE FUNCTION synthetic_grouped(param INT)
RETURNS TABLE (id VARCHAR(1), id_count INT)
AS
$$
	SELECT id, COUNT(*) id_count
	FROM synthetic
	WHERE var2 = param
	GROUP BY id;
$$
LANGUAGE SQL;

SELECT * from synthetic_grouped(1);
id | id_count
----+----------
 A  |        2
 B  |        1
 C  |        2

Finally, here’s a function that returns a single value (in this case a VARCHAR). These return types are useful to define complicated row-wise operations or for operations that need to be reused.

CREATE OR REPLACE FUNCTION var2_condition(param INT)
RETURNS VARCHAR(1)
AS
$$
	SELECT CASE WHEN param = 1 THEN 'Y'
	       WHEN param = 0 THEN 'N'
	       ELSE Null END;
$$
LANGUAGE SQL;

SELECT *, var2_condition(var2) var2_char
FROM synthetic;
id | var1 | var2 | var2_char
----+------+------+-----------
 A  |    1 |    1 | Y
 A  |    1 |    0 | N
 A  |    2 |    1 | Y
 B  |    5 |    1 | Y
 B  |    8 |    0 | N
 C  |    3 |    0 | N
 C  |    4 |    1 | Y
 C  |    9 |    1 | Y
 D  |    7 |    0 | N

CTE (Common Table Expression)

CTEs allow you to reference a table created by one query in another query, without having to write the table to disk (in other words you don’t have to use CREATE TABLE). Basically you’re keeping the table in memory temporarily.

WITH tmp_table AS (
	SELECT *
	FROM synthetic
	WHERE var2 = 1
)

SELECT *
FROM tmp_table;
id | var1 | var2
----+------+------
 A  |    1 |    1
 A  |    2 |    1
 B  |    5 |    1
 C  |    4 |    1
 C  |    9 |    1

Knowing When to Write a Table to Disk

Sometimes it’s obvious when to create a table. Typically you’ll do it when you want to share the result with someone or some service. But sometimes it’s worth breaking up a complicated ETL workflow into tables that are written to disk. This helps you do QA on different components of the ETL workflow. It also helps when a part of the workflow errors out for some reason. You don’t have to start the entire ETL workflow from the beginning. The process can just start at the point where the error took place. Just make sure you clean up tables that aren’t actively being used so that they don’t unnecessarily consume space.

It’s important to know the tradeoffs though. Writing to disk can be slower since you have to spend time writing/reading from disk. Keeping things in memory circumvents this inefficiency… at the expense of holding the data in memory (typically you have more space on disk than you do in memory). If your code errors out and you kept all your data in memory then whatever transformations you did to that data won’t be saved.