A short focus on joins

Posted 3 years ago.

After the introduction about the CTE, it is important to understand how sets can be joined together in a single set.

All the material needed to play with the following exercise can be found on this gist. There are two tables: one containing sellers an another one containing sales. If you download the gist, be sure to download the RAW file and then import it in a freshly created database using \i in your psql client.

Let’s say we want something as simple as the count of sales per day for the seller id = 3 to draw a graphic. Knowing the aggregate functions, the according query might be as easy as:

select
  date_trunc('day', sale_ts) as day,
  count(sale_id) as sales,
  sum(total_price_ct)
from
  sale
where
  seller_id = 3
group by 1
order by 1 asc
;

Note the field position alias in the group by and order by statements. They refer to the first field of the select part.

So this returns what appears to be the good answer:

┌────────────────────────┬───────┬───────┐
│          day           │ sales │  sum  │
├────────────────────────┼───────┼───────┤
│ 2015-04-13 00:00:00+00 │     3 │  2950 │
│ 2015-04-14 00:00:00+00 │     1 │  6780 │
│ 2015-04-15 00:00:00+00 │     4 │ 23390 │
│ 2015-04-16 00:00:00+00 │     1 │  2640 │
│ 2015-04-18 00:00:00+00 │     1 │   130 │
└────────────────────────┴───────┴───────┘
(5 rows)

The problem here is simple too: there are no sales on the 17th and 19th of April and we need this knowledge to draw a correct graphic. Since the information is not present in the input set (the table) it will be hard to output there are no informations on information we lack.

Fortunately Postgres grants us with a set generator, the function generate_series. Although it is always recommendable to read the official documentation, we can just say it is a set returning function that can generate series of integers or timestamps:

select generate_series(1,5);

┌─────────────────┐
│ generate_series │
├─────────────────┤
│               1 │
│               2 │
│               3 │
│               4 │
│               5 │
└─────────────────┘
(5 rows)

Since this function generates sets, it is used to create data that can be joined against, by example, a set of dates!

select generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval);

┌─────────────────────┐
│   generate_series   │
├─────────────────────┤
│ 2015-04-13 00:00:00 │
│ 2015-04-14 00:00:00 │
│ 2015-04-15 00:00:00 │
│ 2015-04-16 00:00:00 │
│ 2015-04-17 00:00:00 │
│ 2015-04-18 00:00:00 │
│ 2015-04-19 00:00:00 │
│ 2015-04-20 00:00:00 │
└─────────────────────┘
(8 rows)

The SQL query turns to a promising:

select
  week.day::date,
  count(sale.sale_id) as sale_count
from
  generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval) as week (day)
  join sale on sale.sale_ts::date = week.day
where
  sale.seller_id = 3
group by 1
order by 1 asc

Which returns a disappointing:

┌────────────┬────────────┐
│    day     │ sale_count │
├────────────┼────────────┤
│ 2015-04-13 │          3 │
│ 2015-04-14 │          1 │
│ 2015-04-15 │          4 │
│ 2015-04-16 │          1 │
│ 2015-04-18 │          1 │
└────────────┴────────────┘
(5 rows)

To understand what is wrong with the above query, it is necessary to understand in which order the SQL processor treats a query.

  1. FROM. Input sets and how they interact with each other are computed.
  2. WHERE clause filters the rows of the input sets.
  3. GROUP BY is evaluated to understand how to reduce sets.
  4. HAVING filters grouped data.
  5. SELECT defines projection from the filtered sets.
  6. ORDER BY sort the set defined by the projection.

The query above is not turned right because of two things:

  from
    generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval) as week (day)
    join sale on sale.sale_ts::date = week.day

Sales from all sellers are extracted joined by their selling day. The first error here is the join type: a simple join statement is by default an inner join which means all rows must have a corresponding row in both sets to be fetched. If a date has no sales, there are no matching rows in the sale set hence this date row is discarded. A left join tells the processor to fetch all rows from the first set even though they do not have a corresponding row in the joined set. This would allow the FROM statement to retrieve all dates whatever there are sales or not this day.

where
  sale.seller_id = 3

And here, rows from the input set are filtered so only rows with a seller_id = 3 are kept. This way, days with no sales hence no seller_id are removed from the results. This is not what we want.

What do we want ?

We want to count all sales made by a given seller for each day. A condition in the WHERE clause would be a way to remove days (like Sundays or public holidays) but the given seller has to be part of the join condition:

select
  week.day::date,
  count(sale.sale_id) as sale_count
from
  generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval) as week (day)
  left join sale on sale.seller_id = 3 and sale.sale_ts::date = week.day
group by 1
order by 1 asc
;

┌────────────┬────────────┐
│    day     │ sale_count │
├────────────┼────────────┤
│ 2015-04-13 │          3 │
│ 2015-04-14 │          1 │
│ 2015-04-15 │          4 │
│ 2015-04-16 │          1 │
│ 2015-04-17 │          0 │
│ 2015-04-18 │          1 │
│ 2015-04-19 │          0 │
│ 2015-04-20 │          0 │
└────────────┴────────────┘
(8 rows)

A more general example

What if now we want the details for all sellers ? What do we want ?

For each day we want the sales count for all sellers. In another way, we want for each seller his sales count for all days. This Cartesian product is the only join that can not be illustrated by a Venn diagram: a cross join.

select
  week.day::date,
  seller.name,
  count(sale.sale_id) as sale_count
from
  generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval) as week (day)
  cross join seller
  left join sale on sale.seller_id = seller.seller_id and sale.sale_ts::date = week.day
group by 1, 2
order by 1, 2 asc
;

┌────────────┬─────────────────────────┬────────────┐
│    day     │          name           │ sale_count │
├────────────┼─────────────────────────┼────────────┤
│ 2015-04-13 │ Delphia Kozey           │          3 │
│ 2015-04-13 │ Dr. Ernie Kilback Sr.   │         10 │
│ 2015-04-13 │ Mr. Johnson Kiehn       │          3 │
│ 2015-04-13 │ Prof. Angelo Parker DVM │          3 │
│ 2015-04-14 │ Delphia Kozey           │          7 │
│ 2015-04-14 │ Dr. Ernie Kilback Sr.   │          5 │
│ 2015-04-14 │ Mr. Johnson Kiehn       │          4 │
│ 2015-04-14 │ Prof. Angelo Parker DVM │          1 │
│ 2015-04-15 │ Delphia Kozey           │          3 │
│ 2015-04-15 │ Dr. Ernie Kilback Sr.   │          3 │
  …
│ 2015-04-20 │ Dr. Ernie Kilback Sr.   │          0 │
│ 2015-04-20 │ Mr. Johnson Kiehn       │          0 │
│ 2015-04-20 │ Prof. Angelo Parker DVM │          0 │
└────────────┴─────────────────────────┴────────────┘
(32 rows)

The article about joins on stack overflow is a real gold mine to understand more deeply about joins. The examples given above are simplistic and more complex queries may stack tens of joins. Since joins have a direct impact on performances, fields used to join sets and join conditions will be examined as excellent candidates for indexes. More on this on Markus Winand’s website (and awesome book). Enjoy.