Posted 7 years ago.
After the introduction about the CTE, it is important to understand how sets can be joined together in a single set.
All the material needed to play with the following exercise can be found on this gist. There are two tables: one containing sellers an another one containing sales. If you download the gist, be sure to download the RAW file and then import it in a freshly created database using \i
in your psql client.
Let’s say we want something as simple as the count of sales per day for the seller id = 3 to draw a graphic. Knowing the aggregate functions, the according query might be as easy as:
select
date_trunc('day', sale_ts) as day,
count(sale_id) as sales,
sum(total_price_ct)
from
sale
where
seller_id = 3
group by 1
order by 1 asc
;
Note the field position alias in the group by
and order by
statements. They refer to the first field of the select part.
So this returns what appears to be the good answer:
┌────────────────────────┬───────┬───────┐
│ day │ sales │ sum │
├────────────────────────┼───────┼───────┤
│ 2015-04-13 00:00:00+00 │ 3 │ 2950 │
│ 2015-04-14 00:00:00+00 │ 1 │ 6780 │
│ 2015-04-15 00:00:00+00 │ 4 │ 23390 │
│ 2015-04-16 00:00:00+00 │ 1 │ 2640 │
│ 2015-04-18 00:00:00+00 │ 1 │ 130 │
└────────────────────────┴───────┴───────┘
(5 rows)
The problem here is simple too: there are no sales on the 17th and 19th of April and we need this knowledge to draw a correct graphic. Since the information is not present in the input set (the table) it will be hard to output there are no informations on information we lack.
Fortunately Postgres grants us with a set generator, the function generate_series
. Although it is always recommendable to read the official documentation, we can just say it is a set returning function that can generate series of integers or timestamps:
select generate_series(1,5);
┌─────────────────┐
│ generate_series │
├─────────────────┤
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
└─────────────────┘
(5 rows)
Since this function generates sets, it is used to create data that can be joined against, by example, a set of dates!
select generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval);
┌─────────────────────┐
│ generate_series │
├─────────────────────┤
│ 2015-04-13 00:00:00 │
│ 2015-04-14 00:00:00 │
│ 2015-04-15 00:00:00 │
│ 2015-04-16 00:00:00 │
│ 2015-04-17 00:00:00 │
│ 2015-04-18 00:00:00 │
│ 2015-04-19 00:00:00 │
│ 2015-04-20 00:00:00 │
└─────────────────────┘
(8 rows)
The SQL query turns to a promising:
select
week.day::date,
count(sale.sale_id) as sale_count
from
generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval) as week (day)
join sale on sale.sale_ts::date = week.day
where
sale.seller_id = 3
group by 1
order by 1 asc
Which returns a disappointing:
┌────────────┬────────────┐
│ day │ sale_count │
├────────────┼────────────┤
│ 2015-04-13 │ 3 │
│ 2015-04-14 │ 1 │
│ 2015-04-15 │ 4 │
│ 2015-04-16 │ 1 │
│ 2015-04-18 │ 1 │
└────────────┴────────────┘
(5 rows)
To understand what is wrong with the above query, it is necessary to understand in which order the SQL processor treats a query.
The query above is not turned right because of two things:
from
generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval) as week (day)
join sale on sale.sale_ts::date = week.day
Sales from all sellers are extracted joined by their selling day. The first error here is the join type: a simple join
statement is by default an inner join
which means all rows must have a corresponding row in both sets to be fetched. If a date has no sales, there are no matching rows in the sale
set hence this date row is discarded. A left join
tells the processor to fetch all rows from the first set even though they do not have a corresponding row in the joined set. This would allow the FROM
statement to retrieve all dates whatever there are sales or not this day.
where
sale.seller_id = 3
And here, rows from the input set are filtered so only rows with a seller_id = 3 are kept. This way, days with no sales hence no seller_id are removed from the results. This is not what we want.
We want to count all sales made by a given seller for each day. A condition in the WHERE clause would be a way to remove days (like Sundays or public holidays) but the given seller has to be part of the join condition:
select
week.day::date,
count(sale.sale_id) as sale_count
from
generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval) as week (day)
left join sale on sale.seller_id = 3 and sale.sale_ts::date = week.day
group by 1
order by 1 asc
;
┌────────────┬────────────┐
│ day │ sale_count │
├────────────┼────────────┤
│ 2015-04-13 │ 3 │
│ 2015-04-14 │ 1 │
│ 2015-04-15 │ 4 │
│ 2015-04-16 │ 1 │
│ 2015-04-17 │ 0 │
│ 2015-04-18 │ 1 │
│ 2015-04-19 │ 0 │
│ 2015-04-20 │ 0 │
└────────────┴────────────┘
(8 rows)
What if now we want the details for all sellers ? What do we want ?
For each day we want the sales count for all sellers. In another way, we want for each seller his sales count for all days. This Cartesian product is the only join that can not be illustrated by a Venn diagram: a cross join.
select
week.day::date,
seller.name,
count(sale.sale_id) as sale_count
from
generate_series('2015-04-13'::timestamp, '2015-04-20', '1 day'::interval) as week (day)
cross join seller
left join sale on sale.seller_id = seller.seller_id and sale.sale_ts::date = week.day
group by 1, 2
order by 1, 2 asc
;
┌────────────┬─────────────────────────┬────────────┐
│ day │ name │ sale_count │
├────────────┼─────────────────────────┼────────────┤
│ 2015-04-13 │ Delphia Kozey │ 3 │
│ 2015-04-13 │ Dr. Ernie Kilback Sr. │ 10 │
│ 2015-04-13 │ Mr. Johnson Kiehn │ 3 │
│ 2015-04-13 │ Prof. Angelo Parker DVM │ 3 │
│ 2015-04-14 │ Delphia Kozey │ 7 │
│ 2015-04-14 │ Dr. Ernie Kilback Sr. │ 5 │
│ 2015-04-14 │ Mr. Johnson Kiehn │ 4 │
│ 2015-04-14 │ Prof. Angelo Parker DVM │ 1 │
│ 2015-04-15 │ Delphia Kozey │ 3 │
│ 2015-04-15 │ Dr. Ernie Kilback Sr. │ 3 │
…
│ 2015-04-20 │ Dr. Ernie Kilback Sr. │ 0 │
│ 2015-04-20 │ Mr. Johnson Kiehn │ 0 │
│ 2015-04-20 │ Prof. Angelo Parker DVM │ 0 │
└────────────┴─────────────────────────┴────────────┘
(32 rows)
The article about joins on stack overflow is a real gold mine to understand more deeply about joins. The examples given above are simplistic and more complex queries may stack tens of joins. Since joins have a direct impact on performances, fields used to join sets and join conditions will be examined as excellent candidates for indexes. More on this on Markus Winand’s website (and awesome book). Enjoy.