A short focus on Common Table Expressions

Posted 3 years ago.

Common Table Expressions aka CTEs are one of Postgresql's most mysterious feature for SQL newcomers. CTE are introduced by the standard ISO SQL99 and are supported by Postgres since version 8.4. To be short, they replace in a more convenient way nested SELECT statements. This example is about employees of a company divided in departments. If you want to get your hands dirty with this one, here is the SQL file you should use to create a ready to go database :

Here is the GIST with the data and the structures.. Be sure you download the file and insert it through a \i psql command to avoid paste & copy annoyances with the COPY commands where data are separated using tabs.

The previous article was about window functions so let's now take the example where we need to fetch all the data related to one given employee augmented with the employee_id of the next employee older than him and the same with the just younger so we can provide links to them on the page showing all the attributes of an employee. The first tempting query is:

select
  employee_id,
  name,
  birthdate,
  lead(employee_id) over employee_age_wdw as older_employee_id,
  lag(employee_id) over employee_age_wdw as younger_employee_id
from
  employee
where
  employee_id = 54
window
  employee_age_wdw as (partition by department_id order by birthdate asc)
;

But this returns a pathetic

┌─────────────┬────────────────────┬────────────────────────┬───────────────────┬─────────────────────┐
│ employee_id │        name        │       birthdate        │ older_employee_id │ younger_employee_id │
├─────────────┼────────────────────┼────────────────────────┼───────────────────┼─────────────────────┤
│          54 │ Carlos Cruickshank │ 1971-08-26 09:33:10+00 │                 ¤ │                   ¤ │
└─────────────┴────────────────────┴────────────────────────┴───────────────────┴─────────────────────┘
(1 row)

The reason is simple: the data window exists within the rows filtered by the where condition. In this case, there is only one row remaining, hence the NULL in older and younger. If you are a web developer, your mind is probably blocking the only solution available: we need to nest a subselect so the window can apply on all employees. Subselects are so tedious to write and to debug, web developers avoid them at all costs. Fortunately, CTE are here for that. Their goal is to define and name prepared sets so they can be used in other statements as they were simple tables:

with
  type1 AS (select … ),
  type2 AS (selectfrom type1 …),
  …
selectfrom something join type2 …

It is exactly what SQL has been made for, defining sets that can be joined together again and again. As a first step, let's write the query that fetch all employees from a given user's department with their younger and older workmate

select
  e1.employee_id,
  e1.birthdate,
  lead(e1.employee_id) over employee_age_wdw as younger_employee_id,
  lag(e1.employee_id) over employee_age_wdw as older_employee_id
from employee e1
  join employee e2 using (department_id)
where e2.employee_id = 27
window
  employee_age_wdw as (partition by e1.department_id order by e1.birthdate asc)
;

It returns the following:

┌─────────────┬────────────────────────┬─────────────────────┬───────────────────┐
│ employee_id │       birthdate        │ younger_employee_id │ older_employee_id │
├─────────────┼────────────────────────┼─────────────────────┼───────────────────┤
│           4 │ 1956-05-13 22:16:15+00 │                  69 │                 ¤ │
│          69 │ 1956-07-02 02:26:05+00 │                  63 │                 4 │
│          63 │ 1957-02-23 02:44:38+00 │                  27 │                69 │
│          27 │ 1957-09-07 15:46:23+00 │                  35 │                63 │
│          35 │ 1958-11-12 05:57:51+00 │                   9 │                27 │
│           9 │ 1965-12-04 16:26:57+00 │                  57 │                35 │
│          57 │ 1966-07-20 17:04:49+00 │                  60 │                 9 │
│          60 │ 1975-03-15 16:52:45+00 │                  55 │                57 │
│          55 │ 1980-07-04 01:14:31+00 │                  89 │                60 │
│          89 │ 1981-02-20 12:21:33+00 │                  26 │                55 │
│          26 │ 1984-01-11 08:08:15+00 │                  83 │                89 │
│          83 │ 1984-09-13 23:26:20+00 │                  36 │                26 │
│          36 │ 1990-09-02 21:38:28+00 │                  56 │                83 │
│          56 │ 1995-01-07 17:18:17+00 │                   3 │                36 │
│           3 │ 1995-11-05 23:25:00+00 │                   ¤ │                56 │
└─────────────┴────────────────────────┴─────────────────────┴───────────────────┘
(15 rows)

Using a CTE, it is possible to create a set containing those rows and join them on a particular employee:

with
  employee_neighbour as ( -- previous query
    select
      e1.employee_id,
      e1.birthdate,
      lead(e1.employee_id) over employee_age_wdw as younger_employee_id,
      lag(e1.employee_id) over employee_age_wdw as older_employee_id
    from employee e1
      join employee e2 using (department_id)
    where e2.employee_id = 27
    window
      employee_age_wdw as (partition by e1.department_id order by e1.birthdate asc)
    )
select -- return a single employee informations
  e.employee_id,
  e.name,
  e.birthdate,
  en.older_employee_id,
  en.younger_employee_id
from
  employee e
    join employee_neighbour en using (employee_id)
where
  employee_id = 27
;

┌─────────────┬─────────────────────┬────────────────────────┬───────────────────┬─────────────────────┐
│ employee_id │        name         │       birthdate        │ older_employee_id │ younger_employee_id │
├─────────────┼─────────────────────┼────────────────────────┼───────────────────┼─────────────────────┤
│          27 │ Mr. Keaton Schulist │ 1957-09-07 15:46:23+00 │                63 │                  35 │
└─────────────┴─────────────────────┴────────────────────────┴───────────────────┴─────────────────────┘
(1 row)

It is disgraceful to use twice the same parameter employee_id = 27. This may be an indication something is not done in the right order. What if swapping both statements, first fetching employee's information then join his workmates ?

with
  employee_alone as ( -- Fetch employee information
    select employee_id, name, birthdate, department_id from employee where employee_id = 27
  ),
  employee_neighbour as ( -- Get work mates age neighbours in the same department
    select
      e.employee_id,
      lead(e.employee_id) over employee_age_wdw as younger_employee_id,
      lag(e.employee_id) over employee_age_wdw as older_employee_id
    from employee e
      join employee_alone ea using (department_id)
    window
      employee_age_wdw as (order by e.birthdate asc)
    )
select -- combine employee with neighbours
  ea.employee_id,
  ea.name,
  ea.birthdate,
  en.older_employee_id,
  en.younger_employee_id
from
  employee_alone ea
    join employee_neighbour en using (employee_id)
;

This is much better, the query is easier to read and understand, the WHERE statements did disappear, well…

As final thoughts of this article: understanding CTEs leads to understanding how SQL works:

  • data are structured in sets of tuples
  • sets are combined using joins
  • sets are shaped through projections (select)
  • again and again

It has to be said that CTEs are optimization barriers for Postgresql's planner so they are seen as a black boxes by the optimizer as opposed to nested queries.

Enjoy.