A short focus on window functions

Posted 3 years ago.

In this hopefully short article, we will try to understand what window functions are and why since 2003, there are the application programmer's best friend. We will first get through well known aggregate functions to spot the differences between theses functions families.

Let's say we have the data about some European factories with production information. See what we can do with Postgres

Setup

If you are willing to get your hands dirty with this article, fire up a Postgres client and create the following table:

create table factory (
    name varchar primary key,
    country varchar not null,
    production int4 not null check(production >=0)
);

Pretty simple! Let's feed it with some data:

copy factory from stdin delimiter ',';

Then, paste the following lines:

un,france,2063
deux,france,1821
eins,germany,2211
zwei,germany,1821
drei,germany,1119
uno,spain,1793
um,portugal,1021
due,italy,1453
jedno,poland,1339
edno,bulgaria,1512
\.

Aggregate functions

Everybody know about aggregate functions even though they might think they not. The most famous SQL function is an aggregate function:

select count(*) from factory ;
┌───────┐
│ count │
├───────┤
│    10 │
└───────┘
(1 row)

Normally, without a WHERE clause that filters rows, every query on this table would return a set composed of 10 rows. But an aggregate function reduces the output to a determined number of rows. In the case above, the result set is composed of one row indicating the cardinality of the factory table. In counter part, we did loose a lot of the information contained in the table. It is of course possible to tell the database about how we do want to reduce the set:

select
  country,
  count(*) as factories,
  sum(production) as production
from
  factory
group by
  country
order by
  factories desc,
  production desc
;
┌──────────┬───────────┬────────────┐
│ country  │ factories │ production │
├──────────┼───────────┼────────────┤
│ germany  │         35151 │
│ france   │         23884 │
│ spain    │         11793 │
│ bulgaria │         11512 │
│ italy    │         11453 │
│ poland   │         11339 │
│ portugal │         11021 │
└──────────┴───────────┴────────────┘
(7 rows)

The example above shows how to reduces the set based on the information contained in the country field. The sum aggregate is also used, see Postgrsql's aggregate functions documentation page for more informations.

Window functions

Let's now imagine we need to rank factories by production. The first think one may state is that it is not a reducing operation: we will output as many rows as there are factories in the table. We need a different beast here able to compare rows in a determined window of data: a window function.

select
  name,
  country,
  production,
  rank() over (order by production desc)
from factory
;
┌───────┬──────────┬────────────┬──────┐
│ name  │ country  │ production │ rank │
├───────┼──────────┼────────────┼──────┤
│ eins  │ germany  │       22111 │
│ un    │ france   │       20632 │
│ deux  │ france   │       18213 │
│ zwei  │ germany  │       18213 │
│ uno   │ spain    │       17935 │
│ edno  │ bulgaria │       15126 │
│ due   │ italy    │       14537 │
│ jedno │ poland   │       13398 │
│ drei  │ germany  │       11199 │
│ um    │ portugal │       102110 │
└───────┴──────────┴────────────┴──────┘
(10 rows)

The window function rank is ranking rows defined in the given window order by production desc. It is even able to manage ex aequo (look at ranking 3). What if we want to rank factories by country ? It is just a matter of defining the data window:

select
  name,
  country,
  production,
  rank() over (partition by country order by production desc)
from
  factory
order by
  country,
  rank asc
;
┌───────┬──────────┬────────────┬──────┐
│ name  │ country  │ production │ rank │
├───────┼──────────┼────────────┼──────┤
│ edno  │ bulgaria │       15121 │
│ un    │ france   │       20631 │
│ deux  │ france   │       18212 │
│ eins  │ germany  │       22111 │
│ zwei  │ germany  │       18212 │
│ drei  │ germany  │       11193 │
│ due   │ italy    │       14531 │
│ jedno │ poland   │       13391 │
│ um    │ portugal │       10211 │
│ uno   │ spain    │       17931 │
└───────┴──────────┴────────────┴──────┘
(10 rows)

The data window now defines a partition by country that resets the ranking for every different country. There is a very rich and powerful syntax to define windows, you might also want to have a look at Postgresql's built-in window functions.

To close this article with a more complex example, let's say we want to have for each factory, the name of the preceding more productive European factory and the following less productive. The functions lead and lag are here for that:

select
  name,
  country,
  production,
  lead(name) over european_production_wdw,
  lag(name) over european_production_wdw
from
  factory
window
  european_production_wdw as (order by production desc)
order by
  production desc
;
┌───────┬──────────┬────────────┬───────┬───────┐
│ name  │ country  │ production │ leadlag  │
├───────┼──────────┼────────────┼───────┼───────┤
│ eins  │ germany  │       2211 │ un    │ ¤     │
│ un    │ france   │       2063 │ deux  │ eins  │
│ deux  │ france   │       1821 │ zwei  │ un    │
│ zwei  │ germany  │       1821 │ uno   │ deux  │
│ uno   │ spain    │       1793 │ edno  │ zwei  │
│ edno  │ bulgaria │       1512 │ due   │ uno   │
│ due   │ italy    │       1453 │ jedno │ edno  │
│ jedno │ poland   │       1339 │ drei  │ due   │
│ drei  │ germany  │       1119 │ um    │ jedno │
│ um    │ portugal │       1021 │ ¤     │ drei  │
└───────┴──────────┴────────────┴───────┴───────┘
(10 rows)

These two functions are very useful to give access to the next or previous resource when looking at a search result. Enjoy !