Customizing Aggregations With Window Frames
Window Frames
Dolt is a database that speaks the MySQL wire protocol but uses a custom SQL Engine to implement relational logic. Our engine, go-mysql-server, is increasingly inching towards 100% drop-in MySQL compatibility, and today we are happy introduce the latest feature: window aggregations and frames.
Window frames are a more recent SQL syntax addition used heavily in
analytical processing
(OLAP) queries. MySQL and Postgres added
full support in
2018,
but some databases like SQLite and SQL Server still lack support for RANGE
frames (described below).
Zach added initial windowing support almost one year
ago, including window
functions like first_value
and row_number
. The most recent release expands
our ability to execute aggregation functions, like sum
, avg
, and count
,
inside windows. The new release also includes window frames, a way to
customize the input set for aggregation functions. Window frames include
ROWS
and RANGE
units, and UNBOUNDED
, CURRENT ROW
, and expression
PRECEDING
and FOLLOWING
bounds. Refer to the MySQL
docs
for more info on window frame syntax.
In general, you might be interested in advanced windowing features if you run data analytics workflows. This blog will focus on practical examples of how to use these new features. We encourage you to install dolt yourself, take windows for a test ride, and let us know what you think!
Tutorial: Using Window Frames in Dolt
Group By
First, let’s create a parts
table with id
, lbs
, and cnt
fields:
mysql> create table parts (id varchar(8) primary key, lbs int, cnt int);
mysql> insert into parts values ('a', 10, 3), ('b', 10, 6), ('c', 10, 2),
('d', 10, 0), ('e', 10, 12), ('f', 10, 7),
('g', 20, 1), ('h', 20, 2), ('i', 20, 3);
Query OK, 8 rows affected
We will look at the simplest aggregation, GROUP BY
, before building
towards more complicated window functions. A GROUP BY
partitions a
dataset before rolling up a rows with an aggregation
function, like SUM
:
mysql> SELECT lbs, SUM(cnt) FROM parts GROUP BY lbs;
+-----+----------+
| lbs | SUM(cnt) |
+-----+----------+
| 10 | 30 |
| 20 | 6 |
+-----+----------+
The first thing we notice is that a GROUP BY
aggregation outputs a
a single row for each partition. Two parts.lbs
partitions yields
two output rows. The diagram below demonstrates this split and
collapse:
Partitioning on the lbs
column divides our table into two sets of
rows, lbs = 10
and lbs = 20
. A GROUP BY
passes each of those
partitions to the aggregation function, SUM
, outputting two result
rows. We have included purple "frames" that overlap with partition
boundaries.
Window Over Partitions
Window syntax is similar to GROUP BY
, but introduces two key features:
-
Windows yield an output row for every input row.
-
Windows use frames to vary their scope of computation, whereas
GROUP BY
is always scoped to an entire partition.
To make the differences concrete, we will first look at a simplified
window that uses the same frame as GROUP BY
:
mysql> SELECT SUM(cnt)
OVER (
PARTITION BY lbs
ROWS BETWEEN
UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING
) as `sum(cnt)` FROM parts;
+----------+
| sum(cnt) |
+----------+
| 30 |
| 30 |
| 30 |
| 30 |
| 30 |
| 30 |
| 6 |
| 6 |
| 6 |
+----------+
The output values are similar to GROUP BY
, but the sums are duplicated for
every input row in a partition. Again, we specified this using window frames, a way to
customize an aggregation function's input set. Here we
used a frame definition that mimics GROUP BY
: the beginning
of the partition (Unbounded Preceding) to the end of the partition
(Unbounded Following).
The diagram below uses the purple frames to highlight the duplicated
frames.
Window Rows Framing
Next, we will execute a query with a more interesting window frame: the sum of every two trailing rows:
mysql> SELECT SUM(cnt)
OVER (
PARTITION BY lbs
ROWS BETWEEN
2 PRECEDING AND
1 PRECEDING
) as `sum(cnt)` FROM parts;
+----------+
| sum(cnt) |
+----------+
| NULL |
| 3 |
| 9 |
| 8 |
| 2 |
| 12 |
| NULL |
| 1 |
| 3 |
+----------+
The first row in each partition lacks trailing rows, so the first frame
is empty, and the first sum is NULL
. There are two partitions in our
table, so the beginning of the second partition is NULL
for the same
reason. The second frames are also special, overrunning the partition
boundary and yielding 1 row frames. The second row's output is the
first row's cnt
. The remaining rows in each partition yield the
sum of the two previous rows.
The diagram above details a frame for every input row, as in the previous diagrams. The frames now use our more complicated frame boundary: two rows back (2 Preceding) to one row back (1 Preceding).
Window Range Framing
So far we have focused on ROWS
framed windows. A row frame uses index
counts to select and move frame bounds. In this final example, we will
consider RANGE
frames.
RANGE
frames use values to select window bounds. For example, a
RANGE
frame could match all rows within 5% of of the current row,
or all rows within 5 units of the current row. A window's ORDER BY
expression serves as a reference point for ranking frame peer values.
To make this explicit, we will make a second table, grades
, with a date
column to RANGE
over:
mysql> create table grades (student varchar(8), subject varchar(8), grade float, date date, primary key (student, subject, date));
mysql> insert into grades values ('max', 'English', 2, '2021-01-21'), ('max', 'English', 6, '2021-02-16'),
('max', 'English', 5, '2021-03-17'), ('max', 'English', 4, '2021-04-20'),
('max', 'English', 8, '2021-05-17'), ('max', 'English', 8, '2021-06-14'),
('max', 'English', 9, '2021-06-23'), ('max', 'English', 10, '2021-06-30');
Now we take a 3 month moving average of grades:
> SELECT avg(grade)
OVER (
PARTITION BY student, subject
ORDER BY date
RANGE BETWEEN
INTERVAL '1' MONTH PRECEDING and
INTERVAL '1' MONTH FOLLOWING
) as `avg(grade)` FROM grades;
+-------------------+
| avg(grade) |
+-------------------+
| 4 |
| 4 |
| 5 |
| 6 |
| 6.666666666666667 |
| 8.75 |
| 9 |
| 9 |
+-------------------4
The diagram above is similar to the others, considering a frame
represents a row mapping between an input row and its output
aggregation set. The difference is that the width of RANGE
frames vary.
A ROWS
frame will always be a fixed length (unless overrunning the start
or end of a partition).
The date
column in the ORDER BY
expression is chosen as the
reference point for computing frames. The INTERVAL '1' MONTH
bounds
mean that a RANGE
frame here matches one or more rows
within one month of the current row's date
. For example, the exam on
2021-03-17
only matches itself, because there are no other exams
between 2021-02-17
and 2021-04-17
. The exam on 2021-06-14
matches
4 rows, on the other hand, every exam in June and one exam in May
following the interval start, 2021-05-14
.
Summary
The latest Dolt release supports window aggregations and customized window framing, a step towards greater MySQL compatibility and stronger OLAP support.
Applying window frames in practice is a bit trickier than in theory, but if you are new to SQL windows hopefully we have provided some starters for how to apply them in your own work.
There are also several improvements we look forward to delivering. We would like to add more functions, like quantiles. There are parser features, like named windows, that we would like to add. And there are performance optimizations that can improve query latency, like sliding window aggregations to reduce memory footprint, partition level sorting to minimize unnecessary compute, and inter/intra partition parallelism to take advantage of multithreading.
Give window frames a try! Let us know if you find bugs or are interested in new features! And if you would like to contribute to our SQL Engine yourself, many of the window extensions mentioned above are good starter tasks.
Feel free to reach out on Twitter, Discord, and GitHub if you have any questions.