WL#9236: Add first batch of SQL window functions

Affects: Server-8.0   —   Status: Complete

Umbrella WL for adding SQL window functions to MySQL.


Allow use of SQL window functions in MySQL. What are they? Cf. this good 
description culled from the PostgreSQL docs:

"A window function performs a calculation across a set of table rows that are 
somehow related to the current row. This is comparable to the type of calculation 
that can be done with an aggregate function. But unlike regular aggregate 
functions, use of a window function does not cause rows to become grouped into a 
single output row — the rows retain their separate identities. Behind the scenes, 
the window function is able to access more than just the current row of the query 

This is a large feature set and we do not expect that all will be implemented,
at least not in the first release of the feature set.


Addition of new (i.e. missing) aggregate functions: only the already existing
aggregate functions for MySQL will be considered for window functions.
SQL 2003 introduced more aggregate functions that could also be used as
window function.


Support for window functions (a.k.a. analytic functions) is a frequent user 
request. Window functions have long been part of standard SQL (SQL 2003).


Yes, but performance would be abysmal and expressing the same semantics would lead
to hard to understand queries.


yes: Oracle, PostgreSQL, Maria DB, SQL server


[SQL 2011] - "SQL 2011 ISO/IEC 9075-2:2011(E) JTC 1/SC 32/WG 3"
[SQL 2016] - "SQL 2016 ISO/IEC 9075-2:2016(E) JTC 1/SC 32/WG 3"

Attached file window-functions-intro.pdf as presented to optimizer team
Feb 2016.

An overview which places window function in the context of OLAP capabilities for 


The PostgreSQL documentation for window functions is also quite readable.

Git branch mysql-trunk-wl9236 contains the prototyping code.

Feature T611, "Elementary OLAP operations"
Feature T612, “Advanced OLAP operations”
Feature T614, "NTILE function"
Feature T615, "LEAD and LAG functions"
Feature T616, "Null treatment option for LEAD and LAG functions"
Feature T617, "FIRST_VALUE and LAST_VALUE functions"
Feature T618, "NTH_VALUE function"
Feature T619, "Nested window functions"
Feature T620, "WINDOW clause: GROUPS option"

Main sections discussing window functions in [SQL 2016]:

  4.15.15 Windowed tables
  4.16.1 Introduction to data analysis operations
  4.16.3 Window functions
  4.16.4 Aggregate functions
[ 6.11  ]

but windows are also treated elsewhere in many contexts.

Old WL#2266 OLAP.
WL#9603 Add remaining non-aggregate window functions
WL#9727 Additional aggregate window functions
WL#10431 Support operator queries with windowing functions

CREATE TABLE employees(name VARCHAR(20), department INT, salary LONG);
CREATE TABLE departments(id INT, name VARCHAR(20));

INSERT INTO departments VALUES (1, 'Cartoons'), (2, 'Fiction');
INSERT INTO employees VALUES ('donald', 1, 20000),
                             ('mickey', 1, 15000),
                             ('wile e. coyote', 1, 30000),
                             ('swann', 2, 40000),
                             ('ishmael', 2, 40000),
                             ('ahab', 2, 50000);

SELECT e.name, d.name, salary,
       RANK() OVER (PARTITION BY d.id ORDER BY salary DESC) rank_in_dep
       FROM employees e, departments d
       WHERE e.department = d.id
       ORDER BY e.name;

| name           | name     | salary | rank_in_dep |
| ahab           | Fiction  | 50000  |           1 |
| donald         | Cartoons | 20000  |           2 |
| ishmael        | Fiction  | 40000  |           2 |
| mickey         | Cartoons | 15000  |           3 |
| swann          | Fiction  | 40000  |           2 |
| wile e. coyote | Cartoons | 30000  |           1 |
6 rows in set (0.00 sec)

F-1 Support window functions, with the limitations described in this WL
F-2 Produce correct query results
F-3 For some moving frame aggregates, applying the inverse aggregate function
    for removing values from the aggregate (to avoid non-linear performance),
    can lead to unexpected results, notably for floating values,
    cf example here:
    Still users may want/need this behavior, but it should be possible
    to control this via a user variable. The default should produce
    the expected values, i.e. we should not use inverse functions when it is
    not safe unless told to do so by the user. 
    Proposed variable:

        windowing_use_high_precision [false / true (default) ]

F-4 Tables with BLOBs should be supported, but there may be a limitation
    on the size of the BLOBs.
F-5 Operator subqueries will only be supported for uncorrelated IN,
    cf WL#10431, i.e. ANY, SOME, ALL are not allowed.

NF-1 performance shouldn't be lower than when using a non-wf SQL rewrite with same 
NF-2 Avoid non-linear performance degradation for large frames when
     competition avoids it.
NF-3 performance of non-wf queries shouldn't be affected in any significant way



New reserved words:


Although the above are all reserved words according to the standard,
PostgreSQL allows all the above as column names, except WINDOW. Presently,
the implementation follows the standard.

New keywords (unreserved):



    opt_window_clause /* new */
    opt_window_clause /* new */

    | sum_expr
    | window_func_call /* new */

sum_expr: /* all aggregates below now have a new opt_windowing_clause */
          AVG '(' in_sum_expr ')' opt_windowing_clause
        | AVG '(' DISTINCT in_sum_expr ')' opt_windowing_clause
        | BIT_AND  '(' in_sum_expr ')' opt_windowing_clause
        | BIT_OR  '(' in_sum_expr ')' opt_windowing_clause
        | BIT_XOR  '(' in_sum_expr ')' opt_windowing_clause
        | JSON_ARRAYAGG '(' in_sum_expr ')' opt_windowing_clause
        | JSON_OBJECTAGG '(' in_sum_expr ',' in_sum_expr ')' opt_windowing_clause
        | COUNT '(' opt_all '*' ')' opt_windowing_clause
        | COUNT '(' in_sum_expr ')' opt_windowing_clause
        | COUNT '(' DISTINCT expr_list ')' opt_windowing_clause
        | MIN '(' in_sum_expr ')' opt_windowing_clause
        | MIN '(' DISTINCT in_sum_expr ')' opt_windowing_clause
        | MAX '(' in_sum_expr ')' opt_windowing_clause
        | MAX '(' DISTINCT in_sum_expr ')' opt_windowing_clause
        | STD '(' in_sum_expr ')' opt_windowing_clause
        | VARIANCE '(' in_sum_expr ')' opt_windowing_clause
        | STDDEV_SAMP '(' in_sum_expr ')' opt_windowing_clause
        | VAR_SAMP '(' in_sum_expr ')' opt_windowing_clause
        | SUM '(' in_sum_expr ')' opt_windowing_clause
        | SUM '(' DISTINCT in_sum_expr ')' opt_windowing_clause
        | GROUP_CONCAT '(' opt_distinct
          expr_list opt_gorder_clause
          ')' opt_windowing_clause
          ROW_NUMBER '(' ')' windowing_clause
        | RANK '(' ')' windowing_clause
        | DENSE_RANK '(' ')' windowing_clause
        | CUME_DIST '(' ')' windowing_clause
        | PERCENT_RANK '(' ')' windowing_clause
        | NTILE '(' simple_expr ')' windowing_clause
        | LEAD '(' expr opt_lead_lag_info ')' 
          opt_null_treatment windowing_clause
        | LAG '(' expr opt_lead_lag_info ')' 
          opt_null_treatment windowing_clause
        | FIRST_VALUE '(' expr ')' opt_null_treatment windowing_clause
        | LAST_VALUE  '(' expr ')' opt_null_treatment windowing_clause
        | NTH_VALUE '(' expr ',' simple_expr ')' opt_from_first_last 
          opt_null_treatment windowing_clause

          /* Nothing */
        | ',' NUM_literal opt_ll_default

          /* Nothing */
        | ',' expr

          /* Nothing */
        | IGNORE NULLS

          /* Nothing */
        | FROM FIRST
        | FROM LAST

          /* Nothing */
        | windowing_clause

        OVER window_name_or_spec

        | window_spec

        '(' window_spec_details ')'


          /* Nothing */
        | window_name

          /* Nothing */
        | PARTITION BY group_list

          /* Nothing */
        | ORDER BY order_list

          /* Nothing*/
        | window_frame_units

        | window_frame_between

        | simple_expr PRECEDING  /* simple_expr correct? FIXME */
        | CURRENT ROW

          BETWEEN window_frame_bound AND window_frame_bound

        | simple_expr FOLLOWING

          /* Nothing */
        | EXCLUDE TIES

         | RANGE
         | GROUPS

          /* Nothing */
        | WINDOW  window_definition_list

        | window_definition_list ',' window_definition

          window_name AS window_spec

The standard states that for window functions that operate on the entire
partition, the opt_window_frame_clause should not be present. These are:


The current implementation allows it to be present, but ignores it for these 
functions, similar to PostGreSQL's behaviour. This has the benefit of allowing
a mix of window functions to be used in the same windowing step. 

The following parts will *not* be included as part of this worklog, but left
to a possible later inclusion:

- All aggregate functions, except the following which we do support:


- All window functions, except the following which we do support:


- DISTINCT in aggregates when used as window functions
- GROUPS frame units (T620)
- EXCLUDE clause (opt_window_frame_exclusion)
- opt_null_treatment (T616). Our semantics is RESPECT NULLS, which
  can optionally be specified.
  The syntax for IGNORE NULLS is recognized, but gives an error.
- Nested window functions (T619): NOTE: syntax for T619 not included above
- Dynamic frame bounds, i.e. bounds based on the value of the current row
- We do not require ORDER BY unless there is a RANGE frame with
   PRECEDING or FOLLOWING. If there is no ORDER BY, all rows
  in a partition are considered peers.


SELECT name, department_id, salary,
       SUM(salary) OVER (PARTITION BY department_id
                         ORDER BY name
                         ROWS 2 PRECEDING) department_total,
       ROW_NUMBER() OVER w
FROM employee
ORDER BY department_id, name;

Refer to the SQL standard (2011) for non-MySQL window functions.
For MySQL specific aggregates, the semantics will be analogous to standard
aggregate functions, cf. for GROUP_CONCAT and any new JSON aggregates we would 
aggregate for all rows in the actual frame.

We extend the standard by allowing the partition clause to use expressions,
not just column names, e.g partitioning on a substring of a column.

The aggregation always happens in the SELECT on the level where the
window function is used. This is per the standard. We extend that standard by
allowing window function arguments and expression in a window's ORDER BY and
PARTITION by clauses to reference outer queries' columns and aliases, similar to
what PG allows.

Non-deterministic ORDERING
If a window function specifies only a partial order on the set of rows,
the actual order is non-deterministic, and will differ across platforms.
This is allowed by the standard as long as the order is the same within a query 
for equal window specifications.

Ordering of NULLs
Currently, MySQL does not allow specification of where NULLs go in ORDER BY.
This also applies to PARTITION BY and ORDER BY in windows: ascending sorts NULLs
first, whereas descending sorts NULLs last.
(I) Parsing:

The grammar as described above is implemented pretty much verbatim.

The following objects are created as part of the parse to hold and represent
the windows:

    PT_window : public Parse_tree_node, public Window
    PT_order_list     - for PARTITION BY, if any
    PT_order_list     - for ORDER BY, if any
    PT_frame          - for window frame if any
       w/ sub-objects PT_border and PT_exclusion

There are three kinds of window objects initially created:

    unnamed window (1)
    named window (2)
    window reference (3)

After resolution, the latter isn't used. Window functions are modeled
on Item_sum, which now have an optional member field
Item_sum::m_window, which point to (1) or (3) before resolution, and
to (1) or (2) after resolution.

The item classes for the supported window functions are shown below
(in square brackets: the ones implemented in WL#9603).

          |                \                         \
                |                         |
          |       Item_first_last_value        Item_rank (incl. dense)
Item_sum_sum      [Item_nth_value]             Item_row_number
Item_sum_avg                                   Item_ntile 
Item_sum_count                                [Item_cume_dist]
       :                                      [Item_percent_rank]
 [the rest not                                [Item_lead_lag]

(II) Contextualizing

is driven from PT_window::contextualize.

All used windows for a SELECT is to be found in the list
SELECT_LEX::m_windows after the Item_sum's have been itemized.

(III) Resolution and static semantic checking

References from window function to a named window are resolved by
Window::resolve_reference, called from Item_sum::fix_fields called
from SELECT_LEX::prepare before Window::setup_windows, see below.

Checking of windows is driven by Window::setup_windows called from
SELECT_LEX::prepare which resolves any inter window references
(ancestor, represented as Window:m_ancestor). Each window has a list
of all the window functions that use it, Window::m_functions.

Window::setup_windows will also check the semantic requirements of
individual window functions on the window by calling

This method will collect the dynamic requirements of the window
function as far as which execution strategy must be used. All window
function for a window will be evaluated in the same pass, so the
strategy will reflect the requirements of the "most hard" window
function used on a window.

(IV) Optimization

Derived table merging is switched off if the subquery has windows, it is
always materialized. Semijoin isn't applicable as it applies to subqueries in
WHERE and JOIN ON condition, which condition cannot contain any window function.

Most of the windows magic during optimization (or rather execution
planning in our case) happens while calling
JOIN::make_tmp_tables_info. Essentially we insert one or more tmp table
step after any grouping (GROUP BY/HAVING) but before any final ORDER
BY.  We decorate each such step with initial sorting if required by
concatenating the expression from PARTITION BY and ORDER BY. This will
give the right order of the result set rows for dividing it into
partitions (all rows with the same partition column values will be
adjacent), and then within each partition the rows will be sorted
according to any ORDER BY, so the rows come in the right order for
windows processing. No PARTITION BY means we have a trivial partition:
the entire result set.

Each extra step corresponds to the evaluation of all window functions
using one window.  At the outset, windows are processed in the order they are
declared, but we sometimes reorder them:

N windows that have the same ordering requirements (in the sense of the
SQL standard) are processed in sequence, so we can skip sorting for windows 2..N. 

No attempt is (yet) made to to merge windows which could be evaluated in one
step, e.g. if several window specifications are identical. The work-around is
obvious: use a named window instead.

If the window step contains window functions in conjunction with a
window specification that requires buffering, an extra temporary table
(the window frame buffer, or the "frame buffer" for short) is created
per step to hold the rows so we can compute the window functions by
(re)visiting the rows as needed during execution of that window step.

The logic in create_tmp_table
is extendedto handle window functions: we make sure that only
the window functions pertaining to the window of that step get
evaluated in that step. In later steps the result of previous steps
are normal fields, and after the last step in the outgoing result set
all functions have been finally evaluated to fields.

The set of item indirections JOIN::ref_items,
all_fields/tmp_all_fields has been extended to include the new tmp
table steps, and so now are dynamically allocated, see class comments
of JOIN::ref_items and JOIN::alloc_indirection_slices.

(V) Execution:

It all comes together in sql_executor.cc's end_write_wf (a copy and extension of
end_write): this is where we
encounter the (possibly sorted) rows as input, and before we write
them on to the out table (and possibly another windowing step) we have a
chance to act on them and evaluate their window functions (the
functions to be evaluated for this step can be found in
Temp_table_param::items_to_copy) as determined by the logic
make_tmp_tables_info, see above.

A windowing step normally produces a temporary out table, which is
used as an input table by the processing of the next window, if any.
If there is no more windowing steps, nor a final DISTINCT or ORDER BY
step, we optimize away this output table, and just send rows out. By "optimize
away" we mean the TABLE object is created in memory but not in the engine (so,
doesn't get any read/writes).

If evaluation of the window requires a frame buffer, we write the
input table to a temporary frame buffer table, which we iterate over as
we evaluate window functions for the result set and either write the
next out table or send the rows on.

An (optimized) alternative would have been to evaluate on the incoming
table as rows are read, possibly re-positioning that (tmp) table in
lieu of a frame buffer, but I didn't see an easy way to do this with
the existing code structure. In any case, the input is not guaranteed
to be in the form of a table that can be (re)positioned: indeed if the first
window doesn't have any ordering requirements, we do not force the
input result set to be buffered to a file; and if the first window does sorting,
the output of filesort isn't re-positionable.

If we don't require buffering, the logic is pretty simple, we just
detect partition boundaries and evaluate window functions as we go
("on the fly") in the "copy_funcs" call.

If we do require buffering, we use a different code path, which in
addition to partition detection, buffers rows: they are written to the
frame buffer using the method buffer_windowing_record until enough
rows (perhaps the entire partition, depending) has been read, and then
they are evaluated by the main method for buffered processing:
process_buffered_windowing_record, which can iterate over the frame

Most of the complexity of the actual evaluation of the window functions
is in the method process_buffered_windowing_record and in the val_xxx methods of
Item_sum sub-classes representing the window function.

See appendix A for details on process_buffered_windowing_record.

Since we need to potentially visit the same row several times, we need
to read the row as quickly as possible given its row number. We do not
add an index to the frame buffer, but rather rely on file
positions(¹).  We keep track of file positions from evaluations of
earlier rows and take advantage of the fact that the sliding windows'
border increase monotonically as long as the frame bounds are static
in the query (current limitation). By recording a suitable set of file
positions we are able to find any row we need by maximum two scan
reads: one to re-read using a former read row using its recorded
location, and one to read the next row, cf the logic in

(¹) The MEM heap engine didn't provide the necessary functionality for
this, since a positioned read did not update the current position in 
contrast to InnoDB. The current work adds this functionality.

A note on RANGE  frames: this requires an ORDER BY in the
window frame, and integral or date/time type of the ordering
expression.  The determination of whether a row is outside the set,
either before or after the range indicated, is achieved by setting up
a set of expressions at prepare-time, cf Item *Window::m_comparators[].

We poke in the value of the ORDER BY expression of the current row and
evaluate the expression on the candidate row, cf logic in

  Cached_item_int::copy_to_Item_cache (and same for Cached_item_real,
Cached_item_decimal, Cached_item_temporal),

called from Window::before_frame and Window::after_frame. The other
Cached_item_xxx are not used for windowing.

(VI) Cleanup

is called from SELECT_LEX::cleanup, the method being Window::cleanup.

(VII) In Opt trace.

As this WL creates potentially many temp tables, some tracing of temp table
creation is added.


In JSON format, windowing (the process of sorting data and calculating
functions) is represented like this:
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "0.65" <<< usual preamble
      "windowing": { <<<<<<<<<<< new 
        "windows": [
            "name": "",
            "using_temporary_table": true,
            "using_filesort": true,
            "filesort_key": [
            "functions": [
        "table": { <<<<<<<< the usual list of tables
          "table_name": "t",
In traditional format, it's impossible to provide windowing information (which
is too rich), so we print nothing, and send a warning:
 "To get information about window functions use EXPLAIN FORMAT=JSON"

(IX) Printing

- SELECT_LEX_UNIT::print() is augmented to print out window functions
  and window definition so windowing can be used inside VIEW definitions

(Appendix A - process_buffered_windowing_record)

This is culled from the Doxygen of the method. It is included here because it
describes major design dimension of the execution strategy.

  While there are more unprocessed rows ready to process given the current
  partition/frame state, process such buffered rows by evaluating/aggregating
  the window functions defined over this window on the current frame, moving
  the frame if required.

  This method contains the main execution time logic of the evaluation
  window functions if we need buffering for one or more of the window functions
  defined on the window.

  Moving (sliding) frames can be executed using a naive or optimized strategy
  for aggregate window functions, like SUM or AVG (but not necessarily for 
  In the naive approach, for each row considered for processing from the buffer,
  we visit all the rows defined in the frame for that row, essentially leading
  to N*M complexity, where N is the number of rows in the result set, and M is
  the number for rows in the frame. This can be slow for large frames,
  obviously, so we can choose an optimized evaluation strategy using inversion.
  This means that when rows leave the frame as we move it forward, we re-use
  the previous aggregate state, but compute the *inverse* function to eliminate
  the contribution to the aggregate by the row(s) leaving the frame, and then
  use the normal aggregate function to add the contribution of the rows moving
  into the frame. The present method contains code paths for both strategies.

  For integral data types, this is safe in the sense that the result will be the
  same if no overflow occurs during normal evaluation. For floating numbers,
  optimizing in this way may lead to different results, so it is not done by
  default, cf the session variable "windowing_use_high_precision".

  Since the evaluation strategy is chosen based on the "most difficult" window
  function defined on the window, we must also be able to evaluate
  non-aggregates like ROW_NUMBER, NTILE, FIRST_VALUE in the code path of the
  optimized aggregates, so there is redundant code for those in the naive and
  optimized code paths. Note that NTILE forms a class of its own of the
  non-aggregates: it needs two passes over the partition's rows since the
  cardinality is needed to compute it. Furthermore, FIRST_VALUE and LAST_VALUE
  heed the frames, but they are not aggregates.

  The is a special optimized code path for *static aggregates*: when the window
  frame is the entire partition and there is no ORDER BY
  specified, the value of the framing window functions, i.e. SUM, AVG,
  FIRST_VALUE, LAST_VALUE can be evaluated once for all and saved when
  we visit and evaluate the first row of the partition. For later rows we
  restore the aggregate values and just fill in the other fields and evaluate
  non-framing window functions for the row.

  The code paths both for naive execution and optimized execution differ
  depending on whether we have ROW or RANGE boundaries in a explicit frame.

  Another special optimized code path ("dynamic_upper_optimizable") exists for
  the case where we have a dynamic upper frame limit based on the value of the
  current rows ORDER BY expression, but otherwise have no explicit frame: in
  this case the last row in the frame is the last peer row of the current row,
  which can be many rows out. In that case also, we save the evaluated
  aggregates and restore them for the next row, as long as we haven't left
  the peer set. Once we do, we add all the new contributions and make a new
  saved result for the new peer set. This code shares some mechanisms with the
  optimized aggregates path for RANGE frames.

  A word on BLOBs. Below we make copies of rows' records into the frame buffer.
  This is a temporary table, so BLOBs get copied in the normal way.

  Sometimes we save records containing already computed framing window
  functions away into memory only: is the lifetime of the referenced BLOBs long
  enough?  We have two cases:

  BLOB results from wfs: Any BLOB results will reside in the copies in result
  fields of the Items ready for the out table, so they no longer need any BLOB
  memory read from the frame buffer tmp table.

  BLOB fields not evaluated by wfs: Any other BLOB field will be copied as
  well, and would not have life-time past the next read from the frame buffer,
  but they are never used since we fill in the fields from the current row
  after evaluation of the window functions, so we don't need to make special
  copies of such BLOBs. This can be (and was) tested by shredding any BLOBs
  deallocated by InnoDB at the next read.

  We also save away in memory the next record of the next partition while
  processing the current partition. Any blob there will have its storage from
  the read of the input table, but we won't be touching that for reading again
  until after we start processing the next partition and save the saved away
  next partition row to the frame buffer.

  @param thd                    Current thread
  @param param                  Current temporary table
  @param out_table              Results of this windowing step go here
  @param new_partition_or_eof   True if we are about to start a new partition,
                                or eof
  @param[out] output_row_ready  True if there is a row record ready to write.

  @return true if error
static bool
process_buffered_windowing_record(const THD *thd,
                                  Temp_table_param *param,
                                  TABLE *out_table,
                                  const bool new_partition_or_eof,
                                  bool *output_row_ready)

(Appendix B - detailed comments on what's changed in individual files)

is available on request from the implementor.