WL#9236: Add first batch of SQL window functions
Affects: Server-8.0
—
Status: Complete
Umbrella WL for adding SQL window functions to MySQL. WHAT ==== Allow use of SQL window functions in MySQL. What are they? Cf. this good description culled from the PostgreSQL docs: "A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result." This is a large feature set and we do not expect that all will be implemented, at least not in the first release of the feature set. WHAT THIS IS NOT ================ Addition of new (i.e. missing) aggregate functions: only the already existing aggregate functions for MySQL will be considered for window functions. SQL 2003 introduced more aggregate functions that could also be used as window function. WHY === Support for window functions (a.k.a. analytic functions) is a frequent user request. Window functions have long been part of standard SQL (SQL 2003). BUT CAN'T I DO THE SAME WITH EXISTING MEANS? ============================================ Yes, but performance would be abysmal and expressing the same semantics would lead to hard to understand queries. DO OTHERS HAVE WINDOW FUNCTIONS? ================================ yes: Oracle, PostgreSQL, Maria DB, SQL server REFERENCES ========== [SQL 2011] - "SQL 2011 ISO/IEC 9075-2:2011(E) JTC 1/SC 32/WG 3" [SQL 2016] - "SQL 2016 ISO/IEC 9075-2:2016(E) JTC 1/SC 32/WG 3" Attached file window-functions-intro.pdf as presented to optimizer team Feb 2016. An overview which places window function in the context of OLAP capabilities for SQL: http://wwwlgis.informatik.uni- kl.de/cms/fileadmin/courses/SS2008/NEDM/RDDM.Chapter.06.Windows_and_Query_Function s_in_SQL.pdf The PostgreSQL documentation for window functions is also quite readable. Git branch mysql-trunk-wl9236 contains the prototyping code. SQL FEATURES ============ Feature T611, "Elementary OLAP operations" Feature T612, “Advanced OLAP operations” Feature T614, "NTILE function" Feature T615, "LEAD and LAG functions" Feature T616, "Null treatment option for LEAD and LAG functions" Feature T617, "FIRST_VALUE and LAST_VALUE functions" Feature T618, "NTH_VALUE function" Feature T619, "Nested window functions" Feature T620, "WINDOW clause: GROUPS option" Main sections discussing window functions in [SQL 2016]: 4.15.15 Windowed tables 4.16.1 Introduction to data analysis operations 4.16.3 Window functions 4.16.4 Aggregate functions 6.10[ 6.11 ] 7.4 7.15
10.9 but windows are also treated elsewhere in many contexts. SEE ALSO ======== Old WL#2266 OLAP. WL#9603 Add remaining non-aggregate window functions WL#9727 Additional aggregate window functions WL#10431 Support operator queries with windowing functions EXAMPLE OF WINDOW FUNCTION ========================== CREATE TABLE employees(name VARCHAR(20), department INT, salary LONG); CREATE TABLE departments(id INT, name VARCHAR(20)); INSERT INTO departments VALUES (1, 'Cartoons'), (2, 'Fiction'); INSERT INTO employees VALUES ('donald', 1, 20000), ('mickey', 1, 15000), ('wile e. coyote', 1, 30000), ('swann', 2, 40000), ('ishmael', 2, 40000), ('ahab', 2, 50000); SELECT e.name, d.name, salary, RANK() OVER (PARTITION BY d.id ORDER BY salary DESC) rank_in_dep FROM employees e, departments d WHERE e.department = d.id ORDER BY e.name; +----------------+----------+--------+-------------+ | name | name | salary | rank_in_dep | +----------------+----------+--------+-------------+ | ahab | Fiction | 50000 | 1 | | donald | Cartoons | 20000 | 2 | | ishmael | Fiction | 40000 | 2 | | mickey | Cartoons | 15000 | 3 | | swann | Fiction | 40000 | 2 | | wile e. coyote | Cartoons | 30000 | 1 | +----------------+----------+--------+-------------+ 6 rows in set (0.00 sec) F-1 Support window functions, with the limitations described in this WL F-2 Produce correct query results F-3 For some moving frame aggregates, applying the inverse aggregate function for removing values from the aggregate (to avoid non-linear performance), can lead to unexpected results, notably for floating values, cf example here: https://www.postgresql.org/docs/9.5/static/xaggr.html#XAGGR-MOVING-AGGREGATES Still users may want/need this behavior, but it should be possible to control this via a user variable. The default should produce the expected values, i.e. we should not use inverse functions when it is not safe unless told to do so by the user. Proposed variable: windowing_use_high_precision [false / true (default) ] F-4 Tables with BLOBs should be supported, but there may be a limitation on the size of the BLOBs. F-5 Operator subqueries will only be supported for uncorrelated IN, cf WL#10431, i.e. ANY, SOME, ALL are not allowed. NF-1 performance shouldn't be lower than when using a non-wf SQL rewrite with same semantics NF-2 Avoid non-linear performance degradation for large frames when competition avoids it. NF-3 performance of non-wf queries shouldn't be affected in any significant waySyntax ====== New reserved words: CUME_DIST DENSE_RANK FIRST_VALUE GROUPS LAG LAST_VALUE LEAD NTH_VALUE NTILE OVER PERCENT_RANK RANK ROW_NUMBER WINDOW Although the above are all reserved words according to the standard, PostgreSQL allows all the above as column names, except WINDOW. Presently, the implementation follows the standard. New keywords (unreserved): EXCLUDE FOLLOWING NULLS OTHERS PRECEDING RESPECT TIES UNBOUNDED Productions =========== query_specification: : opt_having_clause opt_window_clause /* new */ : | : opt_having_clause opt_window_clause /* new */ : simple_expr: : | sum_expr | window_func_call /* new */ : sum_expr: /* all aggregates below now have a new opt_windowing_clause */ AVG '(' in_sum_expr ')' opt_windowing_clause | AVG '(' DISTINCT in_sum_expr ')' opt_windowing_clause | BIT_AND '(' in_sum_expr ')' opt_windowing_clause | BIT_OR '(' in_sum_expr ')' opt_windowing_clause | BIT_XOR '(' in_sum_expr ')' opt_windowing_clause | JSON_ARRAYAGG '(' in_sum_expr ')' opt_windowing_clause | JSON_OBJECTAGG '(' in_sum_expr ',' in_sum_expr ')' opt_windowing_clause | COUNT '(' opt_all '*' ')' opt_windowing_clause | COUNT '(' in_sum_expr ')' opt_windowing_clause | COUNT '(' DISTINCT expr_list ')' opt_windowing_clause | MIN '(' in_sum_expr ')' opt_windowing_clause | MIN '(' DISTINCT in_sum_expr ')' opt_windowing_clause | MAX '(' in_sum_expr ')' opt_windowing_clause | MAX '(' DISTINCT in_sum_expr ')' opt_windowing_clause | STD '(' in_sum_expr ')' opt_windowing_clause | VARIANCE '(' in_sum_expr ')' opt_windowing_clause | STDDEV_SAMP '(' in_sum_expr ')' opt_windowing_clause | VAR_SAMP '(' in_sum_expr ')' opt_windowing_clause | SUM '(' in_sum_expr ')' opt_windowing_clause | SUM '(' DISTINCT in_sum_expr ')' opt_windowing_clause | GROUP_CONCAT '(' opt_distinct expr_list opt_gorder_clause opt_gconcat_separator ')' opt_windowing_clause window_func_call: ROW_NUMBER '(' ')' windowing_clause | RANK '(' ')' windowing_clause | DENSE_RANK '(' ')' windowing_clause | CUME_DIST '(' ')' windowing_clause | PERCENT_RANK '(' ')' windowing_clause | NTILE '(' simple_expr ')' windowing_clause | LEAD '(' expr opt_lead_lag_info ')' opt_null_treatment windowing_clause | LAG '(' expr opt_lead_lag_info ')' opt_null_treatment windowing_clause | FIRST_VALUE '(' expr ')' opt_null_treatment windowing_clause | LAST_VALUE '(' expr ')' opt_null_treatment windowing_clause | NTH_VALUE '(' expr ',' simple_expr ')' opt_from_first_last opt_null_treatment windowing_clause opt_lead_lag_info: /* Nothing */ | ',' NUM_literal opt_ll_default opt_ll_default: /* Nothing */ | ',' expr opt_null_treatment: /* Nothing */ | RESPECT NULLS | IGNORE NULLS opt_from_first_last: /* Nothing */ | FROM FIRST | FROM LAST opt_windowing_clause: /* Nothing */ | windowing_clause windowing_clause: OVER window_name_or_spec window_name_or_spec: window_name | window_spec window_name: ident window_spec: '(' window_spec_details ')' window_spec_details: opt_existing_window_name opt_partition_clause opt_window_order_by_clause opt_window_frame_clause opt_existing_window_name: /* Nothing */ | window_name opt_partition_clause: /* Nothing */ | PARTITION BY group_list opt_window_order_by_clause: /* Nothing */ | ORDER BY order_list opt_window_frame_clause: /* Nothing*/ | window_frame_units window_frame_extent opt_window_frame_exclusion window_frame_extent: window_frame_start | window_frame_between window_frame_start: UNBOUNDED PRECEDING | simple_expr PRECEDING /* simple_expr correct? FIXME */ | CURRENT ROW window_frame_between: BETWEEN window_frame_bound AND window_frame_bound window_frame_bound: window_frame_start | UNBOUNDED FOLLOWING | simple_expr FOLLOWING opt_window_frame_exclusion: /* Nothing */ | EXCLUDE CURRENT ROW | EXCLUDE GROUP | EXCLUDE TIES | EXCLUDE NO OTHERS window_frame_units: ROWS | RANGE | GROUPS opt_window_clause: /* Nothing */ | WINDOW window_definition_list window_definition_list: window_definition | window_definition_list ',' window_definition window_definition: window_name AS window_spec The standard states that for window functions that operate on the entire partition, the opt_window_frame_clause should not be present. These are: NTILE, LEAD, LAG, ROW_NUMBER, RANK, DENSE_RANK, PERCENT_RANK and CUME_DIST The current implementation allows it to be present, but ignores it for these functions, similar to PostGreSQL's behaviour. This has the benefit of allowing a mix of window functions to be used in the same windowing step. The following parts will *not* be included as part of this worklog, but left to a possible later inclusion: - All aggregate functions, except the following which we do support: COUNT SUM AVG. - All window functions, except the following which we do support: RANK DENSE_RANK ROW_NUMBER NTILE FIRST_VALUE LAST_VALUE - DISTINCT in aggregates when used as window functions - GROUPS frame units (T620) - EXCLUDE clause (opt_window_frame_exclusion) - opt_null_treatment (T616). Our semantics is RESPECT NULLS, which can optionally be specified. The syntax for IGNORE NULLS is recognized, but gives an error. - Nested window functions (T619): NOTE: syntax for T619 not included above - Dynamic frame bounds, i.e. bounds based on the value of the current row - We do not require ORDER BY unless there is a RANGE frame withPRECEDING or FOLLOWING. If there is no ORDER BY, all rows in a partition are considered peers. Example ======= SELECT name, department_id, salary, SUM(salary) OVER (PARTITION BY department_id ORDER BY name ROWS 2 PRECEDING) department_total, ROW_NUMBER() OVER w FROM employee WINDOW w AS () ORDER BY department_id, name; Semantics ========= Refer to the SQL standard (2011) for non-MySQL window functions. For MySQL specific aggregates, the semantics will be analogous to standard aggregate functions, cf. for GROUP_CONCAT and any new JSON aggregates we would aggregate for all rows in the actual frame. We extend the standard by allowing the partition clause to use expressions, not just column names, e.g partitioning on a substring of a column. The aggregation always happens in the SELECT on the level where the window function is used. This is per the standard. We extend that standard by allowing window function arguments and expression in a window's ORDER BY and PARTITION by clauses to reference outer queries' columns and aliases, similar to what PG allows. Non-deterministic ORDERING ========================== If a window function specifies only a partial order on the set of rows, the actual order is non-deterministic, and will differ across platforms. This is allowed by the standard as long as the order is the same within a query for equal window specifications. Ordering of NULLs ================= Currently, MySQL does not allow specification of where NULLs go in ORDER BY. This also applies to PARTITION BY and ORDER BY in windows: ascending sorts NULLs first, whereas descending sorts NULLs last. (I) Parsing: The grammar as described above is implemented pretty much verbatim. The following objects are created as part of the parse to hold and represent the windows: PT_window : public Parse_tree_node, public Window PT_order_list - for PARTITION BY, if any PT_order_list - for ORDER BY, if any PT_frame - for window frame if any w/ sub-objects PT_border and PT_exclusion There are three kinds of window objects initially created: unnamed window (1) named window (2) window reference (3) After resolution, the latter isn't used. Window functions are modeled on Item_sum, which now have an optional member field Item_sum::m_window, which point to (1) or (3) before resolution, and to (1) or (2) after resolution. The item classes for the supported window functions are shown below (in square brackets: the ones implemented in WL#9603). Item_sum ^ |__________________________________________ | \ \| | | Item_first_last_value Item_rank (incl. dense) Item_sum_sum [Item_nth_value] Item_row_number Item_sum_avg Item_ntile Item_sum_count [Item_cume_dist] : [Item_percent_rank] [the rest not [Item_lead_lag] supported] (II) Contextualizing is driven from PT_window::contextualize. All used windows for a SELECT is to be found in the list SELECT_LEX::m_windows after the Item_sum's have been itemized. (III) Resolution and static semantic checking References from window function to a named window are resolved by Window::resolve_reference, called from Item_sum::fix_fields called from SELECT_LEX::prepare before Window::setup_windows, see below. Checking of windows is driven by Window::setup_windows called from SELECT_LEX::prepare which resolves any inter window references (ancestor, represented as Window:m_ancestor). Each window has a list of all the window functions that use it, Window::m_functions. Window::setup_windows will also check the semantic requirements of individual window functions on the window by calling Window::check_window_functions. This method will collect the dynamic requirements of the window function as far as which execution strategy must be used. All window function for a window will be evaluated in the same pass, so the strategy will reflect the requirements of the "most hard" window function used on a window. (IV) Optimization Derived table merging is switched off if the subquery has windows, it is always materialized. Semijoin isn't applicable as it applies to subqueries in WHERE and JOIN ON condition, which condition cannot contain any window function. Most of the windows magic during optimization (or rather execution planning in our case) happens while calling JOIN::make_tmp_tables_info. Essentially we insert one or more tmp table step after any grouping (GROUP BY/HAVING) but before any final ORDER BY. We decorate each such step with initial sorting if required by concatenating the expression from PARTITION BY and ORDER BY. This will give the right order of the result set rows for dividing it into partitions (all rows with the same partition column values will be adjacent), and then within each partition the rows will be sorted according to any ORDER BY, so the rows come in the right order for windows processing. No PARTITION BY means we have a trivial partition: the entire result set. Each extra step corresponds to the evaluation of all window functions using one window. At the outset, windows are processed in the order they are declared, but we sometimes reorder them: N windows that have the same ordering requirements (in the sense of the SQL standard) are processed in sequence, so we can skip sorting for windows 2..N. No attempt is (yet) made to to merge windows which could be evaluated in one step, e.g. if several window specifications are identical. The work-around is obvious: use a named window instead. If the window step contains window functions in conjunction with a window specification that requires buffering, an extra temporary table (the window frame buffer, or the "frame buffer" for short) is created per step to hold the rows so we can compute the window functions by (re)visiting the rows as needed during execution of that window step. The logic in create_tmp_table is extendedto handle window functions: we make sure that only the window functions pertaining to the window of that step get evaluated in that step. In later steps the result of previous steps are normal fields, and after the last step in the outgoing result set all functions have been finally evaluated to fields. The set of item indirections JOIN::ref_items, all_fields/tmp_all_fields has been extended to include the new tmp table steps, and so now are dynamically allocated, see class comments of JOIN::ref_items and JOIN::alloc_indirection_slices. (V) Execution: It all comes together in sql_executor.cc's end_write_wf (a copy and extension of end_write): this is where we encounter the (possibly sorted) rows as input, and before we write them on to the out table (and possibly another windowing step) we have a chance to act on them and evaluate their window functions (the functions to be evaluated for this step can be found in Temp_table_param::items_to_copy) as determined by the logic make_tmp_tables_info, see above. A windowing step normally produces a temporary out table, which is used as an input table by the processing of the next window, if any. If there is no more windowing steps, nor a final DISTINCT or ORDER BY step, we optimize away this output table, and just send rows out. By "optimize away" we mean the TABLE object is created in memory but not in the engine (so, doesn't get any read/writes). If evaluation of the window requires a frame buffer, we write the input table to a temporary frame buffer table, which we iterate over as we evaluate window functions for the result set and either write the next out table or send the rows on. An (optimized) alternative would have been to evaluate on the incoming table as rows are read, possibly re-positioning that (tmp) table in lieu of a frame buffer, but I didn't see an easy way to do this with the existing code structure. In any case, the input is not guaranteed to be in the form of a table that can be (re)positioned: indeed if the first window doesn't have any ordering requirements, we do not force the input result set to be buffered to a file; and if the first window does sorting, the output of filesort isn't re-positionable. If we don't require buffering, the logic is pretty simple, we just detect partition boundaries and evaluate window functions as we go ("on the fly") in the "copy_funcs" call. If we do require buffering, we use a different code path, which in addition to partition detection, buffers rows: they are written to the frame buffer using the method buffer_windowing_record until enough rows (perhaps the entire partition, depending) has been read, and then they are evaluated by the main method for buffered processing: process_buffered_windowing_record, which can iterate over the frame buffer. Most of the complexity of the actual evaluation of the window functions is in the method process_buffered_windowing_record and in the val_xxx methods of Item_sum sub-classes representing the window function. See appendix A for details on process_buffered_windowing_record. Since we need to potentially visit the same row several times, we need to read the row as quickly as possible given its row number. We do not add an index to the frame buffer, but rather rely on file positions(¹). We keep track of file positions from evaluations of earlier rows and take advantage of the fact that the sliding windows' border increase monotonically as long as the frame bounds are static in the query (current limitation). By recording a suitable set of file positions we are able to find any row we need by maximum two scan reads: one to re-read using a former read row using its recorded location, and one to read the next row, cf the logic in bring_back_frame_row. (¹) The MEM heap engine didn't provide the necessary functionality for this, since a positioned read did not update the current position in contrast to InnoDB. The current work adds this functionality. A note on RANGE frames: this requires an ORDER BY in the window frame, and integral or date/time type of the ordering expression. The determination of whether a row is outside the set, either before or after the range indicated, is achieved by setting up a set of expressions at prepare-time, cf Item *Window::m_comparators[]. We poke in the value of the ORDER BY expression of the current row and evaluate the expression on the candidate row, cf logic in Cached_item_int::copy_to_Item_cache (and same for Cached_item_real, Cached_item_decimal, Cached_item_temporal), called from Window::before_frame and Window::after_frame. The other Cached_item_xxx are not used for windowing. (VI) Cleanup is called from SELECT_LEX::cleanup, the method being Window::cleanup. (VII) In Opt trace. As this WL creates potentially many temp tables, some tracing of temp table creation is added. (VIII) in EXPLAIN In JSON format, windowing (the process of sorting data and calculating functions) is represented like this: { "query_block": { "select_id": 1, "cost_info": { "query_cost": "0.65" <<< usual preamble }, ... "windowing": { <<<<<<<<<<< new "windows": [ { "name": " ", "using_temporary_table": true, "using_filesort": true, "filesort_key": [ "`j`" ], "functions": [ "sum" ] } ], "table": { <<<<<<<< the usual list of tables "table_name": "t", In traditional format, it's impossible to provide windowing information (which is too rich), so we print nothing, and send a warning: "To get information about window functions use EXPLAIN FORMAT=JSON" (IX) Printing - SELECT_LEX_UNIT::print() is augmented to print out window functions and window definition so windowing can be used inside VIEW definitions (Appendix A - process_buffered_windowing_record) This is culled from the Doxygen of the method. It is included here because it describes major design dimension of the execution strategy. /** While there are more unprocessed rows ready to process given the current partition/frame state, process such buffered rows by evaluating/aggregating the window functions defined over this window on the current frame, moving the frame if required. This method contains the main execution time logic of the evaluation window functions if we need buffering for one or more of the window functions defined on the window. Moving (sliding) frames can be executed using a naive or optimized strategy for aggregate window functions, like SUM or AVG (but not necessarily for MIN/MAX). In the naive approach, for each row considered for processing from the buffer, we visit all the rows defined in the frame for that row, essentially leading to N*M complexity, where N is the number of rows in the result set, and M is the number for rows in the frame. This can be slow for large frames, obviously, so we can choose an optimized evaluation strategy using inversion. This means that when rows leave the frame as we move it forward, we re-use the previous aggregate state, but compute the *inverse* function to eliminate the contribution to the aggregate by the row(s) leaving the frame, and then use the normal aggregate function to add the contribution of the rows moving into the frame. The present method contains code paths for both strategies. For integral data types, this is safe in the sense that the result will be the same if no overflow occurs during normal evaluation. For floating numbers, optimizing in this way may lead to different results, so it is not done by default, cf the session variable "windowing_use_high_precision". Since the evaluation strategy is chosen based on the "most difficult" window function defined on the window, we must also be able to evaluate non-aggregates like ROW_NUMBER, NTILE, FIRST_VALUE in the code path of the optimized aggregates, so there is redundant code for those in the naive and optimized code paths. Note that NTILE forms a class of its own of the non-aggregates: it needs two passes over the partition's rows since the cardinality is needed to compute it. Furthermore, FIRST_VALUE and LAST_VALUE heed the frames, but they are not aggregates. The is a special optimized code path for *static aggregates*: when the window frame is the entire partition and there is no ORDER BY specified, the value of the framing window functions, i.e. SUM, AVG, FIRST_VALUE, LAST_VALUE can be evaluated once for all and saved when we visit and evaluate the first row of the partition. For later rows we restore the aggregate values and just fill in the other fields and evaluate non-framing window functions for the row. The code paths both for naive execution and optimized execution differ depending on whether we have ROW or RANGE boundaries in a explicit frame. Another special optimized code path ("dynamic_upper_optimizable") exists for the case where we have a dynamic upper frame limit based on the value of the current rows ORDER BY expression, but otherwise have no explicit frame: in this case the last row in the frame is the last peer row of the current row, which can be many rows out. In that case also, we save the evaluated aggregates and restore them for the next row, as long as we haven't left the peer set. Once we do, we add all the new contributions and make a new saved result for the new peer set. This code shares some mechanisms with the optimized aggregates path for RANGE frames. A word on BLOBs. Below we make copies of rows' records into the frame buffer. This is a temporary table, so BLOBs get copied in the normal way. Sometimes we save records containing already computed framing window functions away into memory only: is the lifetime of the referenced BLOBs long enough? We have two cases: BLOB results from wfs: Any BLOB results will reside in the copies in result fields of the Items ready for the out table, so they no longer need any BLOB memory read from the frame buffer tmp table. BLOB fields not evaluated by wfs: Any other BLOB field will be copied as well, and would not have life-time past the next read from the frame buffer, but they are never used since we fill in the fields from the current row after evaluation of the window functions, so we don't need to make special copies of such BLOBs. This can be (and was) tested by shredding any BLOBs deallocated by InnoDB at the next read. We also save away in memory the next record of the next partition while processing the current partition. Any blob there will have its storage from the read of the input table, but we won't be touching that for reading again until after we start processing the next partition and save the saved away next partition row to the frame buffer. @param thd Current thread @param param Current temporary table @param out_table Results of this windowing step go here @param new_partition_or_eof True if we are about to start a new partition, or eof @param[out] output_row_ready True if there is a row record ready to write. @return true if error */ static bool process_buffered_windowing_record(const THD *thd, Temp_table_param *param, TABLE *out_table, const bool new_partition_or_eof, bool *output_row_ready) (Appendix B - detailed comments on what's changed in individual files) is available on request from the implementor.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.