MySQL 8.0.29
Source Code Documentation
interesting_orders.h File Reference

Tracks which tuple streams follow which orders, and in particular whether they follow interesting orders. More...

#include "my_table_map.h"
#include "sql/join_optimizer/interesting_orders_defs.h"
#include "sql/mem_root_array.h"
#include "sql/sql_array.h"
#include <bitset>
#include <string>

Go to the source code of this file.

Classes

struct  FunctionalDependency
 
class  LogicalOrderings
 
struct  LogicalOrderings::ItemInfo
 
struct  LogicalOrderings::NFSMState
 
struct  LogicalOrderings::DFSMState
 
struct  LogicalOrderings::NFSMEdge
 
struct  LogicalOrderings::DFSMEdge
 
struct  LogicalOrderings::OrderingWithInfo
 

Typedefs

using Ordering = Bounds_checked_array< OrderElement >
 

Functions

bool IsGrouping (Ordering ordering)
 

Detailed Description

Tracks which tuple streams follow which orders, and in particular whether they follow interesting orders.

An interesting order (and/or grouping) is one that we might need to sort by at some point during query execution (e.g. to satisfy an ORDER BY predicate); if the rows already are produced in that order, for instance because we scanned along the right index, we can skip the sort and get a lower cost.

We generally follow these papers:

[Neu04] Neumann and Moerkotte: “An efficient framework for order optimization” [Neu04b] Neumann and Moerkotte: “A Combined Framework for Grouping and Order Optimization”

[Neu04b] is an updated version of [Neu04] that also deals with interesting groupings but omits some details to make more space, so both are needed. A combined and updated version of the same material is available in Moerkotte's “Query compilers” PDF.

Some further details, like order homogenization, come from

[Sim96] Simmen et al: “Fundamental Techniques for Order Optimization”

All three papers deal with the issue of logical orderings, where any row stream may follow more than one order simultaneously, as inferred through functional dependencies (FDs). For instance, if we have an ordering (ab) but also an active FD {a} → c (c is uniquely determined by a, for instance because a is a primary key in the same table as c), this means we also implicitly follow the orders (acb) and (abc). In addition, we trivially follow the orders (a), (ac) and (ab). However, note that we do not necessarily follow the order (cab).

Similarly, equivalences, such as WHERE conditions and joins, give rise to a stronger form of FDs. If we have an ordering (ab) and the FD b = c, we can be said to follow (ac), (acb) or (abc). The former would not be inferrable from {b} → c and {c} → b alone. Equivalences with constants are perhaps even stronger, e.g. WHERE x=3 would give rise to {} → x, which could extend (a) to (xa), (ax) or (x).

Neumann et al solve this by modelling which ordering we're following as a state in a non-deterministic finite state machine (NFSM). By repeatedly applying FDs (which become edges in the NFSM), we can build up all possible orderings from a base (which can be either the empty ordering, ordering from scanning along an index, or one produced by an explicit sort) and then checking whether we are in a state matching the ordering we are interested in. (There can be quite a bit of states, so we need a fair amount of pruning to keep the count manageable, or else performance will suffer.) Of course, since NFSMs are nondeterministic, a base ordering and a set of FDs can necessarily put us in a number of states, so we need to convert the NFSM to a DFSM (using the standard powerset construction for NFAs; see ConvertNFSMToDFSM()). This means that the ordering state for an access path is only a single integer, the DFSM state number. When we activate more FDs, for instance because we apply joins, we will move throughout the DFSM into more attractive states. By checking simple precomputed lookup tables, we can quickly find whether a given DFSM state follows a given ordering.

The other kind of edges we follow are from the artificial starting state; they represent setting a specific ordering (e.g. because we sort by that ordering). This is set up in the NFSM and preserved in the DFSM.

The actual collection of FDs and interesting orders happen outside this class, in the caller.

A weakness in the approach is that transitive FDs are not always followed correctly. E.g., if we have an ordering (a), and FDs {a} → b and {b} → c, we will create (ab) and (abc), but not (ac). This is not a problem for equivalences, though, and most of the FDs we collect are equivalences. We do have some heuristics to produce a new FD {a} → c where it is relevant, but they are not always effective.

Neumann and Moerkotte distinguish between “tested-for” (O_T) and “producing” (O_P) orderings, where all orders are interesting but only some can be produced by explicit operators, such as sorts. Our implementation is exactly opposite; we allow every ordering to be produced (by means of sort-ahead), but there are orders that can be produced (e.g. when scanning an index) that are not interesting in themselves. Such orders can be pruned away early if we can show they do not produce anything interesting.

The operations related to interesting orders, in particular the concept of functional dependencies, are related to the ones we are doing when checking ONLY_FULL_GROUP_BY legality in sql/aggregate_check.h. However, there are some key differences as well:

  • Orderings are lexical, while groupings are just a bag of attributes. This increases the state space for orderings significantly; groupings can add elements at will and just grow the set freely, while orderings need more care. In particular, this means that groupings only need FDs on the form S → x (where S is a set), while orderings also benefit from those of the type x = y, which replace an element instead of adding a new one.
  • ONLY_FULL_GROUP_BY is for correctness of rejecting or accepting the query, while interesting orders is just an optimization, so not recognizing rare cases is more acceptable.
  • ONLY_FULL_GROUP_BY testing only cares about the set of FDs that hold at one specific point (GROUP BY testing, naturally), while interesting orders must be tracked throughout the entire operator tree. In particular, if using interesting orders for merge join, the status at nearly every join is relevant. Also, performance matters much more.

Together, these mean that the code ends up being fairly different, and some cases are recognized by ONLY_FULL_GROUP_BY but not by interesting orders. (The actual FD collection happens in BuildInterestingOrders in join_optimizer.cc; see the comment there for FD differences.)

A note about nomenclature: Like Neumann et al, we use the term “ordering” (and “grouping”) instead of “order”, with the special exception of the already-established term “interesting order”.

Typedef Documentation

◆ Ordering

Function Documentation

◆ IsGrouping()

bool IsGrouping ( Ordering  ordering)