MySQL 8.4.0
Source Code Documentation
graph_simplification.h File Reference

Heuristic simplification of query graphs to make them execute faster, largely a direct implementation of [Neu09] (any references to just “the paper” will generally be to that). More...

#include <assert.h>
#include <stddef.h>
#include <limits>
#include <vector>
#include "my_compiler.h"
#include "priority_queue.h"
#include "sql/join_optimizer/hypergraph.h"
#include "sql/join_optimizer/online_cycle_finder.h"
#include "sql/mem_root_allocator.h"
#include "sql/mem_root_array.h"
#include "sql/sql_array.h"

Go to the source code of this file.

Classes

class  GraphSimplifier
 
struct  GraphSimplifier::ProposedSimplificationStep
 
struct  GraphSimplifier::SimplificationStep
 
struct  GraphSimplifier::EdgeCardinalities
 
struct  GraphSimplifier::NeighborCache
 
struct  GraphSimplifier::CompareByBenefit
 
struct  GraphSimplifier::MarkNeighborCache
 

Functions

void SetNumberOfSimplifications (int num_simplifications, GraphSimplifier *simplifier)
 
void SimplifyQueryGraph (THD *thd, int subgraph_pair_limit, JoinHypergraph *graph, GraphSimplifier *simplifier)
 Repeatedly apply simplifications (in the order of most to least safe) to the given hypergraph, until it is below “subgraph_pair_limit” subgraph pairs or we can simplify it no more. More...
 

Detailed Description

Heuristic simplification of query graphs to make them execute faster, largely a direct implementation of [Neu09] (any references to just “the paper” will generally be to that).

This is needed for when query hypergraphs have too many possible (connected) subgraphs to evaluate all of them, and we need to resort to heuristics.

The algorithm works by evaluating pairs of neighboring joins (largely, those that touch some of the same tables), finding obviously bad pairwise orderings and then disallowing them. I.e., if join A must very likely happen before join B (as measured by cost heuristics), we disallow the B-before-A join by extending the hyperedge of B to include A's nodes. This makes the graph more visually complicated (thus making “simplification” a bit of a misnomer), but reduces the search space, so that the query generally is faster to plan.

Obviously, as the algorithm is greedy, it will sometimes make mistakes and make for a more expensive (or at least higher-cost) query. This isn't necessarily an optimal or even particularly good algorithm; e.g. LinDP++ [Rad19] claims significantly better results, especially on joins that are 40 tables or more. However, using graph simplification allows us to handle large queries reasonably well, while still reusing nearly all of our query planning machinery (i.e., we don't have to implement a separate query planner and cost model for large queries).

Also note that graph simplification only addresses the problem of subgraph pair explosion. If each subgraph pair generates large amounts of candidate access paths (e.g. through parameterized paths), each subgraph pair will in itself be expensive, and graph simplification does not concern itself with this at all. Thus, to get a complete solution, we must also have heuristic pruning of access paths within a subgraph, which we're currently missing.

[Neu09] Neumann: “Query Simplification: Graceful Degradation for Join-Order Optimization”. [Rad19] Radke and Neumann: “LinDP++: Generalizing Linearized DP to Crossproducts and Non-Inner Joins”.

Function Documentation

◆ SetNumberOfSimplifications()

void SetNumberOfSimplifications ( int  num_simplifications,
GraphSimplifier simplifier 
)

◆ SimplifyQueryGraph()

void SimplifyQueryGraph ( THD thd,
int  subgraph_pair_limit,
JoinHypergraph graph,
GraphSimplifier simplifier 
)

Repeatedly apply simplifications (in the order of most to least safe) to the given hypergraph, until it is below “subgraph_pair_limit” subgraph pairs or we can simplify it no more.

Since we cannot know ahead of time exactly how many simplification steps required, we need to do this iteratively, running DPhyp (with all the actual and expensive costing removed, only subgraph pair counting) as we go.

On the assumption that running DPhyp over the graph is significantly more expensive than applying a simplification step, we do this by means of binary search (what the paper calls “the full algorithm”). We apply first 1, 2, 4, 8, 16, etc. steps until we find a number that takes us below the limit. Then, we apply a simple binary search between that value and the previous one. Once we find the border between too complicated and just simple enough, we set the graph to the latter, and the actual query planning will start afresh.