MySQL 8.0.37
Source Code Documentation
estimate_selectivity.cc File Reference
#include "sql/join_optimizer/estimate_selectivity.h"
#include <sys/types.h>
#include <algorithm>
#include <initializer_list>
#include <string>
#include "my_bitmap.h"
#include "my_table_map.h"
#include "sql/field.h"
#include "sql/handler.h"
#include "sql/histograms/histogram.h"
#include "sql/item.h"
#include "sql/item_cmpfunc.h"
#include "sql/item_func.h"
#include "sql/join_optimizer/bit_utils.h"
#include "sql/join_optimizer/print_utils.h"
#include "sql/key.h"
#include "sql/sql_bitmap.h"
#include "sql/sql_const.h"
#include "sql/sql_select.h"
#include "sql/table.h"
#include "template_utils.h"

Functions

static double EstimateFieldSelectivity (Field *field, double *selectivity_cap, string *trace)
 Estimate the selectivity of (equi)joining a given field to any other field. More...
 
double EstimateSelectivity (THD *thd, Item *condition, string *trace)
 For the given condition, to try estimate its filtering selectivity, on a 0..1 scale (where 1.0 lets all records through). More...
 

Function Documentation

◆ EstimateFieldSelectivity()

static double EstimateFieldSelectivity ( Field field,
double *  selectivity_cap,
string *  trace 
)
static

Estimate the selectivity of (equi)joining a given field to any other field.

Use cardinality information from indexes, if possible. Otherwise, use a histogram if there is one. Assumes equal distribution and zero correlation between the two fields, so if there are e.g. 100 records and 4 distinct values (A,B,C,D) for the field, it assumes 25% of the values will be A, 25% B, etc. (equal distribution), and thus, when joining a row from some other table against this one, 25% of the records will match (equal distribution, zero correlation).

If there are multiple ones, we choose the one with the largest selectivity (least selective). There are two main reasons for this:

  • Databases generally tend to underestimate join cardinality (due to assuming uncorrelated relations); if we're wrong, it would better be towards overestimation to try to compensate.
  • Overestimating the number of rows generally leads to safer choices that are a little slower for few rows (e.g., hash join). Underestimating, however, leads to choices that can be catastrophic for many rows (e.g., nested loop against table scans). We should clearly prefer the least risky choice here.

Returns -1.0 if no index or no histogram was found. Lifted from Item_equal::get_filtering_effect.

◆ EstimateSelectivity()

double EstimateSelectivity ( THD thd,
Item condition,
string *  trace 
)

For the given condition, to try estimate its filtering selectivity, on a 0..1 scale (where 1.0 lets all records through).

TODO(sgunders): In some cases, composite indexes might allow us to do better for joins with multiple predicates.