MySQL 8.0.32
Source Code Documentation
anonymous_namespace{} Namespace Reference


class  AggregateRowEstimator
 This class finds disjoint sets of aggregation terms that form prefixes of some non-hash index, and makes row estimates for those sets based on index metadata. More...


using TermArray = Mem_root_array< const Item * >
 Array of aggregation terms. More...


double EstimateAggregateRows (const TermArray &terms, double child_rows, string *trace)
 We use the following data to make a row estimate, in that priority: More...

Typedef Documentation

◆ TermArray

using anonymous_namespace{}::TermArray = typedef Mem_root_array<const Item *>

Array of aggregation terms.

Function Documentation

◆ EstimateAggregateRows()

double anonymous_namespace{}::EstimateAggregateRows ( const TermArray terms,
double  child_rows,
string *  trace 

We use the following data to make a row estimate, in that priority:

  1. (Non-hash) indexes where the aggregation terms form some prefix of the index key. The handler can give good estimates for these.
  2. Histograms for aggregation terms that are fields. The histograms give an estimate of the number of unique values.
  3. The table size (in rows) for terms that are fields without histograms. (If we have "SELECT ... FROM t1 JOIN t2 GROUP BY t2.f1", there cannot be more results rows than there are rows in t2.) We also make the pragmatic assumption that that field values are not unique, and therefore make a row estimate somewhat lower than the table row count.
  4. In the remaining cases we make an estimate based on the input row estimate. This is based on two assumptions: a) There will be fewer output rows than input rows, as one rarely aggregates on a set of terms that are unique for each row, b) The more terms there are, the more output rows one can expect.

We may need to combine multiple estimates into one. As an example, assume that we aggregate on three fields: f1, f2 and f3. There is and index where f1, f2 are a key prefix, and we have a histogram on f3. Then we could make good estimates for "GROUP BY f1,f2" or "GROUP BY f3". But how do we combine these into an estimate for "GROUP BY f1,f2,f3"? If f3 and f1,f2 are uncorrelated, then we should multiply the individual estimates. But if f3 is functionally dependent on f1,f2 (or vice versa), we should pick the larger of the two estimates.

Since we do not know if these fields are correlated or not, we multiply the individual estimates and then multiply with a damping factor. The damping factor is a function of the number of estimates (two in the example above). That way, we get a combined estimate that falls between the two extremes of functional dependence and no correlation.

termsThe aggregation terms.
child_rowsThe row estimate for the input path.
traceAppend optimizer trace text to this if non-null.
The row estimate for the aggregate operation.