WL#6635: Make use of condition filtering in the optimizer

Status: Complete   —   Priority: Medium

When the join optimizer calculates the order tables should be joined, fanout is
of great importance. The fanout for a table 't_y' in a join plan consisting of
(t_1,...,t_x,t_y,..) is the average number of rows from 't_y' that will match
each partial row combination from (t_1,...,t_x) and all predicates applicable
to 't_y'. It is important to keep the fanout as low as possible because it
directly influences how many row combinations will be joined with the next table
in the join order.

Currently, MySQL only considers:
 * the fanout that is directly determined by the chosen access method. 
 * a very basic heuristic: 
   - If 'ref' access is applicable, the fanout for 'scan' is 25% of the 
     number of rows in the table. 
   - If 'range' access is applicable, the fanout for 'scan' equals the 
     number of rows estimated by 'range' access relative to number of rows
     in the table ("rows_estimate_range_access/rows_in_table").

This worklog is for making the join optimizer consider fanout estimates for any
access method and for all conditions (except subqueries [1] and outer join ON
predicates [2]), including non-indexed columns, when calculating partial plans.
By taking this into account, query conditions can be evaluated earlier during
query execution which in turn can reduce the number of inspected rows.

[1] Subquery predicates of the form "ty.col IN/ALL/ANY (subquery)" reduce the
fanout for table 'ty', but on the other hand the cost of evaluating the subquery
is currently not part of the cost model. Without a cost estimate of evaluating
the subquery, we cannot determine if it is beneficial to move 'ty' early in the
join sequence. 

[2] Technical reason: ON predicates of INNER JOINs are automatically part of the
join condition and are therefore part of condition filtering whereas ON
predicates for OUTER JOINs are stored in TABLE_LIST but are not available until
make_outerjoin_info() has been called. That happens after filtering is
calculated, so at filtering time the conditions are not easily available. It may
be decided later that the these predicates should also be taken into account.

Facts to take into account - selectivity in other products:
===========================================================

Selinger selectivity:
~~~~~~~~~~~~~~~~~~~~~
The following are "selectivity factors" as suggested by Selinger et al. in
"Access Path Selection in a Relational Database Management System", 1979

Selectivity for non-indexed columns:
------------------------------------
AND       : P(A and B) = P(A) * P(B)
OR        : P(A or B)  = P(A) + P(B) - P(A and B)
=         : 1/10
<,>       : 1/3
BETWEEN   : 1/4
IN (list) : MIN(#items_in_list * SEL(=), 1/2)
IN subq   : [1]
NOT OP    : 1-SEL(OP)

[1] Quote from paper:
columnA IN subquery:
F = (expected cardinality of the subquery result) /
    (product of the cardinalities of all the relations in the subq FROM list)
This formula is derived by the following argument: 
Consider the simplest case, where subq is of the form 
"SELECT columnB FROM relationC ... ". Assume that the set of all columnB
values in relationC contains the set of all columnA values. If all the 
tuples of relationC are selected by the subquery, then the predicate is
always TRUE and F=1. If the tuples of the subquery are restricted by a 
selectivity factor F', then assume that the set of unique values in the 
subquery result that match columnA values is proportionately restricted, i.e.
the selectivity factor for the predicate should be F'. F' is the product of
all the subquery's selectivity factors, namely 
(subq cardinality) / (cardinality of all possible subquery answers). With a 
little optimism, we can extend this reasoning to include subqueries which
are joins and subqueries in which columnB is replaced by an arithmetic 
expression involving column names. This leads to the formula given above.

PostgreSQL selectivity:
~~~~~~~~~~~~~~~~~~~~~~~

Selectivity calculation in PostgreSQL is done in
/src/backend/optimizer/path/clausesel.*

PostgreSQL relies on the following statistics:
  * Most Common Value (MCV) array for the 100 most common
    values in a table
  * "rec_per_key" for values not in MCV
  * histograms

These statistics are checked first and are used to calculate selectivity if
applicable. Otherwise, the values below are used to estimate selectivity for
comparison operations. 

Selectivity for non-indexed columns:
------------------------------------
=         : NULL          -> return 0, 
            BOOL datatype -> return 0.5, 
            <200 rows     -> return 1/rowcount
            >200 rows     -> return DEFAULT_EQ_SEL
<,<=,>,>= : NULL          -> return 0,
            DEFAULT_INEQ_SEL
BETWEEN   : DEFAULT_RANGE_INEQ_SEL
LIKE      : exact match   -> see "="
            DEFAULT_MATCH_SEL
IS NULL   : DEFAULT_UNK_SEL
NOT OP    : 1-SEL(OP)

Default values:
---------------
/* default selectivity estimate for equalities such as "A = b" */
#define DEFAULT_EQ_SEL	0.005

/* default selectivity estimate for inequalities such as "A < b" */
#define DEFAULT_INEQ_SEL  0.3333333333333333

/* default selectivity estimate for range inequalities "A > b AND A < c" */
#define DEFAULT_RANGE_INEQ_SEL	0.005

/* default selectivity estimate for pattern-match operators such as LIKE */
#define DEFAULT_MATCH_SEL	0.005

/* default number of distinct values in a table */
#define DEFAULT_NUM_DISTINCT  200

/* default selectivity estimate for boolean and null test nodes */
#define DEFAULT_UNK_SEL			0.005
#define DEFAULT_NOT_UNK_SEL		(1.0 - DEFAULT_UNK_SEL)
Functional requirements:
------------------------
F1) The feature shall be turned on and off by using the optimizer_switch
    "set optimizer_switch='condition_fanout_filter=off'". The default is 
    'on'. When the feature is off, all query execution plans shall remain
    unchanged.

F2) The feature shall take condition filtering effects into
    account when deciding on the table join order. This may
    result in changed query execution plans if and only if the
    fanout estimate for one or more of the tables changes.
    
F3) EXPLAIN will show the filtering effect of 'post read filtering' in 
    the 'filtered' column.

F4) Post read filtering estimates shall only contain estimates
    from predicates that are not part of the access method and
    only if all tables that predicate depends upon are earlier in
    the join sequence. For complete rules, see "Requirements: how
    to calculate post read filtering estimates".

Non-functional requirements:
----------------------------
NF1) Performance when the feature is 'off': No performance degradation
     is acceptable.
NF2) Performance when the feature is 'on':
     a) The performance shall be as good as or better than before for 95%
        of all test queries.
     b) A performance degradation of 5% for up to 5% of the queries is 
        tolerable in the general case
     c) A higher performance degradation clearly caused by skewed or 
        covariant data [1] is acceptable as long as NF1 is satisfied.

[1] Covariance between column values (column values "city=NewYork" and
    "country=USA" are clearly covariant) cannot be predicted without 
    covariance histograms. Such histograms are not even planned for the 
    5.8 time-frame
THE PROBLEM
===========

Consider the following simple self-join:

EXPLAIN 
SELECT * 
FROM t1 AS t1a JOIN t1 AS t1b ON t1a.idx_col=t1b.idx_col 
WHERE t1b.non_idx_col=5;
id select_type  table  type  key     key_len  ref         rows  Extra
1  SIMPLE       t1a    ALL   NULL    NULL     NULL        1000  Using where
1  SIMPLE       t1b    ref   idx_col 5        t1a.idx_col 8     Using where

For this join, MySQL may choose either (t1a,t1b) or (t1b,t1a) join order. The
join order doesn't matter when deciding access method; in either case the QEP
will be table scan on the first table and 'ref' access for the second. MySQL
considers the cost of these join orders equal, and the chosen join order will be
the same as the order of the tables in the query.

What MySQL fails to recognize here is the filtering effect of the WHERE
predicate on t1b.non_idx_col=5. Let's assume for a second that there are 1000
rows in table t1, and that 25% of these rows satisfy the WHERE predicate. Then
MySQL will have to do the following work for the two join orders:

Join order 1: (t1a,t1b)
-----------------------
Scan t1a (one time)
For 1000 rows in t1a: Read and join on average 8 rows from t1b 
                      (=8000 'ref' accesses)
For 8000 joined rows: Apply WHERE predicate (2000 rows will pass)
Return 2000 rows

Join order 2: (t1b,t1a)
-----------------------
Scan t1b (one time)
For 1000 rows in t1b:  Apply WHERE predicate (250 rows will pass)
For 250 rows from t1b: Read and join on average 8 rows from t1a 
                       (=2000 'ref' accesses)
Return 2000 rows

Thus, in this example, it would be much better if the join optimizer had chosen
join order 2. To put this in the context of fanout, we have that:
 * The fanout for the first table in join order 1 is 1000 
 * The fanout for the first table in join order 2 is 250


TYPES OF FILTERING
==================

There are two types of filtering, and the fanout is the combination of these:

 * Access type filtering: The filtering effect we get from using an access
   method that reads fewer than all rows in the table. From the example
   above, 'ref' access on the second table in the join order enables us to 
   read only 8 rows (instead of all 1000 rows) for each row in the first table.
 * Post read filtering: The filtering effect we get from evaluating 
   conditions on rows after they are read from a table, but before joining
   with the next table in the join order.

Access type filtering is by far the most important of these. To see why, let's
revisit the example query using join order (t1a,t1b):

QEP with 'ref' on t1b (as shown above, repeated for readability):
-----------------------------------------------------------------
Scan t1a (one time)
For 1000 rows in t1a: Read and join on average 8 rows from t1b 
                      (=8000 'ref' accesses)
For 8000 joined rows: Apply WHERE predicate (2000 rows will pass)
Return 2000 rows

QEP with table scan on t1b (only 'post read filtering'):
--------------------------------------------------------
Scan t1a (one time)
For 1000 rows in t1a: Scan table t1b and join 1000 rows with those read 
                      from t1a (1M joins performed)
For 1M joined rows:   Apply ON and WHERE predicate (2000 rows will pass)
Return 2000 rows

Thus, MySQL does a lot more work if t1b is accessed through a table scan. Note,
however, that the *fanout* for t1b is 2 in both cases ("input" is 1000 rows
from t1a, "output" is 2000 rows => fanout=2).

The join optimizer already considers 'access type filtering' when it calculates
the cost of different join orders. In addition, some very basic 'post read
filtering' heuristic is taken into account for 'scan' access types. This
worklog will change the join optimizer so that 'post read filtering' is
considered for all access types and all types of conditions that refer to
fields, regardless of whether or not these fields are indexed. 'Access type
filtering' will remain unchanged, and the old-style 'post read filtering' for
'scan' access types will be replaced.


REQUIREMENTS: HOW TO CALCULATE POST READ FILTERING ESTIMATES
============================================================

A statement's 'post read filtering' effect will be calculated for the 
subset of predicates in JOIN::conds that apply to the table currently being
processed.

The following are requirements for making 'post read filtering' estimates for
table 't_y' in join order (t_1,...,t_x,t_y,...):

R1) Only predicates applicable to 't_y' shall be part of the calculation.

R2) If predicate 'a' applies to table 't_y' but can only be evaluated if
    table 't_a' has been read, the predicate is only part of the calculation
    if 't_a' is in (t_1,...,t_x).

R3) Predicates already part of 'access type filtering' shall not be part of 
    the calculation. If, e.g., a query contains join condition 
    "t_a.col = t_y.col" and table t_y is read using 'ref' access with
    t_a.col as input, this condition will not contribute to 'post read
    filtering'.

R4) The 'post read filtering' estimate shall be visible in the 'filtered'
    column of EXPLAIN.

For many predicates, MySQL will not have a high quality source of information
about the filtering effect. For example, a column that is not indexed will not
have any index statistic or range access row estimate. The following are
requirements for how to estimate the filtering effect.

R5) Filtering estimates for a predicate shall use the source of information 
    with highest quality that is available. Filter estimate sources (sorted on
    quality of estimate): RANGE access estimate, [histograms (later)], 
    index statistics, guesstimate.

R6) Since MySQL has no statistics about covariance between columns, 
    estimates from multiple predicates shall be calculated under the 
    assumption that there is no covariance.

Update: During testing it became clear that a lower bound on the filtering 
effect for a table was needed to avoid breakdown of the cost model. The 
MySQL cost calculations depend on the fact that adding a table to the join
plan increases the cost and that the number of produced rows is a positive
number greater than zero. The lower bound is set so that the minimum fanout
for a table is 0.05 "rows". In other words, this rule is enforced:
"rows_fetched_by_access_method * filter >= 0.05"

Filter calculation details
~~~~~~~~~~~~~~~~~~~~~~~~~~
As specified in R5, range estimates and index statistics are considered to be
better sources for the filtering effect than mere guesstimates. However, when
none of the higher level information sources are available or applicable, the
filtering effect falls back to the guesstimates listed below.

The values are based on the non-indexed guesstimates made by PostgreSQL. See
"Selectivity in other products" in the HLD.

Operation | Value
----------+-------------------------------------------------------------
AND       | Calculated as "Conjunction of independent events":
          | P(A and B) = P(A) * P(B)
OR        | Calculated as "Disjunction of independent events":
          | P(A or B)  = P(A) + P(B) - P(A and B)
<col> IN  | Values in IN are known to not intersect. 
          | Calculated as "Disjunction of events without replacement":
          | MIN(#items_in_list * SEL(=), 1/2)
<list> IN | Calculated as Conjunction of independent '<col> IN' events:
          | SEL (<list> IN) = SEL(col_1 IN) * ... * SEL(col_n IN)
XOR       | Calculated as "Exactly one of independent events":
          | P(A and not B) + P(B and not A) = P(A) + P(B) - 2*P(A and B)
=         | MAX(0.005, 1/rowcount)
<=>       | SEL(=)
<,<=      | MAX(1/3, 1/rowcount)
>, >=     | MAX(1/3, 1/rowcount)
BETWEEN   | MAX(1/9, 1/rowcount)
LIKE      | SEL(BETWEEN)
IS NULL   | SEL(=)
NOT OP    | 1 - SEL(OP)
Fulltext  | SEL(=)
Subquery  | 1 (see WL#7384)


USER VISIBLE EFFECTS
====================

The result of this worklog will be:

1) The join order of queries (and other multi-table DML statements) may
   change because adding 'post read filtering' effects to the join order
   calculations will change the join costs.
2) EXPLAIN will show the filtering effect of 'post read filtering' in 
   the 'filtered' column.

HOW TO DEAL WITH PERFORMANCE REGRESSIONS
========================================

Even though a better join order is the goal of this WL, performance regressions
are inevitable. The reason is that the optimizer does not know all facts about
the data it is optimizing the query for. The main pain points for the optimizer
are as follows:
 PP1) When the filtering effect of a non-indexed column is calculated, the
      optimizer does not know anything about the data distribution. All 
      rows may have the same column value, or there may be only distinct 
      column values. Histograms/statistics for non-indexed columns would 
      reduce this issue.
 PP2) If the data for a column is skewed, the filtering effect may be 
      calculated completely wrong even for indexed columns because the 
      optimizer only has access to very basic statistics (degree of 
      distinctness). Histograms would remove this issue.
 PP3) The optimizer has no way to detect the level of covariance between 
      two columns and may therefore overestimate the filtering effect of 
      predicates over such column combinations (examples: 
      "WHERE city=NewYork AND country=USA" or "WHERE name=John AND sex=male")
      Covariance histograms would reduce this issue.

Since performance regressions are inevitable due to the pain points listed
above, an optimizer_switch that can be used to turn the feature 
'on' and 'off' will be added. The default value will be 'on'.

set optimizer_switch='condition_fanout_filter=off';

A FEW NOTES FOR DOCUMENTATION
=============================
1) Since the filtering effect does not affect the cost of the current table
   but only tables reading its output rows, the condition filtering effect
   is not calculated for the last table in the join order. There are, 
   however, a few exceptions. The filtering effect is calculated also for
   the last table if:
   a) The statement is EXPLAIN, or
   b) Cost is calculated for scans with join buffering, or
   c) The table is in a subselect (because the cost of potential 
      materialization depends on the number of output rows), or
   d) SemiJoin is used (because the cost of some of the SJ strategies depend
      on the number of output rows).
   This may seem like a lot of exceptions, but it removes the overhead for 
   an important class of queries with only a single (non-const) table
   like the ones in SysBench. It also significantly reduces the overhead for
   joins with only a few tables. For example, for 't1 JOIN t2', the filtering
   effect is calculated only for t1 when t1 is first and only for t2 when t2 
   is first. It would go like this:
      <empty plan>
         Add t1 to plan; calculate filter for t1
             Add t2 to plan; do NOT calculate filter (no more tables)
                 Done; save cost and try alternative plans
      <empty plan>
         Add t2 to plan; calculate filter for t2
             Add t1 to plan; do NOT calculate filter (no more tables)
                 Done; compare costs of "t1-t2" vs "t2-t1" and pick cheapest
Implementation:
===============

Currently, best_access_path() has the following heuristic for the 'scan' 
access methods (table/index):

  1) If 'ref' access is possible, fanout is 25%.
  2) If 'range' access is possible, fanout is calculated as 
     "rows_estimate_range_access/rows_in_table"

This heuristic will be replaced by the following implementation:

Main filter implementation:
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The main filter calculation function is called from
Optimize_table_order::best_access_path() and relies on
Item*::get_filtering_effect(). The design of get_filtering_effect()
can be found below.

/*
  Calculate 'Post read filtering' effect of JOIN::conds for table 'tab'.
  Only conditions that are not directly involved in the chosen
  access method shall be calculated in this 'Post read filtering' effect.

  The function first identifies fields that are directly used by the
  access method. This includes columns used by range and ref access types,
  and predicates on the identified columns (if any) will not be taken into
  account when the filtering effect is calculated.

  The function will then calculate the filtering effect of any predicate 
  that applies to 'tab' and is not depending on the columns used by the
  access method. The source of information with highest accuracy is 
  always preferred and is as follows:
    1) Row estimates from the range optimizer
    2) Row estimates from index statistics (rec_per_key)
    3) Guesstimates

  Thus, after identifying columns that are used by the access method,
  the function will look for rows estimates made by the range optimizer. 
  If found, the estimates from the range optimizer are calculated into 
  the filtering effect.

  The function then goes through JOIN::conds to get estimates from any
  remaining predicate that applies to 'tab' and does not depend on any
  tables that come later in the join sequence. Predicates that depend on
  columns that are either used by the access method or used in the row
  estimate from the range optimizer will not be considered in this phase.

  @param tab          The table condition filtering effect is calculated
                      for
  @param keyuse       Describes the 'ref' access method (if any) that is
                      chosen
  @param used_tables  Tables earlier in the join sequence than 'tab'

  @return  the 'post read filtering' effect (between 0 and 1) of 
           JOIN::conds
*/
double calculate_condition_filter(JOIN_TAB *const tab,
                                  Key_use *const keyuse,
                                  table_map used_tables)
{
  MY_BITMAP fields_to_ignore;
  double filter= 1.0;

  if (keyuse) // Chosen access method is 'ref' access
    bitmap_set_bit(fields_to_ignore, <fields_used_by_ref_access>);

  if (tab->quick)
    bitmap_set_bit(fields_to_ignore, <fields_used_by_quick_access>);

  for (uint curkey= 0; curkey < tab->table->s->keys; curkey++)
  {
     if (tab->table->quick_keys.is_set(curkey))
     {
       MY_BITMAP fields_used_by_quick;
       quick->get_fields_used(&fields_used_by_quick)
       /*
         Ignore rows estimates from the range optimizer on this index if
         the range access method used fields that have already been 
         accounted for
       */
       if (bitmap_is_overlapping(fields_used_by_quick, fields_to_ignore)
         continue;

       filter*= (tab->table->quick_rows[curkey] /
                 tab->table->file->stats.records);

       bitmap_union(fields_to_ignore, fields_used_by_quick);
  }

  /*
    Filter now includes the filtering effect from the range optimizer,
    and fields_to_ignore is tagged with all fields that are either used
    by the access method or calculated by the range optimizer.

    Proceed with in-depth analysis of all predicates in JOIN::conds to
    get the filtering effect of all predicates that apply to fields in 
    'tab' that are not tagged in fields_to_ignore.
  */
  filter*= join->conds->get_filtering_effect(used_tables,
                                             tab->table->map,
                                             &fields_to_ignore);

  return filter;
}

/**
  Sets the bit of all fields this range access method uses in 
  'used_fields'
  
  @param[out] indexed fields used by this range access method.
*/
QUICK_*::get_fields_used(MY_BITMAP *used_fields)
{
  for (all indexes 'i' used by this quick)
    for (all keyparts 'kp' used in 'i')
      bitmap_set_bit(used_fields, kp->field->field_index);
}


Item tree implementation:
~~~~~~~~~~~~~~~~~~~~~~~~~
To get the filtering contribution from predicates in JOIN::conds, the function
get_filtering_effect() will be added to Item:

/*
  Calculate the filter contribution that is relevant for table
  'filter_for_table' for this item.

  @param filter_for_table   The table we are calculating filter effect for
  @param read_tables        Tables earlier in the join sequence.
                            Predicates for table 'filter_for_table' that 
                            rely on values from these tables can be part of 
                            the filter effect.
  @param fields_to_ignore   Fields in 'filter_for_table' that should not 
                            be part of the filter calculation. The filtering
                            effect of these fields are already part of the
                            calculation somehow (e.g. because there is a 
                            predicate "'col' = <const>", and the optimizer
                            has decided to do ref access on 'col').
  @return  the filtering effect (between 0 and 1) this Item contributes with.
*/
double Item*::get_filtering_effect(table_map filter_for_table,
                                   table_map read_tables,
                                   MY_BITMAP *fields_to_ignore);

All Item classes except the ones that may contribute to a filtering effect 
(see list in "Filter calculation details" below) will inherit this:

virtual double Item::get_filtering_effect(table_map filter_for_table,
                                          table_map read_tables,
                                          MY_BITMAP *fields_to_ignore)
{
  return 1;
}

The Item classes corresponding to the list of filter contributors (see list
below) will override this function. Example:

double Item_cond_and::get_filtering_effect(table_map filter_for_table,
                                           table_map read_tables,
                                           MY_BITMAP *fields_to_ignore)
{
  if (!(used_tables() & filter_for_table)
    return 1.0;  // None of the ANDed conditions apply to this table

  double filter= 1.0;
  // Calculate P(A and B) = P(A) * P(B)
  while (more ANDed conditions)
    filter*= condition->get_filtering_effect(filter_for_table,
                                             read_tables,
                                             fields_to_ignore);
  
  return filter;
}

/**
  Calculate the filtering effect for 'filter_for_table' for a 
  multiple equal item. Multiple equal items contains two or more
  items that are equal, so there may be multiple fields from a 
  single table. All fields from 'filter_for_table' will contribute
  to the filtering effect as long as there is a value these can be 
  compared to.
*/
double Item_equal::get_filtering_effect(table_map filter_for_table,
                                        table_map read_tables,
                                        MY_BITMAP *fields_to_ignore)
{
  double filter= 1.0;

  // No conditions below this apply to the table
  if (!(used_tables() & filter_for_table))
    return COND_FILTER_NONE;

  /* 
    Do we have a usable value to compare to? We have:
      filter_for_table.field = comparable1 = ... = comparableN
    comparable_value_found will be true if one or more of comparable1..N
    is a) a constant, or b) a field from a table in 'read_tables'
  */
  bool comparable_value_found= const_item ? true : false;
  for (all fields 'f' in the multiple equality)
  {
     if (f->used_tables() & read_tables)
       comparable_value_found= true;
     else
     {
       if (rec_per_key estimate available for 'f')
         filter*= rec_per_key / rows_in_table
       else
         filter*= (rows_in_table < 200) ? 1/rows_in_table : 0.005;
     }
  }
    
  return comparable_value_found ? filter : 1.0;
}

/*
  Utility function for condition filtering. Used to determine 
  if a predicate should contribute to condition filtering. In terms of
  this function, a predicate shall contribute to the filtering effect 
  only if it uses one or more fields from 'filter_for_table' that are 
  not tagged in 'fields_to_ignore'.

  @param read_tables      Tables earlier in the join sequence
  @param filter_for_table The table that filtering effect is 
                          calculated for
  @param fields_to_ignore Fields in 'filter_for_table' that are already
                          accounted for in the filtering effect.
  @return true if this item should contribute to filtering effect.
*/
bool Item::item_contributes_to_filter(const table_map read_tables,
                                      const table_map filter_for_table,
                                      const MY_BITMAP *fields_to_ignore)
{
  /* 
    Do we have a usable value to compare to? We have:
      filter_for_table.field OP comparable1 OP ... OP comparableN
    comparable_value_found will be true if one or more of 
    comparable1..N is a) a constant, or b) a field from a table in 
    'read_tables'
  */
  bool comparable_value_found= false;
  /* 
    Whether or not we have a field from 'filter_for_table' that is not 
    in 'fields_to_ignore'
  */
  bool found_usable_field= false;

  for (all arguments 'a' in item)
  {
    if ('a' is a field)
    {
      if ('a' is a field in 'filter_for_table' &&
          'a' is not in 'ignore_fields')
      {
        /*
          If a field from 'filter_for_table' was already found,
          we have two fields from that table. One can be used as value.
        */
        found_usable_field ? comparable_value_found= true :
                             found_usable_field= true;
      }
      else if ('a' is a field in 'filter_for_table' ||
               'a' is a field in one of 'read_tables')
        comparable_value_found= true;
    }
  }
  return comparable_value_found && found_usable_field;
}

double Item_func_X::get_filtering_effect(table_map filter_for_table,
                                         table_map read_tables,
                                         MY_BITMAP *fields_to_ignore)
{
  // Does this predicate apply to filter_for_table?
  if (!(used_tables() & filter_for_table))
    return 1;

  /*
    Cannot have filtering effect if some dependent table has 
    not been read.
  */
  if (used_tables() & ~(read_tables | filter_for_table))
    return 1;

  /*
    Is the filtering effect of this predicate already calculated
    elsewhere?
  */
  if (!item_contributes_to_filter(read_tables,
                                  filter_for_table, fields_to_ignore))
    return 1;

  return <filter_estimate>;
}


How filtering is used in the optimizer:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The 'post condition filtering' estimate is calculated by calling 
calculate_condition_filter() from Optimize_table_order::best_access_path()
in three locations (see below):
  1) Right before calling calculate_scan_cost(). Only called if scan 
     cost needs to be calculated.
  2) After deciding on the access method (only one of the following two 
     will be called):
     a) After deciding to use scan access method
     b) After deciding to use ref access method

Optimize_table_order::best_access_path()
{
  ...
  double filter_effect= 1.0;
  best_ref= find_best_ref(...);

  if (<heuristics_ref_better_than_scan>)
    ...
  else
  {
    /*
      Get filtering effect of *constant* predicates only 
      ("col OP <const>") where OP is any comparison operator. This
      filter effect is needed because the cost of scanning 'tab'
      depends on how many rows can be filtered away without even 
      comparing them to rows in the join buffer. @see 
      calculate_condition_filter()

      Note: The cost of using an access method is
          read_from_SE_cost
        + CPU_cost_evaluate_row * table_fanout * prefix_rows
      The read cost of the best access method is stored in 
      POSITION::read_cost and the CPU can be calculated based on 
      the row estimates. However, the way calculate_condition_filter()
      is implemented, some of the CPU cost is added to the read_cost
      and compensated by reducing the table_fanout. The resulting 
      total cost remains the same, but the cost of doing scans is
      overestimated whenever pure read costs are compared. This 
      incorrectness can be remedied if const_filter is only calculated
      if a join buffer will be used by the scan.
    */
    const double const_filter= calculate_condition_filter(tab,
                                                          NULL,
                                                          0);   // 1)
    const double scan_read_cost= calculate_scan_cost(...,
                                                     const_filter);
    const double scan_total_cost= ...;
    if (scan_total_cost < ref_total_cost)
    {
      best_ref= NULL;
      ...
      /*
        Condition filtering has no effect if 'tab' is the last table
        to join, unless the statement is EXPLAIN.
      */
      if (thd->variables.optimizer_condition_fanout_filter &&
          (remaining_tables || thd->lex->describe))
      {
        const double full_filter=
          calculate_condition_filter(tab, NULL, read_tables);   // 2a)
        filter_effect= full_filter / const_filter;
      }
      else
        filter_effect= 1.0;
    }
  }
  ...
  /*
    Calculate filtering effect if ref access was chosen. Condition
    filtering has no effect if 'tab' is the last table to join, 
    unless the statement is EXPLAIN.
  */
  if (best_ref &&
      thd->variables.optimizer_condition_fanout_filter &&
      (remaining_tables || thd->lex->describe))
    filter_effect=
      calculate_condition_filter(tab, best_ref, read_tables);   // 2b)

  pos->filter_effect= filter_effect;
  ...
}

The filtering effect is then taken into account in the join optimizer
like this:

Optimize_table_order::best_extension_by_limited_search()
{
  ...
  best_access_path(...);
  /* Compute the cost of extending the plan with 's' */
  current_read_time=  read_time
                      + position->read_cost
                      + read_count * position->fanout * ROW_EVALUATE_COST;

  /*
    Now calculate how many rows this partial plan will produce. This 
    should not affect current_read_time because all rows returned from 
    an access method must be evaluated, including those that are filtered
    out by query conditions (reflected in position->filter_effect).
  */
  current_record_count=
    record_count * position->fanout * position->filter_effect;
  position->set_prefix_costs(current_read_time, current_record_count);
  ...
}