WL#1021: Add exception Value list for index

Affects: Server-7.1   —   Status: Un-Assigned

One of problems of current optimizer is joining the tables where lookup one has 
very different data distribution. Consider:

select * from a,b  where a.c1=const and b.c=a.c2;

In this case optimizer can't use constant based estimation for table "b" but only
to use cardinality value.  If table b will have very uneven value distribution,
for example 1010 rows,  11 distinct values. 1 met in 1000 rows and others each
in one.  


 The MySQL optimizer in such case will do index in lookup tables for all values,
 which is not good. The same can happen other way around - full table scan can be 
 selected for value which really has just a few rows matching.

The fix:  MySQL shall be able to detect what for this case the variance of
distribution is large and so it shall select table scan or range scan for each
outer loop constant.  

There are two solutions how to select it - real data based. Where we do the
range check for each value to decide if we're to use index or full table scan.
Is generally slow. Usable for some cases however (ie small outer table)

More general approach is to use some sort of extended statistics, which covers
these "exceptional" values. It can be done as simple excpetion list +
cardinality as well as histogram.

The histogram is usually build for some fixed, size a way to have same 
amount of rows for each histogram part.

We shall also store amount of NULL values to be able to handle NULL searches as
well.

The histogram building example:
(There are variations possible)

NULL
NULL
1
1
2
2
2
2
2
3
4
5
5
5

Total 12 values + 2 NULLs
Let us build histogram with 4 parts.  Each part matched to 4 rows:

1 1 2 
-------
2 2 2 
-------
2 3 4 
-------
5 5 5 

Which can be stored as:

NULL=2 TOTAL=14
1-2|2-2|2-3|5-5

For large row number we can assume number hi(n)=low(n+1) and store just

min=1|2|2|3|5

which means:

1-2|2-2|2-3|3-5|


Having such histogram we can check

How many rows matches 2?
2 is met in 3/4 ranges -> 9 rows

how many rows matches 1?

1 is met in 1/4 ranger -> 3 rows

How many rows have value 8? 

It is not in the histogram -> 0 rows 

(note we shall be careful as MySQL in some cases assumes 0 is accurate)

how many rows are in range 1-4

4/4 -> 12 rows


We possible can have histogram size adjustable on index level in long run.


Histograms allows you to cover "exceptions" as well. As having histogram
with 10 values would have dedicated positions for exceptions having 10+% values,
basically having the same info as exception list.