WL#4065: Support extended partition pruning for partitioning by COLUMNS

Affects: Server-6.1   —   Status: Assigned   —   Priority: Medium

Partitioning in MySQL 5.1 is limited to functions returning integers.
To perform partition pruning it's only possible to define partition
pruning on ranges if the function is monotonically increasing which
is true for TO_DAYS, YEAR and YEAR_TO_DATE. There are no useful
partitioning variants based on character fields other than KEY
partitioning and possibly some hash partitioning variant.

This worklog makes it possible to make useful partitioning on character
fields. It also enables partition pruning to be useful on much more
generic ranges. Thus we can with this new partitioning scheme partition
over a,b and c and prune partitions away even with a WHERE clause that
states a = 1 AND b < 1. This was impossible with the partition pruning
introduced in MySQL 5.1.

This new variant partitions on a list of columns instead of on a
function. To distinguish this from the current variant and ensure
that there is no confusion to users what the partitioning introduced
is, we add the keyword COLUMN_LIST to the partition definition.

The new partitioning scheme is only defined for RANGE and LIST partitions.

The worklog also adds a small extension to the current function-based
partitioning by adding the function TO_SECONDS since the current allowed
partition pruning functions can only be on per day level, so it's not
currently possible to partition based on parts of a day.

The major improvements of this worklog is thus the possibility to define
partitioning on character fields and a much more advanced partition
pruning possibility.
1. Work overview
2. On copying the approach used in the range optimizer
3. Partition pruning for multi-column non-ranges
3.1 Demonstration of the idea on an example
3.2 General description of handling non-ranges
3.3 Connection to subpartition pruning


1. Work overview

COLUMN_LIST comparisons have the same meaning as index tuple comparisons,
so our task is to make partition pruning module handle multiple-keypart
SEL_ARG graphs, without the limitation that we handle only graphs equivalent
to "col1 = const1 AND col2=const2 AND ...".

2. On copying the approach used in the range optimizer

(for the sake of simplicity this section will ignore subpartition pruning)

The most straightforward way of doing partition pruning for COLUMN_LIST is 
to just do what the range optimizer does for multi-part keys. However, that
would be not the best solution as the range optimizer has a property (or 
rather limitiation) that we don't need in partition pruning.  We'll show 
the property using an example: consider a WHERE clause (and its
corresponding SEL_ARG tree):

  WHERE t.keypart1 < 'foo' AND t.keypart2=1

It specifies a possibly infinite set of intervals
  S = { (keypart1, keypart2) = (x, 1) | x < 'foo' }

The range optimizer is not able produce E(#matching records) for this
possibly infinite set of intervals, as that would require doing many index
lookups into the index to see index to see which intervals we will actually
need to scan (which is too expensive for optimization). Therefore range 
optimizer computes an "envelope" that encloses all of the intervals:

  S_envelope = " (keypart1, keypart2) <= ('foo', 1) "

S_envelope includes regions that are not the original set S, therefore the move
from S to S_envelope is bad as we start examining records that would not match
the original WHERE condition.

In partition pruning, instead of a [big] table index we have a [relatively
small] list of partition definitions, so we can process the set S directly,
without moving to S_envelope.

3. Partition pruning for multi-column non-ranges

3.1 Demonstration of the idea on an example

Suppose partitioning is


and suppose we have a non-range (will use this term because I can't come up
with anything better)

  kp1<5 AND kp2='foo'

It is apparent that partitions p1 and p3 cannot contain any matching records.
Another interesting observation that we're only able to make this observation
about partition p3 because we know that there are no possible values between 2
and 3.

This technique is also applicable when there is no SEL_ARG for kp1 at all, or
when we have a tree (or, more precisely a next_key_part path) which lacks
SEL_ARGs for some key part(s).

3.2 General description of handling non-ranges
We will now give a general description of the approach shown in the previous
section. Suppose we have partitioning on COLUMN_LIST(kp1, ... kpN) and a SEL_ARG 

     ...                   ...
      |                     |     
  SEL_ARG(kp1) -- X -- SEL_ARG(kp2)
      |                     |
     ...                   ...        
Then the tree can be processed as follows:

1. Start from the first keypart (keypart=0) and proceed right until the walked
   chain of SEL_ARGs represents a proper equality interval
      keypart0=c1 AND ... AND keypart_k=c_k            (PrefixInterval)
   let's denote 
     the length of the prefix as const_prefix_len, 
     and the length of the "hole" in keypart numbers as hole_len

2. Use partitioning range analyzer and obtain a set of partitions PS1 that may
   contain records from PrefixInterval.

3. For each partition in PS1, check if partition can contain records that
   match the remaining SEL_ARG chain, and if yes, mark it as used. 

The check is performed as follows:

if (at least one of the missed keyparts is of a non-enumerable type)
  if (prefix(prefix_len + hole_len, left_partition_endpoint) != 
      prefix(prefix_len + hole_len, right_partition_endpoint))
    // The partition may contain rows  ((tuple)PrefixInterval, *)
    mark the partition as used;
    // the prefix is the same, check if the partition matches the column 
    if (interval_intersection( 
          suffix(left_partition_endpoint, part_cols - (prefix_len + hole_len)),
          suffix(left_partition_endpoint, part_cols - (prefix_len + hole_len))
        ) != EMPTY_INTERVAL)
      mark the partition as used;
    The same as above, but we'll need to also handle the special case of
    partition having edges with adjacent prefix values:
      PARTITION ... VALUES LESS THAN COLUMN_LIST(prefixval,   suffix1),
      PARTITION pX  VALUES LESS THAN COLUMN_LIST(prefixval+1, suffix2)
    (if prefix has multiple columns, then "+1" applies to the last column)
    where we can conclude that partition pX may not contain any records with
    suffix columns with some "suf1 < suffix_cols < suf2" range.

3.3 Connection to subpartition pruning
The scheme described in the previous section shows how to obtain the set of
used partitions. The scheme does not compute interval envelopes (like range
optimizer does), so there is no problem with different ranges in the envelope
having different intervals on subpartitioning columns).

We also do not support partitioning by COLUMN_LIST for subpartitioning so no
changes are needed for subpartition pruning.

1. Check if interval analyzer function for COLUMN_LIST can 
   handle prefixes (it should, even for the "naive" solution).

2. Let partitions store the information whether their endpoints have identical
   prefixes (for every prefix length, i.e. we need an array/bitmap).

3. Modify find_used_partitions(): 
    - invoke interval analyzer even if we have a COLUMN_LIST prefix tuple

    - get and remember set of used partitions for the prefix interval
      (Q: do we need a MAX_KEY_PARTS stack of those?)
    - then 
       for each partition in the obtained partition set (**)
         save left and right endpoint prefixes somewhere;
         call find_used_partitions()  (*)
           so it walks over next-key-part SEL_ARGs and runs analysis
    - when find_used_partitions() is invoked in situation like (*) 
       * get the parent's left/right endpoints, attach the suffix in 
         the passed SEL_ARG, obtain the "extended range" and check 
         if this extended range has non-empty intersection with the 
         partition (TODO: could we arrange so only the last components 
         of the range endpoints are compared?)

The problem is that (**) loop runs number-of-partitions-after-pruning times,
and the call (*) will do a loop (using head/tail recursion) over the list of 
SEL_ARGs that can be quite large as well.

3.1 An alternative approach to one described 3.0:

(see the wl4065-r2.odg picture attached to WL entry) The idea is that we don't
need to do loop (**). Instead we need to loop through groups of partitions 
that have rows that belong to some particular value groups of
partitions (which is just as bad in the worst case but can be much better).

Besides that, we can take advantage of the fact SEL_ARGs in SEL_ARG::next/prev 
chain form an ordered disjoint list. When analyzing the first SEL_ARG in the
chain, we should remember which partition the right edge was in. This
partition is the leftmost partition that can have non-empty intersection with
the second (and subsequent) SEL_ARGs in the chain. We can use this to restrict
the set of partition we'll need to search for 2nd and subsequent SEL_ARGs.

This won't prevent the

  O(#partitions_from_prevous_keyparts * #SEL_ARGs_for_this_keypart)

blowup in the general case, but probably could reduce the number of practical
cases where this blowup occurs.

On possible implementations of "move to next value group"
The most straightforward way is to have each partition to have an array of 
indexes of the first partition within the next value group. 

 * 0-th array element will have the number of partition with different value 
   of (part_col1)
 * 1-th array element will have the number of partition with different value 
   of (part_col1, part_col2)
 * .. and so forth.

Another, more space-efficient solution:
Store a bitmap which indicates which partitions have values that are in a
different value from the last value of the previous partition.  This way
we can walk ahead in big jumps (1024/64=16 step max.) and quickly find the 
range of partitions within this value group as well find the start/end of the
next value group.

Additional things to do
 * Add printout of what is passed to range analyzer interface.
 * Add printout of used partitions (i.e. of the result of partition pruning).