WL#1524: Bitmap indexes

Affects: Server-7.1   —   Status: Un-Assigned

A bitmap index is a set of bit vectors, which are useful  
for a certain type of search on columns of a table.  
  
A bit vector is just a series of bits, e.g. 0101010100001111.  
Although I prefer the term "bit vector", others say "bitmap".  
  
An example 
----------  
  
Suppose there is a table T, with a column C. There  
are 5 rows in T. Column C can have any one of these values:  
'mary', 'joe', NULL. The contents of T are:  
  
Row #  column C  
-----  --------  
0      mary  
1      mary  
2      joe  
3      joe  
4      mary  
  
We create a bitmap index:  
  
CREATE BITMAP INDEX ON T (C);  
  
Now there are 3 bit vectors (there must be one bit vector  
for each possible value of the column, and there are three  
possible values). Each vector has 5 bits (there must be one  
bit for each row in the table, and there are 5 rows in the  
table).  
  
'mary' bit vector   'joe' bit vector   NULL bit vector  
-----------------   ----------------   ---------------  
1                   0                  0               /* bit 0 */  
1                   0                  0               /* bit 1 */  
0                   1                  0               /* bit 2 */  
0                   1                  0               /* bit 3 */  
1                   0                  0               /* bit 4 */  
  
The condition "C = 'mary'" is TRUE for rows 0, 1, and 4.  
Therefore in the 'mary' bit vector bits 0, 1, and 4 are  
TRUE (1) but bits 2, 3 are FALSE (0). Meanwhile, the  
condition "C = 'joe'" is TRUE for rows 2, 3 so in the  
'joe' bit vector bits 2 and 3 are on. For any given bit  
position, there is a TRUE (1) bit in one and only one  
of the bit vectors.  
  
Now we search using the bitmap index.  
  
SELECT COUNT(*) FROM T WHERE C IS NULL;  
... Do this by counting the 1 bits in the NULL bit vector.  
  
The good things that we can note from this example are  
that the index is small (a total of 15 bits for 5 records),  
and the search is easy for a certain kind of problem.  
  
Statement Format  
----------------  
  
CREATE BITMAP INDEX index_name ON table_name (column_name);  
  
If you don't say BITMAP, then MySQL makes a btree index.  
(BITMAP will never be the default index type.)  
  
Error if: index_name already exists, for any index type  
  
Error if: row is not fixed-length  
  
The syntax "CREATE BITMAP INDEX ..." is not standard, but does  
appear in Oracle.  
 
We also could add "BITMAP" to the syntax for the KEY clause: 
{ KEY | INDEX } [ ident ] 
[ [ USING | TYPE ] [ BTREE | RTREE | HASH | BITMAP ] ] 
 
Notice that we do not have a clause for stating what the possible  
values are. For some MySQL data types (SET, ENUM) the  
range is limited anyway. For other data types, we should insist  
on the presence of a CHECK clause for the column definition,  
with "IN (...)" to list all possibilities except for NULL.  
That is not what Oracle does -- Oracle can make new bit vectors  
dynamically -- but things are much simpler if we know in advance  
what values are possible. Thus, for our mary/joe/NULL example,  
the creation statements would look like:  
  
CREATE TABLE T (C char(5), CHECK (C IN ('mary','joe')));  
CREATE BITMAP INDEX I ON T (C);  
  
Internals note: the effect is that we will create three files  
file named T.MB0, T.MB1, T.MB2. It is not appropriate to store  
bit vectors in a .MYI file. It is possible to have a single  
file with large pages and with a part of each page reserved  
for each bit vector, but let's try it out with separate files  
first, it's easier that way.  
  
Use in queries  
--------------  
  
When should the optimizer choose to use a bitmap index? That  
is, under what conditions is a bitmap advantageous? We will  
not know the answer until we have done some benchmarks with a  
real database. The likely cases are:  
  
- If there is a btree index for any column in the search,  
  use the btree index and ignore the bitmap index  
  
- If the condition contains anything other than "column = literal"  
  or "column IS [NOT] NULL", use a full-table scan and ignore the  
  bitmap index.  
  
Bitmap indexes might also prove useful for EXISTS and COUNT.  
Sometimes the user will have to use a hint to "force" bitmap  
index use.  
  
The two most important operations involve converting SQL  
conditional expressions to bit vector positions, and back.  
  
Example: for "WHERE C='joe'" in a WHERE clause,  
(a) find out that 'joe' is the second item in the CHECK for  
    the table and therefore the bit vector is T.MB1  
(b) looking at byte #0 from T.MB1, see that the value is  
    0x30, and conclude (by a table lookup) that this means  
    row #2 is TRUE, and row #3 is TRUE. If the value 0x30  
    were in byte #X, we would have to do the calculation as  
    row #(2+(8*X)) is TRUE, and row #(3+(8*X)) is TRUE.  
  
Use in data-change statements  
-----------------------------  
  
Notoriously, bitmap index maintenance is slow. Every time an  
indexed column value changes, we have to change every bit  
vector. An INSERT may cause an increase in the size of every  
bit vector. For a DELETE, just set the bit value to FALSE (0)  
for the appropriate bit position in every bit vector.  
  
Possible improvements  
----------------------  
  
Handling variable-length rows is harder, because the index  
key points directly to a position within the .MYI file --  
there is no "rowid". Therefore, you need ANOTHER index,  
which answers the question: given that row number is 555,  
what is the position in the file? Use of such an index  
might slow access down, so the number of places where a  
bitmap index is useful is reduced when row length is variable.  
  
Handling columns without CHECK clauses is harder, because  
there has to be a list somewhere showing what value is  
associated with what bit vector. This list is dynamic;  
whenever a new value is inserted that is not in the list,  
we add to the list and add a new bit-vector file.  
  
Instead of multiple files, we could have one file with  
multiple pages. Say that a page has N bytes, and there  
are 3 possible values. Then the first N/3 bytes of the  
page are for value 1, the next N/3 bytes of the page are  
for value 2, and the final N/3 bytes of the page are for  
value 3.  
  
Indexing a condition, rather than a value, would be good  
for both btree and bitmap indexes.  
  
From my notes  
-------------  
  
There have been suggestions from the user list about bitmaps. See:  
http://lists.mysql.com/mysql/125271  
 
On June 20 2000, Monty said on the MySQL mailing list: 
"This [bitmap indexing] would be easy to do for fixed length 
rows; For dynamic length rows this would in many case take up 
even more space than the current packed index method!" 
 
SQL Performance Tuning mentions that Informix, Oracle, and Sybase  
support bitmap indexes. So does IBM, but DB2 bitmap indexes are  
not similar to the others.  

References
----------

Bug##30347 Support Bitmap Indexes (feature request)