The most basic storage engines implement read-only table scanning. Such engines might be used to support SQL queries of logs and other data files that are populated outside of MySQL.
The implementation of the methods in this section provide the first steps toward the creation of more advanced storage engines.
The following shows the method calls made during a nine-row table
scan of the CSV engine:
ha_tina::store_lock ha_tina::external_lock ha_tina::info ha_tina::rnd_init ha_tina::extra - ENUM HA_EXTRA_CACHE Cache record in HA_rrnd() ha_tina::rnd_next ha_tina::rnd_next ha_tina::rnd_next ha_tina::rnd_next ha_tina::rnd_next ha_tina::rnd_next ha_tina::rnd_next ha_tina::rnd_next ha_tina::rnd_next ha_tina::extra - ENUM HA_EXTRA_NO_CACHE End caching of records (def) ha_tina::external_lock ha_tina::extra - ENUM HA_EXTRA_RESET Reset database to after open
The [custom-engine.html#custom-engine-api-reference-store_lock
store_lock()] method is called before any
reading or writing is performed.
Before adding the lock into the table lock handler mysqld calls store lock with the requested locks. Store lock can modify the lock level, for example change blocking write lock to non-blocking, ignore the lock (if we don't want to use MySQL table locks at all) or add locks for many tables (like we do when we are using a MERGE handler).
When releasing locks, store_lock() is also
called. In this case, one usually doesn't have to do anything.
If the argument of store_lock is
TL_IGNORE, it means that MySQL requests the
handler to store the same lock level as the last time.
The potential lock types are defined in
includes/thr_lock.h and are copied here:
enum thr_lock_type
{
TL_IGNORE=-1,
TL_UNLOCK, /* UNLOCK ANY LOCK */
TL_READ, /* Read lock */
TL_READ_WITH_SHARED_LOCKS,
TL_READ_HIGH_PRIORITY, /* High prior. than TL_WRITE. Allow concurrent insert */
TL_READ_NO_INSERT, /* READ, Don't allow concurrent insert */
TL_WRITE_ALLOW_WRITE, /* Write lock, but allow other threads to read / write. */
TL_WRITE_ALLOW_READ, /* Write lock, but allow other threads to read / write. */
TL_WRITE_CONCURRENT_INSERT, /* WRITE lock used by concurrent insert. */
TL_WRITE_DELAYED, /* Write used by INSERT DELAYED. Allows READ locks */
TL_WRITE_LOW_PRIORITY, /* WRITE lock that has lower priority than TL_READ */
TL_WRITE, /* Normal WRITE lock */
TL_WRITE_ONLY /* Abort new lock request with an error */
};
Actual lock handling will vary depending on your locking implementation and you may choose to implement some or none of the requested lock types, substituting your own methods as appropriate. The following is the minimal implementation, for a storage engine that does not need to downgrade locks:
THR_LOCK_DATA **ha_tina::store_lock(THD *thd,
THR_LOCK_DATA **to,
enum thr_lock_type lock_type)
{
if (lock_type != TL_IGNORE && lock.type == TL_UNLOCK)
lock.type=lock_type;
*to++= &lock;
return to;
}
See also ha_myisammrg::store_lock() for a
more complex implementation.
The
[custom-engine.html#custom-engine-api-reference-external_lock
external_lock()] method is called at the
start of a statement or when a LOCK TABLES
statement is issued.
Examples of using external_lock() can be
found in the sql/ha_innodb.cc file, but most
storage engines can simply return 0, as is
the case with the EXAMPLE storage engine:
int ha_example::external_lock(THD *thd, int lock_type)
{
DBUG_ENTER("ha_example::external_lock");
DBUG_RETURN(0);
}
The method called before any table scan is the
[custom-engine.html#custom-engine-api-reference-rnd_init
rnd_init()] method. The
rnd_init() method is used to prepare for a
table scan, resetting counters and pointers to the start of the
table.
The following example is from the CSV storage
engine:
int ha_tina::rnd_init(bool scan)
{
DBUG_ENTER("ha_tina::rnd_init");
current_position= next_position= 0;
records= 0;
chain_ptr= chain;
DBUG_RETURN(0);
}
If the scan parameter is true, the MySQL
server will perform a sequential table scan, if false the MySQL
server will perform random reads by position.
Prior to commencing a table scan, the
[custom-engine.html#custom-engine-api-reference-info
info()] method is called to provide extra
table information to the optimizer.
The information required by the optimizer is not given through
return values but instead by populating certain properties of
the stats member of the handler class, which the optimizer reads
after the info() call returns. The stats
member is an instance of the ha_statistics class which is also
defined in handler.h
In addition to being used by the optimizer, many of the values
set during a call to the info() method are
also used for the SHOW TABLE STATUS
statement. The flag argument is a bitfield that conveys for
which context the info method was called.
The flags are defined in include/my_base.h. The ones that are used are:
HA_STATUS_NO_LOCK - the handler may use outdated info if it can prevent locking the table shared
HA_STATUS_TIME - only update of stats->update_time required
HA_STATUS_CONST - update the immutable members of stats (max_data_file_length, max_index_file_length, create_time, sortkey, ref_length, block_size, data_file_name, index_file_name)
HA_STATUS_VARIABLE - records, deleted, data_file_length, index_file_length, delete_length, check_time, mean_rec_length
HA_STATUS_ERRKEY - status pertaining to last error key (errkey and dupp_ref)
HA_STATUS_AUTO - update autoincrement value
The public properties are listed in full in
sql/handler.h; several of the more common
ones are copied here:
ulonglong data_file_length; /* Length off data file */ ulonglong max_data_file_length; /* Length off data file */ ulonglong index_file_length; ulonglong max_index_file_length; ulonglong delete_length; /* Free bytes */ ulonglong auto_increment_value; ha_rows records; /* Records in table */ ha_rows deleted; /* Deleted records */ ulong raid_chunksize; ulong mean_rec_length; /* physical reclength */ time_t create_time; /* When table was created */ time_t check_time; time_t update_time;
For the purposes of a table scan, the most important property is
records, which indicates the number of
records in the table. The optimizer will perform differently
when the storage engine indicates that there are zero or one
rows in the table than it will when there are two or more. For
this reason it is important to return a value of two or greater
when you do not actually know how many rows are in the table
before you perform the table scan (such as in a situation where
the data may be externally populated).
The bare minimum implementation for the info method is probably something like what is used for the CSV (tina) engine:
int ha_tina::info(uint flag)
{
DBUG_ENTER("ha_tina::info");
/* This is a lie, but you don't want the optimizer to see zero or 1 */
if (!records_is_known && stats.records < 2)
stats.records= 2;
DBUG_RETURN(0);
}
Prior to some operations, the
[custom-engine.html#custom-engine-api-reference-extra
extra()] method is called to provide extra
hints to the storage engine on how to perform certain
operations.
Implementation of the hints in the extra call
is not mandatory, and most storage engines return
0:
int ha_tina::extra(enum ha_extra_function operation)
{
DBUG_ENTER("ha_tina::extra");
DBUG_RETURN(0);
}
After the table is initialized, the MySQL server will call the
handler's
[custom-engine.html#custom-engine-api-reference-rnd_next
rnd_next()] method once for every row to be
scanned until the server's search condition is satisfied or an
end of file is reached, in which case the handler returns
HA_ERR_END_OF_FILE.
The rnd_next() method takes a single byte
array parameter named *buf. The
*buf parameter must be populated with the
contents of the table row in the internal MySQL format.
The server uses three data formats: fixed-length rows,
variable-length rows, and variable-length rows with BLOB
pointers. In each format, the columns appear in the order in
which they were defined by the CREATE TABLE statement. (The
table definition is stored in the .frm
file, and the optimizer and the handler are both able to access
table metadata from the same source, its
TABLE structure).
Each format begins with a NULL bitmap of one bit per nullable column. A table with as many as eight nullable columns will have a one-byte bitmap; a table with nine to sixteen nullable columns will have a two-byte bitmap, and so forth. One exception is fixed-width tables, which have an additional starting bit so that a table with eight nullable columns would have a two-byte bitmap.
After the NULL bitmap come the columns, one by one. Each column
is of the size indicated in
MySQL
Data Types. In the server, column data types are defined
in the sql/field.cc file. In the fixed length
row format, the columns are simply laid out one by one. In a
variable-length row, VARCHAR columns are
coded as a one or two-byte length, followed by a string of
characters. In a variable-length row with
BLOB columns, each blob is represented by two
parts: first an integer representing the actual size of the
BLOB, and then a pointer to the
BLOB in memory.
Examples of row conversion (or packing) can be found by starting
at rnd_next() in any table handler. In
ha_tina.cc, for example, the code in
find_current_row() illustrates how the
TABLE structure (pointed to by table) and a
string object (named buffer) can be used to pack character data
from a CSV file. Writing a row back to disk requires the
opposite conversion, unpacking from the internal format.
The following example is from the CSV storage
engine:
int ha_tina::rnd_next(byte *buf)
{
DBUG_ENTER("ha_tina::rnd_next");
statistic_increment(table->in_use->status_var.ha_read_rnd_next_count, &LOCK_status);
current_position= next_position;
if (!share->mapped_file)
DBUG_RETURN(HA_ERR_END_OF_FILE);
if (HA_ERR_END_OF_FILE == find_current_row(buf) )
DBUG_RETURN(HA_ERR_END_OF_FILE);
records++;
DBUG_RETURN(0);
}
The conversion from the internal row format to CSV row format is
performed in the find_current_row() method:
int ha_tina::find_current_row(byte *buf)
{
byte *mapped_ptr= (byte *)share->mapped_file + current_position;
byte *end_ptr;
DBUG_ENTER("ha_tina::find_current_row");
/* EOF should be counted as new line */
if ((end_ptr= find_eoln(share->mapped_file, current_position,
share->file_stat.st_size)) == 0)
DBUG_RETURN(HA_ERR_END_OF_FILE);
for (Field **field=table->field ; *field ; field++)
{
buffer.length(0);
mapped_ptr++; // Increment past the first quote
for(;mapped_ptr != end_ptr; mapped_ptr++)
{
// Need to convert line feeds!
if (*mapped_ptr == '"' &&
(((mapped_ptr[1] == ',') && (mapped_ptr[2] == '"')) ||
(mapped_ptr == end_ptr -1 )))
{
mapped_ptr += Move past the , and the "
break;
}
if (*mapped_ptr == '\\' && mapped_ptr != (end_ptr - 1))
{
mapped_ptr++;
if (*mapped_ptr == 'r')
buffer.append('\r');
else if (*mapped_ptr == 'n' )
buffer.append('\n');
else if ((*mapped_ptr == '\\') || (*mapped_ptr == '"'))
buffer.append(*mapped_ptr);
else /* This could only happed with an externally created file */
{
buffer.append('\\');
buffer.append(*mapped_ptr);
}
}
else
buffer.append(*mapped_ptr);
}
(*field)->store(buffer.ptr(), buffer.length(), system_charset_info);
}
next_position= (end_ptr - share->mapped_file)+1;
/* Maybe use \N for null? */
memset(buf, 0, table->s->null_bytes); /* We do not implement nulls! */
DBUG_RETURN(0);
}
