WL#6671: Improve scalability by not using thr_lock.c locks for InnoDB tables

Affects: Server-Prototype Only   —   Status: Complete

Improve scalability by not using thr_lock.c locks for InnoDB tables
(this task is mostly about avoiding calls to thr_lock.c code/locking
THR_LOCK::mutex with minimal changes to SE API and locking code on
SQL-layer).

User Documentation
==================

http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-5.html

http://dev.mysql.com/doc/refman/5.7/en/metadata-locking.html
(mentions new locking behavior re: LOCK TABLES ... READ)
Non-functional requirements:

NF1) This patch should show performance/scalability improvement in
     1-table Sysbench OLTP_RO/POINT_SELECT tests for InnoDB on multi-core
     machines.

NF2) There should be no significant regressions in other benchmarks for
     InnoDB tables.
User visible changes caused by this task
========================================

THR_LOCK will be no longer used for InnoDB tables
-------------------------------------------------

Besides the performance improvements aspect this means that user will
never see waits for THR_LOCK locks for InnoDB tables in P_S schema or
I_S.PROCESSLIST output. Instead in cases when user previously has seen
wait on THR_LOCK lock for InnoDB table it will see wait on MDL.

Particularly this means that for DML like INSERT INTO t1 VALUES(1) which
is blocked due to LOCK TABLES t1 READ active in another connection
we will see in I_S.PROCESSLIST.STATE column "Waiting for table metadata
lock" state instead of "Waiting for table level lock".

There won't be THR_LOCK acquisitions/waits registered in this case
in P_S.TABLE_HANDLES table and for "wait/lock/table/sql/handler" event
(calls to handler::external_lock() will still be registerd).
Instead waits for/acquisitions of metadata lock will be visible in 
P_S.METADATA_LOCKS table and for "wait/lock/metadata/sql/mdl" event.


LOCK TABLES READ and transactions
---------------------------------
LOCK TABLES READ will block and will be blocked by concurrent
transactions changing the table, for all storage engines,
similarly to how LOCK TABLES WRITE work now.

Particularly this means that:

- Explicitly or implicitly started transactions updating non-transactional
  (e.g. MyISAM) table will start to block and get blocked by LOCK TABLES READ
  on this table (unlike before).
- LOCK TABLES READ on InnoDB table in @@autocommit=1 mode or with
  @@innodb_table_locks=0 will start block and blocked by concurrent
  transactions changing this table.
  This means that @@innodb_table_locks will only have in a few scenarios
  involving cascading foreign keys and updates through no-SQL API.
- Storage engines which prefer to ignore LOCK TABLES READ by downgrading
  THR_LOCK lock acquired for it won't be able to do so.


Tables which are implictly locked by LOCK TABLES
------------------------------------------------
Tables which are implicitly used by LOCK TABLES (e.g. through view
or trigger) will be locked using metadata locks in addition to (all SEs
except InnoDB) or instead of (InnoDB) THR_LOCK locks.

These means that changes described in the above two points will also
apply to such tables.


Multi-update and concurrent LOCK TABLES READ
--------------------------------------------

Multi-update will be blocking/start to be blocked by concurrent LOCK TABLES
READ on any table from its join, even though such table will be only used
for reading and won't be updated. Such blocking will be visible as a
wait on metadata lock.


HANDLER READ and concurrent LOCK TABLES WRITE
---------------------------------------------

HANDLER READ statement (for table using any SE) will wait for concurrent
LOCK TABLES WRITE to go away on metadata lock and not THR_LOCK lock.
Vice versa LOCK TABLES WRITE will wait for HANDLER READ to complete on
metadata lock instead of THR_LOCK lock. This will be visible as a different
STATE for waiter in I_S.PROCESSLIST and in statistics gathered by P_S.


ALTER TABLE IMPORT/DISCARD TABLESPACE under LOCK TABLES
-------------------------------------------------------

ALTER TABLE DISCARD/IMPORT under LOCK TABLES will no longer allow
concurrent access to table metadata. This means that concurrent
SHOW CREATE TABLE on the table will be blocked during such statement.
Also information about table won't be accessible in I_S during this
time.

Bugs fixed or alleviated
------------------------

Also this WL is likely to completely fix or at least alleviate the
issues described in the following bug reports:

Bug #11751331 "CONCURRENT DML AND LOCK TABLE ... READ FOR INNODB TABLE
CAUSE WARNINGS IN"
Bug #11764618 "DEADLOCK WHEN DDL UNDER LOCK TABLES WRITE, READ + PREPARE"
Bug #18913551 "LOCK TABLES USES INCORRECT LOCK FOR IMPLICITLY USED TABLES"
Background: why it is possible to get rid of thr_lock.c locks for InnoDB
========================================================================

Currently InnoDB:

*) Downgrades all types of write locks to TL_WRITE_ALLOW_WRITE lock
   unless:
   -) This is ALTER TABLE IMPORT/DISCARD TABLESPACE
   -) This is TRUNCATE TABLE
   -) This is OPTIMIZE TABLE
   -) This is CREATE TABLE
   -) This is LOCK TABLES statement

*) Strong, TL_READ_NO_INSERT, read locks are downgraded to TL_READ locks
   unless this is LOCK TABLES. In case of LOCK TABLES TL_READ_NO_INSERT
   locks are not downgraded and TL_READ locks are upgraded to
   TL_READ_NO_INSERT locks.

This means that in most cases thr_lock.c locks for InnoDB tables are
compatible with each other and don't play any significant role.
Let us take a more detailed look at the above list of exceptions:

1) ALTER TABLE IMPORT/DISCARD TABLESPACE.

Since these operations acquire exclusive (**) MDL lock on the table
being discarded or imported all concurrent DML and DDL would be
blocked for them anyway, so thr_lock.c lock is not really required
in this case.

(**) The fact that under LOCK TABLES we acquire only SNRW lock and
never upgrade it to X lock can be treated as a bug according to
Marko/Sunny and should be fixed (and if it is not a bug - the
situation boils down to the same as for LOCK TABLES WRITE).

2) TRUNCATE TABLE

We always acquire exclusive MDL lock on the table being truncated,
so all concurrent DDL and DML will be blocked during this operation,
without any participation of thr_lock.c locks.

3) OPTIMIZE TABLE

Have to be split in two parts:

  a) Normal OPTIMIZE TABLE for InnoDB is implemented as ALTER TABLE FORCE
     thus it doesn't acquire write lock and relies on MDL/row-locking to
     isolate from concurrent DML and DDL.
  b) OPTIMIZE for FTS indexes is implemented in ha_innobase::optimize(),
     which is called under protection of MDL SNRW lock, which means that
     it is isolated from all DDL and DML from which thr_lock.c can isolate
     except open HANDLER cursors. But this situation is similar to one for
     LOCK TABLES WRITE and will be discussed later.

4) CREATE TABLE

There are two cases when we acquire write locks in CREATE TABLE:

  a) Write lock is acquired on target table in CREATE TABLE SELECT.
     But at the point when it is acquired we will have exclusive
     MDL lock on it, so all concurrent DDL/DML on this table is
     going to be blocked anyway and strong thr_lock.c lock is
     not required.
  b) Write locks are acquired on source tables in CREATE TABLE SELECT
     if FOR UPDATE clause is used. AFAIU the fact that we acquire strong
     write lock TL_WRITE in this case is artefact from a) and is not
     really required (LOCK_X row-locks should be sufficient).

5) LOCK TABLES

Again we have several cases for this statement:

  a) Tables explicitly locked for write (i.e. explicitly present in
     statement table list with with WRITE clause). We acquire strong
     SNRW metadata lock on such tables which blocks concurrent DDL
     and DML access to this table leaving only I_S, access from prepare
     phase of PS and open HANDLER cursors. Out of these three cases only
     the last one uses thr_lock.c locks. So the only scenario in which
     thr_lock.c plays some role in this case is preventing concurrent
     reads from HANDLER statements during LOCK TABLES WRITE duration.
     This problem can be solved by ensuring that HANDLER READ statement
     upgrade MDL from current S lock to SR lock for the duration of read
     (which will also remove nasty exception from assert ensuring that
     all operations which read or modify data acquire SR or SW or stronger
     locks on tables).
     Note that we also acquire table-level InnoDB locks in this case
     but they I) don't block non-locking reads which used by HANDLER
     cursor and II) are not acquired in case of @@autocommit=1 mode.

  b) Tables explicitly locked for read (i.e. explicitly present in
     statement table list with READ clause). For these tables we only
     acquire weak SR metadata lock and InnoDB table-level LOCK_S
     lock if @@autocommit=0. This means that strong read thr_lock.c
     lock which is acquired in this case plays significant role in
     this case.

     Possible options to replace their usage are:
    
     α) Change code to acquire table-level LOCK_S InnoDB locks and
        turn off @@autocommit for the duration of LOCK TABLES.
        This sounds like a rather radical change in behavior to
        me and has some other drawbacks (e.g. according to InnoDB
        documentation this significantly increases chances of
        deadlocks, introduces new intersubsystem deadlocks if
        HANDLER statements are involved, et cetera).
     β) Introduce new type of MDL lock - let us call it
        SHARED_READ_ONLY (SRO) type of lock which will block all
        concurrent updates to the table but will be compatible with
        itself and use it for LOCK TABLES READ. I.e. basically replace
        thr_lock.c lock with a similar MDL lock. There are some
        complications on this path (e.g. need to sort out problems
        with priorities if we are to preserve full compatibility)
        but overall it is doable.

  c) Tables implicitly locked for read and write by LOCK TABLES
     (i.e. underlying tables of views locked, tables used in
     triggers on tables locked, stored programs called from these
     views and triggers). We acquire weak SR and SW metadata locks
     on these tables. We also LOCK_X and LOCK_S table-level InnoDB
     locks but only in @@autocommit=0 mode. This means that strong
     thr_lock.c locks acquired in these cases play significant
     role in this case as well.
     
     Possible options to replace their usage are:

     α) Do not replace thr_lock.c with anything. This means that
        access to implicitly locked tables from under LOCK TABLES
        would block or even might deadlock in some cases. Which
        seems to be too radical change in behavior (e.g. think
        of SELECT from explicitly locked view being blocked due
        to problems with underlying table).
     β) Change code to acquire LOCK_S and LOCK_X table-level InnoDB
        locks and turn off @@autocommit mode until UNLOCK TABLES
        happens. Similarly to solution b.α) this is probably too
        big behavior change which is not free from other issues
        (more probable deadlocks, new deadlocks when there are
        open HANDLER cursors).
     γ) Acquire strong SNRW and newly introduced SRO metadata locks
        on tables implicitly locked for write and read correspondingly
        and thus replace thr_lock.c locks with MDL locks.
        Indeed, this approach is not without its own issues:
        *) We will have to sort out issues with lock priorities somehow.
        *) Thanks to the fact that we can't predict set of implicitly
           locked tables in advance probability of deadlocks is increasing
           so lock acquisition in LOCK TABLES becomes less efficient
           on average + some code acquiring locks for DDL statements
           needs to be adjusted to be able to handle ER_LOCK_DEADLOCK
           errors (e.g. by trying to restart acquisition).
        As a bonus, we might be able to fix some bugs involving
        intersubsystem deadlocks involving LOCK TABLES in this case.


Summary of steps required to get rid of thr_lock.c for InnoDB
=============================================================

Summing up the above it seems that we can totally get rid of thr_lock.c
locks for InnoDB tables provided that we will:

1) Change HANDLER code to temporarily upgrade S metadata
   lock to SR lock for the duration of read from handler.
2) Fix ALTER TABLE IMPORT/EXPORT tablespace code to upgrade
   metadata lock to X lock when it is executed under LOCK
   TABLES.
3) Introduce new type of metadata lock - SRO-lock to replace
   TL_READ_NO_INSERT lock in LOCK TABLES.
4) Change code to acquire SNRW and SRO metadata locks on tables
   which are implicitly locked for LOCK TABLES.

Let us discuss steps 3) and 4) and other issues which can arise
in more details.


Details: New type of metadata lock
==================================

New type of lock should be compatible with S, SH and SR type of metadata
locks and itself. Also for backward compatibility it should be compatible
with SU and SNW types of locks (QQ: does this mean that we need to update
descriptions of SU lock - currently it says that thread holding such lock
can update data?). It should be incompatible with SW, SNRW and X types of
locks. This new type of lock can be named SHARED_READ_ONLY (SRO). Renaming
existing SNW type to SHARED_UPGRADABLE_NO_WRITE and reusing SHARED_NO_WRITE
for new type of lock is another option, but such approach will complicate
merges older code.

Adding such type of lock would have been fairly straightforward if not
issue with priorities. The problem is that in thr_lock.c TL_READ_NO_INSERT
lock has lower priority than normal write lock (and higher than of low-prio
write lock). While it is possible to make priority of new SRO type lower
than of SW type, it means that LOCK TABLES READ can be easily starved by
concurrent stream of DML. thr_lock.c solution to this problem is two-pronged
a) it is possible to use DML with low-priority write locks by using
LOW_PRIORITY clause and corresponding TL_WRITE_LOW_PRIO lock b) it is also
possible to limit number of write locks acquired in a row without letting
read locks in by using max_write_lock_count parameter.

Possible solutions of priority problem for new metadata lock type:

a) Break compatibility and declare that LOCK TABLES READ has higher
   priority than DML, i.e. that SRO type has higher priority than SW
   type (but not SNRW type used for LOCK TABLES WRITE). It is likely
   that we will be immediately requested to add some way to avoid
   LOCK TABLES READ starving out DML. This means new parameter
   similar to max_write_lock_count will have to be added. E.g.
   we can call it max_read_lock_count.
b) Mimic thr_lock.c behavior fully by assigning SRO priority lower
   than of SW, adding low-priority SW locks and obeying
   max_write_lock_count parameter when granting SW locks.
c) Break compatibility with both old thr_lock.c behavior and
   current MDL behavior implement fair scheduling, at least
   for locks used for DML and LOCK TABLES READ/WRITE. This
   will require fairly significant changes to MDL subsystem
   (specifically to the way we discover conflicting locks in
   waiters graph). And likely even more changes to test suite.
d) Mimic thr_lock.c behavior partially by assigning SRO priority
   lower than of SW and obeying max_write_lock_count parameter
   when granting SW locks, without introducing low-priority SW
   locks.

Note that with any choice of priority between SRO and SW types it
probably makes sense to make pending SRO not to block acquisition
of SNRW and X locks as otherwise we can easily get much more
deadlock errors during upgrade of SU lock in ALTER TABLE.

Since approach b) is the safest from backward compatibility point
of view and also only a bit more complex from implementation perspective
than approaches a) and d) we will stick to it.


Details: Acquiring SNRW and SRO on implicitly locked tables
===========================================================

Changing code responsible for opening views and adding tables used by
stored programs to prelocking list to acquire SNRW and SRO metadata
locks in case when this is done for LOCK TABLES statement should be
fairly straightforward. 

The new problem arises thanks to the fact that after such a step we
we won't be able to predict order in which strong MDL locks will be
acquired by LOCK TABLES statements. Due to this we no longer will be
able to rely on that in almost all cases (MERGE tables being exception)
strong locks are acquired in certain order to avoid deadlock errors
in DDL.

Possible solutions to this problem:

a) Change function which chooses deadlock resolution victim to prefer
   waits for strong locks from LOCK TABLES over locks from other DDL.
   Ensure that attempt to acquire locks for LOCK TABLES is restarted
   when we get ER_LOCK_DEADLOCK error.
b) Simply change lock_table_names() and other similar places in
   our code where we acquire several strong locks to react to
   ER_LOCK_DEADLOCK error by releasing locks and restarting lock
   acquisition if possible (i.e. if there were no locks acquired
   before this step which we can't easily release and re-acquire).
   Special analysis needs to be performed to figure out if
   ER_LOCK_DEADLOCK error is possible after such change at
   the points where we do upgrade of MDL locks.
c) Drastically change approach to the acquisition of strong locks
   for LOCK TABLES, acquire weak S locks on all tables first, then
   upgrade them to strong locks at once using order. The drawbacks
   of this approach are that a) we will have to handle deadlocks
   during lock upgrade somehow (plus possibly in other places where
   we acquire strong locks) b) in fact LOCK TABLES will acquire MDL
   locks twice meaning additional performance overhead as compared
   to other variants.

The current plan is to follow approach a) as it is least intrusive and
its implementation is less likely to break anything in existing code.


Additional issue: LOCK TABLE READ LOCAL
=======================================

LOCK TABLE READ LOCAL is special syntax which allows concurrent inserts
into the locked table for MyISAM tables. It is not supported for InnoDB
tables. Still since we don't know table's storage engine before opening
it we will have to change our approach to metadata lock acquisition for
this statement. Possible options are:

a) Ignore LOCAL clause for LOCK TABLES. This is probably too big behavior
   change. (LOCK TABLES LOCAL is used by mysqldump so by doing this step
   we are likely to reduce concurrency of such backups). OTOH this option
   becomes much more attractive once MyISAM is deprecated.

b) Acquire SRO locks first and then downgrade them to SR locks if SE
   supports LOCAL clause (e.g. for MyISAM). This approach reduces a bit
   concurrency between LOCK TABLE READ LOCAL (e.g. mysqldump) and
   concurrent inserts.

c) Acquire SR locks if LOCAL clause is present, upgrade them to SRO if
   SE doesn't support this clause. This approach introduces possibility
   of MDL deadlocks for SE doing MDL upgrade. Such deadlocks will be 
   properly detected and should be handled somehow (e.g. by restarting
   lock acquisition). Since it is unlikely that anybody will use LOCK
   TABLE READ LOCAL for InnoDB tables, performance impact of these
   deadlocks probably can be ignored.

The suggestion is to go with approach c). It allows to fully preserve
backward compatibility and initial code investigation shows that its
implementation should be fairly straightforward.


Additional issue: Multi-UPDATE
==============================

Multi-update doesn't acquire thr_lock.c write locks on table participating
in its join which are not updated. As result such statement can run even
if some other thread locks tables which are not updated but participate
in its main join using LOCK TABLE READ. Since we acquire SW metadata lock
for all tables participating in Multi-UPDATE join the situation is going
to change once LOCK TABLE READ starts to acquire SRO lock - concurrent
LOCK TABLE READ on any table in join will block multi-update (AFAIU the
multi-update is the only place where we acquire metadata lock which is
stronger than thr_lock.c lock, so other places should not be affected).
Possible solutions for this issue are:

a) Accept change in behavior. The upside is that code stays simple, the
   downside is that some users may encounter problems + new behavior might
   be hard to explain without referring to implementation. Support Team
   was asked about this alternative and majority of people who answered
   seem to be OK with it (there is one person who objects).
   Also, to reduce the impact of this change in behavior, we can downgrade
   SW locks for tables which are read only  to SR locks.

b) Keep behavior backward-compatible. Change code to acquire S locks
   initially and then upgrade them to SR or SW locks after figuring out if
   specific table in join going to be only read or read and updated.
   Such an upgrade could lead to deadlocks which will be detected by MDL
   subsystem and need to be handled. Code which clean-ups after resolving
   columns from SET clause and allows to restart locking is present in 5.1.
   The awkwardness of this code (and the fact that it has caused some
   problems in the past) is the main downside of this approach. The upside -
   absence of change in behavior. Performance impact of lock upgrade and
   deadlock handling is negligible since upgrade takes "fast path" and
   deadlocks can occur only due to concurrent multi-update and DDL.

The suggestion is to go with option a) since it is more straighforward and
there were not strong opposition to it from Support. We still will be able
to implement option b) later if there will be loud complaints after MRU/RC
with this task is out.


Details: Not acquiring thr_lock.c locks for InnoDB
==================================================

Changing InnoDB to not acquire thr_lock.c locks itself is fairly
straightforward:

I)   We need to change ha_innobase::lock_count() to return 0, so no
     (THR_LOCK*) elements are allocated in MYSQL_LOCK::locks[] array for
     InnoDB tables.
II)  We need to ensure that ha_innobase::store_lock() doesn't try to store
     type of thr_lock.c lock in MYSQL_LOCK::locks[] array. Note that this
     method still needs to be called as it plays important role in
     determining type of row-locks used by InnoDB (as well as
     ha_innobase::externa_lock() and ha_innobase::start_stmt()).
     Refactoring this part of SE API is subject of different WL.
III) We need to double check that code in sql/lock.cc dealing with
     MYSQL_LOCK structure and thr_lock.c locks correctly handles all
     situations for SEs which don't use any thr_lock.c locks and
     return 0 as handler::lock_count().