WL#7306: Improve MDL performance and scalability by implementing lock-free lock acquisition for DML

Affects: Server-5.7   —   Status: Complete

Since DML statements are most common for normal workload/benchmarks it makes
sense to optimize MDL subsystem for such statements. One of possible
optimizations is making acquisition/release of locks for DML statement (S, SH,
SR and SW locks) lock-free, e.g. by replacing complex construction involving
locking mutex, object manipulation and unlocking mutex with construction
involving CAS operator updating single object member.

Preliminary benchmarks of draft patch implementing this idea has shown good
results. In some cases performance was on par with tree built without MDL.

User Documentation
==================

http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-4.html
There are no functional requirements for this task as user visible functionality
is not supposed to be changed by it.

NF1) The only non-functional requirement is performance improvement for
POINT_SELECT/OLTP_RO tests of 1-table SysBench/InnoDB tables/multi-core machine.
None needed. See LLD.
General idea
============

After WL#7304 and WL#7305 are implemented, acqusition of an unobtrusive
metadata lock can be described by the following pseudocode:

retry:
  /* Look-up MDL_lock object in LF_HASH. */
  lock= lf_hash_search(pins, &m_locks, mdl_key);

  wrlock(&lock->m_rwlock);

  if (lock->m_is_destroyed)
  {
    unlock(&lock->m_rwlock);
    unpin(pins, lock);
    goto retry;
  }

  unpin(pins, lock);

  if (! lock->m_obtrusive_locks_granted_waiting_count)
  {
    /* Increment number of acquired fast path locks of this type. */
    lock->m_fast_path_granted_count+= unobtrusive_lock_increment;
    unlock(&lock->m_rwlock);
    ...
  }
  else
  {
    /* Slow path. */
    ...
  }

Similarly release of a fast path lock looks like:

  wrlock(&lock->m_rwlock);

  /* Decrement number of acquired fast path locks of this type. */
  lock->m_fast_path_granted_count-= unobtrusive_lock_increment;

  if (lock->is_empty())
  {
    /* If it was last ticket for this MDL_lock object remove it from MDL_map. */
    ...
  }
  else
  {
    /*
      If there are any lock requests waiting for ticket being released to go away
      try to reschedule them.
    */
    if (lock->m_obtrusive_locks_granted_waiting_count)
      lock->reschedule_waiters();
    unlock(&lock->m_rwlock);
  }

By representing both the MDL_lock::m_is_destroyed member and the value
of the (MDL_lock::m_obtrusive_locks_granted_waiting_count != 0)
predicate as flags in the MDL_lock::m_fast_path_granted_count packed
counter it becomes possible to replace the above steps involving mutex
acquisition/checks and packed counter increment with a single atomic
compare-and-swap operation on the MDL_lock::m_fast_path_granted_count.
Since MDL_lock::m_fast_path_granted_count after this change no longer
will contain only lock counts it makes sense to rename this member
to MDL_lock::m_fast_path_state.

As a result the pseudocode for acquisition of an unobtrusive lock is
transformed to:

retry:
  /* Look-up MDL_lock object in LF_HASH. */
  lock= lf_hash_search(pins, &m_locks, mdl_key);

  old_state= lock->m_fast_path_state;

  do
  {
    if (old_state & IS_DESTROYED)
    {
      unpin(pins, lock);
      goto retry;
    }
    if (old_state & HAS_OBTRUSIVE)
      goto slow_path;
    
    /*
      If lock is not waiting to be destroyed and doesn't have active or
      pending obtrusive lock - increment number of acquired fast path 
      locks of this type using CAS.
    */
  }
  while (!compare_and_swap(&lock->m_fast_path_state, &old_state,
                           old_state + unobtrusive_lock_increment)));
  
  /* We have acquired lock. */
  unpin(pins, lock);
  ...
  return;

slow_path:

  /* Acquire MDL_lock::m_rwlock and try to acquire lock using slow path. */
  wrlock(&lock->m_rwlock);
  
  if (lock->m_fast_path_state & IS_DESTROYED)
  {
    unlock(&lock->m_rwlock);
    unpin(pins, lock);
    goto retry;
  }

  unpin(pins, lock);

  ...

And pseudocode for release of a lock which was acquired using fast path:

  old_state= lock->m_fast_path_state;

  /*
    We need to decrement part of counter which hold number of
    acquired "fast path" locks of this type.
  */
  do
  {
    if (((old_state - unobtrusive_lock_increment) == 0) ||
        (old_state & MDL_lock::HAS_OBTRUSIVE))
    {
      /*
        - Either there is a high chance that this is last ticket for lock,
        - Or there are pending or active obtrusive locks.

        In both these cases we need to acquire MDL_lock::m_rwlock and:
        - Either try to remove object from hash and delete it.
        - Or try to reschedule pending locks.
      */
      wrlock(&lock->m_rwlock);
  
      /* Decrement number of acquired fast path locks of this type. */
      atomic_add(&lock->m_fast_path_state, -unobtrusive_lock_increment);

      if (lock->m_granted.is_empty() && lock->m_waiting.is_empty())
      {
        /*
          The above check corresponds to part of MDL_lock::is_empty() in
          old code which can be done without atomic compare-and-swap.

          This might be last ticket for lock so try atomically check that
          there are no "fast path" locks in MDL_lock::m_fast_path_state
          and set MDL_lock::IS_DESTROYED flag. 
          If this step succeeds remove it from LF_HASH.
        */
        ...
      }
      else
      {
        /*
          If there are any lock requests waiting for ticket being released
          to go away. Since this is "fast path" ticket it represents
          "unobtrusive" type of lock. In this case if there are any waiters
          for it there should be "obtrusive" type of request among them.
        */
        if (lock->m_obtrusive_locks_granted_waiting_count)
          lock->reschedule_waiters();
        unlock(&lock->m_rwlock);
      }
      break;
    }
  }
  while (! compare_and_swap(&lock->m_fast_path_state, &old_state,
                            old_state - unobtrusive_lock_increment));

As a result, acquisition of unobtrusive locks in absence of obtrusive
locks becomes equivalent to one CAS which should be twice cheaper than
acquire, check, increment and release of mutex. Release of locks acquired
on fast path (in absence of obtrusive locks and assuming we are not
releasing last lock for this MDL_lock object) also becomes equivalent
to CAS which again should be twice cheaper than acquisition of mutex,
decrement, checks and release of mutex.


Detailed description
====================

What changes are necessary to implement the above idea?

Essentially we perform the following three transformations:

I)   MDL_lock::m_fast_path_granted_count is replaced with atomic
     MDL_lock::m_fast_path_state member, which in an ideal case of
     "fast path" acquisition/release is checked and changed using CAS
     without holding any mutexes.
II)  Since we would like to check in the same atomic CAS operation that
     MDL_lock object was not destroyed, its m_is_destroyed member is
     replaced by IS_DESTROYED bit flag in m_fast_path_state
     packed counter.
III) Similarly, since we also would like to check in the same atomic CAS
     that there are no granted or pending obtrusive locks, we have to
     add HAS_OBTRUSIVE bit flag in m_fast_path_state, while
     keeping MDL_lock::m_obtrusive_locks_granted_waiting_count.
     This flag should be set when we are about to try acquiring an obtrusive
     lock and cleared once the last granted or pending obtrusive lock goes
     away. Note that this flag should be set before we call
     MDL_lock::can_grant_lock() when trying to acquire an obtrusive lock
     as otherwise it would be hard to avoid races due to concurrent changes
     to m_fast_path_state happening on fast path. This is a subtle
     but important difference from the current handling of
     MDL_lock::m_obtrusive_locks_granted_waiting_count.

Let us take a detailed look at all places which currently use these members
and are going to be affected by these transformations:

1) Increment of m_fast_path_state on a fast path in 
   try_acquire_lock_impl() should be done as an atomic compare-and-swap
   operation (which also checks if IS_DESTROYED or HAS_OBTRUSIVE
   flags are set) without taking MDL_lock::m_rwlock.
2) Reads of m_fast_path_state in
   MDL_lock::fast_path_granted_bitmap() can happen in two cases.
   In the first one we call MDL_lock::can_grant_lock() to figure out
   if we can grant an unobtrusive lock. Since unobtrusive locks are
   compatible with all fast path locks per definition we don't really
   care about the m_fast_path_state value in this case. So reads
   don't have to be atomic.
   In the second one we call MDL_lock::can_grant_lock() to figure out
   if we can grant an obtrusive lock. But this means that the HAS_OBTRUSIVE
   flag is set so all changes to m_fast_path_state happen under
   protection of MDL_lock::m_rwlock (this is an invariant which we need
   to enforce, let us call it [INV1]). Since can_grant_lock() is called
   only when MDL_lock::m_rwlock is held, it is safe to do an ordinary read
   of m_fast_path_state in this case as well.
3) Decrement of m_fast_path_state in
   MDL_context::materialize_fast_path_locks() should be done atomically
   and under protection of MDL_lock::m_rwlock. This is necessary in
   order to ensure that decrement and addition of corresponding ticket
   to m_granted list happen atomically and to enforce [INV1].
4) Increment of m_fast_path_state in MDL_context::clone_ticket()
   needs to happen atomically and under protection of MDL_lock::m_rwlock
   at least when the HAS_OBTRUSIVE flag is set in order to enforce [INV1].
5) Decrement of m_fast_path_state in
   MDL_context::upgrade_shared_lock() needs to happen under protection of
   m_rwlock in order to be atomic with changes to ticket lists. Technically
   this decrement doesn't have to be atomic since we always upgrade
   to obtrusive locks, so we have HAS_OBTRUSIVE flag set at the point
   where decrement happens and thus thanks to [INV1] there can be no
   concurrent changes to m_fast_path_state.
6) Decrement of m_fast_path_state on a fast path in
   MDL_context::release_lock() should happen atomically (without protection
   of MDL_lock::m_rwlock). The same compare-and-swap has to check if
   the HAS_OBTRUSIVE flag is set or that we are about to release the last
   fast path ticket. In these cases MDL_lock::m_rwlock has to be acquired.
   In the former case this is necessary to make decrement and rescheduling
   of waiting locks atomic. This also enforces [INV1].
   In the latter case acquiring m_rwlock is necessary as a pre-requisite for
   possible deletion of the MDL_lock (we need to ensure that only one thread
   tries to delete MDL_lock/set IS_DESTROYED flag, let us call it [INV2]).
   Also this allows us to ensure that we never can have an outstanding
   reference to MDL_lock which is not pinned, has unlocked m_rwlock and with
   zero m_fast_path_state (including IS_DESTROYED bit). Failure to
   ensure this opens the door to a race when the object for which
   m_fast_path_state was just decremented to 0 is deleted immediately
   after it, before we even try to destroy the object ourselves.
7) Since checking that m_fast_path_state is 0 and setting of the
   IS_DESTROYED flag needs to happen atomically, places which use
   MDL_lock::is_empty() (i.e. MDL_context::release_lock() and
   MDL_lock::remove_ticket()) need to be changed to check that m_granted
   and m_waiting are empty directly and then call MDL_map::remove() which
   will try to check if m_fast_path_state is 0 and set IS_DESTROYED with
   a single atomic compare-and-swap operation.
   Indeed, this means that it makes sense to rename MDL_map::remove() to
   remove_if_no_fast_path().

1') Checking of the MDL_lock::m_is_destroyed member which now happens in
    the MDL_map::find_or_insert() method needs to be replaced with a check
    of the IS_DESTROYED flag in m_fast_path_state and moved out of
    find_or_insert() and into MDL_context::try_acquire_lock_impl().
    On the fast path this needs to happen as a part of an atomic
    compare-and-swap operation which also increments packed counter
    in m_fast_path_state.
    On the slow path it can be done after we have acquired
    MDL_lock::m_rwlock. Thanks to the fact that we always set
    IS_DESTROYED with m_rwlock held, we can do this check using the usual
    non-atomic read (it can return outdated/incorrect values for other
    parts of m_fast_path_state but this is irrelevant in this
    case).
    In both these cases we need to carry out the check before unpinning
    the MDL_lock object.
2') As mentioned above, MDL_map::remove() needs to be changed to
    set the IS_DESTROYED flag as part of atomic compare-and-swap operation
    which also checks if m_fast_path_state is zero and happens under
    protection of MDL_lock::m_rwlock lock.

1'') Code on the fast path in MDL_context::try_acquire_lock_impl() which
     checks MDL_lock::m_obtrusive_locks_granted_waiting_count value needs
     to be adjusted to check for the presence of the HAS_OBTRUSIVE flag
     instead. This needs to happen as part of the same compare-and-swap
     construct which increments packed counter in m_fast_path_state.
2'') When we try to acquire an obtrusive lock using the slow path in
     MDL_context::try_acquire_lock_impl() we need to atomically set
     the HAS_OBTRUSIVE flag in m_fast_path_state before we call 
     the MDL_lock::can_grant_lock() method. This is necessary to prevent
     concurrent fast path acquisitions from invalidating the results of
     this method. Consequently it is more convenient to increment the
     MDL_lock::m_obtrusive_locks_granted_waiting_count counter in the
     same place. This means that we also no longer need to increment
     this counter in the MDL_context::acquire_lock() method, but instead
     we need to decrement it and clear the HAS_OBTRUSIVE flag in
     MDL_context::try_acquire_lock().
     This also means that there is a subtle change in semantics of
     m_obtrusive_locks_granted_waiting_count - it now also accounts
     for obtrusive locks we are about to try to acquire.

     QQ: So perhaps its name should be changed? OTOH this change in
         semantics is not strictly necessary, so maybe we should keep
         semantics as it is now?

3'') Indeed MDL_lock::remove_ticket() needs to be changed to atomically
     clear the HAS_OBTRUSIVE flag when we are removing the last ticket
     for an obtrusive lock.
4'') Code in MDL_context::clone_ticket() doesn't require adjustment as
     we don't decrease the number of active or pending obtrusive locks
     in this method or change this number from 0 to a non-0 value.
5'') Code in MDL_context::upgrade_shared_lock() can be kept as is since
     we always upgrade to obtrusive locks and therefore the HAS_OBTRUSIVE
     flag will be set by acquire_lock().
6'') Instead of checking the MDL_lock::m_obtrusive_locks_granted_waiting_count
     value on fast path in MDL_context::release_lock(), we need to check
     for presence of the HAS_OBTRUSIVE flag and acquire MDL_lock::m_rwlock
     to do rescheduling of waiting tickets and to enforce [INV1] if this
     flag is set. This check should be done atomically by the same
     compare-and-swap construction which decrements packed counter in the
     MDL_lock::m_fast_path_state.
7'') MDL_ticket::downgrade_lock() should be adjusted to clear the HAS_OBTRUSIVE
     flag in cases when we have downgraded the ticket for last granted or
     pending obtrusive lock. This needs to be done atomically in order
     to avoid lost/partial updates to MDL_lock::m_fast_path_state
     on 32-bit platforms.

To sum up the above - we use three approaches for handling reads and
changes to MDL_lock::m_fast_path_state member:

A) Write and Read-Modify-Write operations are always carried out
   atomically, this is necessary to avoid lost updates on 32-bit
   platforms among other things.
B) In some cases Reads can be done non-atomically because we don't
   really care about value which they will return (for example,
   if further down the line there will be a compare-and-swap operation
   which will validate this value and provide the correct value if
   the validation will fail).
C) In other cases Reads can be done non-atomically since they happen
   under protection of MDL_lock::m_rwlock and there is some invariant
   which ensures that concurrent updates of the MDL_lock::m_fast_path_state
   member can't happen until MDL_lock::m_rwlock.