WL#7047: InnoDB: Optimize buffer pool list scans and related batch processing code

Status: Complete   —   Priority: Medium

Recently our benchmarks showed that we are scanning excessive number of pages when 
doing flush list batches. We fixed it by introducing the concept of Hazard Pointer 
which reduced the time complexity of scan from O(n*n) to O(n). This WL builds on 
that to extend the same concept to other scans within buffer pool. There are 
significant other changes which are made as part of reduced scanning logic.
Hazard Pointers:
There was one hazard pointer (hp) based scan in buffer pool. With this work 
we'll have four hp based scans.

We have used the concept of hazard pointer in kind of reverse order. 
Academically a hazard pointer is a pointer that the thread working on it will 
declare as such and as long as that thread is not done no other thread is 
allowed to do anything with it.

In our case, we declare the pointer as a hazard pointer and then if any other 
thread attempts to work on it, it is allowed to do so but it has to adjust the 
hazard pointer to the next valid value. We use hazard pointer solely for reverse 
traversal of lists within a buffer pool instance.

* Traversal of LRU list during LRU flushing batch.
* If we are working on, say, bpage we set bpage->lru->prev as hazard pointer
* We release buf_pool mutex and start processing bpage
* Meanwhile if any other thread attempts to remove prev from the LRU list, it 
will adjust the hazard pointer by setting it to prev->lru->prev under 
buf_pool::mutex protection.
* When we are done processing bpage we grab buf_pool::mutex and read the hazard 
pointer and start processing it.

For each buffer pool instance we'll have:
* flush_hp: hp for flush list batches
* lru_hp: hp for lru batches
* lru_scan_itr: special purpose iterator we'll use to scan LRU list to find a 
victim for eviction. Threads doing scan will start scan not from tail of LRU but 
from where the previous scan ended.
* single_scan_itr: special purpose iterator we'll use to scan LRU list to find a 
victim for single page flushing/eviction. Threads doing scan will start scan not 
from tail of LRU but from where the previous scan ended.

The last two types of iterators will be capped by LRU->old i.e.: if we reach the 
end of 'old' pages within LRU list we'll reset to the tail of LRU.

Page Eviction from LRU to free list:
We evict pages from LRU to free list when:
* We do LRU batch
* We do single page flushing to find a victim

What we have currently is quite involved. During LRU batch if we hit a page that 
is good for eviction (it is clean) we evict it i.e., put it on free list. If, 
however, it is dirty we flush it but don't put it on free list. This is because 
during flushing we release the buf_pool::mutex and we don't want to start a 
rescan for eviction when we are done flushing of that page. We simply leave it 
for eviction in the next iteration.
In case of single page flushing we do:
* scan to find a victim to flush
* flush the page
* scan to find that free page (scan needed because we released buf_pool::mutex)
* evict the page

In new scheme of things we'll move the eviction to the IO completion routine 
(typically run in IO helper thread). This means we can do both flushing and 
eviction in a single pass during LRU and single page flushing.

LRU batches in single pass:
Currently we divide LRU batches into smaller chunks of 100 pages. Why?
We do this because we want to go through quick iterations of flushing and 
eviction as described above. This ensures that thread waiting for a clean page 
don't have to wait very long in case of a very large LRU batch i.e., high value 
of innodb_lru_scan_depth.
However, in new scheme of things the eviction is done in the same pass as 
flushing. That means we can have arbitrarily large LRU batch and the clean 
blocks will become available to the waiting threads as and when they are flushed 
(because they will be evicted in the IO completion routine).
Therefore, the logic to divide LRU batch into smaller chunks is not needed 

No wait to end LRU batch:
The threads looking for a victim don't need to wait for the LRU batch to end. If 
they can find a page in free list that is fine, otherwise they can retry as the 
pages are becoming available as and when they are flushed and if that doesn't 
help they should try single page flushing.

Other Changes:
One other thing that showed up on perf when working on high end system is that 
when doing a search for replaceable page we call buf_LRU_free_page(). This will 
compute the fold value and grab page_hash::lock before checking whether or not 
page is replaceable. This is called very frequently and we should only try this 
function if buf_flush_ready_for_replace() returns true.

Changes based on Performance meeting discussions:
* When scanning a clean page from common LRU list, only scan minimal pages, hard 
coded to 100 for now. Why minimal? Because with special iterator this much is 
enough to tell us the density of dirty pages. If it is higher then 99% i.e.: 100 
dirty pages without a clean page, in the tail of LRU then we are better off 
doing a single page flush.
* Above means that now innodb_lru_scan_depth only controls the LRU batches i.e.: 
how many pages the page_cleaner thread will try to keep in free list.
* Add extra counters to distinguish between pages flushed and pages evicted 
during and LRU batch.