Recently our benchmarks showed that we are scanning excessive number of pages when doing flush list batches. We fixed it by introducing the concept of Hazard Pointer which reduced the time complexity of scan from O(n*n) to O(n). This WL builds on that to extend the same concept to other scans within buffer pool. There are significant other changes which are made as part of reduced scanning logic.
Hazard Pointers: ================ There was one hazard pointer (hp) based scan in buffer pool. With this work we'll have four hp based scans. We have used the concept of hazard pointer in kind of reverse order. Academically a hazard pointer is a pointer that the thread working on it will declare as such and as long as that thread is not done no other thread is allowed to do anything with it. In our case, we declare the pointer as a hazard pointer and then if any other thread attempts to work on it, it is allowed to do so but it has to adjust the hazard pointer to the next valid value. We use hazard pointer solely for reverse traversal of lists within a buffer pool instance. Example: * Traversal of LRU list during LRU flushing batch. * If we are working on, say, bpage we set bpage->lru->prev as hazard pointer * We release buf_pool mutex and start processing bpage * Meanwhile if any other thread attempts to remove prev from the LRU list, it will adjust the hazard pointer by setting it to prev->lru->prev under buf_pool::mutex protection. * When we are done processing bpage we grab buf_pool::mutex and read the hazard pointer and start processing it. For each buffer pool instance we'll have: * flush_hp: hp for flush list batches * lru_hp: hp for lru batches * lru_scan_itr: special purpose iterator we'll use to scan LRU list to find a victim for eviction. Threads doing scan will start scan not from tail of LRU but from where the previous scan ended. * single_scan_itr: special purpose iterator we'll use to scan LRU list to find a victim for single page flushing/eviction. Threads doing scan will start scan not from tail of LRU but from where the previous scan ended. The last two types of iterators will be capped by LRU->old i.e.: if we reach the end of 'old' pages within LRU list we'll reset to the tail of LRU. Page Eviction from LRU to free list: ==================================== We evict pages from LRU to free list when: * We do LRU batch * We do single page flushing to find a victim What we have currently is quite involved. During LRU batch if we hit a page that is good for eviction (it is clean) we evict it i.e., put it on free list. If, however, it is dirty we flush it but don't put it on free list. This is because during flushing we release the buf_pool::mutex and we don't want to start a rescan for eviction when we are done flushing of that page. We simply leave it for eviction in the next iteration. In case of single page flushing we do: * scan to find a victim to flush * flush the page * scan to find that free page (scan needed because we released buf_pool::mutex) * evict the page In new scheme of things we'll move the eviction to the IO completion routine (typically run in IO helper thread). This means we can do both flushing and eviction in a single pass during LRU and single page flushing. LRU batches in single pass: =========================== Currently we divide LRU batches into smaller chunks of 100 pages. Why? We do this because we want to go through quick iterations of flushing and eviction as described above. This ensures that thread waiting for a clean page don't have to wait very long in case of a very large LRU batch i.e., high value of innodb_lru_scan_depth. However, in new scheme of things the eviction is done in the same pass as flushing. That means we can have arbitrarily large LRU batch and the clean blocks will become available to the waiting threads as and when they are flushed (because they will be evicted in the IO completion routine). Therefore, the logic to divide LRU batch into smaller chunks is not needed anymore. No wait to end LRU batch: ========================= The threads looking for a victim don't need to wait for the LRU batch to end. If they can find a page in free list that is fine, otherwise they can retry as the pages are becoming available as and when they are flushed and if that doesn't help they should try single page flushing. Other Changes: ============== One other thing that showed up on perf when working on high end system is that when doing a search for replaceable page we call buf_LRU_free_page(). This will compute the fold value and grab page_hash::lock before checking whether or not page is replaceable. This is called very frequently and we should only try this function if buf_flush_ready_for_replace() returns true. Changes based on Performance meeting discussions: ================================================= * When scanning a clean page from common LRU list, only scan minimal pages, hard coded to 100 for now. Why minimal? Because with special iterator this much is enough to tell us the density of dirty pages. If it is higher then 99% i.e.: 100 dirty pages without a clean page, in the tail of LRU then we are better off doing a single page flush. * Above means that now innodb_lru_scan_depth only controls the LRU batches i.e.: how many pages the page_cleaner thread will try to keep in free list. * Add extra counters to distinguish between pages flushed and pages evicted during and LRU batch.