WL#12616: InnoDB: Make number of PAUSES in spin loops configurable
Affects: Server-8.0
—
Status: Complete
When performing a spin locking loop to check if the mutex or rw-lock become free, we usually pick a small random `delay` (say a dice roll), multiply it by 50 and perform that many PAUSE instructions. The comment for ut_delay() states this procedure was calibrated on 100 MHz Pentium. Things have changed since then. In particular, Skylake processors have a much slower (~15x) PAUSE instruction than other machines. The purpose of this WL is to let the end user configure this algorithm, so that it is better suited to the particular processor.
FR1. By default, the patch should not affect the system in any way (in particular the default value of new sys var should be chosen as to simulate the old behaviour), so that upgrading the system will not accidentally affect it. FR2. It must be possible to configure InnoDB running on a system on which PAUSE instruction takes 10x as much time as on "regular" system, in such a way, that the observed duration of performed chains of PAUSES take the same time it would on regular system. In other words, if on "regular" system the observed duration follows unif{0,6}*50*Regular_pause_duration distribution, then we want to have similar distribution on a system where single PAUSE takes Regular_pause_duration*10, so that we can support processor architectures with much slower PAUSES. (NOTE: arguably this is difficult to test using "external" tools, other than debugging/looking at the code - ideas on how to test FR2, are welcomed) FR3. The end user should have ability to fine tune the system (w.r.t. to number of PAUSES per spin) in real time on a running InnoDB instance, so that she or he can take a more end-to-end view on performance, as it is not clear that original value of 50 or suggested value of 50/10 is the actual optimum.
When performing a spin locking loop to check if the mutex or rw-lock become free, we usually pick a number `delay` from range 0..@@innodb_spin_wait_delay uniformly at random, and pass it to ut_delay(delay) which performs 50*delay calls to UT_RELAX_CPU() (which translates to PAUSE). The comment for ut_delay() states it was calibrated on 100 MHz Pentium. Things have changed since then. In particular, Skylake processors have a much slower (~15x) PAUSE instruction than other machines, which means that to achieve a similar slowdown we would need to reduce @@innodb_spin_wait_delay 15 times, but this is a) not possible, as the default value of it is 6, and thus 6/15 after rounding is either zero or 1 b) not equivalent, as the goal of randomization seems to be to decorrelate multiple waiters, and if we shrink the range from which the random `delay` is picked, then the ppb of collision (picking same number as another thread) is higher, and c) the granularity of "50" will not be affected, so one can not get values smaller than 50 but larger than 0 The constant 50 should instead be platform-dependent (auto-tuning) or configurable (new dynamic sys-var). This would allow running our software on variety of processors which differ in their implementation of PAUSE instruction. It would also give power to the user to fine tune the value based on real world data (say, by observing the lock contention, or time wasted in ut_delay while the server performs typical transactions). New system variables introduced: ================================ @@innodb_spin_wait_pause_multiplier - a global, dynamic, system variable, ranging from 0 to 100, with default equal to the backward-compatible value of 50 - can be accessed like this: set global innodb_spin_wait_pause_multiplier=10; select @@global.innodb_spin_wait_pause_multiplier;
This WL will modify the `ulint ut_delay(uling delay)` function which is used to perform PAUSE instructions. As of today it performs delay*50 PAUSES. The new implementation will replace the constant 50 with a dynamic system variable, with default value of 50. So, there are two changes needed: 1. introduction of a new dynamic innodb-specific system variable 2. modification of the waiting loop so that the variable's current value is used instead of 50 Some care must be taken to avoid unintentionally affecting the performance of the loop, by reading the current value of sys var itself. The sufficient solution is to use non-atomic read, and do it only once before the loop.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.