WL#2373: Use cycle counter for timing
Affects: Server-5.5 — Status: Complete — Priority: Medium
For WL#2360 Performance wait_events, we need to time procedures, e.g. waits for mutexes, frequently and unobtrusively. Phase one: compare speed of RDTSC and gettimeofday() or QueryPerformanceCounter. Done. Phase two: add cycle counters for non-x86 platforms. Done. Phase three: check that the test code does what the comments say, check that the results are as stated in high-level specification. Done. Phase four: put cycle counter functions in server code. Done. Phase four took place as part of WL#2360 work. [ Note added by Peter Gulutzan 2009-11-02 ] I have marked the code-review boxes "complete". Reviewers were Vladislav Vaintroub, Olav Sandstaa, Kay Roepke. Emails re review (and approvals of a change in late October) are in lists.mysql.com/commits starting here: http://lists.mysql.com/commits/74177
For WL#2360 Performance Schema, we need to time procedures, e.g. waits for mutexes, frequently and unobtrusively. After testing, I see that gettimeofday would be preferable on a few platforms, but a choice of a low-level cycle counter is usually best. A cycle counter commonly returns the number of cycles that have gone by since the computer was last reset. Thus, on a 1GHz Pentium which has been on for one second, the value of the cycle counter is 1 billion. With the assembler instruction RDTSC it's possible to get this value into registers. An example with gcc inline assembly: #include <sys/time.h> ... #define RDTSC(val) \ __asm__ __volatile__ ("rdtsc" : "=A" (val)) ... long long time1; long long time2; ... RDTSC(time1); test1(); RDTSC(time2); printf("speed: %lld\n",time2-time1); Other proposals --------------- In the original version of this task, I suggested: "Others may be aware of something better [than a cycle counter], we'll see.". Some people noted that we already have my_getsystime(), gettimeofday(), clock_gettime(), and other variants of wall-clock-time functions. InnoDB already uses gettimeofday for mutex timing, see ut_usectime() in .../innobase/ut/ut0ut.c, mutex_spin_wait() in .../innobase/sync/sync0sync.c. Since the performance-schema plan would require "getting the time" thousands, perhaps many thousands, of times per second, I suggest that the overhead of gettimeofday, compared to rdtsc, is unacceptable. This suggestion is testable. Tests ----- I attach four files to this task: rdtsc3.c, configure.in, rdtsc3_notes.txt, my_timer_cycles.il. (Marc Alff took whatever code was necessary and put it in the mysql-6.0-perf tree so the attachments are only of importance for archivists.) Instructions (how-to-compile and how-to-interpret) are in rdtsc3.c comments. Using rdtsc3 I timed a procedure which contained RDTSC itself, and all the alternatives (gettimeofday, clock_gettime, gethrtime, ftime, etc), and all the assembly-level non-x86 equivalents (itc, mftb, etc.). I ran rdtsc3 on all the available MySQL machines listed in the "development machines" wiki, [internal mysql address]/wiki/DevelopmentMachines. Most tests are from 2006. The complete results are in rdtsc3_notes.txt. For illustration here are the lines for just one machine: ROUTINES -------- Platform Cycles Nanoseconds Microseconds Millis Ticks -------- ----------- ----------- ------------ ------ ----- production asm_x86 clock_gettime*! gettimeofday ftime times "*" means: clock_gettime resolution is actually microseconds "!" means: clock_gettime required linking of librt.so ... OVERHEAD -------- Platform Cycles Nanoseconds Microseconds Milliseconds Ticks -------- ------ ----------- ------------ ------------ ----- production 88 4116 3888 4092 2044 The lines mean: "on production.mysql.com the cycle counter is in assembly-language for x86, the nanoseconds can be requested with clock_gettime() but the result is really microseconds; the microseconds are done with gettimeofday(); the milliseconds are done with ftime(); the ticks are done with time(); the overhead for the cycle counter i.e. RDTSC is 88 cycles; the overhead for the microseconds counter i.e. gettimeofday() is 3888 cycles." On some platforms (e.g. sunfire280 compiled without -m64) cycle counter and gettimeofday() are nearly the same; on most machines (e.g. sol10-x86, win2003-x, ita2) cycle counter is from 5 to 20 times faster; on some machines (e.g. win2003-amd64, production) RDTSC is more than 40 times faster. Bad things about RDTSC ---------------------- - RDTSC doesn't "serialize". That is, if there is out-of-order execution, rdtsc might be processed after an instruction that logically follows it. (We can force serialization, but we won't bother.) See: "Q&A: RDTSC to measure performance of small # of FP calculations" http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30226599.aspx This flaw is unimportant since we are trying to measure events that take much longer times. - It is possible to set a flag which renders RDTSC inoperative. Somebody responsible for the kernel of the operating system would have to make this decision. For Windows and Linux, there's no such problem (although CONFIG_X86_TSC_DISABLE exists). - With a multi-processor arrangement, it's possible to get the cycle count from one processor in thread X, and the cycle count from another processor in thread Y. They may not always be in synch. Each processor might have a different TSC value. This is especially noted for AMD multi-socket (as opposed to multi-core) systems. See: "AMD TSC Drift Solutions in Red Hat Enterprise Linux" http://developer.amd.com/article_print.jsp?id=92 "Future TSC Directions and Solutions" http://ltt.polymtl.ca/svn/ltt/branches/poly/doc/developer/tsc.txt "RDTSCP" http://developer.amd.com/articles.jsp?id=92&num=5 "tsc timer related problems/questions" http://kerneltrap.org/mailarchive/linux-kernel/2007/9/9/191506 But "Intel systems are normally all synchronized". See: "linux/arch/i386/kernel/tsc.c" http://lxr.linux.no/linux/arch/i386/kernel/tsc.c#L329 Or the operating system may synchronize TSCs. For example "normally, Windows synchronizes the time stamp counters on all processors" (in special circumstances) (not Windows Server). On Linux, though, the apparent tendency is to check for synchronization but not force it. See: "Measure Code Sections Using The Enhanced Timer" http://softwarecommunity.intel.com/articles/eng/2589.htm "x86: unify/rewrite SMP TSC sync code" http://lwn.net/Articles/211051/ "Hardware Support and Directions for Windows Server" http://download.microsoft.com/download/0/0/b/00bba048-35e6-4e5b-a3dc-36da83cbb0d1/ServerDirections.docx Synchronizing may cause a counter to go backwards. - Converting cycles to elapsed time is only reliable if a CPU always has the same cycle rate. That's not true on a laptop, which might change speed to save power. And it's not true with high-performance chips which might gear down if heating becomes dangerous. (Notice that elsewhere I count this as an argument in favour of RDTSC because such computers generally have slow gettimeofday().) Variability does not exist on some recent processors. See: "Intel secretly changes the rules" http://www.x86-secret.com/?option=newsd&nid=846 "TSC and Power Management Events on AMD Processors" http://lkml.org/lkml/2005/11/4/173 Microsoft describes the flaw as "not common". See: "SQL Server timing values may be incorrect when you use utilities or technologies that change CPU frequencies" http://support.microsoft.com/kb/931279 - The implementor will have to write code for all the processors that MySQL fully supports. I have already done this, read the comments in the attached file rdtsc3.c. But in m attempt to be cautious I left a few gaps in the coverage. So a cycle counter won't be a wonderful solution for all timing situations. However, the defects are acceptable for WL#2360. Other DBMSs ----------- Description of other implementations can be found by clicking 'Progress' and looking at what was deleted from this section on 2009-04-13. What goes in and out -------------------- Here in the high level description, it's inappropriate to state procedure names, state whether one should have a procedure or a macro, etc. -- such things are up to the WL#2360 implementor. However, the functionality will have to include: "Initialize". Determine if a high-resolution timer is known for the computer (if not, set a flag and continue). Determine the cycle rate of the computer by timing a function with both cycles and gettimeofday. Also calculate what the wall-clock time is that corresponds to the current (at initialization time) cycle count. This is done. "Get Cycles". This is RDTSC, though we might want a better name. Although getticks() is a good name, I worry about confusion with the Windows GetTickCounter() function. TSC means "time stamp counter" so we could use the term "get counter" but that is Intel-specific. This is done. "Get Nanoseconds / Microseconds / Milliseconds / Ticks / Seconds". It's okay to provide a choice based on something other than cycle count. Besides, some people insist on it. Not all of the options are available on all platforms, and not all of the options work well, but we can test so that an appropriate choice can be made at run time. This is done. "Get Time". Given a cycle count C, calculate what wall-clock time it represents. To do so, subtract initialization-time cycle count from C, Get Elapsed Seconds from that, and add the seconds to the wall-clock time as of initialization. Here is a trick. Although RDTSC returns a 64-bit value EDX:EAX, it is not necessary to store EDX, or use 64-bit calculations, unless you think the elapsed time will be more than 2**32 cycles (more than a second on current processors). Intel makes this assumption for some of its examples, and therefore uses 32-bit arithmetic. But we won't use this trick, because I'm sure people wouldn't remember when it is okay to use 32-bit. Definitely we need all 64 bits sometimes. This task is not a proposal to replace the already-existing function my_getsystime() in mysys/my_getsystime.c. We need more precise wall-clock time, as noted in WL#1338 "Change timers in MySQL to have better resolution". But the tests show that a cycle counter is sometimes much faster, and of course more precise, and the writer (for WL#2360) does not need to find the time of day. Extracts from comp.arch ----------------------- Andi Kleen of SuSE Labs made some remarks on forum comp.arch on 2003-05-26 in thread " Re: lmbench on x86 and RDTSC". I quote, but with some editing. " [The x-86-64 Linux version] calls gettimeofday() which is implemented using RDTSC. The x86-64 linux kernel uses a special vsyscall to implement gettimeofday() in user space without a system call, so it should be nearly as fast as if used on your own in a subfunction. The Opteron, using RDTSC directly, can be very misleading because the CPU is able to execute it speculatively mixed with your benchmark code (it's the other way round from the P4 where it takes 60+ cycles and stalls everything; Opteron is also much faster on this [about 6 cycles]). The gettimeofday implementation avoids the speculative-execution problem by stalling the pipeline explicitly using a synchronizing instruction. Of course this slows it down a bit again, but it is the only way to get reliable results. " That would explain why our tests show that gettimeofday isn't so bad with 64-bit Linux. References ---------- References about other implementations and internal emails can be found by clicking 'Progress' and looking at what was deleted from this section on 2009-04-13. Some results with performance schema ------------------------------------ This section is FYI. It shows what various people saw with select * from performance_schema.performance_timers; for machines which are outside the MySQL build environment and therefore are not listed in the rdtsc3_notes.txt file attached to this document. Feel free to add your own if it's interesting and not a near-duplicate. Paul DuBois, MacBook Pro, x86, from email in November 2009 +-------------+-----------------+------------------+----------------+ | TIMER_NAME | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD | +-------------+-----------------+------------------+----------------+ | CYCLE | 2390925373 | 1 | 48 | | NANOSECOND | 1000000000 | 1 | 120 | | MICROSECOND | 1000000 | 1 | 144 | | MILLISECOND | 1042 | 1 | 156 | | TICK | 102 | 1 | 3660 | +-------------+-----------------+------------------+----------------+
The code for the timers is inside the program that was used to test timers for Phase 1. The program works with gcc or icc on Unix/Linux platforms, and Windows (Symantec C or icc or cl). It works with CC (Sun Studio C++) on Solaris/Linux platforms. It's standalone. Just remove main() to see what would be added in the server, with a few #define adjustments (probably in mysys/my_rdtsc.c). If you just run the program, it shows what system routine it would use, and what the overhead would be, for timers for cycles, nanoseconds, microseconds, milliseconds, and ticks. In the attached file rdtsc3_notes.txt, I report what the program displayed for all easily-accessible Uppsala machines. All interesting files -- rdtsc3.c, rdtsc3_notes.txt, configure.in, my_timer_cycle.il -- are available as file attachments on this worklog task. Update May 20 2008 ------------------ To avoid confusion caused by having rdtsc3.c and configure.in in two different places, I deleted the copy in the LLD. The definitive copies of rdtsc3.c and configure.in are in the file attachments. I also fixed a few spelling errors and changed the number of loop iterations in two routines from 10 to 20. Otherwise it's the same program that I wrote nearly three years ago. Changes on November 2 2008 -------------------------- This is the first significant set of changes since the original in August 2005. The changes are: 1. fix for AIX 5.2 2. fix for mftb 32-bit 3. fix for non-gcc + Solaris + x86 4. new code for 64-bit PPC 5. new code for Sun Studio, Sun Studio + SPARC, gcc + SPARC. I must draw special attention to the switches for SPARC and/or Sun Studio. For 32-bit Solaris + Sun Studio + SPARC: CC -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/librt.so or CC -xarch=v8plus -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/librt.so not CC -xarch=v8 -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/librt.so For 64-bit Solaris + Sun Studio + SPARC: CC -xarch=v9 -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/64/librt.so For 32-bit Solaris + gcc + SPARC: gcc -mcpu=v9 -m32 -o rdtsc3 rdtsc3.c /usr/lib/librt.so For 64-bit Solaris + gcc + SPARC: gcc -mcpu=v9 -m64 -o rdtsc3 rdtsc3.c /usr/lib/64/librt.so For 32-bit Linux + gcc + SPARC: gcc -mcpu=v9 -m32 -O3 -o rdtsc3 rdtsc3.c /usr/lib/librt.so 64-bit Linux + gcc + SPARC: gcc -mcpu=v9 -m64 -O3 -o rdtsc3 rdtsc3.c /usr/lib/64/librt.so 32-bit Solaris + gcc + x86: gcc -O3 -o rdtsc3 rdtsc3.c /usr/lib/librt.so 32-bit Solaris + Sun Studio + x86: CC -O3 -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/librt.so This should be generally okay, but there are two extra difficulties. 1. Whenever you use CC, i.e. Sun Studio C++, you have to put an assembler template file, ending with extension .il, on the command line. I've read that this isn't necessary with Sun Studio 12, but that doesn't help us, we have to be able to build with Sun Studio 11. Olav Sandstaa did an .il file for Falcon BUG#37622 /mysql-6.0/storage/falcon/CompareAndSwapSparc.il see http://lists.mysql.com/commits/50538?f=plain Olav has tested the timer code described here and it is okay. 2. I had a problem with gcc + Solaris + 32-bit + SPARC. The "%tick" register came in with SPARC v8+. But that's not the default setting for gcc. This fails: gcc -O3 -o rdtsc3 rdtsc3.c /usr/lib/librt.so The only way to make it work is to act as if I'm sparcv9 + 32-bit: gcc -mcpu=v9 -m32 -O3 -o rdtsc3 rdtsc3.c /usr/lib/librt.so Ordinarily MySQL does not use -mcpu=v9 when building 32-bit: http://dev.mysql.com/doc/refman/5.0/en/mysql-binaries.html I am not the first person who's ever had this sort of problem: http://www.nabble.com/-detail--sparc-v8plus-build-problem-td7944960.html So I'd like special permission to change the build flags. Or, we get no cycle counter with gcc + Solaris + 32-bit + SPARC. A reply is taking a long time, so we'll go ahead without -mcpu=v9. And I simply gave up on SPARC + gcc 2.95, which is getting old. For that combination there will never be a cycle counter. Changes in June 2009 -------------------- Four reviewers -- Olav Sandstaa, Vladislav Vaintroub, Kay Roepke, Mark Leith -- approved the code as included in MySQL server for WL#2360. Due to a query from a WL#2360 code reviewer, the configure.in file for building mysqld will not contain code for detecting librt.so. So on some machines the nanoseconds timers will be there for rdtsc3 tests, but won't be there for MySQL server. Changes in October 2009 ----------------------- Due to a request from a WL#2360 code reviewer, the structure of the my_timer_info now contains a structure. That is, now it's two-level instead of flat. Useful tools ------------ On Linux: cat /proc/cpuinfo cpufreq-set cpufreq-info cpufreq-selector
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.