WL#2373: Use cycle counter for timing

Affects: Server-5.5   —   Status: Complete

For WL#2360 Performance wait_events, we need to time        
procedures, e.g. waits for mutexes, frequently and        
unobtrusively.

Phase one: compare speed of RDTSC and gettimeofday() or
QueryPerformanceCounter.
Done.

Phase two: add cycle counters for non-x86 platforms.
Done.

Phase three: check that the test code does what the comments say,
check that the results are as stated in high-level specification.
Done.

Phase four: put cycle counter functions in server code.
Done. Phase four took place as part of WL#2360 work.

[ Note added by Peter Gulutzan 2009-11-02 ]
I have marked the code-review boxes "complete".
Reviewers were Vladislav Vaintroub, Olav Sandstaa, Kay Roepke.
Emails re review (and approvals of a change in late October)
are in lists.mysql.com/commits starting here:
http://lists.mysql.com/commits/74177


For WL#2360 Performance Schema, we need to time            
procedures, e.g. waits for mutexes, frequently and            
unobtrusively. After testing, I see that gettimeofday  
would be preferable on a few platforms, but a choice
of a low-level cycle counter is usually best.                

A cycle counter commonly returns the number of cycles that have            
gone by since the computer was last reset. Thus, on a            
1GHz Pentium which has been on for one second, the value            
of the cycle counter is 1 billion. With the assembler            
instruction RDTSC it's possible to get this value into            
registers. An example with gcc inline assembly:            
            
#include             
...            
#define RDTSC(val) \            
        __asm__ __volatile__ ("rdtsc" : "=A" (val))            
...            
long long time1;            
long long time2;            
...            
  RDTSC(time1); test1(); RDTSC(time2);            
  printf("speed: %lld\n",time2-time1);            
      
Other proposals      
---------------      
      
In the original version of this task, I suggested:      
"Others may be aware of something better [than a cycle      
counter], we'll see.". Some people noted that we      
already have my_getsystime(), gettimeofday(), clock_gettime(),      
and other variants of wall-clock-time functions.      
      
InnoDB already uses gettimeofday for mutex timing, see           
ut_usectime() in .../innobase/ut/ut0ut.c,           
mutex_spin_wait() in .../innobase/sync/sync0sync.c.       
      
Since the performance-schema plan would require "getting the time"      
thousands, perhaps many thousands, of times per second, I suggest
that the overhead of gettimeofday, compared to rdtsc, is unacceptable.
This suggestion is testable.      
 
Tests
-----

I attach four files to this task: rdtsc3.c, configure.in, rdtsc3_notes.txt,
my_timer_cycles.il. (Marc Alff took whatever code was necessary and put it
in the mysql-6.0-perf tree so the attachments are only of importance for
archivists.)

Instructions (how-to-compile and how-to-interpret) are in rdtsc3.c comments.
Using rdtsc3 I timed a procedure which contained RDTSC itself, and all the
alternatives (gettimeofday, clock_gettime, gethrtime, ftime, etc), and all
the assembly-level non-x86 equivalents (itc, mftb, etc.). I ran rdtsc3 on
all the available MySQL machines listed in the "development machines" wiki,
[internal mysql address]/wiki/DevelopmentMachines. Most tests are from 2006.
The complete results are in rdtsc3_notes.txt. For illustration here are
the lines for just one machine:
  ROUTINES
  --------
  Platform   Cycles           Nanoseconds      Microseconds Millis Ticks
  --------   -----------      -----------      ------------ ------ -----
  production asm_x86          clock_gettime*!  gettimeofday ftime  times
  "*" means: clock_gettime resolution is actually microseconds
  "!" means: clock_gettime required linking of librt.so
  ...
  OVERHEAD
  --------
  Platform   Cycles Nanoseconds Microseconds Milliseconds Ticks
  --------   ------ ----------- ------------ ------------ -----
  production     88        4116         3888         4092  2044
The lines mean: "on production.mysql.com the cycle counter is in
assembly-language for x86, the nanoseconds can be requested with
clock_gettime() but the result is really microseconds; the microseconds
are done with gettimeofday(); the milliseconds are done with ftime();
the ticks are done with time(); the overhead for the cycle counter i.e.
RDTSC is 88 cycles; the overhead for the microseconds counter i.e.
gettimeofday() is 3888 cycles."

On some platforms (e.g. sunfire280 compiled without -m64) cycle counter and
gettimeofday() are nearly the same; on most machines (e.g. sol10-x86,
win2003-x, ita2) cycle counter is from 5 to 20 times faster;  on some
machines (e.g. win2003-amd64, production) RDTSC is more than 40 times faster.

Bad things about RDTSC      
----------------------      
            
- RDTSC doesn't "serialize". That is, if there is            
out-of-order execution, rdtsc might be processed            
after an instruction that logically follows it.            
(We can force serialization, but we won't bother.)   
See:
"Q&A: RDTSC to measure performance of small # of FP calculations"
http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30226599.aspx
This flaw is unimportant since we are trying to measure events
that take much longer times.
      
- It is possible to set a flag which renders RDTSC            
inoperative. Somebody responsible for the kernel            
of the operating system would have to make this            
decision. For Windows and Linux, there's no such            
problem (although CONFIG_X86_TSC_DISABLE exists).
      
- With a multi-processor arrangement, it's possible            
to get the cycle count from one processor in            
thread X, and the cycle count from another processor            
in thread Y. They may not always be in synch.            
Each processor might have a different TSC value. This is
especially noted for AMD multi-socket (as opposed to multi-core)
systems. See:
  "AMD TSC Drift Solutions in Red Hat Enterprise Linux"
  http://developer.amd.com/article_print.jsp?id=92
  "Future TSC Directions and Solutions"
  http://ltt.polymtl.ca/svn/ltt/branches/poly/doc/developer/tsc.txt
  "RDTSCP"
  http://developer.amd.com/articles.jsp?id=92&num=5
  "tsc timer related problems/questions"
  http://kerneltrap.org/mailarchive/linux-kernel/2007/9/9/191506
But "Intel systems are normally all synchronized". See:
  "linux/arch/i386/kernel/tsc.c"
  http://lxr.linux.no/linux/arch/i386/kernel/tsc.c#L329
Or the operating system may synchronize TSCs. For example
"normally, Windows synchronizes the time stamp counters on
all processors" (in special circumstances) (not Windows
Server). On Linux, though, the apparent tendency is to
check for synchronization but not force it. See:
  "Measure Code Sections Using The Enhanced Timer"
  http://softwarecommunity.intel.com/articles/eng/2589.htm
  "x86: unify/rewrite SMP TSC sync code"
  http://lwn.net/Articles/211051/
  "Hardware Support and Directions for Windows Server"
 
http://download.microsoft.com/download/0/0/b/00bba048-35e6-4e5b-a3dc-36da83cbb0d1/ServerDirections.docx

Synchronizing may cause a counter to go backwards.
      
- Converting cycles to elapsed time is only reliable            
if a CPU always has the same cycle rate. That's not            
true on a laptop, which might change speed to save            
power. And it's not true with high-performance chips      
which might gear down if heating becomes dangerous.      
(Notice that elsewhere I count this as an argument in      
favour of RDTSC because such computers generally have      
slow gettimeofday().)
Variability does not exist on some recent processors. See:
  "Intel secretly changes the rules"
  http://www.x86-secret.com/?option=newsd&nid=846
  "TSC and Power Management Events on AMD Processors"
  http://lkml.org/lkml/2005/11/4/173
Microsoft describes the flaw as "not common". See:
  "SQL Server timing values may be incorrect when you use utilities
  or technologies that change CPU frequencies"
  http://support.microsoft.com/kb/931279

- The implementor will have to write code for all the      
  processors that MySQL fully supports. I have already
  done this, read the comments in the attached file rdtsc3.c.
  But in m attempt to be cautious I left a few gaps in the coverage.
  
So a cycle counter won't be a wonderful solution for all timing            
situations. However, the defects are acceptable for WL#2360.      

Other DBMSs
-----------

Description of other implementations can be found by clicking 'Progress'
and looking at what was deleted from this section on 2009-04-13.

What goes in and out      
--------------------      
      
Here in the high level description, it's inappropriate to state      
procedure names, state whether one should have a procedure or a      
macro, etc. -- such things are up to the WL#2360 implementor.      
However, the functionality will have to include:      
      
"Initialize". Determine if a high-resolution timer is            
known for the computer (if not, set a flag and continue).            
Determine the cycle rate of the computer by timing a  
function with both cycles and gettimeofday. Also calculate            
what the wall-clock time is that corresponds to the current            
(at initialization time) cycle count. This is done.            
            
"Get Cycles". This is RDTSC, though we might want a better            
name. Although getticks() is a good name, I worry about confusion            
with the Windows GetTickCounter() function. TSC means "time         
stamp counter" so we could use the term "get counter" but         
that is Intel-specific. This is done.  
            
"Get Nanoseconds / Microseconds / Milliseconds / Ticks  
/ Seconds". It's okay to provide a choice based on  
something other than cycle count. Besides, some people  
insist on it. Not all of the options are available on  
all platforms, and not all of the options work well,  
but we can test so that an appropriate choice can be  
made at run time. This is done.  
            
"Get Time". Given a cycle count C, calculate what wall-clock            
time it represents. To do so, subtract initialization-time            
cycle count from C, Get Elapsed Seconds from that, and add            
the seconds to the wall-clock time as of initialization.            
            
Here is a trick. Although RDTSC returns a 64-bit value            
EDX:EAX, it is not necessary to store EDX, or use 64-bit            
calculations, unless you think the elapsed time will be            
more than 2**32 cycles (more than a second on current processors).            
Intel makes this assumption for some of its examples, and            
therefore uses 32-bit arithmetic. But we won't use this            
trick, because I'm sure people wouldn't remember when it            
is okay to use 32-bit. Definitely we need all 64 bits sometimes.            

This task is not a proposal to replace the already-existing  
function my_getsystime() in mysys/my_getsystime.c.
We need more precise wall-clock time, as noted in  
WL#1338 "Change timers in MySQL to have better resolution".  
But the tests show that a cycle counter is sometimes much faster,
and of course more precise, and  the writer (for WL#2360) does
not need to find the time of day.  
                     
Extracts from comp.arch        
-----------------------        
        
Andi Kleen of SuSE Labs made some remarks on forum comp.arch on        
2003-05-26 in thread " Re: lmbench on x86 and RDTSC". I quote,        
but with some editing.        
        
"        
[The x-86-64 Linux version] calls gettimeofday() which is implemented        
using RDTSC. The x86-64 linux kernel uses a special vsyscall to        
implement gettimeofday() in user space without a system call, so it        
should be nearly as fast as if used on your own in a subfunction.        
        
The Opteron, using RDTSC directly, can be very misleading        
because the CPU is able to execute it speculatively mixed with your        
benchmark code (it's the other way round from the P4 where it takes        
60+ cycles and stalls everything; Opteron is also much faster on        
this [about 6 cycles]).  The gettimeofday implementation avoids        
the speculative-execution problem by stalling the pipeline explicitly        
using a synchronizing instruction. Of course this slows it down a bit        
again, but it is the only way to get reliable results.        
"        
  
That would explain why our tests show that gettimeofday  
isn't so bad with 64-bit Linux.  
  
References            
----------            
            
References about other implementations and internal emails can be
found by clicking 'Progress' and looking at what was deleted from
this section on 2009-04-13.

Some results with performance schema
------------------------------------

This section is FYI. It shows what various people saw with
select * from performance_schema.performance_timers;
for machines which are outside the MySQL build environment
and therefore are not listed in the rdtsc3_notes.txt file
attached to this document. Feel free to add your own if it's
interesting and not a near-duplicate.

Paul DuBois, MacBook Pro, x86, from email in November 2009
+-------------+-----------------+------------------+----------------+
| TIMER_NAME  | TIMER_FREQUENCY | TIMER_RESOLUTION | TIMER_OVERHEAD |
+-------------+-----------------+------------------+----------------+
| CYCLE       |      2390925373 |                1 |             48 |
| NANOSECOND  |      1000000000 |                1 |            120 |
| MICROSECOND |         1000000 |                1 |            144 |
| MILLISECOND |            1042 |                1 |            156 |
| TICK        |             102 |                1 |           3660 |
+-------------+-----------------+------------------+----------------+


The code for the timers is inside the program that  
was used to test timers for Phase 1.   
 
The program works with gcc or     
icc on Unix/Linux platforms, and Windows (Symantec     
C or icc or cl). It works with CC (Sun Studio C++)
on Solaris/Linux platforms. It's standalone.   
Just remove main() to see what would be added in        
the server, with a few #define adjustments        
(probably in mysys/my_rdtsc.c).         
        
If you just run the program, it shows what system     
routine it would use, and what the overhead would     
be, for timers for cycles, nanoseconds, microseconds,     
milliseconds, and ticks.     
   
In the attached file rdtsc3_notes.txt, I report what the    
program displayed for all easily-accessible Uppsala    
machines. All interesting files -- rdtsc3.c,   
rdtsc3_notes.txt, configure.in, my_timer_cycle.il --
are available as file attachments on this worklog task. 

Update May 20 2008
------------------

To avoid confusion caused by having
rdtsc3.c and configure.in in two different places, I
deleted the copy in the LLD. The definitive copies of
rdtsc3.c and configure.in are in the file attachments.
I also fixed a few spelling errors and changed the
number of loop iterations in two routines from 10 to 20.
Otherwise it's the same program that I wrote nearly
three years ago.
 
Changes on November 2 2008
--------------------------

This is the first significant set of changes since the original in August 2005.
The changes are:
1. fix for AIX 5.2
2. fix for mftb 32-bit
3. fix for non-gcc + Solaris + x86
4. new code for 64-bit PPC
5. new code for Sun Studio, Sun Studio + SPARC, gcc + SPARC.

I must draw special attention to the switches for SPARC and/or Sun Studio.
For 32-bit Solaris + Sun Studio + SPARC:
  CC -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/librt.so
  or
  CC -xarch=v8plus -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/librt.so
  not
  CC -xarch=v8     -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/librt.so
For 64-bit Solaris + Sun Studio + SPARC:
  CC -xarch=v9 -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/64/librt.so
For 32-bit Solaris + gcc + SPARC:
  gcc -mcpu=v9 -m32 -o rdtsc3 rdtsc3.c /usr/lib/librt.so
For 64-bit Solaris + gcc + SPARC:
  gcc -mcpu=v9 -m64 -o rdtsc3 rdtsc3.c /usr/lib/64/librt.so
For 32-bit Linux + gcc + SPARC:
  gcc -mcpu=v9 -m32 -O3 -o rdtsc3 rdtsc3.c /usr/lib/librt.so
64-bit Linux + gcc + SPARC:
  gcc -mcpu=v9 -m64 -O3 -o rdtsc3 rdtsc3.c /usr/lib/64/librt.so
32-bit Solaris + gcc + x86:
  gcc -O3 -o rdtsc3 rdtsc3.c /usr/lib/librt.so
32-bit Solaris + Sun Studio + x86:
  CC -O3 -o rdtsc3 rdtsc3.c my_timer_cycles.il /usr/lib/librt.so

This should be generally okay, but there are two extra difficulties.
1.
Whenever you use CC, i.e. Sun Studio C++, you have to put an
assembler template file, ending with extension .il, on the command line.
I've read that this isn't necessary with Sun Studio 12, but that doesn't
help us, we have to be able to build with Sun Studio 11.
Olav Sandstaa did an .il file for Falcon BUG#37622
/mysql-6.0/storage/falcon/CompareAndSwapSparc.il
see http://lists.mysql.com/commits/50538?f=plain
Olav has tested the timer code described here and it is okay.


2. 
I had a problem with gcc + Solaris + 32-bit + SPARC.
The "%tick" register came in with SPARC v8+.
But that's not the default setting for gcc. This fails:
gcc -O3 -o rdtsc3 rdtsc3.c /usr/lib/librt.so
The only way to make it work is to act as if I'm sparcv9 + 32-bit:
gcc -mcpu=v9 -m32 -O3 -o rdtsc3 rdtsc3.c /usr/lib/librt.so
Ordinarily MySQL does not use -mcpu=v9 when building 32-bit:
http://dev.mysql.com/doc/refman/5.0/en/mysql-binaries.html
I am not the first person who's ever had this sort of problem:
http://www.nabble.com/-detail--sparc-v8plus-build-problem-td7944960.html
So I'd like special permission to change the build flags.
Or, we get no cycle counter with gcc + Solaris + 32-bit + SPARC.
A reply is taking a long time, so we'll go ahead without -mcpu=v9.

And I simply gave up on SPARC + gcc 2.95, which is getting old.
For that combination there will never be a cycle counter.

Changes in June 2009
--------------------

Four reviewers -- Olav Sandstaa, Vladislav Vaintroub, Kay Roepke,
Mark Leith -- approved the code as included in MySQL server for
WL#2360.

Due to a query from a WL#2360 code reviewer, the configure.in
file for building mysqld will not contain code for detecting
librt.so. So on some machines the nanoseconds timers will be
there for rdtsc3 tests, but won't be there for MySQL server.

Changes in October 2009
-----------------------

Due to a request from a WL#2360 code reviewer, the structure
of the my_timer_info now contains a structure. That is, now
it's two-level instead of flat.

Useful tools
------------

On Linux:
cat /proc/cpuinfo
cpufreq-set
cpufreq-info
cpufreq-selector