Skip navigation links
**Section Navigation** [Toggle]

MySQL Internals Manual :: 10 Important Algorithms and Structures :: 10.8 FLOAT and DOUBLE Data Type Representation

- 10 Important Algorithms and Structures
- 10.1 The Item Class
- 10.2 How MySQL Does Sorting (filesort)
- 10.3 Bulk Insert
- 10.4 How MySQL Does Caching
- 10.5 How MySQL Uses the Join Buffer Cache
- 10.6 How MySQL Handles FLUSH TABLES
- 10.7 Full-Text Search
- 10.8 FLOAT and DOUBLE Data Type Representation
- 10.9 Date and Time Data Type Representation
- 10.10 Threads
- 10.11 Character Sets and Collations
- 10.12 Error Flags and Functions
- 10.13 Functions in the mysys Library
- 10.14 Bitmaps

The MySQL Reference Manual has a discussion of floating-point numbers in Section 11.2 Numeric Types, including details about the storage. Let us now take up the story from where the MySQL Reference Manual leaves off.

The following discussion concentrates on the case where no display
width and decimals are given. This means that
`FLOAT`

is stored as whatever the C type
`float`

is and `REAL`

or
`DOUBLE [PRECISION]`

is stored as whatever the C
type `double`

is. The field length is selected by
the MySQL code.

This document was created when http://bugs.mysql.com/4457 (Different results in SQL-Statements for the same record) was fixed at the end of August 2004. Until then there was some confusion in the double-to-string conversion at different places in the code.

The bugfix for http://bugs.mysql.com/4937
(```
INSERT + SELECT + UNION ALL + DATE to
VARCHAR(8)
```

conversion problem) produced a conversion
function which was a promising approach to the conversion
problems. Unfortunately it was only used for direct field
conversions and not for function results etc. It did not take
small numbers (absolute value less than 1) and negative numbers
into account. It did not take the limited precision of
`float`

and `double`

data types
into account. The bugfix was developed in two steps: The first
attempt looked like this (in principle):

length= sprintf(buff, "%.*g", field_length, nr); if (length > field_length) length= sprintf(buff, "%.*g", field_length-5, nr);

If the `libc`

conversion produces too many
characters, the precision is reduced by the space required for the
scientific notation (1.234e+05). Thus the
`printf()`

conversion is forced to switch to the
scientific notation, since the value would not fit otherwise. Or,
if it was scientific already, the precision is reduced and also
uses less space. I left out some important stuff around limit
checking just to show the idea. This simple algorithm should work
quite well in most cases, but has been discarded for the sake of
performance. The double call to the slow
`printf()`

conversion `%g`

didn't seem reasonable, though it would only be used for extreme
values and small fields. During my explorations of the code I
didn't find places where `float`

or
`double`

were to be converted into small fields.
Remember that I talk only of conversions where field length and
precision are not given. In this case a sufficient field length is
selected at several places, except for a bug where it was selected
wrongly. If a field length is given, a different conversion is
used anyway. But since the code is quite complex, I don't claim to
grasp it in full, and therefore may be in error. So let us look
further:

The second attempt to fix the bug looked like this:

bool use_scientific_notation=TRUE; if (field_length < 32 && nr > 1) { double e[]={1, 1e1, 1e2, 1e4, 1e8, 1e16 }, p=1; for (int i=sizeof(e), j=1<<i-- ; j; i--, j>>=1 ) { if (field_length & j) p*=e[i]; } use_scientific_notation=(p < nr); } length= sprintf(buff, "%.*g", use_scientific_notation ? field_length-5 : field_length, nr);

Here we evaluate if the string representation of a given number fits into field_length characters. If not, we reduce the precision to make it fit. Again, I left out important details. For example, the evaluation is done only once per field for the sake of performance. The downside here is the unconditional reduction of precision for field length > 31 (which doesn't really matter), for negative numbers and for small numbers (absolute value less than 1).

Both algorithms do not take the limited precision of
`float`

and `double`

values into
account. This could lead to conversions with ridiculous bogus
precision output. For example a value of 0.7 converted with
`%.30g`

will give a lot of digits, which pretend
to tell about deviations from the value 0.7 and are completely
absurd: 0.699999988079071044921875. To understand more about the
`%g`

conversion, I quote from a comment
introduced in the source at the beginning of bugfixing #4937 (this
comment was removed because it mainly describes, how the
`printf()`

conversion works, but I think it's
valuable enough to include it here):

/* Let's try to pretty print a floating point number. Here we use '%-*.*g' conversion string: '-' stands for right-padding with spaces, if such padding will take place '*' is a placeholder for the first argument, field_length, and signifies minimal width of result string. If result is less than field length it will be space-padded. Note, however, that we'll not pass spaces to Field_string::store(const char *, ...), due to strcend in the next line. '.*' is a placeholder for DBL_DIG and defines maximum number of significant digits in the result string. DBL_DIG is a hardware specific C define for maximum number of decimal digits of a floating point number, such that rounding to hardware floating point representation and back to decimal will not lead to loss of precision. That is: if DBL_DIG is 15, number 123456789111315 can be represented as double without precision loss. As one can judge from this description, choosing DBL_DIG here is questionable, especially because it introduces a system dependency. 'g' means that conversion will use [-]ddd.ddd (conventional) style, and fall back to [-]d.ddde[+|i]ddd (scientific) style if there is not enough space for all digits. Maximum length of result string (not counting spaces) is (I guess) DBL_DIG + 8, where 8 is 1 for sign, 1 for decimal point, 1 for exponent sign, 1 for exponent, and 4 for exponent value. XXX: why do we use space-padding and trim spaces in the next line? */ sprintf(to,"%-*.*g",(int) field_length,DBL_DIG,nr); to=strcend(to,' ');

There is one small misapprehension in the comment.
`%g`

does not switch to scientific notation when
there is 'not enough space for all digits'. As the commentator
says, the field length gives the minimal output length.
`printf()`

happily outputs more characters if
required to produce a result with 'precision' digits. In fact it
switches to scientific when the value can no longer be represented
by 'precision' digits in conventional notation. The man page says
"Style e is used if the exponent from its conversion is less than
-4 or greater than or equal to the precision." In explanation, a
precision of 3 digits can print a value of 345 in conventional
notation, but 3456 needs scientific notation, as it would require
4 digits (a precision of 4) in conventional notation. Thus, it is
printed as 3.46e+03 (rounded).

Since we don't want spaces in the output, we should not give a
field length, but always use `"%.*g"`

. However,
the precision matters, as seen above. It is worth its own
paragraph.

Since MySQL uses the machine-dependent binary representation of
`float`

and `double`

to store
values in the database, we have to care about these. Today, most
systems use the IEEE standard 754 for binary floating-point
arithmetic. It describes a representation for single precision
numbers as 1 bit for sign, 8 bits for biased exponent and 23 bits
for fraction and for double precision numbers as 1-bit sign,
11-bit biased exponent and 52-bit fraction. However, we can not
rely on the fact that every system uses this representation.
Luckily, the ISO C standard requires the standard C library to
have a header `float.h`

that describes some
details of the floating point representation on a machine. The
comment above describes the value `DBL_DIG`

.
There is an equivalent value `FLT_DIG`

for the C
data type `float`

.

So, whenever we print a floating-point value, we must not specify
a precision above `DBL_DIG`

or
`FLT_DIG`

respectively. Otherwise we produce a
bogus precision, which is wrong. For the honor of the writer of
the first attempt above, I must say that his complete algorithm
took `DBL_DIG`

into account, if however only for
the second call to `sprintf()`

. But
`FLT_DIG`

has never been accounted for. At the
conversion section of the code, it was not even known whether the
value came from a `float`

or
`double`

field.

My attempt to solve the problems tries to take all this into
account. I tried to concentrate all
`float`

/`double`

-to-string
conversions in one function, and to bring the knowledge about
`float`

versus `double`

to this
function wherever it is called. This solution managed to keep the
test suite happy while solving the new problem of
[http://bugs.mysql.com/4457
Bug#4457]. Luckily the first problem was not big, as the
test cases have been very carefully selected, so that they succeed
as long as the machine uses IEEE 754.

Nevertheless, the function is still not perfect. It is not possible to guess how many significant digits a number has. Given that, it is not simple to tell how long the resulting string would be. This applies to numbers with an absolute value smaller then 1. There are probably ways to figure this out, but I doubt that we would win in terms of performance over the simple solution of the first attempt, and besides we might cause new bugs. The compromise taken here is to accept that the resulting string may exceed the destination field length by five characters in the worst case.

if (nr < 0.0) { abs_nr= -nr; extra_space= 1; } else { abs_nr= nr; extra_space= 0; } precision= is_float ? FLT_DIG : DBL_DIG; if (precision > field_length) precision= field_length; if (! initialized) { /* Better switch to scientific too early than too late. */ double mult; mult= 1e0; for (length= 0; length < DBL_DIG; length++) mult/= 1e1; mult= 1e1 - mult; double val; val= 1.0; for (int idx= 0; idx < DBL_DIG+1; idx++) { DBUG_PRINT("info",("double_to_string_conv: big[%d] %.*g", idx, DBL_DIG+3, val)); big_number[idx]= val; val*= mult; } small_number[0]= 1e0; small_number[1]= 1e0; small_number[2]= 1e0; small_number[3]= 1e-1; small_number[4]= 1e-2; small_number[5]= 1e-3; small_number[6]= 1e-4; /* %g switches to scientific when exponent < -4. */ for (int idx= 7; idx < DBL_DIG+1; idx++) small_number[idx]= 1e-4; initialized= TRUE; } use_scientific_notation= (abs_nr != 0.0) && ((abs_nr > big_number[precision]) || (abs_nr < small_number[precision])); if (use_scientific_notation) { if (((nr >= 0.0) && ((nr >= 1e+100) || (nr <= 1e-100))) || ((nr < 0.0) && ((nr <= -1e+100) || (nr >= -1e-100)))) extra_space+= 6; /* .e+100 or .e-100 */ else extra_space+= 5; /* .e+99 or .e-99 */ } if (field_length < extra_space) precision= 0; else if (precision > (field_length - extra_space)) precision= field_length - extra_space; length= sprintf(buff, "%.*g", precision, nr);

This solution takes performance into account by initializing the
limiting numbers arrays only once into static space. It copes with
negative numbers and tries to decide even over small numbers. The
latter has only small implications, as the prefix 0.000 is exactly
the same size as the postfix e-100. But knowing if scientific
notation will be selected by `sprintf()`

allows
for saving one digit when the exponent is larger than -100.

The calculations for the big number array are less precise than in
the second attempt, but faster. The precision is sufficient for
the guess whether `sprintf()`

uses scientific
notation. There may be number to field length combinations which
exploit the gap, but these won't emerge anyway as I found no
situation where this function is called with small field lengths.
Remember again that it is not called with user-supplied field
lengths.

However in the current stable releases (including gamma) we have
some places where the field length is too small by one character.
Thus, the precision is sometimes one digit smaller than
`DBL_DIG`

would allow for. Consequently, we
cannot use the simple algorithm in the stable releases. There is a
chance of doing it in a development release, though.

**Addendum:**

There turned out to be a new solution to the "big number array"
problem. We have a statically initialized array
`log_10`

, which holds the necessary values. But I
did not check whether these values are safe. Even if computed by
the compiler, they could carry values slightly above the decimal
powers, which would be bad. In this case we needed to initialize
by 9.99999999e+xxx, where the number of nines is equal to
`DBL_DIG`

. This must be protected by ```
#if
DBL_DIG == yy
```

, so that a new `DBL_DIG`

on a new platform is detected. And the array is of limited length.
We must at least protect it by a
```
DBUG_ASSERT(sizeof(log_10)/sizeof(log_10[0]) >
DBL_DIG
```

).

But all of this is probably completely unnecessary, since we are
only speaking of cases where no user-supplied field length is
given. So MySQL selects the field length on its own. So it is
totally possible, indeed highly desirable, that MySQL selects a
field length, which allows for a maximum of precision for all
possible values. And these are `DBL_DIG+7`

or
`FLT_DIG+6`

respectively as far as IEEE 754 is
used. In this case we can have values of about +/-1e-307 to
+/-1e+308 for `double`

and +/-1e-37 to +/-1e+38
for `float`

. That is, for example
-1.<DBL_DIG-1 digits>e+100. For cases where a precision
above IEEE 754 is possible, we may need +8 instead. We can detect
this with `#if DBL_MAX_10_EXP >=`

So using a
field length of `DBL_DIG+8`

in all cases should
be sufficient for a simple ```
sprintf(buff, "%.*g", DBL_DIG,
nr)
```

or ```
sprintf(buff, "%.*g", FLT_DIG,
nr)
```

, respectively. To be safe, we should not use the
machine dependent constants everywhere, but instead concentrate
them into definitions like these:

#if (DBL_MAX_10_EXP > 9999) || (DBL_MIN_10_EXP < -9999) # error "Need new definition for UNSPECIFIED_DOUBLE_FIELD_LENGTH" #elif (DBL_MAX_10_EXP > 999) || (DBL_MIN_10_EXP < -999) # define UNSPECIFIED_DOUBLE_FIELD_LENGTH (DBL_DIG+8) #else # define UNSPECIFIED_DOUBLE_FIELD_LENGTH (DBL_DIG+7) #endif #if (FLT_MAX_10_EXP > 999) || (FLT_MIN_10_EXP < -999) #error "Need new definition for UNSPECIFIED_FLOAT_FIELD_LENGTH" #elif (FLT_MAX_10_EXP > 99) || (FLT_MIN_10_EXP < -99) # define UNSPECIFIED_FLOAT_FIELD_LENGTH (FLT_DIG+7) #else # define UNSPECIFIED_FLOAT_FIELD_LENGTH (FLT_DIG+6) #endif

These definitions should be used wherever an item or field of type
`float`

or `double`

without an
explicit field length specification is encountered. We have to
propagate these lengths though all derived items and fields and we
have to select the maximum of all field lengths wherever in two or
more of them are used in an expression or a function.

We need to treat the precision
(`DBL_DIG`

/`FLT_DIG`

) similarly,
but have to select the minimum in expressions or functions.

## User Comments

So - what this doesn't cover currently is how precise FLOAT and DOUBLE values are by default. If no (m,n) value is specified, what is the default m,n? Personally, I prefer to use DECIMAL(m,n) more because it isn't an estimate especially when dealing with financial information and not theory.

@Kevin

DOUBLE[(M,D)] [ZEROFILL] holds double-precision numeric values, pretty similar to FLOAT double-precision, except for its allowable range, which is -1.7976931348623157E+308 to -2.2250738585072014E-308, 0, and 2.2250738585072014E-308 to 1.7976931348623157E+308.

FLOAT[(M[,D])] [ZEROFILL] stores floating point numbers in the range of -3.402823466E+38 to -1.175494351E-38 and 1.175494351E-38 to 3.402823466E+38. If precision isn't specified, or <= 24, it's SINGLE precision, otherwise FLOAT is DOUBLE precision. When specified alone, precision can range from 0 to 53. If the scale is defined, too, precision may be up to 255, scale up to 253.