WL#5051: make CSV engine's TIMESTAMP type behave like TIMESTAMP

Affects: Server-5.4 — Status: Assigned

Description
High Level Architecture
Low Level Design

We want to fix BUG#34612 "CSV storage engine does not honor time_zone".

A) The normal behaviour of a DATETIME column is that in a SELECT it returns the
date&time expressed in the time zone of INSERT (independently of the time zone
at the time of SELECT). Usually it is implemented by, when inserting, storing in
the column the date&time in the time zone of INSERT, and when selecting,
returning this unchanged.
The normal behaviour of TIMESTAMP is that in a SELECT it returns the date&time
expressed in the time zone of the SELECT. Which implies that the stored data
carries a notion of what time zone this stored data is in. This is usually
implemented by, when inserting, converting in UTC, storing that in the column,
and when selecting, converting from UTC to the time zone of SELECT.
In different words, DATETIME ignore timezones, TIMESTAMP don't and represent a
single fixed point in time all over the planet.

B) Short summary of the symptoms with CSV:
create table t1 (ts timestamp, dt datetime not null) engine=csv;
set time_zone='+00:00';
insert into t1 values ('2003-03-30 02:30:00','2003-03-30 02:30:00');
set time_zone='+10:30';
insert into t1 values ('2003-03-30 02:30:00','2003-03-30 02:30:00');
select * from t1;
ts                      dt
2003-03-30 02:30:00     2003-03-30 02:30:00
2003-03-30 02:30:00     2003-03-30 02:30:00
"dt" values are correct, but we rather expected a difference of 10:30 hours
between the values of "ts" in the two rows (because TIMESTAMP should be
displayed in the SELECT's time zone).
Thus the CSV engine treats TIMESTAMP columns as if they were DATETIME.

C) Consequences of the bug.
C1) this bug is also visible in the mysql.general_log CSV table:
"2009-06-13 17:43:55","root[root] @ localhost []",2,1,"Query","select * from t"
"2009-06-13 17:44:08","root[root] @ localhost []",2,1,"Query","set
time_zone='Japan'"
"2009-06-13 22:44:09","root[root] @ localhost []",2,1,"Query","select * from t"
"2009-06-13 17:44:35","root[root] @ localhost []",3,1,"Connect","root@localhost
on test"
We see that when a session changed its time zone to Japan, this changed the time
recorded in the log table. Then another session (using the default
Europe/Moscow) led to a previous time being logged. One would expect an always
increasing time in a log, which is how the file-based general log behaves by the
way.
C2) this bug implies that the value of time_zone at the moment when
CHECKSUM TABLE is run influences what CHECKSUM TABLE returns (see BUG#34612 for
testcase), for a CSV table.

D) the goal of this WL is to fix those bugs. IBMDB2i and FEDERATED are said to
have those bugs too. The assigned developer does not have access to IBMDB2i
testing and is not assigned the IBMDB2i bug (BUG#43343) so the assigned
developer will limit itself to fixing CSV and if possible FEDERATED, providing
helper functions for IBMDB2i and assisting IBMDB2i developers to use them.
Another unclarity is that the CSV bug is targetted for fixing in 5.4 but the
IBMDB2i one is for 5.1 (Triage committee has been questioned about this). Work
will start on 5.4 and once we have a prototype we will decide if we rebase it on
5.1.

------------------------------------------------------------------
Dmitri's arch review of this WL, 2009-10-26:

I have the following comments about HLS/LLD for this WorkLog:

1) IMO it makes sense to mention in HLS (or may in the original bug
   report if you think that it is a more appropriate place for such
   notes) that at least we have considered storing TIMESTAMP values
   in some fixed time zone other than UTC.
   For example, by internally storing datetime value for such column
   in server's default time zone we will be able to preserve compatible
   behavior for cases when only this time zone is used and there are
   users/scripts which directly access to contents of .CSV files.
   Unfortunately, for most of time zones (unlike for UTC) there are
   periods of time when simple datetime literal can correspond to two
   different moments of time (e.g. for 'Europe/Moscow' zone literal
   "2009-10-25 02:30:00" can correspond to "2009-10-24 22:30:00 UTC"
   and to ""2009-10-24 23:30:00 UTC"). So to be able to avoid
   information loss when values for such moments of time are stored
   in TIMESTAMP column through usage of DEFAULT NOW() clause or through
   switching to '00:00' time zone we will have to extend datetime
   literals with time zone/type of time abbreviations (e.g. 'MSK'
   and 'MSD' for 'Europe/Moscow').
   Since implementing support for handling of such extended datetime
   literals will require noticeable amount of work probably it is
   not practical to follow this approach at the moment.

2) I think HLS should be more affirmative or negative about
   --csv-timestamp-in-local-time option. IMO it is better to make
   decision whether we are going implement this option or not now
   rather than postponing it to the implementation phase.
   Also believe it is better to clarify what is going to happen
   when this option is enabled. I.e. does "local time" mean current
   time zone for connection or something else (like server's default
   time zone)?

3) Finally, I think that LLD should be a bit more elaborate about the
   way in which we are going convert time_t values which are stored
   for TIMESTAMP columns by Field_timestamp::store() method in buffer
   representing record to value to be stored in .CSV file (I assume
   it is going to be a datetime literal).
   What implementation of UTC we are going to use for this conversion?
   There are at least three options now:
   a) One which corresponds to UTC time zone described in system time
      zone tables (but what should happen if such tables are absent?)
   b) One which corresponds to '+00:00' time zone (but on systems which
      support leap seconds values produced by such conversion might
      differ from UTC timestamps produced by other applications).
   c) One implemented by Time_zone_utc class. This class needs to be
      extended first, as it does not implement backward conversion from
      broken-down datetime respresentation to time_t, which will be
      needed for reading data from .CSV tables.

   I am, personally, inclined towards using option a) by default and
   falling back to option c) if system tables are absent.

Otherwise I am OK with architecture of this WorkLog task and think that
it can be approved once the above issues are ironed out.

High-level design (extracted from BUG#34612):

3) The current behaviour of TIMESTAMP in CSV regarding what is stored,
is the normal semantics of DATETIME (i.e. date and time stored in the
timezone used at INSERT time, verbatim, and returned to reader
verbatim, independently of the reader's timezone). So users willing to
have this special behaviour can change their column to DATETIME.

4) The current behaviour of TIMESTAMP in CSV is unique to this engine and
FEDERATED (some say, I haven't tested) and IBMDB2i, and in contradiction with
the manual's definition of TIMESTAMP. Given that DATETIME exists, it sounds
unnecessary to keep the current behaviour of TIMESTAMP in CSV long-term.

5) Given previous points, my proposal is to change the *default* to
be: store TIMESTAMP in UTC in the CSV file, and return it to SELECT
after adjustment to the reader's timezone, like for other
engines. This should fix all bad cases reported in the bug report
(including CHECKSUM TABLE).

6) There remains upgrade problems and current use cases to tackle with
care, see following points.

7) A pre-upgrade departemental database might be limited to a single
timezone in practice, and users/scripts may be reading the table with
"cat" and expect the same content as SELECT (i.e. local time). After
upgrading, newly inserted TIMESTAMP values are stored in UTC in the
file, scripts breaks. The upgrade procedure to fix this is that for
each TIMESTAMP column where the old behaviour (SELECT==file) is
desired, run:
ALTER TABLE tbl MODIFY col DATETIME NOT NULL;
(this must be run on *pre-upgrade* server). And fixing the
application if it was relying on auto-setting of TIMESTAMP when
INSERT/UPDATE specifies a NULL value (DATETIME does not have auto-set
properties, so NOW() should be explicitely used when
inserting/updating DATETIME).
After that, the user has a DATETIME column with DATETIME semantics,
with correct time content, and SELECT==file.

8) For columns which are not upgraded like above, i.e. real correct
TIMESTAMP behaviour is desired, there is a different procedure. Say
local time is UTC+2. A pre-upgrade old CSV table contained TIMESTAMP
in local time (say 08AM). SELECTing it with a post-upgrade server will
show 10AM which is wrong (the file's content 08AM will be wrongly
interpreted as UTC).
For each column where the user wants to switch to real correct
TIMESTAMP behaviour, she or he should do:
ALTER TABLE tbl MODIFY col DATETIME NOT NULL;
(this must be run on *pre-upgrade* server)
and then
ALTER TABLE tbl MODIFY col TIMESTAMP;
(this must be run on *post-upgrade* server).
After that the user has a TIMESTAMP column with normal TIMESTAMP
semantics and correct time content.
Alternative which would also work:
 * run mysqldump on pre-upgrade server
 * in its output, replace
   SET TIME_ZONE='+00:00'
   with
   SET TIME_ZONE=the timezone used for that column (UTC+2)
   (or use "mysqldump --skip-tz-utc" and add the line above to the
   output's start).
 * pipe this into post-upgrade server.

9) If we don't want to force users to execute the upgrade procedures of
(7)(8) immediately, we can provide an option
--csv-timestamp-in-local-time which keeps the old behaviour, and
deprecate that option later.

Given that CSV and IBMDB2i have the same bug, and the IBMDB2i folks requested a
reusable solution, helper functions could be implemented, accessible to all
engines: they would be used by write_row()/update_row()/rnd_*() of impacted
engines and convert between a TIMESTAMP value (in the record buffer) and a
string in UTC. Then CSV/IBMDB2i would:
- in write_row()/update_row(), use one helper to build the UTC string and send
it to their storage.
- in rnd_*(), use the other helper to generate the TIMESTAMP value based on the
UTC string.