WL#5051: make CSV engine's TIMESTAMP type behave like TIMESTAMP

Affects: Server-5.4   —   Status: Assigned

We want to fix BUG#34612 "CSV storage engine does not honor time_zone".

A) The normal behaviour of a DATETIME column is that in a SELECT it returns the
date&time expressed in the time zone of INSERT (independently of the time zone
at the time of SELECT). Usually it is implemented by, when inserting, storing in
the column the date&time in the time zone of INSERT, and when selecting,
returning this unchanged.
The normal behaviour of TIMESTAMP is that in a SELECT it returns the date&time
expressed in the time zone of the SELECT. Which implies that the stored data
carries a notion of what time zone this stored data is in. This is usually
implemented by, when inserting, converting in UTC, storing that in the column,
and when selecting, converting from UTC to the time zone of SELECT.
In different words, DATETIME ignore timezones, TIMESTAMP don't and represent a
single fixed point in time all over the planet.

B) Short summary of the symptoms with CSV:
create table t1 (ts timestamp, dt datetime not null) engine=csv;
set time_zone='+00:00';
insert into t1 values ('2003-03-30 02:30:00','2003-03-30 02:30:00');
set time_zone='+10:30';
insert into t1 values ('2003-03-30 02:30:00','2003-03-30 02:30:00');
select * from t1;
ts                      dt
2003-03-30 02:30:00     2003-03-30 02:30:00
2003-03-30 02:30:00     2003-03-30 02:30:00
"dt" values are correct, but we rather expected a difference of 10:30 hours
between the values of "ts" in the two rows (because TIMESTAMP should be
displayed in the SELECT's time zone).
Thus the CSV engine treats TIMESTAMP columns as if they were DATETIME.

C) Consequences of the bug.
C1) this bug is also visible in the mysql.general_log CSV table:
"2009-06-13 17:43:55","root[root] @ localhost []",2,1,"Query","select * from t"
"2009-06-13 17:44:08","root[root] @ localhost []",2,1,"Query","set
time_zone='Japan'"
"2009-06-13 22:44:09","root[root] @ localhost []",2,1,"Query","select * from t"
"2009-06-13 17:44:35","root[root] @ localhost []",3,1,"Connect","root@localhost
on test"
We see that when a session changed its time zone to Japan, this changed the time
recorded in the log table. Then another session (using the default
Europe/Moscow) led to a previous time being logged. One would expect an always
increasing time in a log, which is how the file-based general log behaves by the
way.
C2) this bug implies that the value of time_zone at the moment when
CHECKSUM TABLE is run influences what CHECKSUM TABLE returns (see BUG#34612 for
testcase), for a CSV table.

D) the goal of this WL is to fix those bugs. IBMDB2i and FEDERATED are said to
have those bugs too. The assigned developer does not have access to IBMDB2i
testing and is not assigned the IBMDB2i bug (BUG#43343) so the assigned
developer will limit itself to fixing CSV and if possible FEDERATED, providing
helper functions for IBMDB2i and assisting IBMDB2i developers to use them.
Another unclarity is that the CSV bug is targetted for fixing in 5.4 but the
IBMDB2i one is for 5.1 (Triage committee has been questioned about this). Work
will start on 5.4 and once we have a prototype we will decide if we rebase it on
5.1.

------------------------------------------------------------------
Dmitri's arch review of this WL, 2009-10-26:

I have the following comments about HLS/LLD for this WorkLog:

1) IMO it makes sense to mention in HLS (or may in the original bug
   report if you think that it is a more appropriate place for such
   notes) that at least we have considered storing TIMESTAMP values
   in some fixed time zone other than UTC.
   For example, by internally storing datetime value for such column
   in server's default time zone we will be able to preserve compatible
   behavior for cases when only this time zone is used and there are
   users/scripts which directly access to contents of .CSV files.
   Unfortunately, for most of time zones (unlike for UTC) there are
   periods of time when simple datetime literal can correspond to two
   different moments of time (e.g. for 'Europe/Moscow' zone literal
   "2009-10-25 02:30:00" can correspond to "2009-10-24 22:30:00 UTC"
   and to ""2009-10-24 23:30:00 UTC"). So to be able to avoid
   information loss when values for such moments of time are stored
   in TIMESTAMP column through usage of DEFAULT NOW() clause or through
   switching to '00:00' time zone we will have to extend datetime
   literals with time zone/type of time abbreviations (e.g. 'MSK'
   and 'MSD' for 'Europe/Moscow').
   Since implementing support for handling of such extended datetime
   literals will require noticeable amount of work probably it is
   not practical to follow this approach at the moment.

2) I think HLS should be more affirmative or negative about
   --csv-timestamp-in-local-time option. IMO it is better to make
   decision whether we are going implement this option or not now
   rather than postponing it to the implementation phase.
   Also believe it is better to clarify what is going to happen
   when this option is enabled. I.e. does "local time" mean current
   time zone for connection or something else (like server's default
   time zone)?

3) Finally, I think that LLD should be a bit more elaborate about the
   way in which we are going convert time_t values which are stored
   for TIMESTAMP columns by Field_timestamp::store() method in buffer
   representing record to value to be stored in .CSV file (I assume
   it is going to be a datetime literal).
   What implementation of UTC we are going to use for this conversion?
   There are at least three options now:
   a) One which corresponds to UTC time zone described in system time
      zone tables (but what should happen if such tables are absent?)
   b) One which corresponds to '+00:00' time zone (but on systems which
      support leap seconds values produced by such conversion might
      differ from UTC timestamps produced by other applications).
   c) One implemented by Time_zone_utc class. This class needs to be
      extended first, as it does not implement backward conversion from
      broken-down datetime respresentation to time_t, which will be
      needed for reading data from .CSV tables.

   I am, personally, inclined towards using option a) by default and
   falling back to option c) if system tables are absent.

Otherwise I am OK with architecture of this WorkLog task and think that
it can be approved once the above issues are ironed out.