WL#5051: make CSV engine's TIMESTAMP type behave like TIMESTAMP
Affects: Server-5.4 — Status: Assigned — Priority: Medium
We want to fix BUG#34612 "CSV storage engine does not honor time_zone". A) The normal behaviour of a DATETIME column is that in a SELECT it returns the date&time expressed in the time zone of INSERT (independently of the time zone at the time of SELECT). Usually it is implemented by, when inserting, storing in the column the date&time in the time zone of INSERT, and when selecting, returning this unchanged. The normal behaviour of TIMESTAMP is that in a SELECT it returns the date&time expressed in the time zone of the SELECT. Which implies that the stored data carries a notion of what time zone this stored data is in. This is usually implemented by, when inserting, converting in UTC, storing that in the column, and when selecting, converting from UTC to the time zone of SELECT. In different words, DATETIME ignore timezones, TIMESTAMP don't and represent a single fixed point in time all over the planet. B) Short summary of the symptoms with CSV: create table t1 (ts timestamp, dt datetime not null) engine=csv; set time_zone='+00:00'; insert into t1 values ('2003-03-30 02:30:00','2003-03-30 02:30:00'); set time_zone='+10:30'; insert into t1 values ('2003-03-30 02:30:00','2003-03-30 02:30:00'); select * from t1; ts dt 2003-03-30 02:30:00 2003-03-30 02:30:00 2003-03-30 02:30:00 2003-03-30 02:30:00 "dt" values are correct, but we rather expected a difference of 10:30 hours between the values of "ts" in the two rows (because TIMESTAMP should be displayed in the SELECT's time zone). Thus the CSV engine treats TIMESTAMP columns as if they were DATETIME. C) Consequences of the bug. C1) this bug is also visible in the mysql.general_log CSV table: "2009-06-13 17:43:55","root[root] @ localhost ",2,1,"Query","select * from t" "2009-06-13 17:44:08","root[root] @ localhost ",2,1,"Query","set time_zone='Japan'" "2009-06-13 22:44:09","root[root] @ localhost ",2,1,"Query","select * from t" "2009-06-13 17:44:35","root[root] @ localhost ",3,1,"Connect","root@localhost on test" We see that when a session changed its time zone to Japan, this changed the time recorded in the log table. Then another session (using the default Europe/Moscow) led to a previous time being logged. One would expect an always increasing time in a log, which is how the file-based general log behaves by the way. C2) this bug implies that the value of time_zone at the moment when CHECKSUM TABLE is run influences what CHECKSUM TABLE returns (see BUG#34612 for testcase), for a CSV table. D) the goal of this WL is to fix those bugs. IBMDB2i and FEDERATED are said to have those bugs too. The assigned developer does not have access to IBMDB2i testing and is not assigned the IBMDB2i bug (BUG#43343) so the assigned developer will limit itself to fixing CSV and if possible FEDERATED, providing helper functions for IBMDB2i and assisting IBMDB2i developers to use them. Another unclarity is that the CSV bug is targetted for fixing in 5.4 but the IBMDB2i one is for 5.1 (Triage committee has been questioned about this). Work will start on 5.4 and once we have a prototype we will decide if we rebase it on 5.1. ------------------------------------------------------------------ Dmitri's arch review of this WL, 2009-10-26: I have the following comments about HLS/LLD for this WorkLog: 1) IMO it makes sense to mention in HLS (or may in the original bug report if you think that it is a more appropriate place for such notes) that at least we have considered storing TIMESTAMP values in some fixed time zone other than UTC. For example, by internally storing datetime value for such column in server's default time zone we will be able to preserve compatible behavior for cases when only this time zone is used and there are users/scripts which directly access to contents of .CSV files. Unfortunately, for most of time zones (unlike for UTC) there are periods of time when simple datetime literal can correspond to two different moments of time (e.g. for 'Europe/Moscow' zone literal "2009-10-25 02:30:00" can correspond to "2009-10-24 22:30:00 UTC" and to ""2009-10-24 23:30:00 UTC"). So to be able to avoid information loss when values for such moments of time are stored in TIMESTAMP column through usage of DEFAULT NOW() clause or through switching to '00:00' time zone we will have to extend datetime literals with time zone/type of time abbreviations (e.g. 'MSK' and 'MSD' for 'Europe/Moscow'). Since implementing support for handling of such extended datetime literals will require noticeable amount of work probably it is not practical to follow this approach at the moment. 2) I think HLS should be more affirmative or negative about --csv-timestamp-in-local-time option. IMO it is better to make decision whether we are going implement this option or not now rather than postponing it to the implementation phase. Also believe it is better to clarify what is going to happen when this option is enabled. I.e. does "local time" mean current time zone for connection or something else (like server's default time zone)? 3) Finally, I think that LLD should be a bit more elaborate about the way in which we are going convert time_t values which are stored for TIMESTAMP columns by Field_timestamp::store() method in buffer representing record to value to be stored in .CSV file (I assume it is going to be a datetime literal). What implementation of UTC we are going to use for this conversion? There are at least three options now: a) One which corresponds to UTC time zone described in system time zone tables (but what should happen if such tables are absent?) b) One which corresponds to '+00:00' time zone (but on systems which support leap seconds values produced by such conversion might differ from UTC timestamps produced by other applications). c) One implemented by Time_zone_utc class. This class needs to be extended first, as it does not implement backward conversion from broken-down datetime respresentation to time_t, which will be needed for reading data from .CSV tables. I am, personally, inclined towards using option a) by default and falling back to option c) if system tables are absent. Otherwise I am OK with architecture of this WorkLog task and think that it can be approved once the above issues are ironed out.
High-level design (extracted from BUG#34612): 3) The current behaviour of TIMESTAMP in CSV regarding what is stored, is the normal semantics of DATETIME (i.e. date and time stored in the timezone used at INSERT time, verbatim, and returned to reader verbatim, independently of the reader's timezone). So users willing to have this special behaviour can change their column to DATETIME. 4) The current behaviour of TIMESTAMP in CSV is unique to this engine and FEDERATED (some say, I haven't tested) and IBMDB2i, and in contradiction with the manual's definition of TIMESTAMP. Given that DATETIME exists, it sounds unnecessary to keep the current behaviour of TIMESTAMP in CSV long-term. 5) Given previous points, my proposal is to change the *default* to be: store TIMESTAMP in UTC in the CSV file, and return it to SELECT after adjustment to the reader's timezone, like for other engines. This should fix all bad cases reported in the bug report (including CHECKSUM TABLE). 6) There remains upgrade problems and current use cases to tackle with care, see following points. 7) A pre-upgrade departemental database might be limited to a single timezone in practice, and users/scripts may be reading the table with "cat" and expect the same content as SELECT (i.e. local time). After upgrading, newly inserted TIMESTAMP values are stored in UTC in the file, scripts breaks. The upgrade procedure to fix this is that for each TIMESTAMP column where the old behaviour (SELECT==file) is desired, run: ALTER TABLE tbl MODIFY col DATETIME NOT NULL; (this must be run on *pre-upgrade* server). And fixing the application if it was relying on auto-setting of TIMESTAMP when INSERT/UPDATE specifies a NULL value (DATETIME does not have auto-set properties, so NOW() should be explicitely used when inserting/updating DATETIME). After that, the user has a DATETIME column with DATETIME semantics, with correct time content, and SELECT==file. 8) For columns which are not upgraded like above, i.e. real correct TIMESTAMP behaviour is desired, there is a different procedure. Say local time is UTC+2. A pre-upgrade old CSV table contained TIMESTAMP in local time (say 08AM). SELECTing it with a post-upgrade server will show 10AM which is wrong (the file's content 08AM will be wrongly interpreted as UTC). For each column where the user wants to switch to real correct TIMESTAMP behaviour, she or he should do: ALTER TABLE tbl MODIFY col DATETIME NOT NULL; (this must be run on *pre-upgrade* server) and then ALTER TABLE tbl MODIFY col TIMESTAMP; (this must be run on *post-upgrade* server). After that the user has a TIMESTAMP column with normal TIMESTAMP semantics and correct time content. Alternative which would also work: * run mysqldump on pre-upgrade server * in its output, replace SET TIME_ZONE='+00:00' with SET TIME_ZONE=the timezone used for that column (UTC+2) (or use "mysqldump --skip-tz-utc" and add the line above to the output's start). * pipe this into post-upgrade server. 9) If we don't want to force users to execute the upgrade procedures of (7)(8) immediately, we can provide an option --csv-timestamp-in-local-time which keeps the old behaviour, and deprecate that option later.
Given that CSV and IBMDB2i have the same bug, and the IBMDB2i folks requested a reusable solution, helper functions could be implemented, accessible to all engines: they would be used by write_row()/update_row()/rnd_*() of impacted engines and convert between a TIMESTAMP value (in the record buffer) and a string in UTC. Then CSV/IBMDB2i would: - in write_row()/update_row(), use one helper to build the UTC string and send it to their storage. - in rnd_*(), use the other helper to generate the TIMESTAMP value based on the UTC string.
Copyright (c) 2000, 2015, Oracle Corporation and/or its affiliates. All rights reserved.