WL#11009: Logging services: log writers: add error ID to traditional error log

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

WL#9344 moved the supermajority of the server's messages to the error log out of 
the code proper and into the messages file, sql/share/errmsg-utf8.txt.
Beside auxiliary advantages such as deduplication, facilitation of translation, 
and providing a stepping stone for a possible moving of the messages to a table 
later, the main advantage was that all messages concerned would now have a stable 
error symbol (e.g. ER_STARTUP) and an error code (e.g. 1451).

While the JSON log writer (one of the sinks defined in WL#9343) is capable of 
writing either, both, or none of these items (depending on the configuration set 
in the filtering engine, WL#9342), it was decided that even the 
"traditional" error-log would benefit from having each message's ID added to it.

Func-Req 1 All error log messages output from mysqld must have an 

Func-Req 2  Log format of "traditional" error log (stderr/file)

Entry in the error log shall be extended so that the actual error message is
preceded by ' [MY-'  '] '. This will maintain human readability, and
is designed so that new tools can parse the error_ID as an indvidual field if
they wish, while legacy ones will consider it part of the error message (with
the intention of keeping the format nominally intact, and the impact on
aggregation etc. as small as possible).

I-1  SQL syntax

NO CHANGE

I-2  Instrumentation

NO CHANGE

I-3  Error and Warnings

NO CHANGE (except default log format, see I-9, log files)

I-4  Install/Upgrade

NO CHANGE

I-5  Commercial Plugins

NO CHANGE (but designed to be compatible with any future changes further 
empowering loadable plugins/components, allowing the latter to register 
namespaced error messages)

I-6  Replication

NO CHANGE

I-7  Protocols

NO CHANGE

I-8  Security

NO CHANGE

I-9  Log files

The "traditional" error log (file/stderr) will be extended with an error message 
ID. This new field be precede the actual human-readable message string, with the 
intention of both keeping the logs human-readable, and of remaining compatible 
with legacy log analyzers / aggregators etc. See below for full specification, 
analysis, and rationale.

I-10 MySQL clients

NO CHANGE

I-11 Auth plugins

NO CHANGE (but designed to be compatible with any future changes further 
empowering loadable plugins/components, allowing the latter to register 
namespaced error messages)

I-12 Globalization

NO CHANGE

I-13 Configuration

NO CHANGE

I-14 File formats

NO CHANGE (except log files, see there)

I-15 Plugins/APIs

NO CHANGE (but designed to be compatible with any future changes further 
empowering loadable plugins/components, allowing the latter to register 
namespaced error messages)

I-16 Other programs

Designed for compatibility with external log analyzers etc. where possible.
No change for programs that do not read error logs.

I-17 Storage engines

NO CHANGE (but designed to be compatible with any future changes further 
empowering loadable plugins/components, allowing the latter to register 
namespaced error messages)

A few considerations that will not result in extra code or work right now, but 
will hopefully keep this spec future-proof, and will prevent us from painting 
ourselves into a corner.


ERROR NUMBERING: I - STATUS QUO

The error message subsystem supports a (somewhat) sparse numberspace in that 
understands ranges, so that e.g. each version of the server can have its own 
range, allowing us to add messages to older servers without breaking the 
numbering of the newer ones by "pushing the newer messages up to higher indices 
in the array."

This is mostly a convenience for us however, with ranges having little meaning 
to the DBA: we don't have ranges per sub-system, and unrelated messages are 
simple appended (to a given server-version's message list, within that server 
version's number range) in the order they come in. As a result, the numbers are 
for all practical purposes opaque; you can't look at an error-code and know, 
"it's between 1000 and 2000, that's from replication, then."


ERROR NUMBERING: II - CHALLENGES IN THE FUTURE

If we enrich each log-line by an error identifier, it would be nice if not just 
the server proper, but loadable plug-ins and components as well could supply 
this information with relative ease. We could perhaps assign a range to each of 
our own plug-ins, but that would not solve the issue for 3rd party offerings. 
(In fact, even within ORCL the idea of all subsystems using errmsg-utf8.txt is 
less than universally popular.) Since we probably wouldn't want to be a central 
registry for all plugins/components and their error ranges, we might need a 
system later that will allow loadable objects to maintain their own error lists, 
and register them on load. We could then assign ranges dynamically on load, but 
that would mean that codes might change every time the DBA changes their 
server's configuration, which would make these codes inconsistent across logs. 
Hence, at such a time in the future as this issue arises, the correct way of 
handling this will be to namespace the error codes, thus enabling each 
plug-in to start their range at 1, if they wish, as "FooPlug:1" will be disjunct 
from the server's "1".


ON DISK FORMAT / PARSING BY THIRD PARTY TOOLS: I - FIELDS

The traditional log is currently printed as

      len= snprintf(buff, sizeof(buff), "%.*s %u [%.*s] %.*s",
                    (int) ts_len,    iso_timestamp,
                    thread_id,
                    (int) label_len, label,
                    (int) msg_len,   msg);


that is, fields are separated by spaces (rather than tabs etc.).

The last field is the message, which may itself contain white space; therefore 
any tools operating on this log are likely to parse it as "1st field, timestamp; 
2nd field, thread-ID; 3rd field, severity-label; all the rest, message."

As the error-ID semantically belongs with the message anyway, the correct 
place for it is therefore right in front of it, separated by a space:

      

This is not only the format that would make most immediate sense to a human 
reader, it also means that a tool that is not aware of this new, extended format 
would consider the error_id part of the message, i.e. parsing and aggregation 
should still work as expected, keeping the format backwards-compatible.

At the same time, more current tools could consider the error_id a separate 
field where this makes sense. It is suggested to use square brackets for the 
error_id (as we do for the severity) to enable such tools to automatically 
decide which format a given log is in:

2017-06-30T08:33:37.659435Z 1 [Note] [MY-003629] Found data dictionary with
version 1
------------1-------------- 2 ---3-- -------------4-------------
------------1-------------- 2 ---3-- -----4----- ------5--------

In this example, the first line shows an arbitrary log entry,
the second line shows how this entry would be parsed by old tools,
and the third line shows how it would be parsed by new tools.


For comparison, an older server's log would not show the error-ID, like so:

2017-06-30T08:33:37.659435Z 1 [Note] Found data dictionary with version 1
------------1-------------- 2 ---3-- -----------------4------------------
------------1-------------- 2 ---3-- -----------------5------------------

Here, old tools would work as they always have (line 2),
while new tools could detect the old log format (id est, the absence of the 
error-ID, or what they'd consider "field 4") by the fact that the field after 
field 3 does not start with an opening square bracket.


ON DISK FORMAT / PARSING BY THIRD PARTY TOOLS: II - ERROR-ID FIELD

As noted above, the actual error-ID is for all intents and purposes an opaque 
identifier (preceded by a '[', succeeded by a ']'), and it is of paramount 
importance that this be reflected in the public documentation. This not only 
helps with keeping the tools compatible as outlined above; it also allows us to 
extend and modify the format at a later date without violating these specs.

At this point in time, a numeric format prefixed with 'MY-' is chosen for the
identifier (6 digits with leading zeros, for better alignment and therefore 
enhanced readability of the logs), but as identifiers are opaque, tools should
treat this field as a string-literal all the same. The rationale to print "MY"
in front of error log numbers is that users can search for the same to find
appropriate MySQL documentation.

Therefore, the error-ID is now implemented as,

error_code     ::= '['  {6} ]
prefix         ::= 'MY-'
digit          ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9


but will be 100 % upwards compatible with a namespaced format if and when the 
need arises in the future:

error_id               ::= '[' [  ':' ]
                            error_code_in_subsystem ']'
subsystem              ::=  [  |  | '_' | '.' ]
error_code_in_subsytem ::=  [  |  | '_' | '.' ]


ACTUAL CODE

After all these considerations, the actual code-change is very minor, and 
expected to look similar to

      len= snprintf(buff_line, sizeof(buff_line),
                    "%.*s %u [%.*s] [MY-%06u] %.*s",
                    (int) ts_len,    iso_timestamp,
                    thread_id,
                    (int) label_len, label,
                    errcode,
                    (int) msg_len,   msg);

for the time being.