WL#5259: PERFORMANCE_SCHEMA HOST_CACHE

Affects: Server-Prototype Only   —   Status: Complete

The server has an internal table containing information
about the host cache, including 'host name' and 'ip address'
and 'error count'. Expose this name via a new table in the performance_schema.
The benefits are:
* Users will know when the 'error count' approaches the value in
  the system variable 'max_connect_errors'
* Users will know how many hosts are cached.

According to support people, there are several support issues
which this would solve.
Background information
======================

Currently in mysql 5.6.4 (and previous releases),
the server behavior is as follows.

host cache:
- there is an internal host cache maintained in memory.
- this host cache has a fixed size (#define HOST_CACHE_SIZE 128)
- each entry consist of an ip address, a host name, and an error counter.
- the cache hash table, that keeps up to HOST_CACHE_SIZE records.
- the oldest entry is discarded to make room for new ones, if needed.

connection checks:
- when a client connection is accepted at the socket level,
various network related checks are performed, before going through
security for authentication and authorization.
The operations involved are:
- get the IP address of the client connecting to the server
- get the hostname for that IP
- check the hostname format for special cases
- get all the IP addresses for that hostname
- verify that the IP address from the client actually belong to the hostname,
  so that resolving IP <--> HOSTNAME is consistent in DNS.
The hostname found as part of this resolution is:
- cached in the host cache to avoid future DNS lookups
- used for authorization

Authentication and authorization can later fail due to various reasons,
some related to the user (bad username / password), some related to the
configuration of the client and/or server (usage of plugins, protocol handshake
errors)

Current limitations and problems
================================

Assume a client application attempts to connect to a server and fails.

When the failure is in the network itself (no route to host, connection refused,
etc), a system admin has all the network related tools to diagnose the problem
and fix it (ping, telnet, trace route, DNS, netstat, ...).

When the network delivered the client connection all the way to the
mysqld server, on the listening port, and when the failure is in the logic
internal to the mysqld process itself, the only visible sign that a connection
was refused is an increment on the server "Aborted_connects" status counter.

There is no way to tell why the connect fails, where the connection came from,
etc, and whether this is expected or not.

In practice, this makes administration very difficult: given that there are many
checks implemented that could cause a connect to be rejected, and that there are
many configuration settings that can have an influence, trouble shooting why a
given connection fails is not easy.

Design goals and proposed improvements
======================================

The goal of this feature is to expose how the internal host_cache function, so
that failures can be clearly identified, and corrected.

Every time a connection is accepted at the socket level (on the listening port)
and then rejected by the server, the DBA should be able to see:
- the exact reason for the failure
- the identity of the client affected, if known
- timestamp information, to help understanding the system status.

Specification
=============

Connection_errors_select status variable
----------------------------------------

This status variable counts the number of errors found
when calling select / poll on the listening port.

A failure to select / poll may not necessarily means a client
connection was rejected, but it is nonetheless a server error condition
that should be accounted for, as part of the general server connection process.

This variable is global.
This variable can be flushed.

Connection_errors_accept status variable
----------------------------------------

This status variable counts the number of errors found
when calling accept on the listening port.

This variable is global.
This variable can be flushed.

Connection_errors_tcpwrap status variable
-----------------------------------------

This status variable counts the number of connections refused
by libwrap.

This variable is global.
This variable can be flushed.

Connection_errors_internal status variable
------------------------------------------

This status variable counts the number of connections refused
due to internal errors in the server.

Internal errors can be:
- failure to start a new thread
- out of memory conditions

This variable is global.
This variable can be flushed.

Connection_errors_max_connections status variable
------------------------------------------------

This status variable counts the number of connections refused
due to the server max_connections limit.

This variable is global.
This variable can be flushed.

Connection_errors_peer_addr status variable
-------------------------------------------

This status variable counts the number of errors found when searching
for the connecting client IP.

This variable is global.
This variable can be flushed.

host_cache_size system variable
-------------------------------

This variable controls the size of the internal host cache.
It is global.
It can be changed dynamically.
Changing the variable size causes implicitly a FLUSH HOSTS.
Zero is a valid value, changing the host cache size to 0
means to disable the host cache.

TABLE performance_schema.host_cache
-----------------------------------

Columns:

  `IP` varchar(64) NOT NULL,
  `HOST` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `HOST_VALIDATED` enum('YES','NO') NOT NULL,
  `SUM_CONNECT_ERRORS` bigint(20) NOT NULL,
  `COUNT_HOST_BLOCKED_ERRORS` bigint(20) NOT NULL,
  `COUNT_NAMEINFO_TRANSIENT_ERRORS` bigint(20) NOT NULL,
  `COUNT_NAMEINFO_PERMANENT_ERRORS` bigint(20) NOT NULL,
  `COUNT_FORMAT_ERRORS` bigint(20) NOT NULL,
  `COUNT_ADDRINFO_TRANSIENT_ERRORS` bigint(20) NOT NULL,
  `COUNT_ADDRINFO_PERMANENT_ERRORS` bigint(20) NOT NULL,
  `COUNT_FCRDNS_ERRORS` bigint(20) NOT NULL,
  `COUNT_HOST_ACL_ERRORS` bigint(20) NOT NULL,
  `COUNT_NO_AUTH_PLUGIN_ERRORS` bigint(20) NOT NULL,
  `COUNT_AUTH_PLUGIN_ERRORS` bigint(20) NOT NULL,
  `COUNT_HANDSHAKE_ERRORS` bigint(20) NOT NULL,
  `COUNT_PROXY_USER_ERRORS` bigint(20) NOT NULL,
  `COUNT_PROXY_USER_ACL_ERRORS` bigint(20) NOT NULL,
  `COUNT_AUTHENTICATION_ERRORS` bigint(20) NOT NULL,
  `COUNT_SSL_ERRORS` bigint(20) NOT NULL,
  `COUNT_MAX_USER_CONNECTIONS_ERRORS` bigint(20) NOT NULL,
  `COUNT_MAX_USER_CONNECTIONS_PER_HOUR_ERRORS` bigint(20) NOT NULL,
  `COUNT_DEFAULT_DATABASE_ERRORS` bigint(20) NOT NULL,
  `COUNT_INIT_CONNECT_ERRORS` bigint(20) NOT NULL,
  `COUNT_LOCAL_ERRORS` bigint(20) NOT NULL,
  `COUNT_UNKNOWN_ERRORS` bigint(20) NOT NULL,
  `FIRST_SEEN` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `LAST_SEEN` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `FIRST_ERROR_SEEN` timestamp NULL DEFAULT '0000-00-00 00:00:00',
  `LAST_ERROR_SEEN` timestamp NULL DEFAULT '0000-00-00 00:00:00'

Operations:

SELECT is supported.
TRUNCATE is supported, and is used to flush the host cache.

Normal privilege checks do apply: a user must be granted the proper table
privileges to perform an operation.

Note that the existing way of flushing the table cache:
FLUSH HOSTS
still works, and now has a new alternative syntax:
TRUNCATE TABLE performance_schema.host_cache.

Both statements have the same effect, but use a different privilege.

Semantic for each column:

IP:
Shows the IP address of the client that connected to the server.
This address is expressed as a string.
Note: IPv4 requires a length of 15, IPv6 a length of 46 bytes.
This column is exposed as a varchar(64) in utf8.

HOST:
Shows the resolved DNS hostname for that client IP, if any.
Note that hostname length is 255, see RFC 2181.

HOST_VALIDATED:
Indicates if the IP --> HOST --> IP DNS resolution was performed.
When HOST_VALIDATED is still NO, the DNS resolution is attempted again
for each connect, until it eventually completes, either with a valid result, or
with a permanent error.
The reason for this column is to avoid caching bad / missing hostnames during
temporary DNS failures, which would affect clients forever.
When HOST_VALIDATED is YES, the HOST column is used as host name, avoiding calls
to DNS.

SUM_CONNECT_ERRORS:
Number of connect errors that are deemed 'blocking',
when using the max_connect_errors system variable.
Currently, only protocol handshake errors are counted,
and only for hosts that passed validation (HOST_VALIDATED).
This is an existing MySQL feature.

COUNT_HOST_BLOCKED_ERRORS:
Number of connections that were blocked, because SUM_CONNECT_ERRORS
exceeded max_connect_errors.
This is an existing MySQL feature.

COUNT_NAMEINFO_TRANSIENT_ERRORS:
Number of transient errors in DNS, during the IP --> HOSTNAME resolution.
This is an existing MySQL feature.

COUNT_NAMEINFO_PERMANENT_ERRORS:
Number of permanent errors in DNS, during the IP --> HOSTNAME resolution.
This is an existing MySQL feature.

COUNT_FORMAT_ERRORS:
Number of hostname format errors.
Hostnames can not start with a number followed by a ".",
This is an existing MySQL feature.

COUNT_ADDRINFO_TRANSIENT_ERRORS:
Number of transient errors in DNS, during the HOSTNAME --> IP reverse resolution.
This is an existing MySQL feature.

COUNT_ADDRINFO_PERMANENT_ERRORS:
Number of permanent errors in DNS, during the HOSTNAME --> IP reverse resolution.
This is an existing MySQL feature.

COUNT_FCRDNS_ERRORS:
Number of Forward-confirmed reverse DNS errors.
These errors occur when the IP --> HOSTNAME --> IP resolution does not
match the client originating IP. 
This is an existing MySQL feature.

COUNT_HOST_ACL_ERRORS:
Number of errors caused when no user from the client host can possibly
log in: the server returns ER_HOST_NOT_PRIVILEGED
and does not even ask for a user name / password.
This is an existing MySQL feature.

COUNT_NO_AUTH_PLUGIN_ERRORS:
Number of errors seen when an auth plugin is requested, but is not loaded in the
server.
This is an existing MySQL feature.

COUNT_AUTH_PLUGIN_ERRORS:
Number of errors reported by the auth plugin itself.
This is an existing MySQL feature.

Note: authentication plugin can report different error codes
to indicate the root cause of a failure.
Based on the plugin return code when an error occurs,
one of the following columns:
- COUNT_HANDSHAKE_ERRORS
- COUNT_AUTH_PLUGIN_ERRORS
- COUNT_AUTHENTICATION_ERRORS
is incremented in the host cache.
New return codes are an optional extension to the existing plugin API.
Unknown or unexpected plugin errors are counted as COUNT_AUTH_PLUGIN_ERRORS.

COUNT_HANDSHAKE_ERRORS:
Number or errors detected at the wire protocol level.
This is an existing MySQL feature.

COUNT_PROXY_USER_ERRORS:
Number of errors detected when using proxy users for authentication,
and when the proxy user does not exist.
This is an existing MySQL feature.

COUNT_PROXY_USER_ACL_ERRORS:
Number of errors detected when using proxy users for authentication,
and when the PROXY privilege is not granted.
This is an existing MySQL feature.

COUNT_AUTHENTICATION_ERRORS:
Number of errors caused by failed authentication.
This is an existing MySQL feature.

COUNT_SSL_ERRORS:
Number of errors caused by SSL.
This is an existing MySQL feature.

COUNT_MAX_USER_CONNECTIONS_ERRORS:
Number of errors caused by exceeding per user connections quotas.
This is an existing MySQL feature.

COUNT_MAX_USER_CONNECTIONS_PER_HOUR_ERRORS:
Number of errors caused by exceeding per user connections per hour quotas.
This is an existing MySQL feature.

COUNT_DEFAULT_DATABASE_ERRORS:
Number of errors related to the default database.
This is an existing MySQL feature.

COUNT_INIT_CONNECT_ERRORS:
Number of errors caused by failures in the init-connect statement execution.
This is an existing MySQL feature.

COUNT_LOCAL_ERRORS:
Number of errors local to the server implementation and not related
to the network, authentication or authorization.
For example, out of memory conditions fit into this category.

COUNT_UNKNOWN_ERRORS:
Number of other, unknown errors.
At time of writing, all errors are known and accounted for in the previous
columns in this table.
This column is reserved for future use, in case new error conditions need to be
reported, and if preserving the backward compatibility and table structure of
the host_cache table is required.

FIRST_SEEN:
Timestamp of the first connect seen from this client IP.

LAST_SEEN:
Timestamp of the last connect seen from this client IP.

FIRST_ERROR_SEEN:
Timestamp of the first error seen from this client IP.

LAST_ERROR_SEEN:
Timestamp of the last error seen from this client IP.

Requirements
============

(1) Table performance_schema.host_cache
---------------------------------------

Func-Req (1.1): A fresh MySQL installation of CURRENT-VERSION must create the
table performance_schema.host_cache.

(2) Server options and variables
---------------------------------

Func-Req (2.1): The status variable Connection_errors_select is defined.

Func-Req (2.2): The status variable Connection_errors_accept is defined.

Func-Req (2.3): The status variable Connection_errors_tcpwrap is defined.

Func-Req (2.4): The status variable Connection_errors_internal is defined.

Func-Req (2.5): The status variable Connection_errors_max_connections is defined.

Func-Req (2.6): The status variable Connection_errors_peer_addr is defined.

Func-Req (2.7): The system variable host_cache_size is defined.

(3) Functional behavior
-----------------------

Func-Req (3.1): When a TCP/IP client connection reaches the server, a
corresponding row for this client IP is inserted in the host_cache.
The HOST column is set to NULL, the VALIDATED column to NO.
This happens regardless of whether the connection is later accepted or rejected.

Func-Req (3.2): When subsequent TCP/IP client connection reaches the server from
the same originating IP, the corresponding row for this client IP is updated in
the host_cache.
This happens regardless of whether the connection is later accepted or rejected.

Func-Req (3.3): Upon successful IP --> HOSTNAME resolution, the corresponding
row for this IP is updated in the host_cache: the HOST column is updated with
the resolved host name, the VALIDATED column is set to YES.

Func-Req (3.4): Upon permanent failures during IP --> HOSTNAME resolution, the
corresponding row for this IP is updated in the host_cache: the HOST column is
NULL, the VALIDATED column is set to YES.

Func-Req (3.5): Upon transient failures during IP --> HOSTNAME resolution, the
HOST and VALIDATED columns are unchanged in the host_cache.

Func-Req (3.6): When an error XXX occurs when processing an incoming client
connection from a given IP address, the corresponding COUNT_XXX_ERRORS
statistics columns for the given IP are updated in the host_cache. See each
column description for the kind of errors accounted for.

Func-Req (3.7): Setting a value to the host_cache_size server variable
implicitly performs a FLUSH HOSTS.

Func-Req (3.8): TRUNCATE TABLE performance_schema.host_cache implicitly performs
a FLUSH HOSTS.

References
==========

BUG#22821 Adding "SHOW HOST_CACHE" patch from Jeremy Cole
BUG 11746048 - 22821: ADDING "SHOW HOST_CACHE" PATCH FROM JEREMY COLE 

BUG#24906 No command to monitor blocked hosts
BUG 11746269 - 24906: NO COMMAND TO MONITOR BLOCKED HOSTS 

BUG#45817 Please add SHOW command for inc_host_errors(max_connect_errors)
BUG 11754244 - 45817: PLEASE ADD SHOW COMMAND FOR
INC_HOST_ERRORS(MAX_CONNECT_ERRORS) 

BUG#59404 More info from host cache
BUG 11766316 - 59404: MORE INFO FROM HOST CACHE 

dev-private thread "BUG#45817" starting with
[ mysql intranet ] /secure/mailarchive/mail.php?folder=4&mail=32668

Some observations about CONNECT errors on a Facebook page
http://www.facebook.com/notes/mysqlfacebook/monitoring-for-errors/293204560932

More examples of complaints, where host_cache could help:

http://stackoverflow.com/questions/9369981/mysql-completely-random-unable-to-connect-to-any-of-the-specified-mysql-hosts

Impact on documentation
=======================

Notes for the doc team:
beside the new table, new server variables etc,
the documentation of return codes for auth plugins **MAY** need to be revised also.