WL#11541: Allow switching the SSL options for a running server

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

For long running servers the certificate validity may expire and one must be
able to change it without restarting the server.
This worklog will make all of the SSL options dynamic by preparing a new SSL_CTX
for the listening socket and substituting it instead of the old one.

FR1: We shall add a command to refresh the SSL key material, e.g. ALTER INSTANCE
RELOAD TLS. When called this will re-initialize the SSL_CTX used to accept
connections from the same sources as the config variables are pointing to.
FR 1.1: All sessions that existed at the time RELOAD TLS is called will continue
to use the old SSL context and operate normally.
FR 1.2: All sessions created after RELOAD TLS gets in effect will be using the
new SSL context.
FR 1.3: the resources of the old SSL context will be freed after the last active
session using it is closed.
FR 1.4: RELOAD TLS will not kill any of the existing sessions. If one wants to
do this they can use KILL.
FR 1.5: The eventual new values for the SSL context related system variables
(-ssl-ca, --ssl-cert, ssl-key, --ssl-capath, --ssl-crl, --ssl-crlpath,
--ssl-cipher, --tls-version) will become effective only after a subsequent call
to ALTER INSTANCE RELOAD TLS.
FR 1.6: ALTER INSTANCE RELOAD TLS will not be replicated: SSL config is local
and resides in files that need to be maintained too.
FR 1.7: CONNECTION_ADMIN will be required to execute ALTER INSTANCE RELOAD TLS
FR 1.8: A call to ALTER INSTANCE RELOAD TLS will not disable SSL connections if
the parameter values in the ssl system variables do not form an acceptable set 
of
options and/or some of the functions called to create the new SSL_CTX fails. It
will roll back to the previous set of values. If the optional clause NO ROLLBACK
ON ERROR is specified the rollback will not be performed and the SSL will be
disabled in case of errors setting up the context. 
FR 1.9: A call to ALTER INSTANCE RELOAD TLS will enable SSL connections if the
parameter values in the ssl system variables form an acceptable set of options
and all of the functions called to create the new SSL_CTX succeed.

FR2. We will make (--ssl-ca, --ssl-cert, ssl-key, --ssl-capath, --ssl-crl,
--ssl-crlpath, --ssl-cipher, --tls-version) system variables settable at 
runtime.
FR2.1. Changing the values for the SSL context related status variables
(-ssl-ca, --ssl-cert, ssl-key, --ssl-capath, --ssl-crl, --ssl-crlpath,
--ssl-cipher, --tls-version) won't have any effect until
ALTER INSTANCE RELOAD TLS is called: e.g. to change a certificate you need to
change the related key too. So changing on every SET is not going to work well.

NF3. The current SSL context variable will be converted into an atomic pointer
and all operations with it will be atomic. This might have some performance
impact (TBD).

FR4. This worklog will *only* affect the server's context for the mysql protocol
handler. All other SSL server contexts (X protocol etc) will need to undergo
similar operation. 

NF5. By default (if one doesn't call ALTER INSTANCE RELOAD TLS) the operation of
the server is backward compatible. A slight delay may be observed at 
connect/SHOW
STATUS time due to the atomic read of the SSL context pointer. 

FR6. All the SSL context related system variables (-ssl-ca, --ssl-cert, ssl-key,
--ssl-capath, --ssl-crl, --ssl-crlpath, --ssl-cipher, --tls-version) are to be
mirrored by a set of status variables named Current_tls_* (e.g. the currently
effective --ssl-ca value is to be found in "show status like 'Current_tls_ca'")
reflecting the *currently active* values. These change as you start the server
or call ALTER INSTANCE RELOAD TLS.

FR7. The --ssl option will only have an effect at server startup in that it will
not prepare the server to accept SSL connections. Subsequent calls to ALTER
INSTANCE RELOAD TLS will not take the --ssl option into account anymore.

FR8. Both Group replication and the X plugin will copy the ssl variables values
from the system variables at the time they are activated only. Subsequent
changes to the values of the system variables or calls to ALTER INSTANCE RELOAD
TLS will currently not result in changing the SSL parameters for these 
components.

Current state
---------------

Currently the SSL context is kept into two global variables:
ssl_acceptor_fd : the SSL_CTX itself
ssl_acceptor:  One fake SSL handle created at startup so some status vars will 
work

Both are initialized in init_ssl_communications, called by mysqld_main right
before network_init.
Both are deinitialized in end_ssl, called by clean_up right before vio_end().

The ssl_acceptor_fd is used in the following status vars:
* ssl_ctx_sess_accept
* ssl_ctx_sess_accept_good
* ssl_ctx_sess_connect_good
* ssl_ctx_sess_accept_renegotiate
* ssl_ctx_sess_connect_renegotiate
* ssl_ctx_sess_cb_hits
* ssl_ctx_sess_hits
* ssl_ctx_sess_cache_full
* ssl_ctx_sess_cache_misses
* ssl_ctx_sess_cache_timeouts
* ssl_ctx_sess_number
* ssl_ctx_sess_get_cache_size
* ssl_ctx_get_verify_mode
* ssl_ctx_get_verify_dept
* ssl_ctx_get_session_cache_mode

The ssl_acceptor ssl context is used for the following status vars:
* ssl_server_not_before
* ssl_server_not_after

The ssl_acceptor_fd is used also in:
* as a flag if SSL is configured in:
** send_server_handshake_packet to set CLIENT_SSL capabilities.
** validate_user_plugin_records when there are sha256 users
* to call SSL_accept in send_server_handshake_packet if it's time to set up SSL
layering.

There's also a flag called have_ssl. It's :
* initialized to YES if compiled with SSL library in mysql_init_variables
* initialized to NO (impossible) if compiled without an SSL library in
mysql_init_variables
* set to DISABLED in init_ssl_communications if allocating an SSL_CTX fails
* checked in check_secure_transport() 
* exposed as have_openssl
* exposed as have_ssl


Proposed design
---------------

class ssl_acceptor_context {
protected:
  struct st_VioSSLFd * acceptor_fd;
  SSL *acceptor;
  std::string current_ca_, current_key_, ...
  ssl_acceptor_context();
  ~ssl_acceptor_context();

  // the pointer to hold the current SSL context
  static std::atomic singleton;
  // Partial RCU implementation: a guard to ensure there's no readers.
  // this works since we expect that the readers will be very brief and 
  // most of the time will be spent servicing the session
  // we only need to mark a thread as a reader when it's reading the
  // singleton value until it allocates the new SSL for the session.
  // After that SSL has its own locking scheme.
  static std::atomic rcu_readers;
public:
  // operations
  static bool singleton_init(... ssl_params ...); // the current
init_ssl_communications();
  static void singleton_deinit(); // the current end_ssl
  static bool singleton_flush(... ssl params ...);

  // info functions, to be called for the session vars
  static ssl_ctx_sess_*(...);
  static ssl_ctx_get_*(...);
  static ssl_server_not_*(...);
  static bool have_ssl();
}


Syntax SUGAR to parser "ALTER INSTANCE RELOAD TLS [NO ROLLBACK ON ERROR]" to 
call
singleton_flush().

All SSL variables settable at runtime. 

singleton_flush() will:
1. create a new ssl_acceptor_context instance complete with SSL_CTX, SSL and
copy the SSL variables values.
2. atomically swap the new instance with the old into the static atomic
singleton pointer.
3. Free the old instance when there's no readers

Note that SSL_CTX has an (atomic) reference count. And SSL_free calls
SSL_CTX_free for the SSL_CTX it was created on. So the last SSL_free for the
last session using the old CTX will free the CTX.

The info functions can safely use the SSL_CTX data due to the reference count
bump by the acceptor SSL.


The home grown RCU (Read-Change-Update) lockless implementation
------------------------------------------------------------------

We want to impose a minimal penalty on reading the SSL_CTX.
That may come at the cost of some extra effort spent on updating it, since
that's going to be very un-frequent compared to the reads.

The best algorithm for that seems to be an adapted version of RCU. See the test
driver attachment for comparison benchmarking. This compares the implementation
with a mutex and a read write lock. And also has a proof of concept SSL_CTX/SSL
implementation too.

We have a single global atomic pointer to hold the SSL context.
We can safely prepare a new one and replace the old one atomically.
But the issue is that we need to make sure nobody's using the old ssl context
before we can call SSL_CTX_free() on it. 
openssl has its own reference counters for SSL_CTX. Thus we only need to make
sure nobody is reading the global atomic pointer without the help of an SSL.
Because if we allocate an SSL from the SSL_CTX the openssl's reference count
kicks in and the SSL_CTX is safe and disposed of properly.
So we're left with having to account ourselves for a very small window: from the
moment the SSL_CTX global atomic pointer is read to the moment the SSL_new
completes.

Since the window is so small there's a high chance that, even on a busy system,
there will be lots of moments when no thread will be in that state.
Thus all we need is a stable reading if there's a reader in that window 
presently. 
We achieve this via a simple atomic reference count. 
And we just busywait on it to be zero in the writer before disposing of the old
CTX. 
In the unlikely event this wait fails (too many sessions reading) we could
either leak the CTX or add it on a global queue for the next writer to dispose 
of.

Results from running the test driver on a windows laptop with openssl 1.1 (times
in secs):

Starting type rcu
all threads started in t=20
all readers ended t=20
all threads ended in t=20
Stats: reads=100000000 writes=50
Starting type mutex
all threads started in t=34
all readers ended t=34
all threads ended in t=34
Stats: reads=100000000 writes=50
Starting type smutex
all threads started in t=78
all readers ended t=78
all threads ended in t=79
Stats: reads=100000000 writes=50
Starting type SSL_rcu
all threads started in t=144
all readers ended t=145
all threads ended in t=145
Stats: reads=0 writes=50


Results from running the test driver on a linux server with openssl 1.0 (times
in secs):

Starting type rcu
all threads started in t=2
all readers ended t=6
all threads ended in t=6
Stats: reads=100000000 writes=50
Starting type mutex
all threads started in t=0
all readers ended t=18
all threads ended in t=18
Stats: reads=100000000 writes=50
Starting type smutex
all threads started in t=0
all readers ended t=25
all threads ended in t=25
Stats: reads=100000000 writes=50
Starting type SSL_rcu
all threads started in t=1
all readers ended t=387
all threads ended in t=387
Stats: reads=0 writes=50

Effects on other users of the server SSL parameters
---------------------------------------------------

Currently X and GR plugins take a freeload on the server parameters for their
own SSL config.
After this work is done there will be no change there, except that they'd take
the startup parameters and freeze into these: i.e. if the server values change
the GR and X will not notice this.

Separate work needs to be done by the relevant teams to ensure that server
changes are handled properly.
Or that they do not depend on the server parameters (which is the preferred 
mode).