WL#3704: Have timeouts for mgmapi
Affects: Server-5.1 — Status: Complete — Priority: Medium
Add proper timeouts for mgmapi functions. Many functions in the mgm api will not timeout correctly (or remotely quickly) on unexpected disconnection from the management server. Properly auditing and fixing existing functions will probably fix a bunch of bugs in our mgm api clients as well as in customer applications.
1. have timeout in NdbMgmHandle 2. function to set timeout 3. modify each mgmapi function to return error on timeout, keeping existing API/ABI 4. add new versions of functions if needed with better return codes for timeout 5. review use of mgmapi in our programs
From comments in commits: Start using the write_timeout we already have in NdbMgmHandle Add error injection either for this connection or for whole server. Currently nothing for injecting errors into *another* connection... but that's perhaps getting tricky-dicky for this point in time. Perhaps needed for events if we don't do anything fancy. In ndb_mgm_call, add checks for expired timeout in (Input|Output)Stream. In case of timeout, we set NdbMgmHandle->last_error and return NULL. In api calls not using ndb_mgm_call (or using it in conjunction with own IO), they'll need to check for timeouts manually. Macros are provided to do this. Add ndb_mgm_disconnect_quiet(h) to disconnect without checking errors (so we don't clobber NdbMgmHandle->last_error). This helps us provide the *consistent* semantic that on timeout we leave the NdbMgmHandle *disconnected*. We check for this in testMgm. Change CHECK_REPLY in mgmapi to also check for set error in handle->last_error This will pick up the ETIMEDOUT errors and return them to client (through returning correct failure code for API call and setting NdbMgmHandle error). Applications written to MGMAPI before this patch will behave as before, and even hopefully check get_last_error and report the error back to the end user! Adding the last CHECK_TIMEDOUT_RET and delete in ndb_mgm_call() we slightly change behaviour of mgmapi. Previously, if disconnect midway through a reply, where there were only optional parameters left, we'd get a Properties object from ndb_mgm_call() containing NULLs for the optional parameters, leading to interesting error messages. This enables the returning of the *real* message and actually improves the API without breaking compatibility. ndb_mgm_start_signallog ndb_mgm_stop_signallog ndb_mgm_log_signals ndb_mgm_set_trace ndb_mgm_insert_error ndb_mgm_set_int64_parameter  ndb_mgm_set_string_parameter  ndb_mgm_purge_stale_sessions  - return error code on error during ndb_mgm_call TODO: ndb_mgm_report_event   marked for removal, unused.  return codes incorrect in CHECK_HANDLE/CONNECTED. undocumented. Server side: in Services (per session) add macro for injecting timeout error (just waiting 10 seconds before continuing... it does work!) We inject these errors in a number of critical places - including the tricky api functions that don't just use ndb_mgm_call but do their own thing (get_config, get_status and friends) ATRT: Expand testMgm to add timout tests for API. Fully automated. *THEORETICALLY* timing dependent - an ultra-slow network will cause problems and "fake" failures... I welcome other solutions. Tests aren't exhaustive, but cover the generics and the tricky bits. Also test some calling semantics (incl disconnected on error). It is encouraged to add *more* mgmapi tests, not less :) Also add an ERROR_codes.txt file for mgmd Default timout of 30secs for ConfigRetriever Default timout of 5sec for use by Transporter (ports etc). And Ndb_cluster_connection::set_timeout() api for setting timeout from NDBAPI applications. Should be called before connect. e.g. c.set_timeout(4200); c.connect();
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.