WL#2796: Add slave_status_wait(thread, status, timeout) for more reliable rpl tests
Affects: Server-7.1
—
Status: Un-Assigned
Suggestion from Sasha to be implemented sometime after 5.0: There is a fair number of reports of "race conditions" - replication tests that fail sporadically and only on certain platforms. The root of the problem is using sleep with a fixed number of seconds to wait for some particular event, usually a slave thread stopping on error that we have caused on purpose, and then expecting the condition to actually have occurred. We test for the condition, and if it did not happen, we consider this an error. Some platforms may have an unusual scheduler, thread implementation, slow CPU, or just be loaded so heavily that the given number of seconds is not enough. This problem can be solved by implementing a few functions similar to master_pos_wait() on the server that do not return until the condition is reached or until we time out with the default time out being something like 60 seconds. Actually, I think we really need only one - slave_status_wait(thread_type,status_value,timeout). It will return when the specified thread has reached the specified status. Lars suggests that the function has an extra argument for the named_master (or with the keyword ALL), so that it works without syntax change with multi-source in 5.1, i.e. slave_status_wait(named_master, thread_type, status_value, timeout). Magnus suggests that the function should be generic and allow us to wait for any SQL query to return a particular answer. Something like: wait_until("SELECT a from t1", 60); Unfortunately not all SQL statement's for replication are yet "selectable" I guess, like SHOW MASTER STATUS, but maybe we can do that in a smart way? Examples: select position from show master status; or create temporary table #temp_stat as show master status; And then use wait_until("select position from #temp_stat", 60) from the test script.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.