WL#7742: Slave performance: Waiting for more transactions to enter Binlog Group Commit (BGC) queues

Affects: Server-5.7   —   Status: Complete


This worklog adds two new options to introduce an artificial delay to
make the binary log group commit procedure wait. This gives a chance
that more transactions are flushed and synced together to disk, thus
reducing the overall time spent to commit a group of transactions (the
bigger the groups the less number of sync operations). With the correct
tuning, this can make the slave perform several times faster without
compromising master's throughput.
These options are named: binlog-group-commit-sync-delay and
binlog-group-commit-sync-no-delay-count. The former takes as input the
number of microseconds to wait. The latter, takes as input the number
of transactions that the server waits for, before deciding to abort the

- PB2 entry:
- RB entry: http://rb.no.oracle.com/rb/r/4965/
Functional requirements:

F1: The new options SHALL be dynamically changeable.

F2: The new options SHALL only have GLOBAL scope. It does not make
    sense to have SESSION scope.

F3: Zero SHALL be an allowed value of the new option. This means that
    the feature is deactivated. 

F4: binlog-group-commit-sync-delay does not interact direclty with the
    existing binlog-max-flush-queue-time option.

F5: binlog-group-commit-sync-no-delay-count does not interact direclty 
    with the existing binlog-max-flush-queue-time option.

F6: If binlog-group-commit-sync-no-delay-count is zero, the server 
    SHALL wait the entire binlog-group-commit-sync-delay to elapse.

F7: If binlog-group-commit-sync-delay is zero, the server SHALL ignore

This worklog presents two new options that allow the user to introduce a
delay in the binlog group commit pipeline, just before the server
starts processing the flush or sync queue.

Roughly, the pipeline of the binlog group commit (BGC) is divided into
three stages: FLUSH, SYNC and COMMIT. Each stage has a leader thread
that handles the procedure on behalf of other threads that are
parked. Once all stages are done, the leader thread notifies the
parked threads that they have either been committed successfully or
ended up in errors. In the flush stage, the leader thread keeps
flushing incoming threads' caches until the queue of that stage
becomes empty or the binlog-max-flush-queue-time elapses. In the SYNC
stage the leader thread syncs the binary log file to disk. In the
COMMIT stage, the leader thread commits transactions to the engine and
notifies parked transactions that they should awake and resume their
after commit operations/cleanup.

The idea behind this worklog is to introduce a delay on the flush or
sync stage, before the leader thread actually starts processing the
flush queue or syncs the binary log file. The exact place where to put
the wait needs to be determined by running some tests.

With an additional wait by the leader thread, there is a chance that
more sessions gather on the flush queue and thus are eligible to be
flushed, synced and committed as part of the same/bigger group commit.

This should increase performance on the master in some cases, but also
on the multi-threaded applier on the slave side. Why? Given that this
also increases the chance that more transactions prepare at the same
logical point in time, then more transactions shall be eligible to be
scheduled in parallel at the slave.


A new CLI option and System Variable is introduced to configure the

  NAME: binlog-group-commit-sync-delay
  RANGE: 0 - 1000000 (usec)

A new CLI option and System Variable is introduced to configure the
maximum number of sessions to wait to pile up, before exiting the 
waiting procedure.

  NAME: binlog-group-commit-sync-no-delay-count
  RANGE: 0 - 100000 (max connections)

3. BEHAVIOR OF binlog-group-commit-sync-delay

This variable controls whether the leader thread for the flush or sync
stage of the binlog group commit shall wait or not before actually
process the stage queue or sync the binary log file. This gives a
chance for more follower threads to reach the flush queue.

It is as simple as introducing a sleep operation before the procedure
to process the queue or before syncing the binary log file.

4. BEHAVIOR OF binlog-group-commit-sync-no-delay-count

This variable sets the maximum number of sessions to wait for before
aborting the current wait procedure as specified by 
binlog-group-commit-sync-delay. If binlog-group-commit-sync-delay is 
set to 0, then this option is a no-op.

As such, setting this option means that the server exits the wait procedure
if the number of sessions reaches the count given before the timeout elapses.

5. INTERACTION BETWEEN binlog-group-commit-sync-delay and

binlog-group-commit-sync-delay is not processed while the flush queue is
being processed, therefore, it does not intersect with
binlog-max-flush-queue-time. The behavior of this option remains

6. INTERACTION BETWEEN binlog-group-commit-sync-no-delay-count and

binlog-group-commit-sync-no-delay-count is not processed while the flush
queue is being processed, therefore, it does not intersect with
binlog-max-flush-queue-time. The behavior of this option remains
The following details the major low level changes needed to implement the 
major design blocks depicted in the previous sections.

1. Introduce a waiting mechanism on the FLUSH or SYNC stage for the leader
   thread to wait according to the value of binlog-group-commit-sync-delay. 
   Similar to the following:

Performance tests will dictate the best place where to put this wait.

DECISION: Given that basic tests did not show difference between the 
          two approaches, lets wait before SYNCing. If later, some
          difference is to be found, we can still move the wait point
          to before the flush procedure.

                           thd->thread_id, thd->commit_error));
+  /* Shall introduce a delay. */
+  stage_manager.wait_count_or_timeout(
+                           opt_binlog_group_commit_sync_no_delay_count,
+                           opt_binlog_group_commit_sync_delay,
+                           Stage_manager::SYNC_STAGE);
   THD *final_queue= stage_manager.fetch_queue_for(Stage_manager::SYNC_STAGE);
   if (flush_error == 0 && total_bytes > 0)


+  time_t wait_count_or_timeout(ulong count, time_t usec, StageID stage)
+  {
+    time_t to_wait= usec;
+    time_t delta= static_cast(to_wait * 0.1);
+    while ((static_cast(m_queue[stage].get_size()) < count || count ==
0) && to_wait > 0) 
+    {
+      my_sleep(delta);
+      to_wait -= delta;
+    }
+    return to_wait;
+  }

2. Introduce code for the new variables, similar to the following:

+static Sys_var_ulong Sys_binlog_group_commit_sync_delay(
+       "binlog_group_commit_sync_delay",
+       "The number of microseconds the server waits for the "
+       "binary log group commit sync queue to fill before "
+       "continuing. Default: 0. Min: 0. Max: 1000000.",
+       GLOBAL_VAR(opt_binlog_group_commit_sync_delay),
+       VALID_RANGE(0, 1000000 /* max 1 sec */), DEFAULT(0), BLOCK_SIZE(1),
+static Sys_var_ulong Sys_binlog_group_commit_sync_no_delay_count(
+       "binlog_group_commit_sync_no_delay_count",
+       "If there are this many transactions in the commit sync "
+       "queue and the server is waiting for more transactions "
+       "to be enqueued (as set using --binlog-group-commit-sync-delay), "
+       "the commit procedure resumes.",
+       GLOBAL_VAR(opt_binlog_group_commit_sync_no_delay_count),
+       VALID_RANGE(0, 100000 /* max connections */),
+       DEFAULT(0), BLOCK_SIZE(1),