MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
The Query Rewrite Plugins

Why Query Rewrites?

Now that the cost model project is progressing, most of you are going to notice execution plan changes. In the vast majority of the cases, the changes will be for the better, and some bugs with a long history will finally be closed. In some cases, however, you will notice that your queries run slower. This is inevitable: even if the MySQL optimizer is doing a much better job with the information it has, it may still be the case that the information was incomplete and that the best plan was, in fact, found by not trusting that information! Normally, we would just say “add an optimizer hint” and be over with it. But sometimes you can’t do that. For instance your query could be auto-generated from an application that you have no control over. This is why you want to intervene right before the query reaches the server. And on behalf of Sweden I apologize for Ace of Base.

When Query Rewrites?

Ideally, a query re-writer should be placed right between the client and the server, to take a quick sniff at the queries coming in and dose the foul-smelling ones with fragrant optimizer hints. But in practice it would probably have to have a complete SQL parser built-in. This would cause a large performance overhead if that re-writer was running on the same machine as the server. A common request has therefore been to be able to write plugins that can pick up the queries at the server side.

We now offer two API’s for writing query rewrite plugins. The first one is for when you know exactly – as in character-by-character exactly – what the offending queries look like. This one has a hook to intercept the query string right before it’s parsed. Unsurprisingly, we call it the pre-parse rewrite plugin API. The second one comes in right after parsing and acts on the parse tree. It offers the basic functionality to walk over the parsed query, which is a lot more efficient than dealing with a string. It should not surprise anyone that we call this the post-parse rewrite plugin API. As part of the package there is also a plugin we developed called Rewriter, to which I shall devote the remainder of this post. This plugin is of the latter kind, post-parse.

How Query Rewrites?

The Rewriter plugin uses a pattern matching machinery to identify the queries that should be rewritten and what they should be rewritten to. The pattern uses wildcard symbols to match constants in the query, and those symbols may be referenced in the replacement. A bit like search-and-replace with capture. I feel the easiest way to explain it is with examples, so I’ve put together a little tutorial.

Query Rewrite Tutorial

This will be a tutorial on how to get going with your query rewrites. You shouldn’t need more than a running MySQL server and a client to get going.

I don’t recommend loading the plugin by simply doing INSTALL PLUGIN. Granted, this will install the plugin for you, but it won’t let you communicate with it. In order to get the whole package you will have to run the provided install script. You should be able to just copy-paste it in your sql client. I just pipe it to the command line client:

This will set everything up for you. If you change your mind about rewriting queries, you can just uninstall the plugin with an UNINSTALL PLUGIN command. If you want to completely cover your tracks, you can run the provided script uninstall_rewriter_plugin.sql, but be aware that this will drop all of your rewrite rules without a trace.

A simple example

Now, let’s rewrite some queries. You do this by defining rewrite rules. A rule consists of a pattern that the queries have to match and a replacement query that will replace all queries that match the pattern. So now that we’re up and running let’s write a rule and hand it to Rewriter. This being a relational database, rules are of course stored in a table. The table is called rewrite_rules and lives in the database query_rewrite. There is nothing magic about this table, it’s a table like any other.

We will start with the simplest possible example: we’ll rewrite the query SELECT constant to SELECT constant + 1. So let’s make this  rule

We have now created a rule in this table, and you may take a look at at in case you forget it later:

There are other columns in the table as well, but let’s focus on these for now. We have now created a rewrite rule, but we haven’t told Rewriter about it yet. In order to execute rewrites with a minimal effect on performance, the rules have to be loaded into memory. This is done by calling the stored procedure flush_rewrite_rules in the same database as before:

This procedure will commit  your current transaction, just like FLUSH TABLE would, and load the rules up in Rewriter’s memory. We are now ready to rewrite some queries:

Certainly not what you’d expect when writing SELECT 1! Obviously, something just happened here, so let’s note two things: First, the pattern was “SELECT ?”, yet I wrote “select 1”  – lowercase – and still it obviously matched the pattern. Rewriter, being a post-parse plugin, acts on parse trees rather than strings, so as long as the query and the pattern have the same parse tree, it is considered a match. The second thing to note is that there is a warning. So what might that be? Let’s find out:

You didn’t think Rewriter would just rewrite your queries and not tell you, did you? In fact, it will always leave this  note when it rewrites your queries. The really vigilant ones of you will also notice that the original query is spelled out exactly the way I wrote it, in lowercase. This makes the rewrites easy to track in logs.

As you saw in the above example, the rule remembers those constant values matched by the ? symbol and injects them into the replacement. The values are injected strictly left-to-right. The syntax uses the same mechanism for matching parts of queries as the prepared statement syntax does: the question mark symbol may be used to match any single literal. So in a sense patterns are like prepared statements run backwards.

Where Is My Current Database?!

Those of you that really stayed awake during the last section may actually have tried inserting that rewrite rule. You were really apt pupils, trying to be the best in your class. And what did you get for all your hard work? You probably did something like this.

Huh? Failed to load? What’s going on here I hear you saying. Indeed you followed my example to the letter and still you ran into trouble immediately. In order to fix this, let’s gather all information we can from Rewriter. It’s time to see the full contents of the rules table:

As you can see, there’s three more columns that I didn’t tell you about: enabled, pattern_database, and message. Enabled is straightforward, the default value is Y, meaning that the rule should be loaded when you load the table. You can set it to N if you want to disable a rule but wish to keep it so you don’t forget it. In this case Rewriter disabled the rule for you because there’s a problem with it. What the problem is can be seen in the message column. Now you’re thinking “but I did select a database! I ran use mydb!” In order to explain why this happens, I will need to go into a technical detail of the loading procedure. So far I’ve tried to keep digressions into the inner workings of the plugin to a minimum, but this one you really need to know about. When you tell Rewriter to load your rules (calling flush_rewrite_rules(), remember?), what it actually does is log in with its own session. Since the current database is a session-dependent setting, Rewriter has no idea what your current session is. Besides, you may have multiple clients logging in an running queries with their own current database. So Rewriter does the sensible thing and doesn’t make any assumptions about what should be considered the current database. That’s what we use the pattern_database column for. Rewriter will use that as the current database when trying to match a query to a rule. Armed with this knowledge, let’s rectify the situation:

That’s better. This means that the rule will match when you send the query with the unqualified table name mytable and the current database is mydb. Note that the rule will not match queries referencing the qualified table name. If you want to match the query regardless of qualification (and you usually do,) you will have to create two rules, one referencing mydb.mytable and one referencing mytable. This is slightly inconvenient, but it’s a price we have to pay for performance of the pattern matching. I might cover this in more detail in a later blog post, but for now let’s accept it as a fact of life.

A Real-World Example

In this section, I will cover an actual case where the optimizer lacks complete knowledge about data distribution, and makes an optimistic choice which works best on average but is really slow in the worst case. Those who encounter a worst-case data distribution case will want to nudge the optimizer in the right direction.

In this example case we have a health clinic where they frequently need to run the following query:

There are existing indexes on patientId and time.

Now, there are two ways we could go when executing this query.

  1. Do a range scan on the patientId index, then sort it on time using a filesort and finally to return the first row.
  2. Do an index scan on the time field until we find a row that matches the where clause.
Intuitively, the second approach seems to be the most efficient one in this case. Since headache is a fairly common gripe, at least among us lowly developers who spend our days hunched in front of terminals, we can assume that we will find a matching row fairly quickly. Note that the optimizer does not have the information on distribution here, so it has to make an educated guess using general heuristics. Indeed, this is how the query is executed after MySQL 5.6.
Now, consider if the query were:
The illness is different here (I dare you to pronounce it). In this case, a matching row will be rare indeed, if there is one at all. MySQL, still going with option 2, will take 20 seconds to execute this query for the user. There are proposals on how to address this, but it’s not a clear-cut case. That’s why for the time being we recommend adding a FORCE INDEX( patientIdIdx ) clause to the query. The problem is that in this particular case the query was automatically generated by a third-party application that the user couldn’t alter. The solution is now to create a rewrite rule for this particular case:

This will add the needed hint to each matching query and we’ll get a major improvement in the performance of them.

We look forward to your feedback on this new feature! Please let us know if you encounter any problems or have other general input. Thanks!

Acrocephalosyndactylia is a disease that affect one in 100, 000 people.