MySQL :: Row numbering, ranking: how to use LESS user variables in MySQL queries

User variables, even though not part of standard SQL, are great objects in MySQL. They allow to keep some “state” for the life of a session: a session can execute a statement to compute a value, store that value in a user variable, and use it in all next statements. This avoids recalculating the same value in each of the next statements, thus gaining efficiency.

For example, let’s simulate a lottery: pick up a random user, store him in the table of winners, and send him a congratulations email:

# Pick up a random user.
# Unlike column names, a user variable's name always starts with '@'.

SET @winner:=
 ( SELECT pk_id FROM users ORDER BY RAND() LIMIT 1 );

# Store in table of winners
INSERT INTO winners (id, day) VALUES(@winner, NOW());

# Send congratulations email
# with an imaginary user-defined function:

SELECT UDF_SEND_EMAIL(@winner);

# Pick up a random user.

# Unlike column names, a user variable's name always starts with '@'.

SET @winner:=

( SELECT pk_id FROM users ORDER BY RAND() LIMIT 1 );

# Store in table of winners

INSERT INTO winners (id, day) VALUES(@winner, NOW());

# Send congratulations email

# with an imaginary user-defined function:

SELECT UDF_SEND_EMAIL(@winner);

Using @winner has allowed us to run the sub-query-with-RAND once instead of twice; that’s faster, and more correct: running it twice would have generated potentially two different winners for the INSERT and the email sending, causing a logic error in our design.

This is all nice and neat. The above example clearly separates phases:

phase 1 sets the @winner variable
phase 2 reads @winner in next two statements.

But it’s also possible to set and read a user variable in a single statement. The use case is: to have a “moving value” which changes for every row, based on the value which it had for the previous row (this assumes a notion of “previous row” defined by some ordering).

Here’s a classical example which generates row numbers (1, 2, …) according to some order. I have a table of famous people:

CREATE TABLE people (name VARCHAR(100), birthdate DATE, sex CHAR(1),
                     citizenship CHAR(2));

INSERT INTO people VALUES
("Jimmy Hendrix", "19421127", "M", "US"),
("Herbert G Wells", "18660921", "M", "UK"),
("Angela Merkel", "19540717", "F", "DE"),
("Rigoberta Menchu", "19590109", "F", "GT"),
("Georges Danton", "17591026", "M", "FR"),
("Tracy Chapman", "19640330", "F", "US");

CREATE TABLE people (name VARCHAR(100), birthdate DATE, sex CHAR(1),

citizenship CHAR(2));

INSERT INTO people VALUES

("Jimmy Hendrix", "19421127", "M", "US"),

("Herbert G Wells", "18660921", "M", "UK"),

("Angela Merkel", "19540717", "F", "DE"),

("Rigoberta Menchu", "19590109", "F", "GT"),

("Georges Danton", "17591026", "M", "FR"),

("Tracy Chapman", "19640330", "F", "US");

I want to order and number them by birth date:

SET @rownum:=0;

SELECT @rownum:=(@rownum+1) AS num, name, birthdate
FROM people
ORDER BY birthdate;

+--------+------------------+------------+
| num    | name             | birthdate  |
+--------+------------------+------------+
|      1 | Georges Danton   | 1759-10-26 |
|      2 | Herbert G Wells  | 1866-09-21 |
|      3 | Jimmy Hendrix    | 1942-11-27 |
|      4 | Angela Merkel    | 1954-07-17 |
|      5 | Rigoberta Menchu | 1959-01-09 |
|      6 | Tracy Chapman    | 1964-03-30 |
+--------+------------------+------------+

SET @rownum:=0;

SELECT @rownum:=(@rownum+1) AS num, name, birthdate

FROM people

ORDER BY birthdate;

+--------+------------------+------------+

| num | name | birthdate |

+--------+------------------+------------+

| 1 | Georges Danton | 1759-10-26 |

| 2 | Herbert G Wells | 1866-09-21 |

| 3 | Jimmy Hendrix | 1942-11-27 |

| 4 | Angela Merkel | 1954-07-17 |

| 5 | Rigoberta Menchu | 1959-01-09 |

| 6 | Tracy Chapman | 1964-03-30 |

+--------+------------------+------------+

It works.

I add: surprisingly. Let me explain.

Standard SQL says that we should join tables (here we have only one, people), put the result of the join in an imaginary table, then evaluate selected expressions for each row, forming yet another imaginary table, then sort that table and return it to the user.

Yes, in theory ordering is the very final stage of execution, after evaluation of the selected expressions for all matching rows (which is logical, as ORDER BY is allowed to reference selected expressions by alias). So, if we follow this, we read people, evaluate selected expressions for each row, so calculate a row number for each row, and finally we order the result by birth date. Being calculated before ordering, row numbers would have no reason to be according to the order.

Fortunately, here, MySQL sees that there is only one table and that the ordering expression is a column of this table, so it makes an optimization: it first orders the rows of people, then reads this ordered result and, for each row, calculates selected expressions (including @rownum). So @rownum increases in order, as desired. But, as noted, it’s relying on an ORDER BY optimization done by MySQL, which is not guaranteed to remain constant in the future (who knows? we revisit heuristic and cost-based optimization decisions from time to time). Moreover, if later an application developer modifies our query, this modification may lead MySQL to change its optimization strategy and do ordering after evaluating the selected expressions, totally breaking @rownum’s logic. To demonstrate it, let’s say this unaware developer just wants to add to our query a column displaying the full name of the country of citizenship – pretty innocent, uh? He first makes a table with full names:

CREATE TABLE country(code CHAR(2), name VARCHAR(100));

INSERT INTO country VALUES
("GT", "Guatemala"),
("DE", "Germany"),
("FR", "France"),
("UK", "United Kingdom"),
("US", "United States of America");

CREATE TABLE country(code CHAR(2), name VARCHAR(100));

INSERT INTO country VALUES

("GT", "Guatemala"),

("DE", "Germany"),

("FR", "France"),

("UK", "United Kingdom"),

("US", "United States of America");

and now he edits our row-numbering query, joining people with country:

SET @rownum:=0;

SELECT @rownum:=(@rownum+1) AS num, people.name,
       people.birthdate, country.name
FROM people JOIN country ON people.citizenship=country.code
ORDER BY people.birthdate;

+--------+------------------+------------+--------------------------+
| num    | name             | birthdate  | name                     |
+--------+------------------+------------+--------------------------+
|      5 | Georges Danton   | 1759-10-26 | France                   |
|      2 | Herbert G Wells  | 1866-09-21 | United Kingdom           |
|      1 | Jimmy Hendrix    | 1942-11-27 | United States of America |
|      3 | Angela Merkel    | 1954-07-17 | Germany                  |
|      4 | Rigoberta Menchu | 1959-01-09 | Guatemala                |
|      6 | Tracy Chapman    | 1964-03-30 | United States of America |
+--------+------------------+------------+--------------------------+

SET @rownum:=0;

SELECT @rownum:=(@rownum+1) AS num, people.name,

people.birthdate, country.name

FROM people JOIN country ON people.citizenship=country.code

ORDER BY people.birthdate;

+--------+------------------+------------+--------------------------+

+--------+------------------+------------+--------------------------+

| 5 | Georges Danton | 1759-10-26 | France |

| 2 | Herbert G Wells | 1866-09-21 | United Kingdom |

| 1 | Jimmy Hendrix | 1942-11-27 | United States of America |

| 3 | Angela Merkel | 1954-07-17 | Germany |

| 4 | Rigoberta Menchu | 1959-01-09 | Guatemala |

| 6 | Tracy Chapman | 1964-03-30 | United States of America |

+--------+------------------+------------+--------------------------+

num is plain wrong! Because the Optimizer has chosen a plan which does ordering last, thus after evaluation of user variables. A non-limiting list of factors which can disturb num are: join order, use of join buffering, use of temporary table, GROUP BY…

Going further… to avoid the round-trip which sends the initializing SET statement, some clever users (no irony on “clever”) bundle two in one with this trick:

SELECT @rownum:=(@rownum+1) AS num, name, birthdate
FROM (SELECT @rownum:=0) AS initialization, people
ORDER BY birthdate;

SELECT @rownum:=(@rownum+1) AS num, name, birthdate

FROM (SELECT @rownum:=0) AS initialization, people

ORDER BY birthdate;

It works. It relies on the fact that MySQL evaluates the content of the derived table named initialization. But in the future, MySQL may become smarter and say that “this derived table isn’t used by the statement so I can just skip its evaluation”. And it would thus skip the initialization.

The point I’m making is: these queries make assumptions which may hold today, but are not future-proof. If they hold today we have no conscious intention to break them, but one day we unavoidably will, as old code gets refactored, as new SQL features are implemented, as we come closer to the SQL standard. What has never been explicitly guaranteed may change. The accidental change may come from our side, or from your side (see the example with the citizenship’s full name and the join).

Our manual cautiously warns about this:

” As a general rule, other than in SET statements, you should never assign a value to a user variable and read the value within the same statement. For example, to increment a variable, this is okay: SET @a = @a + 1. For other statements, such as SELECT, you might get the results you expect, but this is not guaranteed.”

I hear someone saying: “you’re criticizing, but there’s no other way to get numbered rows in MySQL!“. Here are good news: in MySQL 8.0, there finally is another way: window functions. Here’s a query using the ROW_NUMBER window function:

SELECT ROW_NUMBER() OVER (ORDER BY birthdate) AS num,
       name, birthdate
FROM people;

+--------+------------------+------------+
| num    | name             | birthdate  |
+--------+------------------+------------+
|      1 | Georges Danton   | 1759-10-26 |
|      2 | Herbert G Wells  | 1866-09-21 |
|      3 | Jimmy Hendrix    | 1942-11-27 |
|      4 | Angela Merkel    | 1954-07-17 |
|      5 | Rigoberta Menchu | 1959-01-09 |
|      6 | Tracy Chapman    | 1964-03-30 |
+--------+------------------+------------+

SELECT ROW_NUMBER() OVER (ORDER BY birthdate) AS num,

name, birthdate

FROM people;

+--------+------------------+------------+

| num | name | birthdate |

+--------+------------------+------------+

| 1 | Georges Danton | 1759-10-26 |

| 2 | Herbert G Wells | 1866-09-21 |

| 3 | Jimmy Hendrix | 1942-11-27 |

| 4 | Angela Merkel | 1954-07-17 |

| 5 | Rigoberta Menchu | 1959-01-09 |

| 6 | Tracy Chapman | 1964-03-30 |

+--------+------------------+------------+

It gives the right result. If you are new to window functions’ syntax, you may want to check the manual or Dag’s blog.

Unlike for the query with a user variable, adding a join to the query with a window function doesn’t disturb num at all:

SELECT ROW_NUMBER() OVER (ORDER BY people.birthdate) AS num,
       people.name, people.birthdate, country.name
FROM people JOIN country ON people.citizenship=country.code;

+--------+------------------+------------+--------------------------+
| num    | name             | birthdate  | name                     |
+--------+------------------+------------+--------------------------+
|      1 | Georges Danton   | 1759-10-26 | France                   |
|      2 | Herbert G Wells  | 1866-09-21 | United Kingdom           |
|      3 | Jimmy Hendrix    | 1942-11-27 | United States of America |
|      4 | Angela Merkel    | 1954-07-17 | Germany                  |
|      5 | Rigoberta Menchu | 1959-01-09 | Guatemala                |
|      6 | Tracy Chapman    | 1964-03-30 | United States of America |
+--------+------------------+------------+--------------------------+

SELECT ROW_NUMBER() OVER (ORDER BY people.birthdate) AS num,

people.name, people.birthdate, country.name

FROM people JOIN country ON people.citizenship=country.code;

+--------+------------------+------------+--------------------------+

+--------+------------------+------------+--------------------------+

| 1 | Georges Danton | 1759-10-26 | France |

| 2 | Herbert G Wells | 1866-09-21 | United Kingdom |

| 3 | Jimmy Hendrix | 1942-11-27 | United States of America |

| 4 | Angela Merkel | 1954-07-17 | Germany |

| 5 | Rigoberta Menchu | 1959-01-09 | Guatemala |

| 6 | Tracy Chapman | 1964-03-30 | United States of America |

+--------+------------------+------------+--------------------------+

A similar use case is getting a row number inside the group; let’s say I want to group by sex; for that, clever users have written the following query, which detects if the sex is the same as the previous row; if it is, we’re in the same group as the previous row so increase the counter, otherwise we are entering a new group so reset it to 1:

SET @prev_sex:='';

SELECT
@rownum:=
(
  CASE WHEN (sex=@prev_sex) THEN @rownum+1
  ELSE 1
  END
) AS num,
(@prev_sex:=sex),
name, birthdate
FROM people
ORDER BY sex, birthdate;

+------+------------------+------------------+------------+
| num  | (@prev_sex:=sex) | name             | birthdate  |
+------+------------------+------------------+------------+
|    1 | F                | Angela Merkel    | 1954-07-17 |
|    2 | F                | Rigoberta Menchu | 1959-01-09 |
|    3 | F                | Tracy Chapman    | 1964-03-30 |
|    1 | M                | Georges Danton   | 1759-10-26 |
|    2 | M                | Herbert G Wells  | 1866-09-21 |
|    3 | M                | Jimmy Hendrix    | 1942-11-27 |
+------+------------------+------------------+------------+

SET @prev_sex:='';

SELECT

@rownum:=

(

CASE WHEN (sex=@prev_sex) THEN @rownum+1

ELSE 1

END

) AS num,

(@prev_sex:=sex),

name, birthdate

FROM people

ORDER BY sex, birthdate;

+------+------------------+------------------+------------+

+------+------------------+------------------+------------+

| 1 | F | Angela Merkel | 1954-07-17 |

| 2 | F | Rigoberta Menchu | 1959-01-09 |

| 3 | F | Tracy Chapman | 1964-03-30 |

| 1 | M | Georges Danton | 1759-10-26 |

| 2 | M | Herbert G Wells | 1866-09-21 |

| 3 | M | Jimmy Hendrix | 1942-11-27 |

+------+------------------+------------------+------------+

It works. But it relies on the fact that MySQL sets @rownum before setting @prev_sex, i.e. evaluates selected expressions from left to right. Which is true for this query, but not for all queries. As the manual says: “the order of evaluation for expressions involving user variables is undefined“. The same result is now achievable with a PARTITION sub-clause for the window:

SELECT ROW_NUMBER() OVER (PARTITION BY sex ORDER BY birthdate) AS num,
sex, name, birthdate
FROM people;

+--------+------+------------------+------------+
| num    | sex  | name             | birthdate  |
+--------+------+------------------+------------+
|      1 | F    | Angela Merkel    | 1954-07-17 |
|      2 | F    | Rigoberta Menchu | 1959-01-09 |
|      3 | F    | Tracy Chapman    | 1964-03-30 |
|      1 | M    | Georges Danton   | 1759-10-26 |
|      2 | M    | Herbert G Wells  | 1866-09-21 |
|      3 | M    | Jimmy Hendrix    | 1942-11-27 |
+--------+------+------------------+------------+

SELECT ROW_NUMBER() OVER (PARTITION BY sex ORDER BY birthdate) AS num,

sex, name, birthdate

FROM people;

+--------+------+------------------+------------+

| num | sex | name | birthdate |

+--------+------+------------------+------------+

| 1 | F | Angela Merkel | 1954-07-17 |

| 2 | F | Rigoberta Menchu | 1959-01-09 |

| 3 | F | Tracy Chapman | 1964-03-30 |

| 1 | M | Georges Danton | 1759-10-26 |

| 2 | M | Herbert G Wells | 1866-09-21 |

| 3 | M | Jimmy Hendrix | 1942-11-27 |

+--------+------+------------------+------------+

What are the benefits of the new approach with window functions? First, it’s a much shorter query. Second, it’s standard-compliant so:
– the result is predictable and guaranteed in future versions,
– the query is easily portable to other database software,
– the query is easily understood by database developers not yet familiar with MySQL.

Possibly some especially tricky and smart queries still need user variables read-and-set techniques, or possibly some perform better than their equivalent with window functions (please, let me know about them… we’re eager to know what to optimize).

But as a rule of thumb, this old black-magic practice should not be necessary anymore, and is advantageously replaced with window functions ROW_NUMBER, RANK, LEAD, LAG, FIRST_VALUE…

That’s it for today. Thank you for using MySQL!