If you have used SQL a bit, you are certainly familiar with so-called set functions or aggregate functions COUNT, SUM, AVG, described in the manual. For example, let’s say that I am the owner of a shop and I keep track of daily sales in this table:
1 |
create table sales (month int, day int, amount int); |
The first column is the number of the month, between 1 and 12, the second column is the number of the day in the month, between 1 and 31, and the third column is how much we sold on that date. Let’s add some example rows for a few days of January, March, June and December:
1 2 3 4 5 6 7 8 |
insert into sales values (1,3,9000), (1,5,8000), (3,15,20000), (6,14,110123), (6,18,3000), (12, 20, 101000), (12, 25, 123456); |
I can get a summary of monthly sales like this:
1 |
select month, sum(amount) from sales group by month; |
That will yield one row per month: in the first column there will be the number of the month, and in the second column, how much we sold on that month:
1 2 3 4 5 6 7 8 |
+-------+-------------+ | month | sum(amount) | +-------+-------------+ | 1 | 17000 | | 3 | 20000 | | 6 | 113123 | | 12 | 224456 | +-------+-------------+ |
Today I would like to discuss some less known set functions defined by the SQL standard. Let’s start with SOME (or it synonym ANY). A word of caution first: the topic is set functions; SOME and ANY here are not to be confused with subquery predicates like =SOME(subquery) or >ANY(subquery)!
If I want to find months which had at least one day when we sold over 100,000, I could write:
1 |
select month, some(amount>100000) from sales group by month; |
SOME returns true if some row of the group matched the condition.
In other words, for each group (each month), SOME scans all days of the group (the month), searching for one day which satisfies the inequality condition, and if it finds at least one, it returns TRUE for the month. The months I’m interested in will thus be those having a TRUE value in the second column of the result.
In MySQL, SOME isn’t supported, but it’s very easy to emulate:
1 2 3 4 5 6 7 8 9 |
select month, sum(amount>100000) > 0 from sales group by month; +-------+------------------------+ | month | sum(amount>100000) > 0 | +-------+------------------------+ | 1 | 0 | | 3 | 0 | | 6 | 1 | | 12 | 1 | +-------+------------------------+ |
How does that work? For a given month, the inequality condition is evaluated for all days of the month; for each day it returns TRUE or FALSE. Because they are inside SUM which wants numbers, those Boolean values are cast respectively to number 1 and 0. Thus, the sum is greater than 0 if and only if some day of this month matched the condition. It is exactly equivalent to SOME.
Now, let’s look at EVERY: if I want to find months for which, on all days, we sold below 10,000 (maybe I want to decide upon when to take vacation and close the shop?), I could write:
1 |
select month, every(amount<10000) from sales group by month; |
EVERY returns true if every day of the month matches the condition. The months I’m interested in will thus be those having a TRUE value in the second column of the result.
In MySQL, EVERY isn’t supported either, but we can get the same result with:
1 2 3 4 5 6 7 8 9 |
select month, sum(not(amount<10000)) = 0 from sales group by month; +-------+----------------------------+ | month | sum(not(amount<10000)) = 0 | +-------+----------------------------+ | 1 | 1 | | 3 | 0 | | 6 | 0 | | 12 | 0 | +-------+----------------------------+ |
In other words, this is counting rows which violate the condition, and checking if their count is zero (i.e. if all rows match the condition).
You might say that in my reasoning, I forgot about NULL values. Ok, what if the amount column contains a NULL in a certain row? Then the inequality condition will evaluate to NULL for this row. SOME and EVERY ignore this row, says the standard. But, SUM also ignores it, so there’s no difference. And the NOT above is no worry, as the negation of NULL is NULL. There is no problem!
As an alternative, it’s also possible to use bitwise set functions instead of SUM:
1 |
select month, some(amount>100000) from sales group by month; |
is equivalent to
1 |
select month, bit_or(amount>100000) from sales group by month; |
And
1 |
select month, every(amount<10000) from sales group by month; |
is equivalent to
1 |
select month, bit_and(amount<10000) from sales group by month; |
Pretty simple: just replace SOME with BIT_OR, and EVERY with BIT_AND! How does it work? Again, the condition is tested, producing a TRUE/FALSE/NULL result. If we are using BIT_OR: this function wants numbers, so the Boolean values are cast to 1/0/NULL. Then all values of the group are or-ed together (after ignoring NULLs); the result is 1 if at least one row had a value of 1 (the OR of a single 1 and a thousand 0s is 1); otherwise the result is 0. Regarding BIT_AND, the result is 1 if all rows had a value of 1 (the AND of a single 0 and a thousand 1s is 0). Isn’t it beautiful?
Even though BIT_OR an BIT_AND provide the simplest workaround, they do have a little catch, which fortunately you are unlikely to meet. It has to do with the question “what if there are no meaningful values to aggregate?”. It can manifest itself only if a group has no non-NULL values. In other words, if:
- there is GROUP BY and one group has only NULL amount values
- Or
- there is no GROUP BY (so the single “group” is the result of WHERE), and
- the result of WHERE has only NULL amount values or is empty.
Here is an example for the first bullet. Sales of Sep 23 are not yet known, so someone inserts NULL, which here stands for “unknown”:
1 |
insert into sales values (9, 23, NULL); |
Now, I repeat my question:
1 |
select month, some(amount>100000) from sales group by month; |
For September, SOME would return NULL. The equivalent query using SUM returns NULL, too. That’s what set functions must do, per the Standard: they first ignore NULLs, and if nothing remains, they return NULL. And indeed, this NULL answer, evoking “unknown answer” to us, looks appropriate, as we don’t know the sales for Sep 23 so cannot know if they exceed 100,000 or not. However, BIT_OR returns 0. Awkward… so, one point for SUM. But… let’s also say that sales for another day, Sep 22, are known:
1 |
insert into sales values (9, 22, 9000); |
Repeat the grouped query again: for September, SOME and SUM now return FALSE, just like BIT_OR:
1 2 3 4 5 6 7 8 9 10 |
select month, sum(amount>100000) > 0 from sales group by month; +-------+------------------------+ | month | sum(amount>100000) > 0 | +-------+------------------------+ | 1 | 0 | | 3 | 0 | | 6 | 1 | | 9 | 0 | <<<<<< here! | 12 | 1 | +-------+------------------------+ |
For the DBMS it’s logical: ignore NULL of Sep 23, 9000 of Sep 22 remains, it doesn’t match, so: return FALSE. But for us it may not be intuitive: the addition of a non-matching row changed the result from NULL to FALSE! Even though we still do not know if the sales for Sep 23 exceed 100,000! That brings me to the conclusion: just beware of inserting NULLs and aggregating them, as results are not always intuitive…
Now here is an example for the second bullet, in the “empty result” case. I want to check that there was a day in April this year where we sold for more than 100,000:
1 |
select some(amount>100000) from sales where month=4; |
There are no rows for April in my table, so this will return NULL. That’s what the standard requires. If I may humbly cast an opinion, I find it a bit weird: SOME is usually telling if some row, at least one, matched the condition; if the table is empty, then it is certain that no row matched the condition, so I would expect an unambiguous result of FALSE 1. But I’m ready to accept that more competent people thought differently. Anyway, if you like the standard behavior, then you should use the workaround with SUM; indeed, in MySQL as in the standard, SUM over an empty table returns NULL 2. On the contrary, if like me you would expect SOME(empty) to be FALSE, then you can freely use BIT_OR, as BIT_OR over an empty table returns 0 ; or you can wrap SUM in COALESCE:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
select sum(amount>100000) > 0 from sales where month=4; +------------------------+ | sum(amount>100000) > 0 | +------------------------+ | NULL | +------------------------+ select bit_or(amount>100000) from sales where month=4; +-----------------------+ | bit_or(amount>100000) | +-----------------------+ | 0 | +-----------------------+ select coalesce(sum(amount>100000) > 0, 0) from sales where month=4; +-------------------------------------+ | coalesce(sum(amount>100000) > 0, 0) | +-------------------------------------+ | 0 | +-------------------------------------+ |
Likewise, if I want to know if on all days of April we sold for less than 10,000:
1 |
select every(amount<10000) from sales where month=4; |
If my table is empty this will return NULL. I also find a bit weird: at least in Math lessons, the empty set has all properties, so I would expect a result of TRUE 3. If you want to get NULL, use SUM. If you want to get TRUE, use BIT_AND, as BIT_AND over an empty table returns the largest possible integer (18446744073709551615; just add “<>0” after BIT_AND to make it a 0/1 result). Or wrap SUM in COALESCE.
That’s all for today. The set functions SOME and EVERY are not greatly known, but if you’re migrating to MySQL from a DBMS which features them, the above solutions may help get the job done.
And as always, Thank You for using MySQL!
Notes
1. By the way, if you are using SOME, not as an aggregate function, but as a quantified subquery predicate, like select … from … where a=SOME(subquery), and the subquery returns an empty result, then a=SOME(subquery) will return FALSE, in the standard and in MySQL. Not NULL. Supporting the position that the value of column “a” cannot possibly be equal to something in the empty set…
2. Beats me… Math lessons tell that a sum over an empty set is 0. So that, for example, the sum over the union of two disjoint sets is equal to the sum of the sums over each set… which isn’t true in SQL:
1 2 3 4 5 6 7 8 9 10 11 12 |
select (select sum(amount) from sales where month=4) + (select sum(amount) from sales where month=6) as April_plus_June; +-----------------+ | April_plus_June | +-----------------+ | NULL | # because NULL + non-NULL integer = NULL +-----------------+ select (select sum(amount) from sales where month in (4,6)) as April_plus_June; +-----------------+ | April_plus_June | +-----------------+ | 113123 | +-----------------+ |
But hey, there were probably good reasons for NULL… I just don’t know them 🙂
3. And if you are using the ALL quantified subquery predicate, like select … from … where a=ALL(subquery), and the subquery returns an empty result, then a=ALL(subquery) will return TRUE, in the standard and in MySQL. Not NULL. Supporting the position that the value of column “a” is equal to all elements of the empty set. Now, this is for ALL, which isn’t the same word as EVERY, though they have close meaning, in English…