MySQL :: A must-know about NOT IN in SQL - more antijoin optimization

I will try to make it short and clear: if you are writing SQL queries with “NOT IN” like
SELECT … WHERE x NOT IN (SELECT y FROM …)
you have to be sure to first understand what happens when “x” or “y” are NULL: it might not be what you want! And if it is not, I will tell you here how to fix it.

First, the simple case: if “x” and “y” are table columns which have been created with the NOT NULL clause, then they are never NULL and you can relax.

Let us consider the other cases. The complexity stems from the fact that NULL may be understood as “unspecified, might be anything” and thus SQL’s point of view is that it cannot know if NULL is equal to, for example, “coal”. Such a question yields an answer which is neither TRUE, nor FALSE: it yields UNKNOWN, which MySQL prints as a NULL:

SELECT NULL="coal";

+---------------+
| NULL="coal"   |
+---------------+
| NULL          |
+---------------+

SELECT NULL="coal";

+---------------+

| NULL="coal" |

+---------------+

| NULL |

+---------------+

Before starting, we need to remember two more details of SQL:

a WHERE clause tests a condition against a row and lets the row pass if and only if this condition yields TRUE (it rejects FALSE and UNKNOWN).
NOT(TRUE) is FALSE, NOT(FALSE) is TRUE, and NOT(UNKNOWN) is UNKNOWN.

Now that we are ready, let us look at this example:

create table houses (address varchar(100) not null, heating varchar(30));

1	create table houses (address varchar(100) not null, heating varchar(30));

it is a list of houses, and for every house, we know the type of energy used for heating it (“coal”, “wood”, “gas”, etc, or NULL for no heating).

Show me all houses which are heated with either coal or wood:

select * from houses where heating in ("coal", "wood");

1	select * from houses where heating in ("coal", "wood");

and show me all other houses:

select * from houses where heating not in ("coal", "wood");

1	select * from houses where heating not in ("coal", "wood");

We have a house A without heating and one B using oil:

insert into houses values ("A", NULL), ("B", "oil");

1	insert into houses values ("A", NULL), ("B", "oil");

When we test house A:

heating IN (“coal”, “wood”) -> UNKNOWN
because, says SQL, heating is NULL, and NULL might be coal, or might be wood, or not, we do not know…

heating NOT IN (“coal”, “wood”) -> UNKNOWN
because NOT IN is applying NOT to IN, and IN is UNKNOWN, and NOT(UNKNOWN) is UNKNOWN.

As a consequence:

select * from houses where heating IN ("coal", "wood")
  -> NO ROW
select * from houses where heating NOT IN ("coal", "wood")
  -> (B,"oil")

select * from houses where heating IN ("coal", "wood")

-> NO ROW

select * from houses where heating NOT IN ("coal", "wood")

-> (B,"oil")

as WHERE eliminates rows for which the condition is not TRUE, so eliminates house A.

The two SELECT’s results above are correct from SQL’s point of view. Now it is your turn to decide if they are what you expected.

If yes, then things are fine.

But I know that for some people, it is not what they expected. For example, some are shocked to see that both IN and NOT IN miss house A, like if A were in neither of the two groups (the coal-wood and the others); A seems to be invisible, kind of a ghost…

The crux of the matter is that when I designed the table of houses, I meant NULL to be “none”, “no heating”.
Unlike SQL which means NULL to be “perhaps coal or gas or etc or nothing”.
Therefore, in my intent, NULL cannot possibly be coal or wood, so I expect IN to not return A (it does not, alright), and I expect NOT IN to return A… and it does not.

So what should I do, to make NOT IN behave as I expect?

Simple! I just have to better express what I want, in SQL.
I can change NOT IN to IN IS NOT TRUE:

select * from houses where heating IN ("coal", "wood") IS NOT TRUE;

1	select * from houses where heating IN ("coal", "wood") IS NOT TRUE;

this will let pass houses where IN returns FALSE or UNKNOWN; so A and B will pass, as I expect.

The same problem occurs with “NOT IN(subquery)”. Let us add this table:

create table energy(name varchar(30) not null, co2 bool not null);
insert into energy values ("coal", 1), ("gas", 1),
  ("windmill electricity", 0),("wood", 1), ("oil", 1);

create table energy(name varchar(30) not null, co2 bool not null);

insert into energy values ("coal", 1), ("gas", 1),

("windmill electricity", 0),("wood", 1), ("oil", 1);

Show me houses for which heating does not generate carbon-dioxide:

select * from houses where heating NOT IN
  (select name from energy where co2=1);

1 2	select * from houses where heating NOT IN (select name from energy where co2=1);

-> NO ROW.

A is missing, again. And again the solution is:

select * from houses where heating IN
(select name from energy where co2=1) IS NOT TRUE;

1 2	select * from houses where heating IN (select name from energy where co2=1) IS NOT TRUE;

Now I get A.
This rewrite to IN IS NOT TRUE works pretty well.
I can instead rewrite to NOT EXISTS, which is more editing work though:

select * from houses where NOT EXISTS
(select * from energy where co2=1 AND name=houses.heating);

1 2	select * from houses where NOT EXISTS (select * from energy where co2=1 AND name=houses.heating);

That returns A too.

If I do any of the two rewrites, I am somehow declaring to MySQL that I want NULLs to be clear-cut matches for my NOT IN.

An extra benefit is that this also allows MySQL to optimize more “aggressively”. Indeed, when any side of NOT IN is a nullable column (our case here),

SELECT … WHERE heating NOT IN (SELECT name …)

cannot be converted to an antijoin (a new feature of MySQL 8.0.17), precisely because the behaviour of NOT IN with NULLs does not match the definition of an antijoin in relational algebra. MySQL is thus limited in ways to execute this query.

But,

SELECT … WHERE heating IN (SELECT name …) IS NOT TRUE

can be converted to an antijoin. And that is also true for the NOT EXISTS rewrite.

We can check this in EXPLAIN; first we have the initial NOT IN, with a query plan showing one subquery execution per house, with a table scan each time (which is rather inefficient):

mysql> explain format=tree select * from houses where
heating NOT IN (select name from energy where co2=1);
-> Filter: <in_optimizer>(houses.heating,<exists>(select #2) is false)
  -> Table scan on houses
    -> Select #2 (subquery in condition; dependent)
      -> Limit: 1 row(s)
        -> Filter: ((energy.co2 = 1) and <if>(outer_field_is_not_null, (<cache>(houses.heating) = energy.`name`), true))
          -> Table scan on energy

mysql> explain format=tree select * from houses where

heating NOT IN (select name from energy where co2=1);

-> Filter: <in_optimizer>(houses.heating,<exists>(select #2) is false)

-> Table scan on houses

-> Select #2 (subquery in condition; dependent)

-> Limit: 1 row(s)

-> Filter: ((energy.co2 = 1) and <if>(outer_field_is_not_null, (<cache>(houses.heating) = energy.`name`), true))

-> Table scan on energy

Now here are the rewritten queries, which properly use an antijoin and can thus benefit from our new hash-based join algorithm (introduced in version 8.0.18 for inner joins, and extended to semijoins, antijoins and outer joins in 8.0.20):

mysql> explain format=tree select * from houses where
heating IN (select name from energy where co2=1) IS NOT TRUE;
-> Hash antijoin (houses.heating = energy.`name`)
  -> Table scan on houses
    -> Hash
      -> Filter: (energy.co2 = 1)
        -> Table scan on energy

mysql> explain format=tree select * from houses where
NOT EXISTS (select * from energy where co2=1 and name=houses.heating);
-> Hash antijoin (houses.heating = energy.`name`)
  -> Table scan on houses
    -> Hash
      -> Filter: (energy.co2 = 1)
        -> Table scan on energy

mysql> explain format=tree select * from houses where

heating IN (select name from energy where co2=1) IS NOT TRUE;

-> Hash antijoin (houses.heating = energy.`name`)

-> Table scan on houses

-> Hash

-> Filter: (energy.co2 = 1)

-> Table scan on energy

mysql> explain format=tree select * from houses where

NOT EXISTS (select * from energy where co2=1 and name=houses.heating);

-> Hash antijoin (houses.heating = energy.`name`)

-> Table scan on houses

-> Hash

-> Filter: (energy.co2 = 1)

-> Table scan on energy

The antijoin plan is indeed faster; to check this experimentally, let us create one million random houses:

delete from houses;
insert into houses
  select rand(),
  case round(rand()*5) when 0 then "oil" when 1 then "wood" when 2 then "gas"
                       when 3 then "windmill electricity" when 4 then "coal" end;
insert into houses
  select rand(),
  case round(rand()*5) when 0 then "oil" when 1 then "wood" when 2 then "gas"
                       when 3 then "windmill electricity" when 4 then "coal" end
  from houses;

delete from houses;

insert into houses

select rand(),

case round(rand()*5) when 0 then "oil" when 1 then "wood" when 2 then "gas"

when 3 then "windmill electricity" when 4 then "coal" end;

insert into houses

select rand(),

case round(rand()*5) when 0 then "oil" when 1 then "wood" when 2 then "gas"

when 3 then "windmill electricity" when 4 then "coal" end

from houses;

RAND() returns a number between 0 and 1; ROUND() rounds it to an integer from 0 to 5; 0 to 4 get a real energy source while 5 gets NULL (as 5 is not specified in CASE).

To get to one million houses, I only have to repeat the last INSERT a few times. And now the times for my search query are:

select count(*) from houses where heating IN (select name from energy where co2=1) IS NOT TRUE;
+----------+
| count(*) |
+----------+
| 314189   |
+----------+
1 row in set (0.48 sec)


select count(*) from houses where heating NOT IN (select name from energy where co2=1);
+----------+
| count(*) |
+----------+
| 209721   |
+----------+
1 row in set (0.59 sec)

select count(*) from houses where heating IN (select name from energy where co2=1) IS NOT TRUE;

+----------+

| count(*) |

+----------+

| 314189 |

+----------+

1 row in set (0.48 sec)

select count(*) from houses where heating NOT IN (select name from energy where co2=1);

+----------+

| count(*) |

+----------+

| 209721 |

+----------+

1 row in set (0.59 sec)

The antijoin plan returns more rows (including, as expected, the NULLs) in twenty per-cent less time.

Take-aways from this are: when using NOT IN, and if NULLs cannot be avoided, ask yourself if the behaviour with NULLs is what you want; if yes, alright; if not, consider alternatives IN IS NOT TRUE or NOT EXISTS.

Thank you for using MySQL!

Featured image by Pixabay, from pexels.com.