WL#888: Optimization of Search-like DISTINCT queries.

Status: Un-Assigned   —   Priority: Medium

There are queries typical in search application which use manual index, for
handling of AND and OR  condition for additional search result. 

In the example below we're searching for all articles which speak about "SQL" or
"Windows"

We have to use DISTINCT to eliminate duplicates as some articles may match both
"SQL" and Windows"


CREATE TABLE articles (
  id int(10) unsigned NOT NULL auto_increment,
  name varchar(255) default NULL,
  rating int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (id),
  KEY rating (rating)
) TYPE=MyISAM;

--
-- Dumping data for table 'articles'
--


INSERT INTO articles VALUES (1,'Interview with Monty',99);
INSERT INTO articles VALUES (2,'Interview with Larry',98);
INSERT INTO articles VALUES (3,'Interview with Bill',97);
INSERT INTO articles VALUES (4,'Interview with Bill',96);
INSERT INTO articles VALUES (63,'Fill',0);
INSERT INTO articles VALUES (62,'Fill',0);
INSERT INTO articles VALUES (61,'Fill',0);
INSERT INTO articles VALUES (60,'Fill',0);
INSERT INTO articles VALUES (59,'Fill',0);
INSERT INTO articles VALUES (58,'Fill',0);
INSERT INTO articles VALUES (57,'Fill',0);
INSERT INTO articles VALUES (56,'Fill',0);
INSERT INTO articles VALUES (55,'Fill',0);
INSERT INTO articles VALUES (54,'Fill',0);
INSERT INTO articles VALUES (53,'Fill',0);
INSERT INTO articles VALUES (52,'Fill',0);
INSERT INTO articles VALUES (51,'Fill',0);
INSERT INTO articles VALUES (50,'Fill',0);
INSERT INTO articles VALUES (49,'Fill',0);
INSERT INTO articles VALUES (48,'Fill',0);
INSERT INTO articles VALUES (47,'Fill',0);
INSERT INTO articles VALUES (46,'Fill',0);
INSERT INTO articles VALUES (45,'Fill',0);
INSERT INTO articles VALUES (44,'Fill',0);
INSERT INTO articles VALUES (43,'Fill',0);
INSERT INTO articles VALUES (42,'Fill',0);
INSERT INTO articles VALUES (41,'Fill',0);
INSERT INTO articles VALUES (40,'Fill',0);

CREATE TABLE keywords (
  article_id int(10) unsigned NOT NULL default '0',
  keyword varchar(40) NOT NULL default '',
  PRIMARY KEY  (article_id,keyword)
) TYPE=MyISAM;

--
-- Dumping data for table 'keywords'
--


INSERT INTO keywords VALUES (1,'Linux');
INSERT INTO keywords VALUES (1,'MySQL');
INSERT INTO keywords VALUES (1,'SQL');
INSERT INTO keywords VALUES (1,'Windows');
INSERT INTO keywords VALUES (2,'Linux');
INSERT INTO keywords VALUES (2,'Oracle');
INSERT INTO keywords VALUES (2,'SQL');
INSERT INTO keywords VALUES (2,'Windows');
INSERT INTO keywords VALUES (3,'SQL');
INSERT INTO keywords VALUES (3,'Windows');
INSERT INTO keywords VALUES (4,'Irak');
INSERT INTO keywords VALUES (4,'Iraq');
INSERT INTO keywords VALUES (4,'Oil');
INSERT INTO keywords VALUES (4,'War');


If we do not have DISTINCT we can have nice execution plan

mysql> explain select straight_join articles.name from articles, keywords k1,
keywords k2 where rating>0 and articles.id=k1.article_id and
k2.article_id=articles.id and (k1.keyword="Linux" or  k2.keyword="Windows")
order by articles.rating desc limit 5;
+----------+-------+----------------+---------+---------+-------------+------+--------------------------+
| table    | type  | possible_keys  | key     | key_len | ref         | rows |
Extra                    |
+----------+-------+----------------+---------+---------+-------------+------+--------------------------+
| articles | range | PRIMARY,rating | rating  |       4 | NULL        |    5 |
Using where              |
| k1       | ref   | PRIMARY        | PRIMARY |       4 | articles.id |    4 |
Using index              |
| k2       | ref   | PRIMARY        | PRIMARY |       4 | articles.id |    4 |
Using where; Using index |
+----------+-------+----------------+---------+---------+-------------+------+--------------------------+
3 rows in set (0.00 sec)

( I used STRAIGHT_JOIN to work this like in real life with small tables)

But we get a lot of dups this way:

mysql> select straight_join articles.name from articles, keywords k1, keywords
k2 where rating>0 and articles.id=k1.article_id and k2.article_id=articles.id
and (k1.keyword="Linux" or  k2.keyword="Windows") order by articles.rating desc
limit 5;
+----------------------+
| name                 |
+----------------------+
| Interview with Monty |
| Interview with Monty |
| Interview with Monty |
| Interview with Monty |
| Interview with Monty |
+----------------------+
5 rows in set (0.00 sec)

If we add DISTINCT to solve the issue query starts to be very slow:
Basically LIMIT does not helps us anymore:

mysql> explain select distinct straight_join articles.name from articles,
keywords k1, keywords k2 where rating>0 and articles.id=k1.article_id and
k2.article_id=articles.id and (k1.keyword="Linux" or  k2.keyword="Windows")
order by articles.rating desc limit 5;
+----------+-------+----------------+---------+---------+-------------+------+----------------------------------------------+
| table    | type  | possible_keys  | key     | key_len | ref         | rows |
Extra                                        |
+----------+-------+----------------+---------+---------+-------------+------+----------------------------------------------+
| articles | range | PRIMARY,rating | rating  |       4 | NULL        |    5 |
Using where; Using temporary; Using filesort |
| k1       | ref   | PRIMARY        | PRIMARY |       4 | articles.id |    4 |
Using index; Distinct                        |
| k2       | ref   | PRIMARY        | PRIMARY |       4 | articles.id |    4 |
Using where; Using index; Distinct           |
+----------+-------+----------------+---------+---------+-------------+------+----------------------------------------------+
3 rows in set (0.00 sec)