How to improve subqueries derived tables performance
Last week I was working on one of the issue where the sub-query related to OLAP staging was running for about 2+ hours in the production server and finally nailed down to get the query to run in < 7 secs. It was bit interesting and kind of known issue in MySQL sub-queries world and one of the limitation from MySQL on giving more control over derived table results.
Sometimes we can re-write the sub-queries so that there is no derived tables complexity involved; but the bad part is; this particular sub-query is part of an UPDATE statement; so not all sub-queries can be re-written especially when they are part of UPDATE or DELETE statements due to its own limitations.
PROBLEM:
Here is the subset of the problem query and as you can see it runs for about 6 minutes in this small subset of data that I used for testing on Mac. All tables are InnoDB based.
-------------- SELECT SUM(aggrpt.imps) as imps, SUM(aggrpt.clicks) as clicks, SUM(aggrpt.pos) as pos FROM aggrpt LEFT JOIN ( SELECT DISTINCT ext_group_id, group_id FROM sub ) sub2 ON(sub2.ext_group_id=aggrpt.adgroupid) GROUP BY aggrpt.report_date, aggrpt.campaignid, aggrpt.adgroupid, aggrpt.keywordid ORDER BY NULL INTO OUTFILE '/tmp/test-sub.txt' -------------- Query OK, 47827 rows affected (6 min 47.48 sec) |
HOW TO GET AROUND – SOLUTIONS:
Moving the derived table (in the above case sub2) to a view did not help; and timings are more or less the same. Here is the two alternative ways, which made the query run in < 10 secs.
- By creating external table and adding an index instead of using derived table
- Added an index within the mysql code on the derived table temporary results table by adding FORCE INDEX syntax (changed the syntax to support this, so that engine will create an index on temporary derived table results, which in this case happens to fit within heap engine instead of disk based)
CASE 1:
This is a known alternative and lot of people use this in production by avoiding the derived tables and/or sub-queries completely by creating tables for derived tables. The only thing that made the big difference is adding an index on this; without index it takes more or less the same 6 minutes times.
-------------- CREATE TEMPORARY TABLE sub_temp SELECT DISTINCT ext_group_id, group_id FROM sub -------------- Query OK, 72385 rows affected (0.81 sec) Records: 72385 Duplicates: 0 Warnings: 0 -------------- alter table sub_temp add index i_ext_group_id(ext_group_id) -------------- Query OK, 72385 rows affected (0.10 sec) Records: 72385 Duplicates: 0 Warnings: 0 -------------- SELECT SUM(aggrpt.imps) as imps, SUM(aggrpt.clicks) as clicks, SUM(aggrpt.pos) as pos FROM aggrpt LEFT JOIN sub_temp sub ON(sub.ext_group_id=aggrpt.adgroupid) GROUP BY aggrpt.report_date, aggrpt.campaignid, aggrpt.adgroupid, aggrpt.keywordid ORDER BY NULL INTO OUTFILE '/tmp/test-sub-temp.txt' -------------- Query OK, 47827 rows affected (10.41 sec) |
As you can see its a great improvement from 6minutes to 10secs; but without index i_ext_group_id on sub-temp table; then things will be as usual.
CASE 2:
The engine should be smart enough to add the index on intermediate results temporary table (in this case sub2, index on column i_ext_group_id); but MySQL does not support this or at least should allow people to specify one using FORCE/USE INDEX, so that engine can add one.
For example; I patched MySQL to support this syntax, so that it adds the index on the derived table sub2 results automatically and the query immediately returns in 7 secs.
SELECT SUM(aggrpt.imps) as imps, SUM(aggrpt.clicks) as clicks, SUM(aggrpt.pos) as pos FROM aggrpt LEFT JOIN ( SELECT DISTINCT ext_group_id, group_id FROM sub ) sub2 <strong>USE INDEX(ext_group_id)</strong> ON(sub2.ext_group_id=aggrpt.adgroupid) GROUP BY aggrpt.report_date, aggrpt.campaignid, aggrpt.adgroupid, aggrpt.keywordid ORDER BY NULL INTO OUTFILE '/tmp/test-sub-force.txt' -------------- Query OK, 47827 rows affected (7.18 sec) |
This is again a great improvement; but again using USE/FORCE INDEX is a hack here (this can also be implemented as hint) with column name; which acts as a hint to optimizer to create an index on that column (if its hint, then on JOIN column).
CHANGES TO MYSQL OPTIMIZER
But the idea for this post is to show how important the index in the intermediate temporary table results is for the query performance and optimizer should be smart enough to identify and add one automatically. The logic is same as how currently optimizer opts for index vs non-index scan by estimating the cost.
Related posts:
12 Comments





Hi Venu
I thought internal temporary tables creates the needed indexes. Subqueries in MySQL is still annoy with performance lot of times when comapred with postgres
Comment :: March 6, 2010 @ 6:04 pm
please please ~please~ don’t write DISTINCT(ext_group_id), group_id
putting one column in parentheses with no intervening space makes it look ~exactly~ like DISTINCT is a function
DISTINCT is ~not~ a function !!!!
you don’t know how many impressionable developers will pick up in this and start writing bad SQL, but i’ve run across plenty of them, and they nned a lot of handholding to unlearn this bad habit
please don’t do that
;o)
Comment :: March 6, 2010 @ 7:20 pm
Well; thanks for that; that’s copy & paste. Either way, the post is not related to that, its on different issue. Thanks anyway as that does not effect anything.
Comment :: March 6, 2010 @ 8:39 pm
Sorry, I pressed some wrong keys and the half cooked comment was posted, maybe you can remove it….
The temporary table in case 1 can be created with one Statement:
CREATE TEMPORARY TABLE sub_temp (
PRIMARY KEY(ext_group_id)
) ENGINE=Memory
SELECT DISTINCT ext_group_id, group_id FROM sub
You have the full power of CREATE TABLE here and can e.g. add sequential row numbers simply by adding an AUTO_INCREMENT column. I selected an engine as an example. This makes having to create a temporary table far less painful. The only problem I have with temporary table is you can use it only once per query, you can’t do e.g. (there are better but more complicated examples than UNION ALL)
SELECT *
FROM a JOIN temptab USING(c)
UNION ALL
SELECT *
FROM b JOIN temptab USING(c)
which means you have to create temptab as non-temporary table, make an own copy of temptab for every invocation or may have to rewrite the query to the far less optimal
SELECT *
FROM ( SELECT * FROM a UNION ALL SELECT * FROM b) t
JOIN temptab USING(c)
Comment :: March 7, 2010 @ 1:33 am
If you want the optimizer to start adding indexes for subquery results automatically, why not start automatically adding indexes to normal tables? There must be reasons why adding indexes is left to the humans usually.
In case of intermediate tables the arguments for your solution would be “the table is deleted afterwards, so no indexes aggregate”, “creating an index is slow, but it would be even slower without the index” and “all rows are created by this statement, so there is no unknown content”.
I have seen many people run into exactly the same situation, most think: There is an optimizer, so it will do this optimally, don’t you think. …
Comment :: March 7, 2010 @ 2:02 am
I bet you should first calculate sums in table `aggrpt`, then join table `sub`.
What is execution time then?
Comment :: March 7, 2010 @ 4:17 am
strcmp,
the point is not about regular table indexes; all tables has indexes and they are used right. If one needs to re-write; we have enough ways to re-write and optimize the query.
But the main theme of the post is to talk about benefit of adding index on intermediate temp results and allowing an option todo so.
Comment :: March 7, 2010 @ 5:56 pm
.. and here I thought my answers made it clear I understood the post the same way.
Comment :: March 8, 2010 @ 12:36 am
[...] strcmp: .. and here I thought my answe… [...]
Pingback :: March 8, 2010 @ 3:30 pm
[...] Venu Anuganti gives you tips on one of the most common MySQL frustrations: optimizing subqueries in how to improve subqueries derived tables performance. Justin Swanhart posts the way in which he Gets Linux performance information from your MySQL [...]
Pingback :: March 12, 2010 @ 2:48 pm
Subqueries materialize in v6 and I think they moved that into 5.4 when 6 was ditched. I’m still running v6 and it works great.
Comment :: May 26, 2010 @ 3:10 pm
Check this out… new in version 5.6 – http://dev.mysql.com/doc/refman/5.6/en/from-clause-subquery-optimization.html
Comment :: January 18, 2012 @ 12:33 pm