Last week; I was working on one of the issue where the sub-query related to OLAP staging was running for about 2+ hours in the production server and finally nailed down to get the query to run in less than 7 secs.
It was bit interesting; and kind of known issue in MySQL sub-queries world and one of the limitation from MySQL on giving more control over intermediate derived table results.
Sometimes we can re-write the sub-query so that there is no derived table complexity involved; but the bad part is; this particular sub-query is part of an UPDATE statement; so not all sub-queries can be re-written especially when they are part of UPDATE or DELETE statements due to its own limitations.
PROBLEM:
Here is the subset of the problem query (taken from complex big query) and as you can see it runs for about 6 minutes with small subset of data that I used for testing on Mac (All tables are InnoDB based)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
-------------- SELECT SUM(aggrpt.imps) as imps, SUM(aggrpt.clicks) as clicks, SUM(aggrpt.pos) as pos FROM aggrpt LEFT JOIN ( SELECT DISTINCT ext_group_id, group_id FROM sub ) sub2 ON(sub2.ext_group_id=aggrpt.adgroupid) GROUP BY aggrpt.report_date, aggrpt.campaignid, aggrpt.adgroupid, aggrpt.keywordid ORDER BY NULL INTO OUTFILE '/tmp/test-sub.txt' -------------- Query OK, 47827 rows affected (6 min 47.48 sec) |
HOW TO GET AROUND – SOLUTIONS:
Moving the derived table (in the above case sub2) to a view did not help by indexing on join columns; and timings are more or less the same. Two alternative approaches as described below, made the same query run in less than 10 secs.
- By creating external table and adding an index instead of using derived table
- Adding an index on top of derived table temporary results (internally within MySQL server as part of query execution).
CASE 1:
This is a known alternative and lot of people use this in production by avoiding the derived tables and/or sub-queries completely in MySQL (other open-source databases like PostgreSQL works great with sub-queries) by creating (temporary) tables for derived tables. The only thing that made the big difference is adding an index on this intermediate table; without an index it takes more or less the same amount of time to execute.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
-------------- CREATE TEMPORARY TABLE sub_temp SELECT DISTINCT ext_group_id, group_id FROM sub -------------- Query OK, 72385 rows affected (0.81 sec) Records: 72385 Duplicates: 0 Warnings: 0 -------------- alter table sub_temp add index i_ext_group_id(ext_group_id) -------------- Query OK, 72385 rows affected (0.10 sec) Records: 72385 Duplicates: 0 Warnings: 0 -------------- SELECT SUM(aggrpt.imps) as imps, SUM(aggrpt.clicks) as clicks, SUM(aggrpt.pos) as pos FROM aggrpt LEFT JOIN sub_temp sub ON(sub.ext_group_id=aggrpt.adgroupid) GROUP BY aggrpt.report_date, aggrpt.campaignid, aggrpt.adgroupid, aggrpt.keywordid ORDER BY NULL INTO OUTFILE '/tmp/test-sub-temp.txt' -------------- Query OK, 47827 rows affected (10.41 sec) |
As you can see; it is a great improvement from 407 secs (6 min 47.48 sec) to 10 secs; but without an index i_ext_group_id on sub-temp table; then timings will be as usual at 6 minutes.
CASE 2:
The server engine should be smart enough to add an implicit index on intermediate results temporary table (in this case sub2, index on column i_ext_group_id); but MySQL does not support this functionality or at least should allow people to specify one using FORCE/USE INDEX, so that server can add one on temporary table results.
It should add an implicit index in case if it is joining to a much bigger outer table/results (cost estimate one on one reverse lookup vs complete scan for each entry). If it is joining to smaller outer table results when compared with derived one (one or more), then no need to add an implicit index
For example; I patched MySQL to support this syntax, so that it adds implicit index on the derived table sub2 results automatically (small hack) and the query immediately returns in 7 secs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
SELECT SUM(aggrpt.imps) as imps, SUM(aggrpt.clicks) as clicks, SUM(aggrpt.pos) as pos FROM aggrpt LEFT JOIN ( SELECT DISTINCT ext_group_id, group_id FROM sub ) sub2 <strong>USE INDEX(ext_group_id)</strong> ON(sub2.ext_group_id=aggrpt.adgroupid) GROUP BY aggrpt.report_date, aggrpt.campaignid, aggrpt.adgroupid, aggrpt.keywordid ORDER BY NULL INTO OUTFILE '/tmp/test-sub-force.txt' -------------- Query OK, 47827 rows affected (7.18 sec) |
This is again a great improvement; but again using USE/FORCE INDEX is a hack here; his can also be implemented as hint with column name; basically simple way to make implicit index on temporary results data within query processing (pre and post materialized data depending on cost estimates)
CHANGES TO MYSQL OPTIMIZER
But the idea for this post is to show how important the index in the intermediate temporary table results for the query performance; and optimizer should be smart enough to identify and add one implicitly. The logic should be little similar to how currently optimizer opts for index vs non-index scan by estimating the cost, or rather lean towards hints for derived data.
Hi Venu
I thought internal temporary tables creates the needed indexes. Subqueries in MySQL is still annoy with performance lot of times when comapred with postgres
please please ~please~ don’t write DISTINCT(ext_group_id), group_id
putting one column in parentheses with no intervening space makes it look ~exactly~ like DISTINCT is a function
DISTINCT is ~not~ a function !!!!
you don’t know how many impressionable developers will pick up in this and start writing bad SQL, but i’ve run across plenty of them, and they nned a lot of handholding to unlearn this bad habit
please don’t do that
;o)
Well; thanks for that; that’s copy & paste. Either way, the post is not related to that, its on different issue. Thanks anyway as that does not effect anything.
New blog post #mysql "How to improve subqueries derived tables performance" http://tinyurl.com/ydpyvyr
Sorry, I pressed some wrong keys and the half cooked comment was posted, maybe you can remove it….
The temporary table in case 1 can be created with one Statement:
CREATE TEMPORARY TABLE sub_temp (
PRIMARY KEY(ext_group_id)
) ENGINE=Memory
SELECT DISTINCT ext_group_id, group_id FROM sub
You have the full power of CREATE TABLE here and can e.g. add sequential row numbers simply by adding an AUTO_INCREMENT column. I selected an engine as an example. This makes having to create a temporary table far less painful. The only problem I have with temporary table is you can use it only once per query, you can’t do e.g. (there are better but more complicated examples than UNION ALL)
SELECT *
FROM a JOIN temptab USING(c)
UNION ALL
SELECT *
FROM b JOIN temptab USING(c)
which means you have to create temptab as non-temporary table, make an own copy of temptab for every invocation or may have to rewrite the query to the far less optimal
SELECT *
FROM ( SELECT * FROM a UNION ALL SELECT * FROM b) t
JOIN temptab USING(c)
If you want the optimizer to start adding indexes for subquery results automatically, why not start automatically adding indexes to normal tables? There must be reasons why adding indexes is left to the humans usually.
In case of intermediate tables the arguments for your solution would be “the table is deleted afterwards, so no indexes aggregate”, “creating an index is slow, but it would be even slower without the index” and “all rows are created by this statement, so there is no unknown content”.
I have seen many people run into exactly the same situation, most think: There is an optimizer, so it will do this optimally, don’t you think. …
How to improve subqueries derived tables performance in MySQL: http://bit.ly/9jLEym
How to improve subqueries derived tables performance of MySQL: http://is.gd/9RfIt
I bet you should first calculate sums in table `aggrpt`, then join table `sub`.
What is execution time then?
strcmp,
the point is not about regular table indexes; all tables has indexes and they are used right. If one needs to re-write; we have enough ways to re-write and optimize the query.
But the main theme of the post is to talk about benefit of adding index on intermediate temp results and allowing an option todo so.
.. and here I thought my answers made it clear I understood the post the same way.
Mysql: How to improve subqueries derived tables performance http://bit.ly/9mbR1o
[…] strcmp: .. and here I thought my answe… […]
[…] Venu Anuganti gives you tips on one of the most common MySQL frustrations: optimizing subqueries in how to improve subqueries derived tables performance. Justin Swanhart posts the way in which he Gets Linux performance information from your MySQL […]
Subqueries materialize in v6 and I think they moved that into 5.4 when 6 was ditched. I’m still running v6 and it works great.
How to improve subqueries derived tables performance – https://venublog.com/2010/03/06/how-to-improve-subqueries-derived-tables-performance/
Check this out… new in version 5.6 – http://dev.mysql.com/doc/refman/5.6/en/from-clause-subquery-optimization.html
@vanuganti hi,mayi ask u a question https://t.co/4GiYKsWO, how'd u patch to support the syntax <strong> use index in case 2 thanks a lot.
Of course it also helps to understand the difference between derived tables and subqueries:
http://www.programmerinterview.com/index.php/database-sql/derived-table-vs-subquery/
[…] Venu Anuganti gives you tips on one of the most common MySQL frustrations: optimizing subqueries in how to improve subqueries derived tables performance. Justin Swanhart posts the way in which he Gets Linux performance information from your MySQL […]