Since two days ago I’m trying to believe what I’ve found: MySQL («the world’s most popular open source database«, as they say) is unable to resolve a simple SQL query in an efficient way, leading me (and the program I’m developing) to a dead end.
I have a table «mytable» (1,000,000 rows) with a column named «mycolumn» of type «datetime» and an index created on it named «idx_mycolumn«. When I issue the following query:
SELECT * FROM mytable ORDER BY mycolumn
it tooks more than 1 minute to give the results (the 1,000,000 rows ordered by mycolumn.)
Using the command «EXPLAIN» I can see that the MySQL optimizer don’t use the index at all: it generates a filesort (on disk) in order to sort the resultset. The only way to solve this issue (as far as I know) is providing a «hint» to the optimizer, telling what indexes to use:
SELECT * FROM mytable USE INDEX(idx_mycolumn) ORDER BY mycolumn
But the optimizer still chooses to do a full table scan instead of using the «idx_mycolumn» index (and still generates a filesort, taking more than 1 minute.)
There is a way to force the optimizer to use the index:
SELECT * FROM mytable FORCE INDEX(idx_mycolumn) ORDER BY mycolumn
This time, the optimizer uses the index, and the query takes about 3 seconds. (Is the MySQL query optimizer that bad?)
A bigger problem
Then, I came into the real problem for me: Obviously I’ll never need to get all the rows from that table, but is a common task to «paginate» the rows (previously ordered by some column) taking, let’s say, 25 rows at a time and using an offset. The query looks as follows:
SELECT * FROM mytable ORDER BY mycolumn LIMIT 100000,25
It sorts the resultset using the values of «mycolumn» and then returns 25 rows skipping the first 100,000. Surprisingly, this time the query tooks less than 1 second to execute (and «EXPLAIN» says that is using the «idx_mycolumn» index. What a joy!).
But the problem is still there: If the limit reaches (or surpases) the last row, then the optimizer throws away the index and generates a filesort (taking more than 1 minute, making the query totally useless…)
The solution (again) is to force the optimizer to use the index «idx_mycolumn«, getting a response in about 3 seconds.
Far from being a MySQL expert, the only conclusion I can reach is that the optimizer is really screwed up. How can it be possible that changing only the limit of the results, confuses the optimizer in such a way that leads to a simple query (that can’t take more than a few seconds) to take an unacceptable amount of time (not to mention the disk space)?
As I can’t (and don’t want) force indexes in every query involving large tables in my system, the only real solution for me is replacing MySQL with a more robust (and decent) DBMS (I’ve tried PostgreSQL with the same database and the results are far more reasonable.)
Some things to consider
- The problem is the same being the engine used MyISAM or InnoDB.
- The problem doesn’t get solved after doing a «ANALYZE TABLE«.
- I’ve trying with other datatypes for the sorting column, with the same results.
- I’ve tested this issue with versions 5.0.35 and 5.0.51a, with identical results.
When there are many indexes in table, optimizer wrongly does not even consider the one on column used in ORDER BY, when LIMIT N clause is also present.
I’ve oversimplified the example, but in my real scenario I had multiple single-column indexes on the table. I still can’t believe that the optimizer get confused by having multiple indexes to choose from (but only in presence of the LIMIT clause). Can you guess how is the optimizer coded? (That’s why I use the word «toy» in the title.)
A posible solution seems to be moving to the new version, but anyway, MySQL 5.1 also has several issues (and the distribution installation script for GNU/Linux is totally broken), so I’m moving to PostgreSQL.
(Thanks to Abel Chiola for the links.)