Is MySQL a toy RDBMS?

Since two days ago I’m trying to believe what I’ve found: MySQLthe world’s most popular open source database«, as they say) is unable to resolve a simple SQL query in an efficient way, leading me (and the program I’m developing) to a dead end.

The situation

I have a table «mytable» (1,000,000 rows) with a column named «mycolumn» of type «datetime» and an index created on it named «idx_mycolumn«. When I issue the following query:

SELECT * FROM mytable ORDER BY mycolumn

it tooks more than 1 minute to give the results (the 1,000,000 rows ordered by mycolumn.)

Using the command «EXPLAIN» I can see that the MySQL optimizer don’t use the index at all: it generates a filesort (on disk) in order to sort the resultset. The only way to solve this issue (as far as I know) is providing a «hint» to the optimizer, telling what indexes to use:

SELECT * FROM mytable USE INDEX(idx_mycolumn) ORDER BY mycolumn

But the optimizer still chooses to do a full table scan instead of using the «idx_mycolumn» index (and still generates a filesort, taking more than 1 minute.)

There is a way to force the optimizer to use the index:

SELECT * FROM mytable FORCE INDEX(idx_mycolumn) ORDER BY mycolumn

This time, the optimizer uses the index, and the query takes about 3 seconds. (Is the MySQL query optimizer that bad?)

A bigger problem

Then, I came into the real problem for me: Obviously I’ll never need to get all the rows from that table, but is a common task to «paginate» the rows (previously ordered by some column) taking, let’s say, 25 rows at a time and using an offset. The query looks as follows:

SELECT * FROM mytable ORDER BY mycolumn LIMIT 100000,25

It sorts the resultset using the values of «mycolumn» and then returns 25 rows skipping the first 100,000. Surprisingly, this time the query tooks less than 1 second to execute (and «EXPLAIN» says that is using the «idx_mycolumn» index. What a joy!).

But the problem is still there: If the limit reaches (or surpases) the last row, then the optimizer throws away the index and generates a filesort (taking more than 1 minute, making the query totally useless…)

The solution (again) is to force the optimizer to use the index «idx_mycolumn«, getting a response in about 3 seconds.

My conclusion

Far from being a MySQL expert, the only conclusion I can reach is that the optimizer is really screwed up. How can it be possible that changing only the limit of the results, confuses the optimizer in such a way that leads to a simple query (that can’t take more than a few seconds) to take an unacceptable amount of time (not to mention the disk space)?

As I can’t (and don’t want) force indexes in every query involving large tables in my system, the only real solution for me is replacing MySQL with a more robust (and decent) DBMS (I’ve tried PostgreSQL with the same database and the results are far more reasonable.)

Some things to consider

  • The problem is the same being the engine used MyISAM or InnoDB.
  • The problem doesn’t get solved after doing a «ANALYZE TABLE«.
  • I’ve trying with other datatypes for the sorting column, with the same results.
  • I’ve tested this issue with versions 5.0.35 and 5.0.51a, with identical results.

Update (2008/12/30)

The observed behavior must be related with the bug 28404 (patched in version 5.1). From the patch description:

When there are many indexes in table, optimizer wrongly does not even consider the one on column used in ORDER BY, when LIMIT N clause is also present.

I’ve oversimplified the example, but in my real scenario I had multiple single-column indexes on the table. I still can’t believe that the optimizer get confused by having multiple indexes to choose from (but only in presence of the LIMIT clause). Can you guess how is the optimizer coded? (That’s why I use the word «toy» in the title.)

A posible solution seems to be moving to the new version, but anyway, MySQL 5.1 also has several issues (and the distribution installation script for GNU/Linux is totally broken), so I’m moving to PostgreSQL.

(Thanks to Abel Chiola for the links.)

19 comentarios sobre “Is MySQL a toy RDBMS?

  1. It is generally bad practice to use SELECT * in any query. In some RDBMS, the columns specified in the SELECT are considered by the optimizer when it identifies indexes for execution plans. I know this is a fact with SQL Server. I would retry your query only specifying the columns that are needed by the application and see if the MySQL optimizer comes up with a different execution plan.

  2. I’m not very familiar with the MySQL optimizer, but in general, an optimizer can’t do more than doing the most obvious thing in a particluar situation and not ‘try’ several scenarios and select the best. In your specific example, it is very possible that, when selecting an entire table, it is faster to do a sequential sort instead of using an index, because indices in general works best when only a subset of a table is selected. PostgreSQL does it that way, so I wouldn’t be surprised if MySQL does the same. It will always be a challenge to find the best optimization by tuning the configuration of your RDBMS. It is a bit harsh to call MySQL a ‘toy RDBMS’, based on one single issue you experience. PostgreSQL has similar issues (in one case a query performed 100 times faster when we added a ‘limit 1’ to a query which returned one record anyway). I bet Oracle and SQL Server has similar issues. The cause is that optimizers are not hard science and the results may vary in particular scenarios. You better get used to it, otherwise you will experience a lot of frustration in the future.

  3. To answer your question – no, MySQL is not a toy. I would suggest that you have a read through section 7.4.5 of the manual as to why you are seeing this particular behaviour (http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html):

    «Sometimes MySQL will not use an index, even if one is available. One way this occurs is when the optimizer estimates that using the index would require MySQL to access a large percentage of the rows in the table. (In this case, a table scan is probably much faster, because it will require many fewer seeks.) However, if such a query uses LIMIT to only retrieve part of the rows, MySQL will use an index anyway, because it can much more quickly find the few rows to return in the result.

    Hash indexes have somewhat different characteristics than those just discussed:
    * They are used only for = or comparisons (but are very fast).
    * The optimizer cannot use a hash index to speed up ORDER BY operations. (This type
    of index cannot be used to search for the next entry in order.)
    * MySQL cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide which index to use). This may affect some queries if you change a MyISAM table to a hash-indexed MEMORY table.
    * Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)»

    I would suggest that sorting 100000 rows would probably in the manual’s terms «require MySQL to access a large percentage of the rows in the table». Have a read through the order-by optimization section of the manual and you will probably find a way to instruct it to do what you want (http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html).

  4. I’ve ran into the same kind of issues with MySQL. But I’ve also had problems with PostgreSQL. As other comments point out, query optimizers sometimes behave in an unexpected way.

    I wouldn’t call MySQL a toy RDBMS for that reason. Not handling relationships between tables unless you use InnoDB (which doesn’t support other stuff, like full-text search) is a better reason.

    What would be interesting to see is a documentation for all reported shortcomings of various RDBMS and their solutions.

  5. Jokub:

    I agree with you: sorting 1M (all) rows requires «to access a large percentage of the rows in the table». But anyway, how can you explain the fact that the query takes 3 secs if doesn’t involves the last row and 1 minute if it does? (despite of the filesort).

    I must insist: If I force the optimizer to use the index, the query performs well.

  6. I suspect that you are not retrieving all 1,000,000 rows when you specify «select * from … order by …». Of course it is faster to fetch the top N rows in some client application by index, however it would be much slower to fetch all of the rows in the table via this index. So in effect you are hiding information from the optimizer. You only want the top N rows (i.e. the ones you are going to work with), however you do not tell it this crucial detail.

    To get an accurate sense of what is going on, try each approach (full table scan and index read) in a situation where you consume all of the rows, say by an insert/select statement that puts all of the data into a separate table. You will see that getting all of the rows by index is much slower.

    The above is not specific to MySQL, but applies to any well-designed rdbms optimizer.

  7. I do, however, agree that this sounds like a bug in the optimizer:

    «But the problem is still there: If the limit reaches (or surpases) the last row, then the optimizer throws away the index and generates a filesort (taking more than 1 minute, making the query totally useless…)»

  8. you’re fired.. go and read the different philosophies of database, and come back and give a write up on something useful like why and what the difference between databases like oracle/postgress and mysql are, and explain why one philosophy differs from the other. Then come back, then maybe the people you work with can slap you around for being a moron.

  9. No, MySQL isn’t a toy… but it’s not a serious tool either. It still can be the right tool for the job, just as long as the people using the tool understand the problems they’ll face.

    The optimizer is horrid, and the query planner is a joke. Correlated subqueries, anybody? The most complex query in my employer’s codebase is only so complex because it’s designed to work around how utterly braindead the MySQL query planner is. I’ve even encountered times when FORCE INDEX() won’t actually force the use of that index, if the optimizer thinks that using another index will be faster.

    Honestly though, if it wasn’t for the out-of-the-box it-just-works replication, I’d have switched to something else ages ago.

  10. Hi,
    I don’t know if I’ll be able to link here, but Shlomi Noach goes into some detail on this issue – and in his case he’s using a WHERE clause to restrict the number of rows, yet the optimizer isn’t using the index as the ORDER BY clause invokes the primary key. So it’s not just the «select *» part which is causing problems.

    http://code.openark.org/blog/mysql/7-ways-to-convince-mysql-to-use-the-right-index

    Once you’re aware of this as a developer or administrator, you can take relevant action. I’m still happy to use Mysql on my sites; well until most hosts provide PostgreSQL. Having used both Oracle and SQL*Server in the past, I’m well aware that there are gotchas and glitches with all of them. Well, ok, too many on Mysql, but with luck that’ll provide interesting subjects for my blog.

    Saludos desde Inglaterra

    DBMark

  11. Thank you for these comments. I was having the same problem. I have only a few thousand records and it blows chunks even with an index. It seems to not be fetching in the background and returning the first row immediately, like Oracle does when it uses nested loops.

    For users who just want to feel like the database is responsive, the first page needs to come back fast, as well as paginate quickly on through. Its counterintuitive for it to be otherwise. The first page of an encyclopedia is just as accessible as the first page of a baby book.

  12. Well, the op example is just one of many issues that mysql has. Its optimizer is in fact a TOY. For many reasons. Here is a good example.

    Views with group bys arent optimized. The results from the entire view are executed and stored as a temp table before doing any joins. This also applies to derived tables.

    Example:
    select 1 from smalltable a, ( select * from bigtable b ) where a.id=b.a_id;

    If bigtable has a billion rows in it and smalltable has 1 row in it, The entire result from big table must be put into a temp table before it performs the join.

    I dont know of any modern rdbms besides mysql that has this issue. All it tells me is that mysql ( as a company, im not blaming just the developers ) didnt take the time or the effort ( or maybe they simply just didnt know how ) to write a real optimizer.

    Another example of just piss poor lazyness.

    insert into table a ( someintcolumn ) values (»);

    This succeeds in mysql and puts a 0 in someintcolumn. Encouraging all the crappy php code to work and making it very difficult to port to any other rdbms. Im not sure if this is just a philosophical thing or a bug. And I dont care. It encourages bugs. I bet half the time php developers dont even realize they have bugs because of things like that. A date with a value of 0000-00-00? Give me a break.

  13. The dictionary definition of monogamy is: married to
    only one woman. The constant attention and attendant quasi-celebrity feed and sustain their grandiose
    fantasies and inflated self-image. All right maybe I’m using too
    a lot credit history here, but most of us
    figured we’d enable you to just about all out.

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *