December 2, 2010

MySQL At Scale – Zynga Games

Recently am part of Zynga‘s database team as I was pretty much impressed with company’s database usage. As everyone knows how popular Zynga games like Farmville, Cafe World, Mafia Wars, Poker, FrontierVille, FishVille, PetVille and Treasure Island etc are. Zynga launched yet another new game today called CityVille along with series of acquisitions (latest today is NewToy). You can find current Zynga game stats from appdata.

app_full_proxyBut lot of people asked me why I am part of Zynga database team when there is no MySQL being used by any of the games; and lot of articles on the web also indicate the same. For me it does not matter if it is MySQL or NoSQL or any other system as long as the data store can help to scale the systems and/or games in this case.

As a consultant, I help lot of other companies to scale using NoSQL systems apart from MySQL especially on large data handling; as the data store solution should help to scale the systems to yield the desired results; especially MySQL should be used for typical OLTP workloads and combination of MySQL and NoSQL or any other data warehouse clusters for analytics and/or OLAP workloads by combining with right application and caching components based on the business model and how the data is generated, stored, accessed and processed.

If you don’t use the right technology for what you trying to achieve, then you can’t easily scale, and end up spending time in fixing the performance and scalability issues on day to day basis rather than concentrating on building features that is demanded by the business.

As a matter of fact, Zynga may be the second largest MySQL user after Facebook. All games at Zynga are currently powered by MySQL as the backend storage along with memcache as the middle caching layer.

Last month we expanded the MySQL shards to one of the popular game due to increased DAU (Daily Average Users), and the whole expansion of MySQL shards in production happened without any down-time or taking the game down; which is only possible if the application code is tightly integrated with the caching, backend storage and also if the servers are in the cloud and elastic in nature (unless you have your own private cloud).

August 4, 2010

Why Kickfire is a fail in MySQL Data warehouse

Even though Data warehouse is picking very rapidly in the last year or so, but few companies who are already made a right mark in the right time could not take the market share that easily due to number of reasons. Even though am not a marketing guy to go over, but some of the basic design flaws can make things backwards at any time. Was doing research in to this space for last few months and expecting Kickfire will be either acquired or will be lead to dead end, and looks like the later is true as per the latest reports (kickfire website is also down now) even though it is backed by big investors.

Apart from what has been written in the above article by Daniel Abadi, few other reasons why it might be a failure:

  • it was not directly compatible with MySQL
  • over priced, not reachable by small to medium market customers
  • storage solution is tied with the appliance, as it can cause more damage when rapidly changing technology like storage is tied with an architecture.
  • building OLAP appliance on top of OLTP architecture without scalable MPP is a big misfire (or atleast should have built MPP to compete with XtremeData and/or Netezza on the same space as they have much better solutions)

This is my personal view and nothing related to any of the current work that am doing on data warehouse front. In case if you need any advice on current data warehouse technologies and/or want to build a scalable data warehouse infrastructure, please drop me a note.

July 19, 2010

MapReduce – DBInputFormat – Serialization on readers

Last week I was working on EC2 MySQL server where one of the slave is taking lot of time to catch-up; and only job that is running on that server is mapreduce job to access InnoDB tables for read-only meta data. And debugging it further, noticed that every access to database server is serialized with read lock; so slave has to wait until the locks are released (bad part is, if the resource is not released in-time or if it takes lot of time to process and cleanup, then things will be really bad on server front).

As am new to hadoop/mapreduce world; after pulling mapreduce code, noticed that DbInputFormat class by default uses SERIALIZABLE isolation level in getConnection() internally; which is kind of odd and can cause lot of locking and performance issues especially if multiple threads are acting on the same table.

Surprisingly, Sqoop, a database import tool for Hadoop also uses the same DBInputFormat class, so not sure if anyone actually tested for performance especially on live server with multiple threads where table is frequently updated. Even though the reader thread won’t take any performance hit, but other threads has to wait for any write operation as SERIALIZABLE by default acquires read lock.

I patched the mapreduce code to use READ COMMITTED isolation, and things started working back; even though the scenario is bit different here due to the nature of server workload, but this is not something that is un-common in most of the production environment.

Random Pauses In MySQL – File Handle Serialization

Last month, I blogged about a case involving InnoDB, where all threads acting on InnoDB tables completely stuck for about few hours doing nothing; until we found a way to get around and make the threads to run and do the actual work.

There are few more cases where the server can get into pause state without doing anything for a brief time; this is irrespective of storage engines (well, sort of). Even though server is still active and doing nothing; but blocked on certain conditions and every other thread for the most part has to wait, mostly because of serialization

Here is one such case that involves with THR_OPEN_lock mutex, which is mainly used for opening and closing file handles. When the table is flushed or when it is not in the table cache, then it opens and caches it(open_cache structure) for further use, so that sub-sequent operations can (re)use it without re-opening. Even if the table is InnoDB or any other storage engine type, the frm file has to go through the same mechanism. Not just the regular tables; but implicit file based temporary tables and all SELECT … OUTFILES and LOAD DATA LOCAL. uses the same lock to open and close.

For example; here is a live server thread dump; where most of the threads are blocked on this THR_OPEN_lock mutex:

    129 
     71 pthread_cond_wait@@GLIBC_2.3.2,end_thread,handle_one_connection,start_thread,clone
     25 read,vio_read,my_real_read,my_net_read,handle_one_connection,start_thread,clone
     14 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,os_aio_simulated_handle,fil_aio_wait,io_handler_thread,start_thread,clone
      5 __lll_lock_wait,_L_lock_1233,pthread_mutex_lock,open_table,open_tables,open_and_lock_tables,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 select,os_thread_sleep,srv_lock_timeout_and_monitor_thread,start_thread,clone
      1 select,os_thread_sleep,srv_error_monitor_thread,start_thread,clone
      1 select,handle_connections_sockets,main
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_master_thread,start_thread,clone
      1 __lll_lock_wait,_L_lock_1233,pthread_mutex_lock,open_table,open_tables,open_and_lock_tables,Prepared_statement::prepare,mysql_stmt_prepare,dispatch_command,handle_one_connection,start_thread,clone
      1 __lll_lock_wait,_L_lock_1233,pthread_mutex_lock,my_close,openfrm,open_unireg_entry,open_table,open_tables,open_and_lock_tables,select_like_stmt_test_with_open_n_lock,Prepared_statement::prepare,mysql_stmt_prepare,dispatch_command,handle_one_connection,start_thread,clone
      1 __lll_lock_wait,_L_lock_1233,pthread_mutex_lock,my_close,MYSQL_LOG::close,MYSQL_LOG::new_file,rotate_relay_log,process_io_rotate,queue_event,handle_slave_io,start_thread,clone
      1 __lll_lock_wait,_L_lock_1233,pthread_mutex_lock,my_close,mi_close,free_tmp_table,close_thread_tables,dispatch_command,handle_one_connection,start_thread,clone
      1 __lll_lock_wait,_L_lock_1233,pthread_mutex_lock,close_thread_tables,Query_log_event::exec_event,handle_slave_sql,start_thread,clone
      1 __lll_lock_wait,_L_lock_1233,pthread_mutex_lock,close_thread_tables,Prepared_statement::cleanup_stmt,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 do_sigwait,sigwait,signal_hand,start_thread,clone
      1 close,my_close,select_to_file::send_eof,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone

In the above stack trace, close to 11 threads are blocked and waiting on THR_LOCK_open during either open or close of file handles as last thread is taking time in closing the OUTFILE file descriptor, which holds the lock. The following thread states in the "SHOW PROCESSLIST" is a good indication of such a symptom.

  • opening tables
  • opening table
  • closing tables
  • query end
  • init
  • end

There is an open scalability bug on this and hoping that future versions of MySQL will address this. The best bet will be to split and use different mutex locks between regular tables, implicit temporary tables; input/output files, so that main table operations are not blocked by rest of the temporary operations or maintain a separate fd list and use atomic operations to toggle the state by getting rid of this lock on regular open/close operations. The symptom also exhibits when one is deleting a large table or using large IO buffering.

June 23, 2010

How read_buffer_size Impacts Write Buffering and Write Performance

Even though the name read_buffer_size implies that the variable controls only read buffering, but it actually does dual purpose by providing sequential IO buffering for both reads and writes.

In case of write buffering, it groups the sequential writes until read_buffer_size(it is min(read_buffer_size, 8K)); and then actually does the physical write once the buffer is full. In most cases; this value is the initial value of read_buffer_size when server actually started first time; as this is a dynamic global variable; even if you change the value dynamically at run time; it will not affect write buffering size (and in some cases of read buffering as well) as this is stored one-time in my_default_record_cache_size (might be a bug ?); and that variable is used in initializing IO cache buffers.

Here is some use cases where read_buffer_size is actually used for buffering writes:

  • SELECT INTO … OUTFILE ‘fileName
    • When writing to the OUTFILE, the writes are buffered before writing to OUTFILE
  • When filesort is used, during merge buffers and when merged results are written to a temporary file, then writes are buffered

Normally you will see performance boost due to buffering on slower write disks or when you have IO saturation; but it does not matter when you have descent disks or raid controller with write_cache with BBU enabled.

Here is some stats on how many physical writes are actually posted for a simple SELECT … INTO OUTFILE (file size 384936838 bytes) for variable read_buffer_size values (server needs to be restarted in-order to get the new value):

read_buffer_size physical writes exe time in secs
=0 (defaults to 8200) 23495 28.39
=131072 (default) 2937 27.44
=16777216 23 26.71
=33554432 12 26.00
=536870912 1 26.72

Total writes are calculated using simple patch that I wrote around mysys/my_write.c to get the real physical writes posted as a global status counter.

mysql> show global status like 'Write_count';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Write_count   | 23496 |
+---------------+-------+
1 row in set (0.00 sec)

As you can see, increase in read_buffer_size might save total physical writes and might help if you have lot of OUTFILEs or heavy file sorting to some extent; but again this will actually affect overall performance due to one of the known bug and the buffer is also allocated per query based; so be careful as allocation and initialization of big buffers are much costlier than real IO cost.

In either case; may be worth if the sequential read buffering is actually controlled by read_buffer_size and introduce new write_buffer_size that controls the write buffering instead of using the same for both.

June 12, 2010

Autocommit, Implicit Commit and Open Transactions

If you have any open transaction(s); and if there is any statement that causes implicit commit, especially DDL statements; then the current active transaction will be committed and transaction will be closed automatically. Here is the list of statements that causes implicit commit in MySQL.

But other day; we had an issue in production as all of a sudden everything started failing with the following error on RENAME TABLE statements:

ERROR 1192 (HY000): Can't execute the given command because you have active locked tables or an active transaction

This happened because I changed one of the table type to InnoDB from MyISAM; and as that table is only used for read-only purpose and most part of ETL operations failed with the above error even though no table is explicitly locked or running in any open transactions; and after debugging it further; found that its implicit commit that is not properly closing any open transactions when auto commit is set to ON mode.

Here is the simple repro in MySQL (5.0 and 5.1):

mysql> create table outer_tab(id int)Engine=InnoDB;
Query OK, 0 rows affected (0.12 sec)
 
mysql> insert into outer_tab values(10);
Query OK, 1 row affected (0.00 sec)
 
mysql> commit;
Query OK, 0 rows affected (0.00 sec)
 
mysql> 
mysql> set autocommit=0;
Query OK, 0 rows affected (0.00 sec)
 
mysql> select * from outer_tab;
+------+
| id   |
+------+
|   10 |
+------+
1 row in set (0.00 sec)
 
mysql> set autocommit=1;
Query OK, 0 rows affected (0.00 sec)
 
mysql> create table inner_tab(id int)Engine=InnoDB;
Query OK, 0 rows affected (0.18 sec)
 
mysql> rename table inner_tab to tmp_x;
ERROR 1192 (HY000): Can't execute the given command because you have active locked tables or an active transaction
mysql>

SET AUTOCOMMIT=1 should perform implicit commit and close any open transactions as per the documentation; but it only commits any pending transactional statements; but does not mark transaction as closed; and here is the code that does that in set_var.cc::set_option_autocommit().

 if ((org_options & OPTION_NOT_AUTOCOMMIT))
    {
      /* We changed to auto_commit mode */
      thd->options&= ~(ulonglong) (OPTION_BEGIN | OPTION_KEEP_LOG);
      thd->transaction.all.modified_non_trans_table= FALSE;
      thd->server_status|= SERVER_STATUS_AUTOCOMMIT;
      if (ha_commit(thd))
    return 1;
    }

This is handled by MySQL engine and is in-dependant of storage engines. The fix here is to reset SERVER_STATUS_IN_TRANS in thd->server_status after the successful ha_commit() or by calling end_active_trans() function instead of ha_commit() in the above set_option_autocommit() function; and this holds good for any other implicit commit statements as well, so that it marks transaction as closed. I also filed a bug report

In a side note; here is the same example against Oracle 10G:

SQL>
SQL> create table outer_tab(id int);
Table created.
 
SQL> insert into outer_tab values(10);
1 row created.
 
SQL> commit;
Commit complete.
 
SQL>
SQL> set autocommit off;
SQL> select * from outer_tab;
        ID
----------
        10
 
SQL> set autocommit on;
SQL>
SQL> create table inner_tab(id int);
Table created.
 
SQL>
SQL> rename inner_tab to tmp_x;
Table renamed.

		

June 7, 2010

InnoDB In a Complete Locked Mode

Today morning, we had a weird issue in one of the staging OLAP server (ETL); where all the InnoDB threads were locked and waiting on a signal condition for about 3-4 hours without performing any work related to InnoDB until we noticed it. Processlist indicates every thread is in ACTIVE state mode like Sending Data, update, Copying to tmp table etc; even queries that are acting on single row table are in Sending Data state for more than 3 hours; and slave thread is no exception as it was performing an update on InnoDB table and it got stuck in that query for 3+ hours.

Quick look at the stack trace as shown below indicates that, out of total 92 threads, none of them actually doing anything. Other than sleep, main and monitor threads, every other thread is waiting on a signal condition. The server was actually running with the following concurrency settings:

innodb_thread_concurrency=0
innodb_thread_sleep_delay=0

But looks like by accidentally thread_concurrency and concurrency_tickets count was toggled together; that forced currently executing threads to enter into this signal wait state (pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low). Resetting concurrency back to 0 or increasing tickets did not help; but that allowed me to execute new queries without any issues except that it couldn’t signal to wake rest of the threads.

One way to signal the event is by making a new thread to enter srv_conc_enter_innodb with no more concurrency tickets left; that means reset innodb_concurrency_tickets to a lower number along with setting innodb_thread_concurrency > 0 and by executing an InnoDB related query(create table, bunch of inserts, updates, selects and then drop); and by doing so, actually re-signaled the condition and all threads started working back.

This happened with 5.0.77 build with community patches and looks like a race condition that causes this in combination of innodb_thread_concurrency and innodb_concurrency_tickets when toggling back and forth on a long running active threads. I could not reproduce this again as its hard to simulate the load; but will try again on next weekend as week days we let production systems do their job

Stack trace:

     92 
     12 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,os_aio_simulated_handle,fil_aio_wait,io_handler_thread,start_thread,clone
      8 pthread_cond_wait@@GLIBC_2.3.2,end_thread,handle_one_connection,start_thread,clone
      1 select,os_thread_sleep,srv_lock_timeout_and_monitor_thread,start_thread,clone
      1 select,os_thread_sleep,srv_error_monitor_thread,start_thread,clone
      1 select,handle_connections_sockets,main
      1 read,read,buf=0x2ab822169530,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab7e0000030,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab7dfda85c0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab7dfda45b0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab7d4b035e0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab7a3d747e0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab7a0a1a970,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab78c5ab320,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab71686c1d0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab6d649ad10,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab6cca043e0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab6bfd3bed0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab6bbd3e620,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab6af60dbe0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab6903fefd0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab690000010,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab68ddd4270,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab682eea0c0,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2ab681068d50,my_real_read,my_net_read,handle_one_connection,start_thread,clone
      1 read,read,buf=0x2aaaac0010a0,vio_read_buff,my_real_read,my_net_read,cli_safe_read,read_event,at,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_master_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,record=0x2ab74d2a9f00,write_record,mysql_insert,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,record=0x2ab68a89a5e0,write_record,mysql_insert,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab79032a2a0,join_read_key,sub_select,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab78c4ac270,join_read_always_key,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab76e8e1ea0,ha_innobase::index_first,ha_innobase::rnd_next,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,mysql_parse,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab752517498,write_record,read_sep_field,ex=<value ,mysql_execute_command,mysql_parse,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab74bc72430,ha_innobase::index_first,ha_innobase::rnd_next,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,mysql_parse,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6e87bffb0,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6e3a77660,ha_innobase::index_first,ha_innobase::rnd_next,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,mysql_parse,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6be153290,join_read_next_same,sub_select,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6b7f893f0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6b6b103c0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6b4c1f570,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6af738490,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6a1268040,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab68746fcb0,ha_innobase::index_first,ha_innobase::rnd_next,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab686f64c40,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab685a3c340,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab683cef620,handler::read_range_first,handler::read_multi_range_first,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_delete,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab681b97eb0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab67af5eac0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab672f44fc0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab66c88a778,handler::read_range_first,handler::read_multi_range_first,QUICK_RANGE_SELECT::get_next,rr_quick,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab66bc2f120,ha_innobase::index_first,ha_innobase::rnd_next,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,mysql_parse,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab66bbcefe0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab66b054ef0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab66960f7a0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6694a0620,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab668cd2230,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab667882de0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab660d71870,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,mysql_derived_filling,mysql_handle_derived,open_and_lock_tables,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab66073a070,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6603a8ab0,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6600964a0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab65fe180a0,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab65f18eed0,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab65f1703d0,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,mysql_derived_filling,mysql_handle_derived,open_and_lock_tables,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab65f145540,join_read_next_same,sub_select,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab65f1268e0,join_read_next_same,sub_select,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab65da6d390,join_read_next_same,sub_select,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab658caabe0,ha_innobase::index_first,ha_innobase::rnd_next,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,mysql_parse,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2ab6574905b0,join_read_always_key,sub_select,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2aaaae28e998,handler::read_range_first,handler::read_multi_range_first,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,Query_log_event::exec_event,exec_relay_log_event,optimized,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2aaaac4fc010,rr_sequential,sub_select,do_select,JOIN::exec,mysql_select,mysql_derived_filling,mysql_handle_derived,open_and_lock_tables,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2aaaac03f000,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_event_wait_low,srv_conc_enter_innodb,innodb_srv_conc_enter_innodb,buf=0x2aaaab42ab00,write_record,select_insert::send_data,end_send,evaluate_join_record,sub_select,do_select,JOIN::exec,mysql_select,handle_select,mysql_execute_command,Prepared_statement::execute,mysql_stmt_execute,dispatch_command,handle_one_connection,start_thread,clone
      1 do_sigwait,sigwait,signal_hand,start_thread,clone

May 11, 2010

MySQL Query Engine Scalability Issues

Lately in the MySQL community, we only hear about scalability or performance improvements of storage engines, but nothing about query engine itself. For example, one classic example being InnoDB; if we look back all the scalability issues that community reported a year back or even few months back; most part of those issues have been fixed in forth coming MySQL 5.5 version (or even Percona Server or Facebook patches).

Even if you look at the new storage engines that are in development, they all going to concentrate on existing scalability issues that are common to any storage engine, and they will address it before it gets into beta or production ready.

But most part; we still have enough issues pending in the MySQL query engine; that is something that can’t be addressed by storage engines unless MySQL platform itself gets redesigned such that everything is plug and play and storage engine can override any behavior of the query engine (like how late binding works, but it is very un-likely due to lot of issues in terms of implementation as storage engines can’t have complete control over query engine, like allowing storage engines to do sorting / plan execution / caching etc; but this also deviates the main purpose of storage engine itself)

Here comes some known pending issues within query engine, that makes it hard to scale on designing larger systems where queries per second (QPS) is highly demanded:

Negative Scalability of Buffers

There is a negative performance with few global/session buffer variables; and there are number of open bugs on this.

  • table_cache
  • read_buffer_size
  • sort_buffer_size
    • Recently there are number of community blog posts on this sort_buffer_size scalability (here, here and here).

      The main problem is not because of memory allocation, but its because of making array of string pointers to keep the sorted data, and this array initialization (make_char_array() in filesort.cc) is what is causing the bottleneck as it directly depends on buffer size. One option is to skip this initialization of char array completely or use dynamic initialization as you sort (might slow down a bit for smaller sort keys); and InnoDB especially takes a bigger hit in performance if you have larger sort_buffer_size as it gives estimated rows (estimated_rows_upper_bound) being more than actual rows.

    • Here is sysbench performance results of negative scalability of sort_buffer_size for read-only queries (completely in memory data); and notice the performance drop by ~10 times from 2MB to 128MB
      sort_buffer_size threads=16 (txn/sec) threads=32 (txn/sec)
      2097144 (2MB,default) 6475 6302
      6291456 (6MB) 6123 6018
      33554432 (32MB) 3481 1590
      67108864 (64MB) 2084 1085
      134217728 (128MB) 1024 748
    • If you have a slave server without any other workload or server with very few threads < 5-10; then none of these might cause any major performance impact.

Query Cache Scalability

Most of the servers that I manage; first thing I do is to turn query cache completely off (query_cache_size=0 and query_cache_type=0) as much as possible unless the cache hit ratio is > 20%. The performance really degrades by magnitude especially if you have larger cache size and more queries in the cache.

This is something that MySQL should address either by allowing relaxed caching model (cache expires automatically after x secs instead of write-invalidation) or per table caching model without the overhead of contention. But still this is something that needs to be completely re-designed.

Key Buffer Size Scalability

Even though this is completely used by MyISAM; but still one thing that needs to be addressed as most of the servers still use MyISAM where transactions are not required due to its simplicity in administration and performance.

May be worth to support multiple key cache buffers similar to that of  multiple buffer pools introduced by InnoDB in 5.5 or even per table cache buffers.

ON Duplicate Key Update Performance

Yet another widely used feature from MySQL is ON DUPLICATE KEY UPDATE; but this also takes a hit in performance as it starts working on larger tables especially if it needs to update more number of duplicated rows. One option is to push the logic to storage engine for in-line update of the duplicated row as it searches; instead of query engine controlling this with multi-pass iterations to storage engine.

Recently there was a discussion on this in the internals list

Status Checking

SHOW SESSION/GLOBAL STATUS or VARIABLES uses temporary table(in-memory or disk does not matter); even though this is something that is widely used by most of the monitoring tools every few seconds if not minutes; and MySQL should avoid using temporary table for this by having a pre-allocated heap for this.

This will help lot of monitoring tools to judge how many real temporary tables has been actually created by real queries; right now Created_tmp_tables status variable gets incremented for every SHOW STATUS/VARIABLE command.

Mutex Locks

Most of the functions still use common mutex locks even though there is no dependency; which causes contention and un-necessary waits on busy server; so MySQL should split and use individual locks of their own as much as possible, especially LOCK_thread_count is a nightmare and even causes SHOW STATUS to be blocked in most cases.

Pre-allocation of Query Buffers

As memory is cheap; and most of the large end systems make use of 32/64/128G memory; it may be worth MySQL to consider to support pre-allocation of query buffers (like join_buffer_size, sort_buffer_size etc) by exposing preallocate_sort_buffer_size=XX, preallocate_join_buffer_size=XX, so that  pre-allocation size gets re-used by x-threads in parallel

For example, lets say one sets preallocate_join_buffer_size=256M and join_buffer_size=32M; then 8 parallel queries ( x join_threads = prellocate_join_buffer_size/join_buffer_size) can re-use the same pre-allocated pool memory instead of allocating and de-allocating from the heap; and if it runs out of pre-allocated pool memory, then it gets from the heap and releases back.

Re-usable Common Code, Distributed and Pluggable Components

One thing that could encourage more plugins or storage engines or even more developers to support the community is by having common re-usable code like separate Lock Management, Parser Management, Cache Management, User Management, Parsing, Storage Management (read/write wrappers), Central Information Schema Management.. etc; so that every time some one needs to write a new storage engine, they don’t need to re-invent the wheel from scratch.

Like solving, how one can develop or integrate a key/value/NoSQL storage engine with least development cycles by making use of existing components within MySQL code base.