May 11, 2010

MySQL Query Engine Scalability Issues

Lately in the MySQL community, we only hear about scalability or performance improvements of storage engines, but nothing about query engine itself. For example, one classic example being InnoDB; if we look back all the scalability issues that community reported a year back or even few months back; most part of those issues have been fixed in forth coming MySQL 5.5 version (or even Percona Server or Facebook patches).

Even if you look at the new storage engines that are in development, they all going to concentrate on existing scalability issues that are common to any storage engine, and they will address it before it gets into beta or production ready.

But most part; we still have enough issues pending in the MySQL query engine; that is something that can’t be addressed by storage engines unless MySQL platform itself gets redesigned such that everything is plug and play and storage engine can override any behavior of the query engine (like how late binding works, but it is very un-likely due to lot of issues in terms of implementation as storage engines can’t have complete control over query engine, like allowing storage engines to do sorting / plan execution / caching etc; but this also deviates the main purpose of storage engine itself)

Here comes some known pending issues within query engine, that makes it hard to scale on designing larger systems where queries per second (QPS) is highly demanded:

Negative Scalability of Buffers

There is a negative performance with few global/session buffer variables; and there are number of open bugs on this.

  • table_cache
  • read_buffer_size
  • sort_buffer_size
    • Recently there are number of community blog posts on this sort_buffer_size scalability (here, here and here).

      The main problem is not because of memory allocation, but its because of making array of string pointers to keep the sorted data, and this array initialization (make_char_array() in filesort.cc) is what is causing the bottleneck as it directly depends on buffer size. One option is to skip this initialization of char array completely or use dynamic initialization as you sort (might slow down a bit for smaller sort keys); and InnoDB especially takes a bigger hit in performance if you have larger sort_buffer_size as it gives estimated rows (estimated_rows_upper_bound) being more than actual rows.

    • Here is sysbench performance results of negative scalability of sort_buffer_size for read-only queries (completely in memory data); and notice the performance drop by ~10 times from 2MB to 128MB
      sort_buffer_size threads=16 (txn/sec) threads=32 (txn/sec)
      2097144 (2MB,default) 6475 6302
      6291456 (6MB) 6123 6018
      33554432 (32MB) 3481 1590
      67108864 (64MB) 2084 1085
      134217728 (128MB) 1024 748
    • If you have a slave server without any other workload or server with very few threads < 5-10; then none of these might cause any major performance impact.

Query Cache Scalability

Most of the servers that I manage; first thing I do is to turn query cache completely off (query_cache_size=0 and query_cache_type=0) as much as possible unless the cache hit ratio is > 20%. The performance really degrades by magnitude especially if you have larger cache size and more queries in the cache.

This is something that MySQL should address either by allowing relaxed caching model (cache expires automatically after x secs instead of write-invalidation) or per table caching model without the overhead of contention. But still this is something that needs to be completely re-designed.

Key Buffer Size Scalability

Even though this is completely used by MyISAM; but still one thing that needs to be addressed as most of the servers still use MyISAM where transactions are not required due to its simplicity in administration and performance.

May be worth to support multiple key cache buffers similar to that of  multiple buffer pools introduced by InnoDB in 5.5 or even per table cache buffers.

ON Duplicate Key Update Performance

Yet another widely used feature from MySQL is ON DUPLICATE KEY UPDATE; but this also takes a hit in performance as it starts working on larger tables especially if it needs to update more number of duplicated rows. One option is to push the logic to storage engine for in-line update of the duplicated row as it searches; instead of query engine controlling this with multi-pass iterations to storage engine.

Recently there was a discussion on this in the internals list

Status Checking

SHOW SESSION/GLOBAL STATUS or VARIABLES uses temporary table(in-memory or disk does not matter); even though this is something that is widely used by most of the monitoring tools every few seconds if not minutes; and MySQL should avoid using temporary table for this by having a pre-allocated heap for this.

This will help lot of monitoring tools to judge how many real temporary tables has been actually created by real queries; right now Created_tmp_tables status variable gets incremented for every SHOW STATUS/VARIABLE command.

Mutex Locks

Most of the functions still use common mutex locks even though there is no dependency; which causes contention and un-necessary waits on busy server; so MySQL should split and use individual locks of their own as much as possible, especially LOCK_thread_count is a nightmare and even causes SHOW STATUS to be blocked in most cases.

Pre-allocation of Query Buffers

As memory is cheap; and most of the large end systems make use of 32/64/128G memory; it may be worth MySQL to consider to support pre-allocation of query buffers (like join_buffer_size, sort_buffer_size etc) by exposing preallocate_sort_buffer_size=XX, preallocate_join_buffer_size=XX, so that  pre-allocation size gets re-used by x-threads in parallel

For example, lets say one sets preallocate_join_buffer_size=256M and join_buffer_size=32M; then 8 parallel queries ( x join_threads = prellocate_join_buffer_size/join_buffer_size) can re-use the same pre-allocated pool memory instead of allocating and de-allocating from the heap; and if it runs out of pre-allocated pool memory, then it gets from the heap and releases back.

Re-usable Common Code, Distributed and Pluggable Components

One thing that could encourage more plugins or storage engines or even more developers to support the community is by having common re-usable code like separate Lock Management, Parser Management, Cache Management, User Management, Parsing, Storage Management (read/write wrappers), Central Information Schema Management.. etc; so that every time some one needs to write a new storage engine, they don’t need to re-invent the wheel from scratch.

Like solving, how one can develop or integrate a key/value/NoSQL storage engine with least development cycles by making use of existing components within MySQL code base.

April 28, 2010

Massachusetts Data Protection Law

Recently I came across this new Massachusetts state data protection security law that has been passed and wondering if anyone took an initiative to fix their data storage, especially if it deals with MA residents. You can find more about this law from Google Search.

One thing that might make a difference for database vendors and users are storing personal information without any encryption and replicating that across the wire and also needs to maintain a Written Information Security Plan (WISP) and file it with the state of Massachusetts.

The main problem is; if you have 1000 users from MA; and if you did not encrypt their personal identification information (PII); then you or your business might end up paying 5M USD (5K per breach or lost record); and same is the case when you loose the data that is stored in USB or laptop or whatever…

This also means; if other states and countries start implementing the same rules; then we might see traction on how the databases actually store the data by having global encryption at different levels like table, file, database or at a system level. Microsoft SQL server 2008,  already started supporting encryption at various levels by introducing Transaparent Data Encryption (TDE)

April 20, 2010

INT and String data comparison, difference in performance because of quotes

In the last post choosing about the right type; there is a case about quoting the tuple values; that I forgot to mention which is pretty much a common mistake when string data types are used for storing int or float/double representation (well sometimes you need to use string due to length or to avoid precision loss); and queries associated with that column does not quote the data to be string when searching…

In the same example; client_id was declared as VARCHAR(255); so without any quotes searching on client_id takes 11 secs:

mysql> explain SELECT SQL_NO_CACHE channel, COUNT(channel) AS visitors FROM xxx_sources WHERE client_id = 1301 GROUP BY client_id, channel;
+----+-------------+-------------+-------+--------------------+--------------------+---------+------+----------+--------------------------+
| id | select_type | table       | type  | possible_keys      | key                | key_len | ref  | rows     | Extra                    |
+----+-------------+-------------+-------+--------------------+--------------------+---------+------+----------+--------------------------+
|  1 | SIMPLE      | xxx_sources | index | idx_client_channel | idx_client_channel | 1032    | NULL | 20207319 | Using where; Using index | 
+----+-------------+-------------+-------+--------------------+--------------------+---------+------+----------+--------------------------+
1 row in set (0.00 sec)
 
mysql> SELECT SQL_NO_CACHE channel, COUNT(channel) AS visitors FROM xxx_sources WHERE client_id = 1301 GROUP BY client_id, channel;
+---------+----------+
| channel | visitors |
+---------+----------+
| NULL    |        0 | 
+---------+----------+
1 row in set (11.69 sec)

But if you quote client_id in the search part(client_id=’1301′); then things will run much faster (0.25sec as opposed to 11.69sec) as it does not need to do the conversion, and even the plan uses the direct const checking:

mysql> explain SELECT SQL_NO_CACHE channel, COUNT(channel) AS visitors FROM xxx_sources WHERE client_id = '1301' GROUP BY client_id, channel;
+----+-------------+-------------+------+--------------------+--------------------+---------+-------+--------+--------------------------+
| id | select_type | table       | type | possible_keys      | key                | key_len | ref   | rows   | Extra                    |
+----+-------------+-------------+------+--------------------+--------------------+---------+-------+--------+--------------------------+
|  1 | SIMPLE      | xxx_sources | ref  | idx_client_channel | idx_client_channel | 258     | const | 457184 | Using where; Using index | 
+----+-------------+-------------+------+--------------------+--------------------+---------+-------+--------+--------------------------+
1 row in set (0.00 sec)
 
mysql> SELECT SQL_NO_CACHE channel, COUNT(channel) AS visitors FROM xxx_sources WHERE client_id = '1301' GROUP BY client_id, channel;
+---------+----------+
| channel | visitors |
+---------+----------+
| NULL    |        0 | 
+---------+----------+
1 row in set (0.25 sec)

Same is the case and performance impact if data is quoted when searching on int/double/float columns. At times its worth to double check column data types and use the same notation when using them (with or without quotes)

April 19, 2010

Choosing the right data type makes a big difference

Today evening one of my friend asked me in the IM to look into one of his production server where a query was taking ~11 seconds to run on 20 million row table, even though the query is using the right index and the plan as shown below:

mysql&gt; explain SELECT channel, COUNT(channel) AS visitors FROM xxx_sources WHERE client_id = 1301 GROUP BY channel;
+----+-------------+-------------+-------+--------------------+--------------------+---------+------+----------+-----------------------------------------------------------+
| id | select_type | table       | type  | possible_keys      | key                | key_len | ref  | rows     | Extra                                                     |
+----+-------------+-------------+-------+--------------------+--------------------+---------+------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | xxx_sources | index | idx_client_channel | idx_client_channel | 1032    | NULL | 19205420 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-------------+-------+--------------------+--------------------+---------+------+----------+-----------------------------------------------------------+
1 row in set (0.01 sec)
 
mysql&gt; SELECT channel, COUNT(channel) AS visitors FROM xxx_sources WHERE client_id = 1301 GROUP BY channel;
+---------+----------+
| channel | visitors |
+---------+----------+
| NULL    |        0 |
+---------+----------+
1 row in set (11.61 sec)
 
mysql&gt; show table status like 'xxx_sources'\G
*************************** 1. row ***************************
           Name: xxx_sources
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 19882760
 Avg_row_length: 46
    Data_length: 926941184
Max_data_length: 0
   Index_length: 1188233216
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2010-04-15 21:03:37
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment: InnoDB free: 0 kB
1 row in set (0.21 sec)

Quickly looking at the plan; I added client_id in the group by to avoid temporary table, and the new plan looks much better, but still took same time for execution (well, cost of temp and copy is cheap in this case)..

mysql&gt; explain SELECT channel, COUNT(channel) AS visitors FROM xxx_sources WHERE client_id = 1301 GROUP BY client_id, channel;
+----+-------------+-------------+-------+--------------------+--------------------+---------+------+----------+--------------------------+
| id | select_type | table       | type  | possible_keys      | key                | key_len | ref  | rows     | Extra                    |
+----+-------------+-------------+-------+--------------------+--------------------+---------+------+----------+--------------------------+
|  1 | SIMPLE      | xxx_sources | index | idx_client_channel | idx_client_channel | 1032    | NULL | 19205420 | Using where; Using index |
+----+-------------+-------------+-------+--------------------+--------------------+---------+------+----------+--------------------------+
1 row in set (0.00 sec)

and then examined the data and noticed that client_id was declared as VARCHAR(255) even though the client_id data is all int; quickly changing client_id to int made a big difference as the query execution took only ~0.24 secs

mysql&gt; SELECT channel, COUNT(channel) AS visitors FROM xxx_sources WHERE client_id = 1301 GROUP BY channel;
+---------+----------+
| channel | visitors |
+---------+----------+
| NULL    |        0 |
+---------+----------+
1 row in set (0.24 sec)

The performance difference is too big after changing the type to int. This is just an example; but I noticed lot of tables with VARCHAR(64) or VARCHAR(255) or VARCHAR(512) (or even TEXT at times).. as default types even though they store at max of 10-15 bytes of data; not sure why anyone do that; as this is something that must be followed as rule #1 when designing schema. Even if you are not directly querying on that column; it is always better to design a schema with right type and storage so that it is optimal in terms of storage space and performance.

April 16, 2010

MySQL 5.5 – A Community Winner

Ever since MySQL 5.5 beta has been announced by Edward Screven, Oracle’s chief corporate architect; there is lot of positive buzz (here, here, …) about the performance and scalability improvements added in this release. We should all be thankful to Michael Ronstrom (as most of the key developers are already working on different forks), who did a great job in the improvements especially scalability related that allows to scale beyond 16 cores by improving the performance by 2-5X in most common workloads. Not to forget about numerous improvements to replication by replication team.

Even though 5.5 has lot of new improvements officially from Sun/Oracle; but some of the changes are actually driven by community (yet another thanks to Google, Mark Callaghan and his team, Percona and his team, Facebook etc) and most of the ideas or patches were already floating for a while and they were used in the production as well (5.0 or 5.1). This is actually a good sign that community can look forward for 5.5 GA instead of worrying about what patches and builds to use.

This is a clear indication that 5.5 performance and scalability improvements were actually driven by community.

Key improvements in 5.5:

  1. InnoDB changes in 1.1
    • Multiple buffer pools (controlled by innodb_buffer_pool_instances)
    • Multiple rollback segments
    • Splitting of purge operation from main background thread (controlled by innodb_purge_threads)
    • New log_buf mutex now controls the mini transaction writes in buffer pool instead of shared log_sys, reduces the contention on buffer pool
    • Separate mutex for flush list handling, reduces the contention on buffer pool
    • Improved recovery time
  2. Rest of the changes as part of InnoDB plugin 1.0.x
  3. Numerous replication related changes

Even though they announced InnoDB as the default storage engine in 5.5; but the latest build still has MyISAM as the default

Server version: 5.5.4-m3 MySQL Community Server (GPL)
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
mysql> show engines;
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| Engine             | Support | Comment                                                        | Transactions | XA   | Savepoints |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| InnoDB             | YES     | Supports transactions, row-level locking, and foreign keys     | YES          | YES  | YES        |
| MRG_MYISAM         | YES     | Collection of identical MyISAM tables                          | NO           | NO   | NO         |
| MEMORY             | YES     | Hash based, stored in memory, useful for temporary tables      | NO           | NO   | NO         |
| BLACKHOLE          | YES     | /dev/null storage engine (anything you write to it disappears) | NO           | NO   | NO         |
| CSV                | YES     | CSV storage engine                                             | NO           | NO   | NO         |
| MyISAM             | DEFAULT | Default engine as of MySQL 3.23 with great performance         | NO           | NO   | NO         |
| ARCHIVE            | YES     | Archive storage engine                                         | NO           | NO   | NO         |
| FEDERATED          | NO      | Federated MySQL storage engine                                 | NULL         | NULL | NULL       |
| PERFORMANCE_SCHEMA | YES     | Performance Schema                                             | NO           | NO   | NO         |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
9 rows in set (0.00 sec)

April 15, 2010

RAID Controllers Cache Management – Missing Features

PERC4DC_4 We all know how important hardware RAID controllers are in today’s data storage performance especially when dealing with large data sets. If we look at the trend from now to couple of years back; they really evolved rapidly with lot of useful features and their usage also grown as most of the new servers by default has one or two controllers built-in (one for internal and another one for external storage array or for redundancy).

Few popular RAID controller vendors in the market: 

More or less everyone supports all common features and differs in number of ports, protocol support (ISCSI, SATA, SAS, HBA/FB), transfer speed, RAID levels, total disks support, cache size and its management.

Controller Cache – Database Workloads

For database OLTP workloads (IO bound), controller cache plays a crucial role for overall write or read throughput, depending on how the cache is used. Most RAID controllers are equipped with either 128MB or 256 MB or 512MB cache, and newer controllers like HP Smart Array P812 supports 1GB.

Write-back mode improves the writes performance by magnitude as the write request is returned as completed as soon as the data is in the controller cache without actually writing to the disk (that’s why controller needs a BBU, Battery Backup Module so that there is no data loss on power failures)

In case if you enable the read ahead from the controller (sometimes good for OLAP workloads or ETL data warehouse, especially adaptive read ahead due to heavy sequential access); then the same cache is used to store the pre-fetched data that can be satisfied later from the cache without hitting the disk. But in case if the database system does read ahead (like InnoDB), then it is better to turn off read ahead from controller to avoid page trashing.

For some workloads, the controller cache can also cause negative performance if the cache is not properly utilized by the controller.

Missing Cache Management Tools

At present, none of the controllers either supports any cache management tools nor exposes how the cache has been actually used, so that one can adjust the cache according to the workloads for improved performance.

Some of the missing features:

  • A way to flush the data from cache to disks, so that the systems can be taken for offline maintenance. Right now there is no easy way to flush data from cache to disk; other than some of the controllers will indicate through LED whether data is in the cache or not
  • Way to set the cache threshold in time or %, so that it can start flushing to disk once it meets the threshold value. For example; if you notice big spikes from RRD graphs for every few minutes, then one can adjust the threshold to evenly distribute the load.
  • Cache usage statistics (writes data size, read ahead data size etc ), so that workload can be adjusted to yield much better results
  • Splitting of cache between reads and writes either in size or by %; so that they do not overlap and cause performance issues. For example; one prefers to set 20% for read ahead data and 80% for writes. Only HP Smart Array controller supports this feature at present.

As you get more control over the controller cache, the more you can tweak and adjust the workloads to get improved performance. Hopefully one day all vendors will expose more cache management options.

April 11, 2010

Data Store, Software and Hardware – What is best

Other day we had a small discussion about data stores and hardware; and which one drives the other when it comes to data storage solution, rather it is a hard discussion as both on its own are bigger entities; and one can not easily conclude as it depends on use cases and actually speaking data store limitation(s) drives the need for more powerful hardware for demanding scalability needs.

We all know how important the hardware is in today’s data scalability, especially when dealing with large data sets. Without hardware, it is hard to scale even if you have a powerful data store either it could be SQL (row or columnar) or NoSQL (key/value or other means) or any other data storage solution; because they are limited by the data structures & its implementation and data store performance directly depends on the hardware lately.

At times, data store vendors claim that they have scalable, high performance architecture; that means the solution is directly built on top of hardware scalability and performance by taking advantage of today’s evolving hardware technology. Also, hardware evolution is too aggressive in the recent years when compared to data store solutions due to the market share as hardware is everywhere as it is not just the storage solution.

In short, when a data store performance is directly proportional to hardware performance; that means the data store actually surpassed all of its software performance bottlenecks (algorithms, decision making, data structures etc). Overcoming from software performance is not that easy as the requirement changes day by day and it depends on data size and how data is actually:

  • stored
  • retrieved
  • processed and
  • maintained

If data is stored and retrieved from memory or non-persistent storage solution; then one does not need to worry about rest of the stuff or performance as it yields the best throughput; but memory or non-persistent solution can be a solution for smaller data sets, but not for large data sets that deals with tera bytes of data.

Other than newly evolving columnar data stores (yet to see any one solution that is really pitching with universal acceptance like Oracle/SQLServer/MySQL), NoSQL or big data warehouse solutions (like Aster data, Green Plum etc), none of the existing solutions really take advantage of the latest hardware or even the data  structures as most of the data store kernels are written years back. In today’s world; the only option for scalability is by depending on the hardware and by distributing the load across multiple systems (either in shared-nothing or shared-common or even “cloud” way…).

Hoping to see a solution, one day that actually bridges the gap between data store, hardware and scalability without the need of using multiple technologies for common use cases instead of depending on one single solution that can be universally adopted. Brian Aker in his recent interview and Baron claims the same thought.

April 7, 2010

CAP Theorem, Eventual Consistency, NoSQL

Very nice and interesting post from Michael Stonebraker explaining how errors dictate CAP Theorem (Consistency, Availability and Partition-tolerance); as only one objective from the CAP can be achieved during normal error conditions as NoSQL system seems to relax the consistency model as CAP theorem anyway proves that one can’t get all 3 at the same time, by favoring partition based availability

As most NoSQL systems adopt Eventual Consistency (depends and some systems it is configurable, and here is yet another nice article on variations in eventual consistency from Werner Vogels, CTO of Amazon) especially when data is distributed across the cluster of systems. Stonebraker suggests CA (Consistency and Availability, typical SQL system) rather than AP (Availability and Partition-Tolerance, typical NoSQL) by explaining the different error scenarios.

For example (just for fun); let’s consider the same error conditions, one by one and see how this can be adopted in distributed cloud computing (SQL or NoSQL)…

  • Application errors;  and up on error, system needs to rollback to its previous consistent state…

In case of transactional system, it is easy to rollback if the unit of work is in transaction (automatically by DBMS or manually by the application). In key/value pair, as easy as calling delete (key). Few systems use append-only mode (where there is no concept of updates or deletes, few SAS companies in the valley also use MySQL/InnoDB/MyISAM in this append only mode where system never gets any deletes/updates as everything runs on INSERT and SELECT only), updating with older time-stamp or serial-number can revert the latest change (depending on how the latest value is read).

  • Repeatable system crash on bad query request or bad data or something…

It is really hard to escape from this failure as other system in the cluster or replica can also fail for the same condition unless the replicated system is on a different version (rare case) or other degree of fault tolerance.

  • Unrepeatable system crash (System failure, hardware/power issue, data center down, network issue…)

If it is a system failure; then this can be handled by fail-over to another system or replica in the LAN or WAN. NoSQL defines much better solution for fail-over with online substitution of nodes to the cluster. The only issue is, if data is persistent in master node and if it implements read-your own write or monotonic read consistency model (same value upon subsequent requests); and before the change propagates to replica of other nodes; then the consistency model fails unless the model is designed such that more than one node shares the same write.

  • Partition failure in LAN, WAN

If bunch of partitions/nodes fail within LAN or WAN due to network, power failure etc; then it is hard to fail-over unless there is a replica or same copy within LAN or WAN. But due to network and/or power redundancy in data center or cloud environment; this failure is very unlikely.

  • Fail-over due to slowness or poor response

When one of the node starts performing slow due to higher load or more data processing; instead of failing over to another node (typical NoSQL concept), Stonebraker recommends building a system that can take the load spikes; but it is really hard due to growing data needs and poor capacity planning as things change over time. So, one option is to retire by fail-over to newer one (not always possible if you already have latest and greatest specs) or adding more nodes to take the load spikes (distributed).

But in case of SQL with shared-nothing or shared-common architecture; it is not always possible to distribute the load to more than one system (for example, client X in system A, but when A exceeds the load; only option is to split client X into system A and B; but client X can’t be split due to data integrity or lack of common layer that can interact with both the nodes and perform the join/merge operations)

Both has its advantages and dis-advantages; but if NoSQL adopts strict consistency model as that of SQL; then it is hard to scale in the distributed environment where that architecture is more or less demanded by many big web applications where the scalability comes only by distribution and consistency has to be sacrificed to some extent to get the blend of performance + scalability + high availability