April 11, 2010

Data Store, Software and Hardware – What is best

Other day we had a small discussion about data stores and hardware; and which one drives the other when it comes to data storage solution, rather it is a hard discussion as both on its own are bigger entities; and one can not easily conclude as it depends on use cases and actually speaking data store limitation(s) drives the need for more powerful hardware for demanding scalability needs.

We all know how important the hardware is in today’s data scalability, especially when dealing with large data sets. Without hardware, it is hard to scale even if you have a powerful data store either it could be SQL (row or columnar) or NoSQL (key/value or other means) or any other data storage solution; because they are limited by the data structures & its implementation and data store performance directly depends on the hardware lately.

At times, data store vendors claim that they have scalable, high performance architecture; that means the solution is directly built on top of hardware scalability and performance by taking advantage of today’s evolving hardware technology. Also, hardware evolution is too aggressive in the recent years when compared to data store solutions due to the market share as hardware is everywhere as it is not just the storage solution.

In short, when a data store performance is directly proportional to hardware performance; that means the data store actually surpassed all of its software performance bottlenecks (algorithms, decision making, data structures etc). Overcoming from software performance is not that easy as the requirement changes day by day and it depends on data size and how data is actually:

  • stored
  • retrieved
  • processed and
  • maintained

If data is stored and retrieved from memory or non-persistent storage solution; then one does not need to worry about rest of the stuff or performance as it yields the best throughput; but memory or non-persistent solution can be a solution for smaller data sets, but not for large data sets that deals with tera bytes of data.

Other than newly evolving columnar data stores (yet to see any one solution that is really pitching with universal acceptance like Oracle/SQLServer/MySQL), NoSQL or big data warehouse solutions (like Aster data, Green Plum etc), none of the existing solutions really take advantage of the latest hardware or even the data  structures as most of the data store kernels are written years back. In today’s world; the only option for scalability is by depending on the hardware and by distributing the load across multiple systems (either in shared-nothing or shared-common or even “cloud” way…).

Hoping to see a solution, one day that actually bridges the gap between data store, hardware and scalability without the need of using multiple technologies for common use cases instead of depending on one single solution that can be universally adopted. Brian Aker in his recent interview and Baron claims the same thought.

April 7, 2010

CAP Theorem, Eventual Consistency, NoSQL

Very nice and interesting post from Michael Stonebraker explaining how errors dictate CAP Theorem (Consistency, Availability and Partition-tolerance); as only one objective from the CAP can be achieved during normal error conditions as NoSQL system seems to relax the consistency model as CAP theorem anyway proves that one can’t get all 3 at the same time, by favoring partition based availability

As most NoSQL systems adopt Eventual Consistency (depends and some systems it is configurable, and here is yet another nice article on variations in eventual consistency from Werner Vogels, CTO of Amazon) especially when data is distributed across the cluster of systems. Stonebraker suggests CA (Consistency and Availability, typical SQL system) rather than AP (Availability and Partition-Tolerance, typical NoSQL) by explaining the different error scenarios.

For example (just for fun); let’s consider the same error conditions, one by one and see how this can be adopted in distributed cloud computing (SQL or NoSQL)…

  • Application errors;  and up on error, system needs to rollback to its previous consistent state…

In case of transactional system, it is easy to rollback if the unit of work is in transaction (automatically by DBMS or manually by the application). In key/value pair, as easy as calling delete (key). Few systems use append-only mode (where there is no concept of updates or deletes, few SAS companies in the valley also use MySQL/InnoDB/MyISAM in this append only mode where system never gets any deletes/updates as everything runs on INSERT and SELECT only), updating with older time-stamp or serial-number can revert the latest change (depending on how the latest value is read).

  • Repeatable system crash on bad query request or bad data or something…

It is really hard to escape from this failure as other system in the cluster or replica can also fail for the same condition unless the replicated system is on a different version (rare case) or other degree of fault tolerance.

  • Unrepeatable system crash (System failure, hardware/power issue, data center down, network issue…)

If it is a system failure; then this can be handled by fail-over to another system or replica in the LAN or WAN. NoSQL defines much better solution for fail-over with online substitution of nodes to the cluster. The only issue is, if data is persistent in master node and if it implements read-your own write or monotonic read consistency model (same value upon subsequent requests); and before the change propagates to replica of other nodes; then the consistency model fails unless the model is designed such that more than one node shares the same write.

  • Partition failure in LAN, WAN

If bunch of partitions/nodes fail within LAN or WAN due to network, power failure etc; then it is hard to fail-over unless there is a replica or same copy within LAN or WAN. But due to network and/or power redundancy in data center or cloud environment; this failure is very unlikely.

  • Fail-over due to slowness or poor response

When one of the node starts performing slow due to higher load or more data processing; instead of failing over to another node (typical NoSQL concept), Stonebraker recommends building a system that can take the load spikes; but it is really hard due to growing data needs and poor capacity planning as things change over time. So, one option is to retire by fail-over to newer one (not always possible if you already have latest and greatest specs) or adding more nodes to take the load spikes (distributed).

But in case of SQL with shared-nothing or shared-common architecture; it is not always possible to distribute the load to more than one system (for example, client X in system A, but when A exceeds the load; only option is to split client X into system A and B; but client X can’t be split due to data integrity or lack of common layer that can interact with both the nodes and perform the join/merge operations)

Both has its advantages and dis-advantages; but if NoSQL adopts strict consistency model as that of SQL; then it is hard to scale in the distributed environment where that architecture is more or less demanded by many big web applications where the scalability comes only by distribution and consistency has to be sacrificed to some extent to get the blend of performance + scalability + high availability

March 28, 2010

Dell MD1120 Storage Array Performance

Here is some file IO performance numbers from DELL MD1120 SAS storage array. Last year I did the same test with HP P800 storage array and numbers were impressive. But when it comes to this high end storage array, few surprises.  Before getting into actual details; lets see the test stats and configuration details.

System Configuration:

  1. DELL R710 with CentOS 5.4
  2. NOOP IO Scheduler
  3. MD1120 with 22 10K SAS disks
    • 20 disk RAID-10 (hardware)
    • 2 hot spares
    • Disk Cache disabled
  4. PERC 6/E RAID controller with BBU
    • Connected to DELL MD1120 using SAS
    • Write Back
    • Read Cache Disabled

Test Configuration:

  1. Sysbench fileio test with variable modes and threads
  2. 64 files with 50G total size
  3. All tests ran in un-buffered mode (O_DIRECT) as most of the workload is InnoDB based.

Test Results:

Number of Threads vs Number of Requests/Sec. Every mode ran with 5 iterations and average is taken.

Random IO:

rndio

Sequential IO:

seqio 

HDPARM Test:

[test~]# for i in `seq 1 3`; do hdparm --direct -tT /dev/sdc1; done | grep Timing
 Timing O_DIRECT cached reads:   2068 MB in  2.00 seconds = 1033.21 MB/sec
 Timing O_DIRECT disk reads:  2146 MB in  3.00 seconds = 715.32 MB/sec
 Timing O_DIRECT cached reads:   2020 MB in  2.00 seconds = 1010.26 MB/sec
 Timing O_DIRECT disk reads:  2162 MB in  3.00 seconds = 720.62 MB/sec
 Timing O_DIRECT cached reads:   2052 MB in  2.00 seconds = 1025.90 MB/sec
 Timing O_DIRECT disk reads:  2128 MB in  3.00 seconds = 709.17 MB/sec
 
[test ~]# for i in `seq 1 3`; do hdparm -tT /dev/sdc1; done | grep Timing
 Timing cached reads:   18920 MB in  2.00 seconds = 9475.34 MB/sec
 Timing buffered disk reads:  3442 MB in  3.02 seconds = 1141.44 MB/sec
 Timing cached reads:   19332 MB in  2.00 seconds = 9681.56 MB/sec
 Timing buffered disk reads:  3478 MB in  3.00 seconds = 1159.24 MB/sec
 Timing cached reads:   18012 MB in  2.00 seconds = 9019.50 MB/sec
 Timing buffered disk reads:  3492 MB in  3.02 seconds = 1155.53 MB/sec

Analysis:

  1. Overall the numbers are not bad when it comes to writes, but few surprises when it comes to reads. When compared with HP’s P800 storage array, the numbers still dropped by 20%.
  2. Radon IO:
    • Random write requests ranges from 3200-5000 per sec; due to write back mode (512M cache)
    • Writes are linearly scaling well with the threads, good sign that controller is able to manage the cache efficiently
    • Random reads and writes (rndrw) is also scaling linearly with the threads load, means the IO distribution and cache burst to satisfy reads seems be efficient as it needs to flush the data from controller cache to disk before the read can be satisfied.
  3. Sequential IO:
    • Writes seems to be scaling well even in sequential mode without much overhead
    • When it comes to reads, big surprise is drop from 5626 requests/sec to 615 from one thread to two threads. Which is really odd. Worst case it should be ~2000-3000 requests/sec; not sure where the overhead is. I can’t believe it could be thread scheduling as there is only 2 threads.
  4. During 100% IO, on and off I noticed IO serialization with higher queue waits, which indicates that there is some degree of serialization overhead in OS; but not able to track which layer is triggering this. Tried with cfq/deadline, still the same.
  5. Next attempt will be replacing 3Gb/s SAS to fiber channel HBA or 6Gb/s SAS (PERC H800) to see how it performs along with combination of HW and SW raid instead of only depending on controller.

March 21, 2010

Hyper Threading Performance

Its been a while anyone talked about Intel’s Hyper-Threading performance when it comes to databases. There were enough posts about disabling Hyper-Threading  completely when it comes to MySQL/InnoDB workloads way back when we had enough issues with scalability of InnoDB on multi-core systems. But things has changed quite a bit in the fast year or two in terms of multi-core support (thanks to Innobase/Mark Callaghan/Google and Percona). I still see lot of production servers running with HT disabled completely either in BIOS (append noht to kernel parameter) or manually disabling the CPUs (echo 0 > /sys/devices/system/node/node[0-1]/cpu[2-..]/online ).

Last few weeks or so; I ran quite a few tests on new production launch with 8-core CPUs and decided to go ahead with HT enabled (along with Intel Turbo mode) as it seems to give close to 15-20% performance gain. Ran tests which are completely InnoDB CPU/memory bound and HT seems to enable much better performance in every test mode.

Also tried with different thread schedulers with IO bound loads; still HT seems to give much better results; especially when most of the workload seems to be with 8-50 threads; and HT seems to give much better boost in that range; so no point to disable HT.

I tested both Intel Xeon X5570 and E5530 and here is the test results for CPU bound work load.

e5530

x5570

It is a clear indication that HT is not a bottleneck. In both the cases, turbo boost is enabled though.

As you can also notice; Intel Xeon X series CPUs yield much better results than E series (obvious as X-series is meant for performance where as E is for economy), which is close to $1000 extra, but yields ~15-20% performance for InnoDB workloads. You can find the comparison between these CPUs from Intel. If you are buying any new system and if you can spare extra $$; then always go for X series as its worth the price for performance