April 15, 2010

RAID Controllers Cache Management – Missing Features

PERC4DC_4 We all know how important hardware RAID controllers are in today’s data storage performance especially when dealing with large data sets. If we look at the trend from now to couple of years back; they really evolved rapidly with lot of useful features and their usage also grown as most of the new servers by default has one or two controllers built-in (one for internal and another one for external storage array or for redundancy).

Few popular RAID controller vendors in the market: 

More or less everyone supports all common features and differs in number of ports, protocol support (ISCSI, SATA, SAS, HBA/FB), transfer speed, RAID levels, total disks support, cache size and its management.

Controller Cache – Database Workloads

For database OLTP workloads (IO bound), controller cache plays a crucial role for overall write or read throughput, depending on how the cache is used. Most RAID controllers are equipped with either 128MB or 256 MB or 512MB cache, and newer controllers like HP Smart Array P812 supports 1GB.

Write-back mode improves the writes performance by magnitude as the write request is returned as completed as soon as the data is in the controller cache without actually writing to the disk (that’s why controller needs a BBU, Battery Backup Module so that there is no data loss on power failures)

In case if you enable the read ahead from the controller (sometimes good for OLAP workloads or ETL data warehouse, especially adaptive read ahead due to heavy sequential access); then the same cache is used to store the pre-fetched data that can be satisfied later from the cache without hitting the disk. But in case if the database system does read ahead (like InnoDB), then it is better to turn off read ahead from controller to avoid page trashing.

For some workloads, the controller cache can also cause negative performance if the cache is not properly utilized by the controller.

Missing Cache Management Tools

At present, none of the controllers either supports any cache management tools nor exposes how the cache has been actually used, so that one can adjust the cache according to the workloads for improved performance.

Some of the missing features:

  • A way to flush the data from cache to disks, so that the systems can be taken for offline maintenance. Right now there is no easy way to flush data from cache to disk; other than some of the controllers will indicate through LED whether data is in the cache or not
  • Way to set the cache threshold in time or %, so that it can start flushing to disk once it meets the threshold value. For example; if you notice big spikes from RRD graphs for every few minutes, then one can adjust the threshold to evenly distribute the load.
  • Cache usage statistics (writes data size, read ahead data size etc ), so that workload can be adjusted to yield much better results
  • Splitting of cache between reads and writes either in size or by %; so that they do not overlap and cause performance issues. For example; one prefers to set 20% for read ahead data and 80% for writes. Only HP Smart Array controller supports this feature at present.

As you get more control over the controller cache, the more you can tweak and adjust the workloads to get improved performance. Hopefully one day all vendors will expose more cache management options.

April 11, 2010

Data Store, Software and Hardware – What is best

Other day we had a small discussion about data stores and hardware; and which one drives the other when it comes to data storage solution, rather it is a hard discussion as both on its own are bigger entities; and one can not easily conclude as it depends on use cases and actually speaking data store limitation(s) drives the need for more powerful hardware for demanding scalability needs.

We all know how important the hardware is in today’s data scalability, especially when dealing with large data sets. Without hardware, it is hard to scale even if you have a powerful data store either it could be SQL (row or columnar) or NoSQL (key/value or other means) or any other data storage solution; because they are limited by the data structures & its implementation and data store performance directly depends on the hardware lately.

At times, data store vendors claim that they have scalable, high performance architecture; that means the solution is directly built on top of hardware scalability and performance by taking advantage of today’s evolving hardware technology. Also, hardware evolution is too aggressive in the recent years when compared to data store solutions due to the market share as hardware is everywhere as it is not just the storage solution.

In short, when a data store performance is directly proportional to hardware performance; that means the data store actually surpassed all of its software performance bottlenecks (algorithms, decision making, data structures etc). Overcoming from software performance is not that easy as the requirement changes day by day and it depends on data size and how data is actually:

  • stored
  • retrieved
  • processed and
  • maintained

If data is stored and retrieved from memory or non-persistent storage solution; then one does not need to worry about rest of the stuff or performance as it yields the best throughput; but memory or non-persistent solution can be a solution for smaller data sets, but not for large data sets that deals with tera bytes of data.

Other than newly evolving columnar data stores (yet to see any one solution that is really pitching with universal acceptance like Oracle/SQLServer/MySQL), NoSQL or big data warehouse solutions (like Aster data, Green Plum etc), none of the existing solutions really take advantage of the latest hardware or even the data  structures as most of the data store kernels are written years back. In today’s world; the only option for scalability is by depending on the hardware and by distributing the load across multiple systems (either in shared-nothing or shared-common or even “cloud” way…).

Hoping to see a solution, one day that actually bridges the gap between data store, hardware and scalability without the need of using multiple technologies for common use cases instead of depending on one single solution that can be universally adopted. Brian Aker in his recent interview and Baron claims the same thought.

March 28, 2010

Dell MD1120 Storage Array Performance

Here is some file IO performance numbers from DELL MD1120 SAS storage array. Last year I did the same test with HP P800 storage array and numbers were impressive. But when it comes to this high end storage array, few surprises.  Before getting into actual details; lets see the test stats and configuration details.

System Configuration:

  1. DELL R710 with CentOS 5.4
  2. NOOP IO Scheduler
  3. MD1120 with 22 10K SAS disks
    • 20 disk RAID-10 (hardware)
    • 2 hot spares
    • Disk Cache disabled
  4. PERC 6/E RAID controller with BBU
    • Connected to DELL MD1120 using SAS
    • Write Back
    • Read Cache Disabled

Test Configuration:

  1. Sysbench fileio test with variable modes and threads
  2. 64 files with 50G total size
  3. All tests ran in un-buffered mode (O_DIRECT) as most of the workload is InnoDB based.

Test Results:

Number of Threads vs Number of Requests/Sec. Every mode ran with 5 iterations and average is taken.

Random IO:

rndio

Sequential IO:

seqio 

HDPARM Test:

[test~]# for i in `seq 1 3`; do hdparm --direct -tT /dev/sdc1; done | grep Timing
 Timing O_DIRECT cached reads:   2068 MB in  2.00 seconds = 1033.21 MB/sec
 Timing O_DIRECT disk reads:  2146 MB in  3.00 seconds = 715.32 MB/sec
 Timing O_DIRECT cached reads:   2020 MB in  2.00 seconds = 1010.26 MB/sec
 Timing O_DIRECT disk reads:  2162 MB in  3.00 seconds = 720.62 MB/sec
 Timing O_DIRECT cached reads:   2052 MB in  2.00 seconds = 1025.90 MB/sec
 Timing O_DIRECT disk reads:  2128 MB in  3.00 seconds = 709.17 MB/sec
 
[test ~]# for i in `seq 1 3`; do hdparm -tT /dev/sdc1; done | grep Timing
 Timing cached reads:   18920 MB in  2.00 seconds = 9475.34 MB/sec
 Timing buffered disk reads:  3442 MB in  3.02 seconds = 1141.44 MB/sec
 Timing cached reads:   19332 MB in  2.00 seconds = 9681.56 MB/sec
 Timing buffered disk reads:  3478 MB in  3.00 seconds = 1159.24 MB/sec
 Timing cached reads:   18012 MB in  2.00 seconds = 9019.50 MB/sec
 Timing buffered disk reads:  3492 MB in  3.02 seconds = 1155.53 MB/sec

Analysis:

  1. Overall the numbers are not bad when it comes to writes, but few surprises when it comes to reads. When compared with HP’s P800 storage array, the numbers still dropped by 20%.
  2. Radon IO:
    • Random write requests ranges from 3200-5000 per sec; due to write back mode (512M cache)
    • Writes are linearly scaling well with the threads, good sign that controller is able to manage the cache efficiently
    • Random reads and writes (rndrw) is also scaling linearly with the threads load, means the IO distribution and cache burst to satisfy reads seems be efficient as it needs to flush the data from controller cache to disk before the read can be satisfied.
  3. Sequential IO:
    • Writes seems to be scaling well even in sequential mode without much overhead
    • When it comes to reads, big surprise is drop from 5626 requests/sec to 615 from one thread to two threads. Which is really odd. Worst case it should be ~2000-3000 requests/sec; not sure where the overhead is. I can’t believe it could be thread scheduling as there is only 2 threads.
  4. During 100% IO, on and off I noticed IO serialization with higher queue waits, which indicates that there is some degree of serialization overhead in OS; but not able to track which layer is triggering this. Tried with cfq/deadline, still the same.
  5. Next attempt will be replacing 3Gb/s SAS to fiber channel HBA or 6Gb/s SAS (PERC H800) to see how it performs along with combination of HW and SW raid instead of only depending on controller.