April 16, 2008

Notes from Falcon from the beginning

Here is the quick notes from the session Falcon from the beginning by Jim Starkey and Ann Harrison

  • Why Falcon
    • Hardware is evolving rapidly, world is changing, so taking advantage
    • Customers need ACID transactions
  • Where hardware is going
    • CPUS breed like rabbits (more sockets, cores, threads/core)
    • Memory is bigger, faster and cheaper
    • Disks are bigger and cheaper but not much faster
    • In general boxes are getting cheaper
  • Where applications are going
    • batch - dead
    • timesharing - dead
    • departmental computing - dead
    • client server - fading fast
    • application servers for most of us
    • web services for the really big buys
  • Database Challenges
    • Traditional challenge
    • exhaust CPU, memory and disk simultaneously
  • Tradeoffs
    • use memory to page cache to avoid disk reads
    • record cache to avoid page cache manipulation
    • use CPU to find the fastest path to record
    • use CPU to minimize record size
    • Synchronize most data structures with user mode read/write locks
    • Synchronize high contention data structures with interlocked instructions
  • Architecture
    • Incomplete in-memory db with disk backfill
    • Multi-version concurrency control in memory
    • Updates in memory until commit
    • Group commits to a single serial log write
    • post-commit multi-threaded pipe line to move updates to disk
  • Incomplete in-memory database
    • records cached in memory
    • separate cache for disk pages
    • record cache hits 15% the cost of a page cache hit
    • record cache is more memory efficient than page cache
  • Record Encoding - cache efficiency
    • records encoded by value, not declaration
    • string “abc” occupies the same space in varchar(3) or varchar(4096)
    • the number 7 is the same where small, medium, int, bigint, decimal or numeric
  • MVCC
    • update ops create new record versions
    • new one is tagged with id, points to old version
    • keep tracks which
  • Updates are in memory
    • held in memory pending commit
    • index changes held in memory
    • verb rollback is dirt cheap
    • trxs rollback is dirt cheap
  • At commit
    • pending record updates flushed to serial log
    • pending index updates flushed to serial log
    • commit record written to serial log
    • serial log flushed to the oxide
    • and trx is also committed
  • Memory is infinite, so
    • large txns chills uncommitted data (flushes it to the log early)
    • chilled records can be thawed
    • scavenger garbage collects unloved records periodically
    • when things get really had, entire record chains flushed to backlog
  • Weakness
    • transactions are ACID but not serializable
    • latency advantage disappears at saturation
    • very large transactions degrade performance
    • optimized for web, not batch
  • Strengths
    • runs like a memory db when data fits
    • scales like disk-based db when db doesn’t fit in cache
    • lowest possible latency for web apps
    • absorbs huge spiky loads
  • Performance
    • benchmark against InnoDB vs Falcon only
    • DBT2 benchmark (what about sysbench?)
    • High contention
    • Writes intensive - 40% records touched are updated
    • measures only performance at saturation
  • DBT2 is InnoDB’s best spot and Falcon’s worst, so do not take benchmark results, decide on what you want
  • When should you use what ?
    • don’t need ACID ? then MyISAM is good
    • single processor, small memory - InnoDB is good
    • large transactions, batch inserts/updates, InnoDB is good
    • multi cores, more memory, more threads , use Falcon
    • For web, Falcon is hard to beat

Notes from Scaling MySQL - Up or Out

Here is the quick notes from the session Scaling MySQL - Up or Out ? moderated by Kaj Arno as part of the todays keynote.

Here is the list of panelists are ordered by Alexa ranking.

  1. Monty Taylor (MySQL)
  2. Matt Ingerenthron (Sun)
  3. John Allspaw (Flickr)
  4. Farhan Mashraqi (Fotolog)
  5. Domas Mituzas (Wkipedia)
  6. Jeff Rotheschild (Facebook)
  7. Paul Tuckfield (YouTube)

Here is the list of questions and answers from panelists:

 How many servers Number of DBAs How many web servers Number of caching servers Version of MySQL Language, platform Operating System
MySQL

1 M, 3 S

1/10

2

2

5.1.23

Perl,php and bash

Linux fedora

Sun

2 clustered, 2 individual

1.5

160+

8

5.0.21

Lots of stuff (java mostly)

Open Solaris

Flickr

166

At present 0

244

14

5.0.51

Php and some Java

Linux

Fotolog

140 databases on 37 instances

10 instances a DBA

70

40 ( 2 on each, 80 total)

4.11 and 4.4

Php, 90% Java

Solaris 10

Wikipedia

20

None, but everybody is kind of a  DBA

70+200

40 ( 2 on each, 80 total)

Â

Php, c++, python

Fedora / Ubuntu

Facebook

30000 databases, 1800 db servers

2

1200

805

5.0.44 with relay log corruption patch

Php, python, c++ and erlang

Fedora / RHEL

Youtube

I can not say

3

I can not say

I can not say

5.0.24

Python

SuSE 9

Few more misc questions …

Number of times re-architected ?

  • My: 2 times - 1 time slave, 1 time memcached
  • SN: site depend (many times over the year)
  • FR: 2.5 (various clusters federated)
  • FL: many cached replacements (about to do one change now)
  • WK: Never (Spaghetti)
  • FB: Every Tuesday, continual
  • YT: Pretty continual, 2-3 times (replication, sharding, federation)

What happens if server fails ? what actions you will generally take ..

  • FR: All of our servers are federated, pairs of servers, we can loose any one side of shard, we can loose boxes.. traffic goes to either side of shard, now it goes to one, and we will get another one (very transparent to user) 
  • WK: Users shout at them on IRC then they moderate … fixed in seconds
  • FB: one of 1800-1900 will always fail, just operate well, minor impact, with data going away for a while…we restore from binlog and start the server quickly, promote slave to master and number of ways
  • FL: we simply mount the snapshots to different servers and get
  • YT: SAN etc, very important data.. recover the server, mirrored disk …mirrored hard drive is crucial

Any recommendation of scaling technology that you wanted to bring

  • FL: UltraSPARC-T1 (excellent master, multi threaded) and UltraSPARC-T2 for slave (single threaded)
  • WK: good network switch
  • FB: cheap switch causes problems and learned lessons, we do not use SAN, neatly partitioned, they scale independently and fail independently
  • MY: cluster very sad

Server virtualization ?

  • nobody uses at this time
  • FB: ETL cluster, we may run more than one in the future

Anything to worry at present ?

  • FB: app design is the key to use resources, data center power supply and consumption
  • FL: Google has to approve for our lab power (cut app servers by 1/2 by moving from php to java)
  • YT: not at all

Any reco, lessons to DBA

  • better you know what the systems are, then you can
  • performance, scaling taking it serious
  • nothing more permanent than temp solutions (if you don’t know when you will fail, then you will )
  • architect properly in start, schema, cost of serving data

April 15, 2008

Notes from InnoDB status, architecture and new features

Here is the notes from “InnoDB status, architecture and new features” by Heikki Tuuri and Ken Jocobs.

  • Introduction
    • fast
      • row-level locking, MVCC -> high concurrency & throughput
      • high performance CPU, memory and I/O architecture
      • efficient indexing (covering)
    • reliable
      • automatic crash recovery
      • integrated referencial integrity and transactions
      • online backup
      • well written, well tested and large user community
  • New in 5.1
    • plugin 1.0 announced
    • can be usable with partitioning, row based replication
    • reduced “next-key” locking with READ COMMITTED
    • optimized create/drop index
    • optimized create/drop of secondary keys
  • Auto-increment in 5.1
    • Auto increment usedto lock around the counter till SQL statement exists in the engine
    • new in 5.1 lighter-weight locking controlled by InnoDB_autoinc_lock_mode
      • 0 - traditional
      • 1 - consecutive
      • 2 - interleaved, faster
  • Plugin new features
    • fast index creating (add/drop without copying the whole table data)
    • data compression
    • new row format (off-page storage for BLOB, TEXT and VARCHAR)
    • Dynamically installed without re-linking
    • can downgrade to standard edition at any time
    • forums.innodb.com for support and QA
  • InnoDB plugin setup
    • edit my.cnf to have skip_innodb, innodb_file_per_table, innodb_file_format=Barracuda
    • start mysql sever
    • login to server and issue the following set of two commands (INSTALL PLUGIN INNODB SONAME ‘ha_innodb.so’ and
      INSTALL PLUGIN INNODB_LOCKS SONAME ‘ha_innodb.so’ )
    • SHOW PLUGINS to make sure InnoDB is listed
  • Fast Index Create
    • just builds the new index rather than whole table
    • sorts data on secondary key
    • inserts rows into the index
    • much faster as it does not re-create the table
    • DROP INDEX for secondary is much faster, data dictionary change only
    • Need to recreate the whole table if its primary key
    • use one alter table to recreate several secondary indexes
    • adding/dropping foreign keys still needs re-building of tables
    • benchmark: 3G table add index used to take 88 minutes, but now just takes 8 minutes (5.1)
    • Faster index scans
  • Concurrency
    • add/drop index must wait for active updates
    • create of secondary index locks table in shared mode
    • create/drop of primary key index needed exclusive lock
  • Compression
    • compressed page size per table KEY_BLOCK_SIZE=8 (8 & 4)
    • Need innodb_file_per_table=1 and innodb_file_format=Barracuda
    • uses zlib
    • patterns in data will define compression ratio (50%)
    • Innodb can predict if 16KB page will compress to 8KB
    • compresses all data in table and indexes
  • Compression and buffer pool
    • caches compressed (disk) pages in the buffer pool
    • if page is frequently used, keeps it uncompressed
    • LRU algorithm balances memory use to save CPU
    • with CPU bound, do not use compression
  • When to use compression
    • when IO bound
    • compression means more data pages in the buffer pool
    • do not compress small and frequently used tables
  • When not to use compression
    • when it is CPU bound
  • How to monitor compression
    • using INFORMATION_SCHEMA.Innodb_CMP table
  • File format
    • before pre-plugin model, by default it is Antelope
    • now it can be controlled using innodb_file_format
    • Barracuda enables compression, ROW_FORMAT=Dynamic
    • Existing engine can access Antelope but not Burracuda
  • Information schema now has a way to query Innodb locks (INSTALL PLUGIN INNODB_LOCKS
  • Truncate table reclaims .ibd file space
  • Innodb strict mode respects SQL Syntax (innodb_strict_mode)

Try the new plugin today (well I will do it over the weekend anyway)

Notes from Online MySQL Backup in 6.0

Here is the notes from MySQL online backup in 6.0 session by Dr. Charles A. Bell, the Author of MySQL Expert book.

  • Introduction
    • SQL Driven, run from any client
    • backup to localhost
    • new security privileges for backup/restore
    • blocking restore
    • non-blocking backup for Innodb, Falcon etc
  • Overview
    • protection and restore
    • basic functionality of backup and restore available now
    • db level -> table, views, sps, functions, … are supported
    • 6.0.5 has it already, try it
    • source code is another way to look into it
  • Design
    • ent level consistency between engines
    • default driver for engines that don’t support backup
    • consistent snapshot
    • storage engine specific backup methods
    • mix logical and physical backup
    • streaming backup data
  • Future
    • pluggable, modular architecture
    • versioning for backward compatibility
  • State of development
    • ent level consistence
    • default blocking
    • consistent
    • kernel synchronization algorithm
    • backup driver API
    • restore driver API
  • Implemented
    • storage on server host
    • dx transfer protocol already built
    • streaming format
    • MyISAM naive
    • Table Space support for Falcon
    • NoData engine backup (blackhole, merge etc)
    • synchronization with binary log for point-in-time recovery
  • Compare with existing solutions
    1. non-blocking tools
      • mysqldump with single transaction
      • cluster backup
      • replication
      • InnoDB-hot backup (commercial)
    2. blocking tools
      • mysqldump
      • mysqlhotcopy
      • native file system copy
      • select .. INTO OUTFILE
    3. commercial
      • zmanda
      • bakbone
      • other
  • Architecture:
    • backup kernel .. part of kernel
    • backup engine .. backup driver and restore driver
    • default backup or restore driver
    • native backup or restore
  • Interfaces
    • sql control interface (simple sql statements)
    • backup engine interface
  • Backup Kernel responsibility
    • execute backup and restore sql statements
    • backup / restore metadata
    • initialize and co-ordinate work of backup/restore
    • write/read backup archive
  • Capabilities
    • backup database {db_list} TO {image}
    • restore from { image }
    • progress logs (currently tables, online_backup, online_backup_progress)
  • Future plans
    • cover all storage engine
    • plugin architecture
    • backup to other media
    • std tool
    • full-server, database and enhanced
    • incremental
    • Limitations
      • single threaded
      • synchronous
      • no optimization at this time, simple go and do it.
      • integrated schema and data (no way to split at this time)
      • NDB cluser is not supported, use native

April 14, 2008

Notes from MySQL Cluster

Here is the quick notes from MySQL Cluster by Stewart Smith session that I attended today in the MySQL conference 2008.

  • What it is
    • Clustering of in memory databases in a shared-nothing system
    • Designed for HA, 99.9% Uptime (not really) and sub-second failover
    • Supports Hot (online) consistent backup along with compression
    • No locks are used during the backup as NDB uses a global counters
  • Think as virtual file system
  • Redundancy
    • NoOfReplicas (1,2,3,4), For production use 1 or 2 but don’t use any other as it may have bugs (2 is preferred and well tested mode)
    • 1 means no redundancy (a node fails and cluster fails)
    • 2 means two copies
  • High performance (due to parallelism)
  • In memory only
    • Both data and index needs to fit in memory
    • 5.1 optionally supports to have non-index stuff to disk
    • Rows will have in-memory part and disk part
  • Durability
    • Checkpoint to disk (frequency is configurable, default 2 secs?)
    • Rate is limited to avoid IO
    • Cluster will restore from last known Global Check Point (GCP)
    • Any comitted transactions that are not part of checkpoint will be lost
  • Node Types
    1. Data Nodes (ndbd)
      • Grouped into nodegroups
      • NoOfReplicas is number of nodes in nodegroup
      • Up2 48 in one cluster
      • Interact using regular SQL or C++ NDBAPI or using other langauge binders
      • For each transaction, one is used in Transaction Coordinator and load balanced around the cluster
      • Node failure, application has to retry the transaction
      • All MySQL servers connected to server have the same view of the database
      • Replication between data nodes is synchrounous as opposed to standard asynchrounous MySQL replication, uses two-phase commit
      • ndbd is single threaded (thats why it is not CPU bound), multithreading coming soon
      • Compressed backups, checkpoint might use other CPUs
      • Multiple nodes for SMP
      • Every table should have a Primary key, else 64-bit int will be internally used (same as InnoDB)
      • Uses hashkey(PK) to find the right partition
    2. Management Server (ndb_mgmd)
      • Interact with management client (ndb_mgm) or C management API (mgmapi)
      • distributes configuration to other nodes
      • Backup is started by management server
      • Can have more than one for redundancy
      • Stopping management server will not break the existing cluster
      • It has to be running when nodes join the cluster
      • Backing up of config.ini is very much recommended
      • Issues any commands to nodes (start, stop, restart)
    3. SQL Node (mysqld)
      • MySQL Server
      • Many API/SQL nodes are needed to load storage nodes
      • dependent on application and types of queries
      • Is multi-threaded, takes advantage of SMP
  • Collection of nodes constitute Cluster
  • Requirements
    • Node is a process not a physical computer
    • Atleast 3 physical machines needed for HA
      • Avoid split brain problem (network down, node down etc)
      • In case of split brain problem, nodes will be shutdown forecefully and all comitted transacations will be lost and will be restored only to last known checkpoint.
      • With only two, its hard to know if it has relly failed or network connection has failed
      • You probably need more than 3 machines though
    • ndbd is single threaded, and future version might be multi-threaded
    • SQL node is multi-threaded (ofcourse)
    • Management server is not CPU intensive (almost 0 CPU)
    • Data Nodes need lots of memory, Disk I/O can be calculated and generally not CPU bound (as its single threaded)
    • Memory needed for Index and Data can be calculated using a tool provided (ndb_size.pl)
  • Setup
    • Configure the nodes
    • Management server needs to be started first (ndbm_mgmd -f config.ini)
    • Data Nodes
      • on each storage node ndbd -c IP
      • Can have more than one on the same physical box
    • SQL Nodes
      • Make sure NDBCLUSTER option is enabled (command line or config)
      • Make sure the connect string is specified (command line or config)
      • start the server
    • Monitor (using ndb_mgb -c localhost:9350 -e ’show’)
  • Using the Cluster
    • Tables can be stored in cluster with ENGINE=NDBCLUSTER
    • Supports views, stored procedur, triggers etc
    • Permissions needs to exists in each SQL Node (mysql.user*)
  • Limitations:
    • prior to 5.1; everything in memory (both index and data)
    • 5.1, non-index can be to disk
    • Foreignkey constraints are not yet supported
    • maximum of 255 nodes can belong to a single MySQL Cluster with 48 of being data nodes
    • No dynamic expansion and needs to be restarted to introduce new nodes in the cluster
    • ndbd is single threaded

April 13, 2008

Finally iPhone works

Finally I found a solution to all my problems with the iPhone…After contacting the Apple iPhone technical support team over the phone (1-800-My-iPhone), they found that it could be an issue with the hardware and asked me to stop by to the nearby apple store. Before doing that they actually asked me to restore the phone and iTunes wiped everything and it failed to restore back to the old state. Even though the iTunes recognised the device; but it is not letting to perform any sync or restore operation. The local store guys with some basic questions, replaced with a new one (not really sure brand new or refurbished, but at least it does not have any refurbished marks).

Now no crashes when I connect the device to the PC nor it skips the tracks while playing the songs. Good to go and lets wait and see what happens in the coming days ! It will be really great if it can work till the 3G one is released.

Dreamhost + Google apps

Today when am creating one of sub-domain using the web panel of dreamhost, I noticed that dreamhost is now offering Google apps hosted services, which includes Gmail, calender, chat, sites, docs and page creator.

This is bit interesting as you get to configure unlimited email storage when you use dreamhost provided email service (even though the default web based one sucks); where as Gmail only offers 6GB and the same thing continues as standard edition of Google apps is limited in number of ways, including the uptime of email service is not guaranteed. So, am wondering why anyone should use Google apps instead of std services.