Here is the quick notes from the session scaling heavy concurrent writes in real time by Dathan Pattishall. Its bad that he left Flickr i.e. Yahoo. Hopefully they will find a replacement, if not contact me I have few people who are interested.
- Who am I
- since 1999 working on mysql
- scaled many companies (FriendFinder, Friendster, Flickr, now RockYou)
- Favorites federation, partitioning, shards, RAID-10
- Requirements
- scale better
- store data forever
- associate time with the data
- allow for change
- keep it cheap
- and downtime is not an option
- Spread the data around
- federate
- all referrers are owned by the page owner
- spread data out by that
- but federate in a different direction
- add a new column to the global account lookup
- Did not work
- strings as PK was not good
- inserts slowed when the table grew larger than memory
- not enough I/O to handle double the load
- Start over
- converted URL into a 64-bit ID - conv(substr(md5(url),1,16),16,10) (Dathan had a typo in the function in his slides, thanks to Patrick who tried it when the session is running)
- 64-bit number, its unique
- reduced PK size to 8 bytes + owner, object, object-type
- InnoDB and strings
- index on string takes a lot of space
- each index has its own 16K page
- fragmentation across pages was hurting, chewing up I/O
- InnoDB and High Concurrency of string writes
- requirement: 200 ms for total db access for all apps
- writes gets slowed down when you cross buffer_pool_size
- 10 ms to 20 secs sometimes for full transaction
- replication keeping up
- Buffer the writes
- java daemon that buffers up to 4000 messages grouped to a transaction and apply serially
- does not use much memory or CPU
- written by Zen Java master
- Reduce the use of big strings
- Keep smaller amounts of data
- Use MyISAM for non-pro users and keep only X weeks of data
- MyISAM keeps 1/6th the size of InnoDB
- Migrate the data when they migrate
- Distributed locks
- GET_LOCK and RELEASE_LOCK on the same server for which a user operates on
Here is the quick notes from the session Helping InnoDB scale on servers with many cores by Mark Callaghan from Google (mcallaghan at google dot com).
- we have a team now, to help scale MySQL to do the enhancements (9 people, I hope yahoo management reads this)
- Overview
- describe the problems on big servers
- work done by InnoDB community
- ask MySQL/InnoDB to fix the problems by taking the patches
- Community team
- InnoDB/Oracle
- Google MySQL team
- InnoDB community
- Percona - Peter and Vadim
- Goal
- Fix bottlenecks on big SMP
- utilize servers with many disks
- support thousands of connections
- handle corruption in memory and on disk
- make query plans predictable
- make thousands of tables and accounts
- Keep InnoDB beautiful while making these changes
- Desirable features
- linear scalability with cores
- 128GB buffer cache
- with many disks and remote disks
- recovering from corruption
- CPU problems
- mutex implementation
- spin lock mutex uses pthreads rather than automatic
- RW-mutex uses the spin lock mutex
- Mutex hotspots
- buffer cache
- memory allocation
- transaction log
- adaptive hash latch
- Symptoms of CPU problems
- adaptive hash latch contention
- excessive mutex contention
- server has many queries, is slow and is not IO bound
- vmstat will report lot of idle time
- oprofile will show a lot of time spent in pthread functions
- Making InnoDB spin lock mutex fast
- replace pthread_mutex_trylock with CAS
- RW-mutex fast
- use atomic ops to change internal state
- use separate events to wake readers and writers
- More work to be done
- always release adaptive hash latch
- RW-mutex for transaction log
- replace malloc heap with scalable malloc (tcmalloc or mtmalloc)
- use atomix ops for rw-mutex
- reduce contention for the sync array mutex
- remove some counters and fields from mutexes
- platforms with > 8 cores
- transaction log mutex has contention
- buffer cache contention
- To support 128GB buffer cache
- data structures must scale
- walking a list with 8M page entries might be slow
- resources need to be a split
- more than one mutex might be needed for the buffer cache and LRU chain
- Detection of corruption is more important
- memory will be corrupted by software and hardware bugs
The Future of MySQL by Monty Widenius and Jay Pipes.
- Why this talk
- MySQL and Sun should become more transparent
- Easier to discuss and act when you have facts
- when user know the limitations, they know how to go around
- Threads
- one connection/thread doesn’t work well
- no priority threads
- no way to ensure we have X threads
- Symptoms
- Too many context switches
- we are not using multi-core efficiently
- does not scale that well after 4-8 cores
- Solution
- –thread-handling=pool-of-threads (6.0)
- Lots of work
- Memory as a resource
- no single memory allocator
- sort_buffer_size is not flexible
- memory engine has limitations (no varchar/blob)
- Metadata
- no online operations
- .frm files are used to generate table info
- no versioning
- they are not ACID
- slow information schema / SHOW command
- Privileges
- not modular or pluggable
- hard to extend
- Get community to add LDAP support and ask MySQL/Sun to add ROLE as high priority
- Pluggable storage engine
- storage engines are depending on internal mysql data structures (table, field)
- can only be used with the exact MySQL server version that is compiled
- think as loadable engine
- Items
- hold temporary values during evaluation
- are not re-entrant
- hard to make parallel execution of statement
- we cant cache prepared statements and stored procedures
- Parser
- state machine too large
- not pluggable
- not cacheable
- still uses Bison
- bad error messages
- Modularity
- very monolithic
- few defined interfaces
- server and libraries are not documented
- multiple execution paths
- no rewrite state for optimizer
- Stored procedures/triggers
- not cacheable across connections
- only supports SQL
- pre-locking for all tables
- all cursors are materialized
- trigger code is not shared across opened tables
- uses more memory
- hard to debug and profile
- Replication
- not fail safe
- no synchronous option
- no consistency checking
- setup and resync is complicated
- single thread on the slave
- no multi master
- only InnoDB synchronizes with the replication binary log
- Client/Server protocol
- only one running query/connection
- not all statements can be prepared
- not all language support prepared statement protocol
- multi-statements/sps with multiple results (need to add the support)
- Table names
- stored as files (.frm)
- file system can be a case sensitive or not
- falcon has its own interpretation of how things should be done
- Why Falcon and Maria
- we don’t have any transactional engine
- if one fails, another will succeed
- Phone home
- no clue for mysql/sun what features mysql users are using
- endless internals discussion
- what can be deprecated
- default values for future
- Open source project ?
- very hard to contribute code or ideas
- customers are using their production systems as bug-finders
- road map and decision making is not transparent
- Release policy
- constantly shipping releases
- 5.1 was declared RC way too early
- features are removed/added in RC code (hmm?, what’s wrong)
- Good news
- we know the problems and have a good clue of how to fix (most) of them
- sun is more open source / free software friendly than MySQL
- sun really understands developer
From Jeff Rothschild, VP of technology, Facebook
- The power of connectedness
- Impact on the database
- Our challengeÂ
- The power of connectedness
- photo tagging
- getting an email that someone tagged a photo in Facebook
- inter tagging between friends and it continues as a network
- outstanding growth because of the photo tagging
- 26B photos in archive now
- Most trafficked photos applications than anybody else
- events, invites impact on social graph
- Opening up the social graph to outside and create a platform and API, so other developers can make use of it (28K applications so far)
- Hypergrowth because of the platform (1M users in four days)
- 200 applications that has over million users
- circle of connectivity can make the globe (with positive feedback, performance etc)
- the social graph links everything
- and that means lot of work for database
- Impact on the database
- user -> connections -> objects
- interact with people, people has lot of data
- 100 friends, 100 friends will have 100’s of objects/friend…10’s of thousands of possible objects
- Horizontal partitioning supports parallel queries
- 50K+ requests per second now
- memcached can match speeds with the application
- multi-threading with multi-core to take advantage with 1G cards
- New C client which enhances run-time performance for PHP
- New binary protocol for efficiency both ends (app and memcached)
- Introduced memcached proxy to purge dirty records
- MySQL replication to replicate data between data centers
- Proxy is not a solution here due to replication
- Changed Insert statements with MEMCACHE_DIRTY key1, keys …
- Put more RAM in cache (95% hit rate is in cache)
- 20M requests and only 500K requests goes to MySQL server
- Optimizations
- memory + servers (reduce to save $$$)
- Tuning (moderate query scope, aggregate related data)
- Partitioning (selective caching, archiving stale data)
- Our Challenge
- MySQL and memcached is a powerful combination
- MySQL at the speed of memcached
- memcache performance for simple tables
- flash memory storage
- dramatic improvement in I/O
- supports higher miss rate
- allows higher data/ram ratio
- persistent “Cache”
- flash for memcache
- persistent distributed hash table could be a good fit
- structured storage
- not all data benefits
- integrated cache and persistent replication
- unmanaged distribution
- A strong community is our challenge
Here is the list of MySQL benchmarking tools that were discussed during the "Benchmarking tools" session today.
It could have been nice if the speaker included sysbench and DBT-2 or DBT-5 as well. May be we can expect it for next year
Here is the quick notes from the session “Architecture of Maria” from Monty Widenius, one of my all time favorite developer and founder of MySQL.
- Goals
- To create ACID complaint and Multi-version concurrency control (MVCC)Â transactional storage engine for MySQL
- To replace existing replacement for MyISAM and if possible the default transactional database
- Maria can run in both transactional and non-transactional, so its easy to replace existing transactional and non-transactional engines
- To create a storage engine which is good for data warehousing (DW) purposes
- Why Falcon and Maria
- two different architectures
- different in feature, compatibility, performance
- Target is for data warehousing
- Project plan
- 5.1 + Maria release
- 1.0 - crash-safe, released in jan, 2008
- 1.5 - concurrent insert/select, released in apr-may 2008
- 2008 it will be merged with 6.0
- 6.0 + Maria
- 2.0 - transactional and ACID (Q4-2008)
- 3.0 - high concurrency & online backup (Q1-2009)
- 3.0 - GA by Q2-2009
- 6.1 + Maria
- 4.0 - Data warehousing Q3/2009
- Technical goals
- MVCC
- ACID
- MyISAM replacement
- Internal temporary tables replacement
- All Indexes should have equal speed
- Allow any length transactions to work
- Log shipping (incremental backups, just by coping the logs)
- Can be used as standalone library
- copying of tables b/w different Maria servers
- Better blob handling
- Select count ( * )Â and table checksums
- Efficient storage on disk
- Small footprint
- behavior of transactional and non-transactional is dynamic
- Maria 1.0 is basically a crash-safe, non-transactional version of MyISAM with cacheable row format
- Primary goal is to make Maria of good quality
- Comparison to MyISAM
- Mostly everything is compatible
- repair, compression of rows
- different row formats (fixed size, dynamic, compressed + pages)
- fast count ( * )Â and checksum
- different index compress format
- after a clean shutdown, one can copy logs and build a new server
- Advantages
- Data, index and statistics are crash-safe
- Can replay everything from log
- include create/drop/rename/truncate tables
- backup of Maria by just coping logs
- Load index can skip index blocks for not wanted indexes
- It has unit tests that we missed it for MyISAM
- support both crash-safe, transactional
- non-txn will use less space
- supports all MyISAM row formats +PAGE format
- Differences to MyISAM
- uses big log files (1G by default)
- has log control file
- Uses 8K pages by default (MyISAM 1K)
- Disadvantages to MyISAM
- repair with many threads not yet
- Will not be fixed
- No external locking
- Maria has one page size for both index and data
- Maria has only one page cache
- Index number + checksums requires 5 extra bytes per page
- It does not support RAID
- Min data file size for BLOCK format is 16K
- Storage for very small rows (< 25bytes) is not optimal
- TODO
- Make indexes and statistics versioned
- Make fulltext (boolean mode) and R-tree index crash safe
- Parallel-recovery (rolling feature)
- Well documented from day-1 (architecture, faq, everything), maria@lists.mysql.com
- XDB indexes for Maria-4.0
Here is the quick notes from the session Falcon from the beginning by Jim Starkey and Ann Harrison
- Why Falcon
- Hardware is evolving rapidly, world is changing, so taking advantage
- Customers need ACID transactions
- Where hardware is going
- CPUS breed like rabbits (more sockets, cores, threads/core)
- Memory is bigger, faster and cheaper
- Disks are bigger and cheaper but not much faster
- In general boxes are getting cheaper
- Where applications are going
- batch - dead
- timesharing - dead
- departmental computing - dead
- client server - fading fast
- application servers for most of us
- web services for the really big buys
- Database Challenges
- Traditional challenge
- exhaust CPU, memory and disk simultaneously
- Tradeoffs
- use memory to page cache to avoid disk reads
- record cache to avoid page cache manipulation
- use CPU to find the fastest path to record
- use CPU to minimize record size
- Synchronize most data structures with user mode read/write locks
- Synchronize high contention data structures with interlocked instructions
- Architecture
- Incomplete in-memory db with disk backfill
- Multi-version concurrency control in memory
- Updates in memory until commit
- Group commits to a single serial log write
- post-commit multi-threaded pipe line to move updates to disk
- Incomplete in-memory database
- records cached in memory
- separate cache for disk pages
- record cache hits 15% the cost of a page cache hit
- record cache is more memory efficient than page cache
- Record Encoding - cache efficiency
- records encoded by value, not declaration
- string “abc” occupies the same space in varchar(3) or varchar(4096)
- the number 7 is the same where small, medium, int, bigint, decimal or numeric
- MVCC
- update ops create new record versions
- new one is tagged with id, points to old version
- keep tracks which
- Updates are in memory
- held in memory pending commit
- index changes held in memory
- verb rollback is dirt cheap
- trxs rollback is dirt cheap
- At commit
- pending record updates flushed to serial log
- pending index updates flushed to serial log
- commit record written to serial log
- serial log flushed to the oxide
- and trx is also committed
- Memory is infinite, so
- large txns chills uncommitted data (flushes it to the log early)
- chilled records can be thawed
- scavenger garbage collects unloved records periodically
- when things get really had, entire record chains flushed to backlog
- Weakness
- transactions are ACID but not serializable
- latency advantage disappears at saturation
- very large transactions degrade performance
- optimized for web, not batch
- Strengths
- runs like a memory db when data fits
- scales like disk-based db when db doesn’t fit in cache
- lowest possible latency for web apps
- absorbs huge spiky loads
- Performance
- benchmark against InnoDB vs Falcon only
- DBT2 benchmark (what about sysbench?)
- High contention
- Writes intensive - 40% records touched are updated
- measures only performance at saturation
- DBT2 is InnoDB’s best spot and Falcon’s worst, so do not take benchmark results, decide on what you want
- When should you use what ?
- don’t need ACID ? then MyISAM is good
- single processor, small memory - InnoDB is good
- large transactions, batch inserts/updates, InnoDB is good
- multi cores, more memory, more threads , use Falcon
- For web, Falcon is hard to beat