Here is the quick notes from the session Helping InnoDB scale on servers with many cores by Mark Callaghan from Google (mcallaghan at google dot com).

  • we have a team now, to help scale MySQL to do the enhancements (9 people,  I hope yahoo management reads this)
  • Overview
    • describe the problems on big servers
    • work done by InnoDB community
    • ask MySQL/InnoDB to fix the problems by taking the patches
  • Community team
    • InnoDB/Oracle
    • Google MySQL team
    • InnoDB community
    • Percona – Peter and Vadim
  • Goal
    • Fix bottlenecks on big SMP
    • utilize servers with many disks
    • support thousands of connections
    • handle corruption in memory and on disk
    • make query plans predictable
    • make thousands of tables and accounts
    • Keep InnoDB beautiful while making these changes
  • Desirable features
    • linear scalability with cores
    • 128GB buffer cache
    • with many disks and remote disks
    • recovering from corruption
  • CPU problems
    • mutex implementation
    • spin lock mutex uses pthreads rather than automatic
    • RW-mutex uses the spin lock mutex
    • Mutex hotspots
      • buffer cache
      • memory allocation
      • transaction log
      • adaptive hash latch
  • Symptoms of CPU problems
    • adaptive hash latch contention
    • excessive mutex contention
      • server has many queries, is slow and is not IO bound
      • vmstat will report lot of idle time
      • oprofile will show a lot of time spent in pthread functions
  • Making InnoDB spin lock mutex fast
    • replace pthread_mutex_trylock with CAS
  • RW-mutex fast
    • use atomic ops to change internal state
    • use separate events to wake readers and writers
  • More work to be done
    • always release adaptive hash latch
    • RW-mutex for transaction log
    • replace malloc heap with scalable malloc (tcmalloc or mtmalloc)
    • use atomix ops for rw-mutex
    • reduce contention for the sync array mutex
    • remove some counters and fields from mutexes
    • platforms with > 8 cores
      • transaction log mutex has contention
      • buffer cache contention
  • To support 128GB buffer cache
    • data structures must scale
      • walking a list with 8M page entries might be slow
    • resources need to be a split
      • more than one mutex might be needed for the buffer cache and LRU chain
    • Detection of corruption is more important
      • memory will be corrupted by software and hardware bugs