MapReduce – DBInputFormat – Serialization on readers
Last week I was working on EC2 MySQL server where one of the slave is taking lot of time to catch-up; and only job that is running on that server is mapreduce job to access InnoDB tables for read-only meta data. And debugging it further, noticed that every access to database server is serialized with read lock; so slave has to wait until the locks are released (bad part is, if the resource is not released in-time or if it takes lot of time to process and cleanup, then things will be really bad on server front).
As am new to hadoop/mapreduce world; after pulling mapreduce code, noticed that DbInputFormat class by default uses SERIALIZABLE isolation level in getConnection() internally; which is kind of odd and can cause lot of locking and performance issues especially if multiple threads are acting on the same table.
Surprisingly, Sqoop, a database import tool for Hadoop also uses the same DBInputFormat class, so not sure if anyone actually tested for performance especially on live server with multiple threads where table is frequently updated. Even though the reader thread won’t take any performance hit, but other threads has to wait for any write operation as SERIALIZABLE by default acquires read lock.
I patched the mapreduce code to use READ COMMITTED isolation, and things started working back; even though the scenario is bit different here due to the nature of server workload, but this is not something that is un-common in most of the production environment.