Kick Start Hadoop: Enable Multiple threads in a mapper aka MultithreadedMapper

Friday, February 10, 2012

Enable Multiple threads in a mapper aka MultithreadedMapper

As the name suggests it is map task that spawns multiple threads. A map task can be considered as a process which runs on its own jvm boundary. Multithreaded spawns multiple threads within the same map task. Don’t confuse the same as multiple tasks within the same jvm (this is achieved with jvm reuse). When I say a task has multiple threads, a task would be reusing the input split as defined by the input format and record reader reads the input like a normal map task. The multi threading happens after this stage; once the record reading has happened then the input/task is divided into multiple threads. (ie the input IO is not multi threaded and multiple threads come into picture after that)

MultiThreadedMapper is a good fit if your operation is highly CPU intensive and multiple threads getting multiple cycles could help in speeding up the task. If IO intensive, then running multiple tasks is much better than multi thread as in multiple tasks multiple IO reads would be happening in parallel.

Let us see how we can use MultiThreadedMapper. There are different ways to do the same in old mapreduce API and new API.

Old API

Enable Multi threaded map runner as

-D mapred.map.runner.class = org.apache.hadoop.mapred.lib.MultithreadedMapRunner

jobConf.setMapRunnerClass(org.apache.hadoop.mapred.lib.MultithreadedMapRunner);

New API

Your mapper class should sub class (extend) org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of org.apache.hadoop.mapreduce.Mapper . The Multithreadedmapper has a different implementation of run() method.

You can set the number of threads within a mapper in MultiThreadedMapper by

MultithreadedMapper.setNumberOfThreads(n); or

mapred.map.multithreadedrunner.threads = n

Note: Don’t think it in a way that multi threaded mapper is better than normal map reduce as it spawns less jvms and less number of processes. If a mapper is loaded with lots of threads the chances of that jvm crashing are more and the cost of re-execution of such a hadoop task would be terribly high.

Don’t use Multi Threaded Mapper to control the number of jvms spanned, if that is your goal you need to tweak the mapred.job.reuse.jvm.num.tasks parameter whose default value is 1, means no jvm reuse across tasks.

The threads are at the bottom level ie within a map task and the higher levels on hadoop framework like the job has no communication regarding the same.

21 comments:

UnknownMarch 29, 2012 at 12:01 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownMarch 29, 2012 at 12:03 PM
Finds useful.
Few questions
If I am setting mapred.job.reuse.jvm.num.
tasks= -1 and my NLineInput format has value 5 and total lines is 20, how it internally executed?
ReplyDelete
Replies
Bejoy KSApril 15, 2012 at 10:12 AM
Hey,
If you are using NLineInputFormat with your spec it is simple, 5 lines in a mapper task instance. 20 Lines then 4 map tasks. When jvm reuse is -1, all the map tasks on the same node/task tracker will be using the same jvm instance. To be noted, you can never guarantee that all the mappers would be on the same node, it'd be dependent on factors like the slots available, scheduling used etc.

Regards
Bejoy
ReplyDelete
Replies
Manish JainJune 2, 2012 at 10:34 AM
In our case, maps are memory and cpu bound. we are also planning to use multithreaded mapper to achieve efficiencies in many aspects
1. memory (common data structures) will be shared across multiple threads
2. each thread will be scheduled on different core and hence should be equivalent of running multiple maps on same physical box.
3. should be able to specify bigger split size and hence combiner efficiency should improve.

I have following queries
1. what are dis-advantages of multithreaded mapper beyond mentioned in notes section?
2. what could be other options to increase efficiency if maps are cpu and memory bound, primarily cpu bound.

Manish Jain
Gauvus, Guavus
ReplyDelete
Replies
Bejoy KSJune 3, 2012 at 10:40 AM
Hi Manish
From your requirement, looks like you don't need MultiThreadedMapper. Use the normal map reduce , you can achieve you requirement of sharing data across mappers using distributed cache.

Some pointers inline

1. memory (common data structures) will be shared across multiple threads
>> use distributed cache
2. each thread will be scheduled on different core and hence should be equivalent of running multiple maps on same physical box.
>> muliple maps is a more cleaner approach here than going for MultiThreaded one
3. should be able to specify bigger split size and hence combiner efficiency should improve.
>> it is a trade off on parallelism vs data shuffle. more maps more parallelism.

Regards
Bejoy
ReplyDelete
Replies
arunJuly 4, 2012 at 5:50 AM
Hi Bejoy
We have java application which spawns multiple threads, each thread invoke a mapreduce task. Since multiple threads invoking the mapreduse task the application is failing.
Could you pls let me know how to execute mapreduce tasks in multi threaded way.

Thanks
Arun
ReplyDelete
Replies
Bejoy KSJuly 5, 2012 at 1:43 AM
Hi Arun

This post is on a map task in a map reduce job spawning multiple threads. From what I got your case is different, You are launching mapreduce jobs in each spawned thread in your java application. And if the number of threads are too high and if the cluster capacity is not that great then definitely it can clog. You may have to get the failed task/job logs and see what is actually the root cause. The root cause can be many in your case.
ReplyDelete
Replies
UnknownAugust 28, 2013 at 11:05 PM
Really good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for HADOOP.
ReplyDelete
Replies
kumarSeptember 23, 2013 at 11:58 PM
it,s a nice article and it is very useful for hadoop learners.hadoop online trainings also provides the hadoop online training
ReplyDelete
Replies
kumarOctober 1, 2013 at 10:49 PM
it's nice information and it is useful for us.123trainings prvides hadoop online training in india
to see free demo just clickonline training hadoop demo class in hyderabad
ReplyDelete
Replies
kumarOctober 1, 2013 at 10:51 PM

it's nice information and it is useful for us.123trainings prvides hadoop online training in india
ReplyDelete
Replies
UnknownNovember 1, 2013 at 12:22 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousNovember 14, 2013 at 10:43 PM
I must first of all appreciate the wonderfull efforts put in this blog to come up with such a good platform to encourage learning of Hadoop training
ReplyDelete
Replies
kumarNovember 15, 2013 at 2:52 AM
It was nice article it was very useful for me as well as useful for Hadoop learners.thanks for providing this valuable information.
ReplyDelete
Replies
UnknownJanuary 10, 2015 at 12:06 AM
HADOOP Online Training by hyderabadsys online Trainings with a fantastic and continuous staff. Our Hadoop hyderabadsys online Trainings substance outlined according to the current IT industry necessity. Apache Hadoop is having great request in the business sector, tremendous number of employment opportunities are there in the IT world. Taking into account this interest hyderabadsys online Trainings began giving Online classes on Hadoop Training through the different online traininng strategies like Gotomeeting, Webex. Hadoop internet preparing is one hot no problem that has been broadly utilized. Owing to this, there is a gigantic requirement for Hadoop web preparing Administrators. There are numerous merchants who offer Hadoop internet preparing organization preparing. Engineering development inside the most recent decade has been huge to the point that things once considered unlimited are currently regular place and capacities and employments that once obliged high abilities and broad preparing can now be performed by practically anybody.
Hadoop Online Training
Contact us:
India +91 9030400777
Usa +1-347-606-2716
Email: contact@Hyderabadsys.com
ReplyDelete
Replies
ParveenJanuary 12, 2015 at 6:38 AM
Thanks for sharing this. Excel Training ,
Excel Training in Delhi , Excel Training in Gurgaon
ReplyDelete
Replies
UnknownJanuary 22, 2015 at 4:54 AM
Hadoop is really a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. We also provide Hadoop online training
ReplyDelete
Replies
kalyan hadoopMarch 26, 2015 at 3:12 AM
You want big data interview questions and answers follow this link.
http://kalyanhadooptraining.blogspot.in/search/label/Big%20Data%20Interview%20Questions%20and%20Answers
ReplyDelete
Replies
UnknownMarch 27, 2015 at 4:05 AM
It is very good information and thanks for sharing this.even I love to share valuable information regarding technology.Recently my friend suggested me to buy hadoop videos at www.hadooponlinetraining.com.the videos are really good and having life time acess.
ReplyDelete
Replies
UnknownApril 9, 2015 at 4:42 AM
Hi Ramya
Ya I visited www.hadooponlinetutor.com.I even bought the videos also.Really the videos are very good and got at $20 only.Thanku so much Ramya.
ReplyDelete
Replies
UnknownJune 21, 2015 at 11:31 PM
Thank you so much for sharing this worthwhile to spent time on. You are running a really awesome blog. Keep up this good work Big Data Training Chennai
ReplyDelete
Replies