Kick Start Hadoop: Is compression codecs required on client nodes?

Tuesday, May 22, 2012

Is compression codecs required on client nodes?

Some times even after having compression codecs available across all nodes in cluster we see some of our jobs giving class not found for compression codecs.

Even though compression /decompression processes are done by task trackers. In certain cases the compression codecs are required on the client nodes.

Some scenarios are
Total Order Partitioner
Before triggering the mapreduce job, the job need to have an understanding on the ranges of key. Only then it can decide on which range of keys should go into which reducer. We need this value before map tasks starts, for that initially the client makes a random across input data sample (seek could be like read first 20 mb skip next 200 mb read next 20 mb etc).
Hive and Pig
For better optimization of jobs, uniform distribution of data across reducers and determining number or reducers etc hive and pig actually does a quick seek on input data samples.

In both these cases, since a sample of Input data is actually read on client side before the MR tasks, if data is compressed the compression codec needs to be available on the client node as well.

5 comments:

MANOJ BABUOctober 7, 2012 at 10:48 AM
Hi Bejoy,

I am getting the same exception but i am not able to understand which one you are mentioning as client node?

Thanks in advance!

ReplyDelete
Replies
UnknownDecember 11, 2014 at 4:07 AM
http://hadoopbigdatatraininginbangalore.blogspot.in/2014/11/interview-questions-1-hadoop.html

I would like to share Questions
ReplyDelete
Replies
UnknownDecember 11, 2014 at 4:09 AM
Interview Questions
ReplyDelete
Replies
UnknownJanuary 22, 2015 at 4:53 AM
Hadoop is really a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. We also provide Hadoop online training
ReplyDelete
Replies
UnknownJanuary 25, 2015 at 9:00 PM
Hadoop is getting boom and it is one of the best technology to get more number of jobs. Hadoop Tutorial
ReplyDelete
Replies

Add comment