Tuesday, May 22, 2012

Is compression codecs required on client nodes?

Some times even after having compression codecs available across all nodes in cluster we see some of our jobs giving class not found for compression codecs.


Even though compression /decompression processes are done by task trackers. In certain cases the compression codecs are required on the client nodes. 

Some scenarios are
 Total Order Partitioner
          Before triggering the mapreduce job, the job need to have an understanding on the ranges of key. Only then it can decide on which range of keys should go into which reducer. We need this value before map tasks starts, for that initially the client makes a random across input data sample (seek could be like read first 20 mb skip next 200 mb read next 20 mb etc). 

 Hive and Pig
For better optimization of jobs, uniform distribution of data across reducers and determining number or reducers etc hive and pig actually does a quick seek on input data samples.

In both these cases, since a sample of Input data is actually read on client side before the MR tasks, if data is compressed the compression codec needs to be available on the client node as well.

6 comments:

  1. Hi Bejoy,

    I am getting the same exception but i am not able to understand which one you are mentioning as client node?

    Thanks in advance!

    ReplyDelete
  2. http://hadoopbigdatatraininginbangalore.blogspot.in/2014/11/interview-questions-1-hadoop.html

    I would like to share Questions

    ReplyDelete
  3. Hadoop is really a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. We also provide Hadoop online training

    ReplyDelete
  4. Hadoop is getting boom and it is one of the best technology to get more number of jobs. Hadoop Tutorial

    ReplyDelete