Tuesday, May 22, 2012

Is compression codecs required on client nodes?

Some times even after having compression codecs available across all nodes in cluster we see some of our jobs giving class not found for compression codecs.


Even though compression /decompression processes are done by task trackers. In certain cases the compression codecs are required on the client nodes. 

Some scenarios are
 Total Order Partitioner
          Before triggering the mapreduce job, the job need to have an understanding on the ranges of key. Only then it can decide on which range of keys should go into which reducer. We need this value before map tasks starts, for that initially the client makes a random across input data sample (seek could be like read first 20 mb skip next 200 mb read next 20 mb etc). 

 Hive and Pig
For better optimization of jobs, uniform distribution of data across reducers and determining number or reducers etc hive and pig actually does a quick seek on input data samples.

In both these cases, since a sample of Input data is actually read on client side before the MR tasks, if data is compressed the compression codec needs to be available on the client node as well.

12 comments:

  1. Hi Bejoy,

    I am getting the same exception but i am not able to understand which one you are mentioning as client node?

    Thanks in advance!

    ReplyDelete
  2. http://hadoopbigdatatraininginbangalore.blogspot.in/2014/11/interview-questions-1-hadoop.html

    I would like to share Questions

    ReplyDelete
  3. Hadoop is really a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. We also provide Hadoop online training

    ReplyDelete
  4. Hadoop is getting boom and it is one of the best technology to get more number of jobs. Hadoop Tutorial

    ReplyDelete
  5. I was reading your blog this morning and noticed that you have a awesome resource page. I actually have a similar blog that might be helpful or useful to your audience.

    Regards
    sap sd and crm online training
    sap online tutorials
    sap sd tutorial
    sap sd training in ameerpet
    sap crm training tutorial

    ReplyDelete
    Replies
    1. Thanks The information which you provided is very much useful for Pega Training v

      Delete
  6. Thanks The information which you provided is very much useful for Hybris Training v

    ReplyDelete
  7. Thank you,so thank of this tutorial! Hope you have more the lession like this!AWS Training
    Thank agains!

    ReplyDelete
  8. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
    Hadoop provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
    Apache Hadoop was born out of a need to process an avalanche of big data.
    http://www.computaholics.in/p/hadoop.html

    ReplyDelete