Monday, October 24, 2011

Include values during execution time in hive QL/ Dynamically substitute values in hive


When you play around with data warehousing it is very common to come across scenarios where you’d like to submit values at run time. In production environments when we have to enable a hive job we usually write our series of hive operations in HQL on a file and trigger it using the hive –f option from a shell script or some workflow management systems like oozie. Let’s have this discussion limited to triggering the hive job from shell as it is the basic one.
      Say I’m having a hive job called hive_job.hql, normally from a shell I’d trigger the hive job as
hive -f hive_job.hql

If I need to set some hive config parameters, say I need to enable compression in hive then I’d include the following arguments along as
hive -f hive_job.hql  -hiveconf hive.exec.compress.output=true -hiveconf mapred.output.compress=true -hiveconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec

Now the final one, say my hive QL is doing some operations on date range and this date is varying/dynamic. This date is to be accepted from CLI each time. We can achieve the same with the following steps
1.       Pass the variables as config parameters
2.       Refer these config parameters in your hive query

                Let me make it more specific with a small example. I need to perform some operation on records in a table called ‘test_table’ which has a column/field named ‘creation_date’ .(ie I need to filter records based on creation_date). The date range for the operation is supplied at run time. It is achieved as
1.       Pass the variables as config parameters
                Here we need to pass two parameters, the start date and end date to get all the records within a specific date range
hive -f hive_job.hql  -hiveconf start_date=2011-01-01 –hiveconf end _date=2011-12-31
2.       Refer these config parameters in your hive query
                In our hive QL the start date and end date are to be decorated with place holders to be replaced by actual values during execution time.
SELECT * FROM test_table t1 WHERE t1.created_date=’ ${hiveconf: start_date }’ AND t2.created_date=’ ${hiveconf: end _date}’


Let us look into one more example, decide on the number of reducers during run time with the previous set of requirements along with compression. I have 2 components here 

·         Shell script that triggers .hql
Hive_trigger.sh
# The script accepts 3 parameters in order- number of reducers, start date and end date.
NUM_REDUCERS=$1
BEGIN_DATE=$2
CLOSE_DATE=$3
hive -f hive_job.hql  -hiveconf mapred.reduce.tasks= NUM_REDUCERS  -hiveconf start_date= BEGIN_DATE –hiveconf end _date= CLOSE_DATE

·         HQL that accepts parameters at run time
hive_job.hql
SELECT * FROM test_table t1 WHERE t1.created_date=’ ${hiveconf: start_date }’ AND t2.created_date=’ ${hiveconf: end _date}’

The components are ready you can now trigger you shell script. For example as
./hive_trigger.sh 50 2011-01-01 2011-12-31

4 comments:

  1. note that there cannot be a space while using the parameter inside query

    hive -f test.sql -hiveconf fromdate=2012-07-25 todate=2012-07-26

    select '${hiveconf: fromdate }','${hiveconf: todate }','${hiveconf:fromdate}','${hiveconf:todate}' from dual;

    Ended Job = job_201205071915_24430
    OK
    ${hiveconf: fromdate } ${hiveconf: todate } 2012-07-25 2012-07-26

    ReplyDelete
  2. Hi Karna

    Sorry for the formatting issues when you copy the commands from your unix box. :)

    ReplyDelete
  3. invalid table alias or column reference 'profile_attribute_code' = (possible column names are: _col0, _col1, _col2) I am facing this problem while working can you please helpme

    ReplyDelete
  4. You can use BeeTamer for that. It allows to store result (or part of it) in a variable, and use this variable later in your code.

    Beetamer is macro extension to Hive or Impala that allows to extend functionality of the Apache Hive and Cloudera Impala engines.

    select avg(a) from abc;
    %capture MY_AVERAGE;
    select * from abc2 where avg_var=#MY_AVERAGE#;
    In here you save average value from you query into macro variable MY_AVERAGE and then reusing it in the second query.

    ReplyDelete