Kick Start Hadoop: Include values during execution time in hive QL/ Dynamically substitute values in hive

Monday, October 24, 2011

Include values during execution time in hive QL/ Dynamically substitute values in hive

When you play around with data warehousing it is very common to come across scenarios where you’d like to submit values at run time. In production environments when we have to enable a hive job we usually write our series of hive operations in HQL on a file and trigger it using the hive –f option from a shell script or some workflow management systems like oozie. Let’s have this discussion limited to triggering the hive job from shell as it is the basic one.

Say I’m having a hive job called hive_job.hql, normally from a shell I’d trigger the hive job as

hive -f hive_job.hql

If I need to set some hive config parameters, say I need to enable compression in hive then I’d include the following arguments along as

hive -f hive_job.hql -hiveconf hive.exec.compress.output=true -hiveconf mapred.output.compress=true -hiveconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec

Now the final one, say my hive QL is doing some operations on date range and this date is varying/dynamic. This date is to be accepted from CLI each time. We can achieve the same with the following steps

1. Pass the variables as config parameters

2. Refer these config parameters in your hive query

Let me make it more specific with a small example. I need to perform some operation on records in a table called ‘test_table’ which has a column/field named ‘creation_date’ .(ie I need to filter records based on creation_date). The date range for the operation is supplied at run time. It is achieved as

1. Pass the variables as config parameters

Here we need to pass two parameters, the start date and end date to get all the records within a specific date range

hive -f hive_job.hql -hiveconf start_date=2011-01-01 –hiveconf end _date=2011-12-31

2. Refer these config parameters in your hive query

In our hive QL the start date and end date are to be decorated with place holders to be replaced by actual values during execution time.

SELECT * FROM test_table t1 WHERE t1.created_date=’ ${hiveconf: start_date }’ AND t2.created_date=’ ${hiveconf: end _date}’

Let us look into one more example, decide on the number of reducers during run time with the previous set of requirements along with compression. I have 2 components here

· Shell script that triggers .hql

Hive_trigger.sh

# The script accepts 3 parameters in order- number of reducers, start date and end date.

NUM_REDUCERS=$1

BEGIN_DATE=$2

CLOSE_DATE=$3

hive -f hive_job.hql -hiveconf mapred.reduce.tasks= NUM_REDUCERS -hiveconf start_date= BEGIN_DATE –hiveconf end _date= CLOSE_DATE

· HQL that accepts parameters at run time

hive_job.hql

SELECT * FROM test_table t1 WHERE t1.created_date=’ ${hiveconf: start_date }’ AND t2.created_date=’ ${hiveconf: end _date}’

The components are ready you can now trigger you shell script. For example as
./hive_trigger.sh 50 2011-01-01 2011-12-31

4 comments:

KarnaJuly 31, 2012 at 2:28 PM
note that there cannot be a space while using the parameter inside query

hive -f test.sql -hiveconf fromdate=2012-07-25 todate=2012-07-26

select '${hiveconf: fromdate }','${hiveconf: todate }','${hiveconf:fromdate}','${hiveconf:todate}' from dual;

Ended Job = job_201205071915_24430
OK
${hiveconf: fromdate } ${hiveconf: todate } 2012-07-25 2012-07-26
ReplyDelete
Replies
Bejoy KSSeptember 10, 2012 at 10:11 AM
Hi Karna

Sorry for the formatting issues when you copy the commands from your unix box. :)
ReplyDelete
Replies
Nagarjuna@javaJune 23, 2013 at 7:25 PM
invalid table alias or column reference 'profile_attribute_code' = (possible column names are: _col0, _col1, _col2) I am facing this problem while working can you please helpme
ReplyDelete
Replies
UnknownSeptember 29, 2014 at 10:48 AM
You can use BeeTamer for that. It allows to store result (or part of it) in a variable, and use this variable later in your code.

Beetamer is macro extension to Hive or Impala that allows to extend functionality of the Apache Hive and Cloudera Impala engines.

select avg(a) from abc;
%capture MY_AVERAGE;
select * from abc2 where avg_var=#MY_AVERAGE#;
In here you save average value from you query into macro variable MY_AVERAGE and then reusing it in the second query.
ReplyDelete
Replies

Add comment