Monday, June 29, 2015

Hadoop Archives (har) - Creating and Reading HAR



A quick post that explains the following with samples
  • Create a HAR file
  • List the Contents of a HAR file
  • Read the contents of a file that is within a HAR


Listed below is the input  directory structure in HDFS I’ll be using to create a har

hadoop fs -ls /bejoyks/test/har/source_files/*
Found 2 items
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:25 /bejoyks/test/har/source_files/srcDir01/file1.tsv
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:25 /bejoyks/test/har/source_files/srcDir01/file2.tsv
Found 2 items
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:25 /bejoyks/test/har/source_files/srcDir02/file3.tsv
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:25 /bejoyks/test/har/source_files/srcDir02/file4.tsv


CLI Command to create a HAR

Syntax
hadoop archive -archiveName tsv <archiveName.har> -p <ParentDirHDFS> -r <ReplicationFactor> <childDir01> <childDir02> <DestinationDirectoryHDFS>

Command Used
hadoop archive -archiveName tsv_daily.har -p /bejoyks/test/har/source_files -r 3 srcDir01 srcDir02 /bejoyks/test/har/destination


LISTING DIRS and FILES in HAR
Syntax
hadoop fs –ls  har://<AbsolutePathOfHarFile>

Command Used and Output
Command 01 :
hadoop fs -ls har:///bejoyks/test/har/destination/tsv_daily.har
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2015-06-29 20:39 har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01
drwxr-xr-x   - hadoop supergroup          0 2015-06-29 20:39 har:///bejoyks/test/har/destination/tsv_daily.har/srcDir02

Command 02 :
hadoop fs -ls har:///home/hadoop/work/bejoyks/test/har/destination/tsv_daily.har/srcDir01
Found 2 items
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:39 har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01/file1.tsv
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:39 har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01/file2.tsv

READING a File within a HAR
hadoop fs -text har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01/file2.tsv
file2    row1
file2    row2

** Common mistakes while reading a HAR file

Always use the URI while reading a HAR file
Since we are used lo listing the directories/files in HDFS without the URI , we might use the similar pattern here. But HAR files doen’t work well if it is not prefixed with URI . If listed without URI you’ll get the HAR metadata under the hood, something like below.

hadoop fs -ls /bejoyks/test/har/destination/tsv_daily.har
Found 3 items
-rw-r--r--   5 hadoop supergroup        277 2015-06-29 20:39 /bejoyks/test/har/destination/tsv_daily.har/_index
-rw-r--r--   5 hadoop supergroup         23 2015-06-29 20:39 /bejoyks/test/har/destination/tsv_daily.har/_masterindex
-rw-r--r--   3 hadoop supergroup         88 2015-06-29 20:39 /bejoyks/test/har/destination/tsv_daily.har/part-0