storing dynamically created file names with pig (piggbank's multistorage to the rescue)
A common use case that can be easily addressed with Pig is to break an input file into separate files based on one of the record's attributes. An easy thing to visualize would be breaking this up on a date as I will show in my quick example, but it could be any relevant attribute such as sales region or originating country. So, let's start with a simple input file to process.
2016-01-01,field1-01a,field2-01a,field3-01a 2016-01-02,field1-02a,field2-02a,field3-02a 2016-01-03,field1-03a,field2-03a,field3-03a 2016-01-01,field1-01b,field2-01b,field3-01b 2016-01-02,field1-02b,field2-02b,field3-02b 2016-01-03,field1-03b,field2-03b,field3-03b 2016-01-01,field1-01c,field2-01c,field3-01c 2016-01-02,field1-02c,field2-02c,field3-02c 2016-01-03,field1-03c,field2-03c,field3-03c
This file, which was placed in /tmp/multistore/all.txt
on HDFS in my HDP 2.4 based Hortonworks Sandbox for testing, has nine records total and three records for each of the first three days in January of 2016. All I want to show how to do now is to create three separate files that have the same attributes (Pig can surely do a bunch of transformation before this!!) and use the very first attribute (again, the date) as the filename that will hold only those particular records. This is easy thanks to Piggybank's MultiStorage
class as shown in the following script.
allTogether = LOAD '/tmp/multistore/all.txt' USING PigStorage(',') AS (newFileName:chararray, field1:chararray, field2:chararray, field3:chararray); STORE allTogether INTO '/tmp/multistore/splitUp' USING org.apache.pig.piggybank.storage.MultiStorage( '/tmp/multistore/splitUp', '0');
As shown in the JavaDoc, that second parameter is asking for the "index of field whose values should be used to create directories and files". You can also supply a specific delimiter in the output file, but I'll let it use the default tab-delimited just to make it clear the individual attributes are being recognized instead of just a long string for the input record. Let's run it!
[root@sandbox multistore]# pig -x tez -f testMultiStore.pig Success! Input(s): Successfully read 9 records (396 bytes) from: "/tmp/multistore/all.txt" Output(s): Successfully stored 9 records (396 bytes) in: "/tmp/multistore/splitUp" [root@sandbox multistore]# [root@sandbox multistore]# hdfs dfs -ls -R /tmp/multistore/splitUp drwxr-xr-x - root hdfs 0 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-01 -rw-r--r-- 1 root hdfs 132 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-01/2016-01-01-0,000 drwxr-xr-x - root hdfs 0 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-02 -rw-r--r-- 1 root hdfs 132 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-02/2016-01-02-0,000 drwxr-xr-x - root hdfs 0 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-03 -rw-r--r-- 1 root hdfs 132 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-03/2016-01-03-0,000 -rw-r--r-- 1 root hdfs 0 2016-07-26 16:21 /tmp/multistore/splitUp/_SUCCESS [root@sandbox multistore]# [root@sandbox multistore]# hdfs dfs -cat /tmp/multistore/splitUp/2016-01-01/2016-01-01-0,000 2016-01-01 field1-01a field2-01a field3-01a 2016-01-01 field1-01b field2-01b field3-01b 2016-01-01 field1-01c field2-01c field3-01c [root@sandbox multistore]# hdfs dfs -cat /tmp/multistore/splitUp/2016-01-02/2016-01-02-0,000 2016-01-02 field1-02a field2-02a field3-02a 2016-01-02 field1-02b field2-02b field3-02b 2016-01-02 field1-02c field2-02c field3-02c [root@sandbox multistore]# hdfs dfs -cat /tmp/multistore/splitUp/2016-01-03/2016-01-03-0,000 2016-01-03 field1-03a field2-03a field3-03a 2016-01-03 field1-03b field2-03b field3-03b 2016-01-03 field1-03c field2-03c field3-03c [root@sandbox multistore]#
Pretty sweet!