storing dynamically created file names with pig (piggbank's multistorage to the rescue)

A common use case that can be easily addressed with Pig is to break an input file into separate files based on one of the record's attributes.  An easy thing to visualize would be breaking this up on a date as I will show in my quick example, but it could be any relevant attribute such as sales region or originating country.  So, let's start with a simple input file to process.

/tmp/multistore/all.txt
2016-01-01,field1-01a,field2-01a,field3-01a
2016-01-02,field1-02a,field2-02a,field3-02a
2016-01-03,field1-03a,field2-03a,field3-03a
2016-01-01,field1-01b,field2-01b,field3-01b
2016-01-02,field1-02b,field2-02b,field3-02b
2016-01-03,field1-03b,field2-03b,field3-03b
2016-01-01,field1-01c,field2-01c,field3-01c
2016-01-02,field1-02c,field2-02c,field3-02c
2016-01-03,field1-03c,field2-03c,field3-03c

This file, which was placed in /tmp/multistore/all.txt on HDFS in my HDP 2.4 based Hortonworks Sandbox for testing, has nine records total and three records for each of the first three days in January of 2016.  All I want to show how to do now is to create three separate files that have the same attributes (Pig can surely do a bunch of transformation before this!!) and use the very first attribute (again, the date) as the filename that will hold only those particular records.  This is easy thanks to Piggybank's MultiStorage class as shown in the following script.

testMultiStorage.pig
allTogether = LOAD '/tmp/multistore/all.txt' USING PigStorage(',') AS
  (newFileName:chararray, field1:chararray, field2:chararray, field3:chararray);
  
STORE allTogether INTO '/tmp/multistore/splitUp' 
  USING org.apache.pig.piggybank.storage.MultiStorage(
    '/tmp/multistore/splitUp', '0');

As shown in the JavaDoc, that second parameter is asking for the "index of field whose values should be used to create directories and files".  You can also supply a specific delimiter in the output file, but I'll let it use the default tab-delimited just to make it clear the individual attributes are being recognized instead of just a long string for the input record.  Let's run it!

[root@sandbox multistore]# pig -x tez -f testMultiStore.pig 
Success!
Input(s):
Successfully read 9 records (396 bytes) from: "/tmp/multistore/all.txt"
Output(s):
Successfully stored 9 records (396 bytes) in: "/tmp/multistore/splitUp"
[root@sandbox multistore]# 
[root@sandbox multistore]# hdfs dfs -ls -R /tmp/multistore/splitUp
drwxr-xr-x   - root hdfs          0 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-01
-rw-r--r--   1 root hdfs        132 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-01/2016-01-01-0,000
drwxr-xr-x   - root hdfs          0 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-02
-rw-r--r--   1 root hdfs        132 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-02/2016-01-02-0,000
drwxr-xr-x   - root hdfs          0 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-03
-rw-r--r--   1 root hdfs        132 2016-07-26 16:21 /tmp/multistore/splitUp/2016-01-03/2016-01-03-0,000
-rw-r--r--   1 root hdfs          0 2016-07-26 16:21 /tmp/multistore/splitUp/_SUCCESS
[root@sandbox multistore]# 
[root@sandbox multistore]# hdfs dfs -cat /tmp/multistore/splitUp/2016-01-01/2016-01-01-0,000
2016-01-01    field1-01a    field2-01a    field3-01a
2016-01-01    field1-01b    field2-01b    field3-01b
2016-01-01    field1-01c    field2-01c    field3-01c
[root@sandbox multistore]# hdfs dfs -cat /tmp/multistore/splitUp/2016-01-02/2016-01-02-0,000
2016-01-02    field1-02a    field2-02a    field3-02a
2016-01-02    field1-02b    field2-02b    field3-02b
2016-01-02    field1-02c    field2-02c    field3-02c
[root@sandbox multistore]# hdfs dfs -cat /tmp/multistore/splitUp/2016-01-03/2016-01-03-0,000
2016-01-03    field1-03a    field2-03a    field3-03a
2016-01-03    field1-03b    field2-03b    field3-03b
2016-01-03    field1-03c    field2-03c    field3-03c
[root@sandbox multistore]# 

Pretty sweet!