create and share a pig udf (anyone can do it)

The Apache Pig project's User Defined Functions gives a pretty good overview of how to create a UDF.  In fact, I stole my simple UDF from there.  For Pig UDF's the obligitory "Hello World" program is actually a "Convert to Upper Case" function.  For this effort, I'm using the Hortonworks Sandbox (version 2.0).  Once you have that setup operational, follow along and we'll get your first UDF created and placed on HDFS where others can easily share it. 

Follow the Sandbox's instructions to log into the VM (root's password is hadoop), but then switch user to hue (bonus points for why this is really a bad idea) before opening up a new file called UPPER.java.

hw10653:~ lmartin$ ssh root@127.0.0.1 -p 2222
root@127.0.0.1's password: 
Last login: Fri Mar 28 21:34:47 2014 from 10.0.2.2
[root@sandbox ~]# su hue
[hue@sandbox root]$ cd ~
[hue@sandbox ~]$ mkdir exampleudf
[hue@sandbox ~]$ cd exampleudf
[hue@sandbox exampleudf]$ vi UPPER.java

Paste in the following code to the vi editor.  If you "forgot" how, just type i to pop into insert mode and then paste the stuff from below (and... if you needed that little hint, then bang the ESC key a couple of times and then type :wq and press ENTER to write the file and quit vi google for a good vi cheat sheet if needed).

UPPER.java
package exampleudf;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UPPER extends EvalFunc<String>
{
   public String exec(Tuple input) throws IOException {
      if(input == null || input.size() == 0 || input.get(0) == null)
         return null;
      try {
         String str = (String)input.get(0);
         return str.toUpperCase();
      } catch(Exception e) {
         throw new IOException("Caught exception processing input row ", e);
      }
   }
}

As you are starting to see, the goal is to create a SIMPLE User-Defined Function.  This will give you a strawman that you can build your own slick new function on top of.  That, or pay some decent Java Hadoop programmer to do it for you – heck, I'm not allergic to a little moonlighting.  (wink)

Then just compile the class and jar it up (your jdk and pig version numbers might vary slightly).  If you have trouble compiling/jaring it, or don't even want to try, then just download exampleudf.jar directly and load it into the directory described further down in the post.

[hue@sandbox exampleudf]$ /usr/jdk64/jdk1.6.0_31/bin/javac -cp /usr/lib/pig/pig-0.12.0.2.0.6.0-76.jar UPPER.java 
[hue@sandbox exampleudf]$ cd ..
[hue@sandbox ~]$ /usr/jdk64/jdk1.6.0_31/bin/jar -cf exampleudf.jar exampleudf
[hue@sandbox ~]$ ls -l *.jar
-rw-rw-r-- 1 hue hue 1534 Mar 29 00:54 exampleudf.jar

Now that we've got it created let's share it.  The best way to make it accessible to everyone is to put the jar file on HDFS itself.  Since we are using the Sandbox, we could just use Hue, but everything is always more fun at the command line

[hue@sandbox ~]$ hadoop fs -mkdir shared
[hue@sandbox ~]$ hadoop fs -mkdir shared/pig
[hue@sandbox ~]$ hadoop fs -mkdir shared/pig/udfs
[hue@sandbox ~]$ ls -l *.jar
-rw-rw-r-- 1 hue hue 1534 Mar 29 00:54 exampleudf.jar
[hue@sandbox ~]$ hadoop fs -put exampleudf.jar shared/pig/udfs/exampleudf.jar
[hue@sandbox ~]$ hadoop fs -ls /user/hue/shared/pig/udfs
Found 1 items
-rw-r--r--   3 hue hue       1534 2014-03-29 00:59 /user/hue/shared/pig/udfs/exampleudf.jar

If all went well you should be able to see that file via Hue's File Browser.

For this compiled UDF library to be accessible for everyone, the jar file needs to have its HDFS permissions set to allow read rights for all users.

Now, create a file (example: typingText.txt) with some random text such and get it into HDFS as shown below.

Next up; write a simple Pig script that will register the UDF jar file from HDFS and then use to turn all the words into upper-case.

test-UPPER.pig
REGISTER 'hdfs:///user/hue/shared/pig/udfs/exampleudf.jar';
DEFINE SIMPLEUPPER exampleudf.UPPER();

typing_line = LOAD '/user/hue/testData/typingText.txt' AS (row:chararray);

upper_typing_line = FOREACH typing_line GENERATE SIMPLEUPPER(row);

DUMP upper_typing_line;

The logical thing would be to use the Pig UI component of Hue to run this super simple function, but I simply cannot figure out why it complains with the following error each time.

2014-03-29 01:15:19,712 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Pathname /tmp/udfs/'hdfs:/user/hue/shared/pig/udfs/exampleudf.jar' from hdfs://sandbox.hortonworks.com:8020/tmp/udfs/'hdfs:/user/hue/shared/pig/udfs/exampleudf.jar' is not a valid DFS filename.

I simply could NOT work past this error and decided to do what everyone should do when something doesn't work (more bonus points if you can get it to work in Hue's Pig UI) – I made the problem simpler.  I went back to my trusty old friend, the command-line interface (CLI).

[hue@sandbox ~]$ pig test-UPPER.pig
2014-03-29 01:20:40,579 [main] INFO  org.apache.pig.Main - Apache Pig version 0.12.0.2.0.6.0-76 (rexported) compiled Oct 17 2013, 20:44:07
2014-03-29 01:20:40,580 [main] INFO  org.apache.pig.Main - Logging error messages to: /usr/lib/hue/pig_1396081240577.log

... LOTS of lines removed ...

2014-03-29 01:21:12,501 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-03-29 01:21:12,502 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY.)
( NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTRY)
(. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNTR)
(Y. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUNT)
(RY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COUN)
(TRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR COU)
(NTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR CO)
(UNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR C)
(OUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR )
(COUNTRY. NOW IS THE TIME FOR ALL GOOD MEN TO COME TO THE AID OF THEIR)

It worked!  You did it!!  Everything has been CAPITALIZED!!!  Congratulations!!!!