/
parameterizing mapred.* properties (cli vs oozie)

parameterizing mapred.* properties (cli vs oozie)

Ok... kinda cheating here, so please forgive me.  I got an email with the question below from a valued client that was looking at my hadoop mini smoke test (VERY mini) posting and just before I hit the send button on my response I realized (cheating of course!) I could just spin this into a blog posting.  I hope it is useful to someone else as well.

Subject: hadoop jar parameters

Per the command execution below, is there a standardized way we can reference “-D” parameter values?  Where we store these values in a config/environment file and they are automagically included upon execution?
 
time hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar terasort -Dmapred.job.queue.name=Queue60 -Dmapred.map.tasks=80 -Dmapred.reduce.tasks=25 /user/root/teragen-1000MB /user/root/terasort-1000MB
 
I’m wondering if it’s best to create a wrapper shell script to facilitate this.  I pause as I’m not sure if some facility is already available.

This concept is really a Java question and it can be done programmatically (see hint), but there is no basic "by convention" approach that says create something like a properties.config file and load the KVPs in it so that they will be ~automagically~ picked up.  Well... not for Java main()'s...  Thus, for just firing something like this off (again, it is a general Java question) from the command-line then a simple wrapper script would do quite nicely.  Then you could parametrize all the other values like source and target directories.

Now, if you were using Oozie, you have some other options.  As https://cwiki.apache.org/confluence/display/OOZIE/Java+Cookbook shows, there is a <java-opts> tag that can be used for parameters such as this when using the Java action.  The example they show is just for a memory setting (I.e. -Xms512m), but http://jayatiatblogs.blogspot.com/2011/05/building-java-action-in-oozie.html shows an example of passing in an -Denv=stg -DPP=DB_PASSPHRASE option.  Furthermore, https://issues.cloudera.org/browse/HUE-1030 (notice from the URL that Hue is a Cloudera project that uses the Apache License, but is not a true ASF project ;-) shows how the <java-opts> and <arg> tags work within a single action.

All that said, teragen/sort are MapReduce Java applications so Oozie can leverage the MapReduce action described at https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook.  You’ll notice the generic <java-opts> tag is gone, but there is a <property> tag within <configuration> and the example uses the mapred.job.queue.name KVP that is also being inquired upon which itself can be used as an example for passing in the number of mappers and reducers.

Maybe I'll start answering all of my emails this way!