stopping oozie from limiting the number of reducers on your hive action (just add some more xml)

This write-up was for an issue, and resolution, on a HDP 1.3.2 installation. Default properties (and behavior!) can surely change in future releases, but the general message should be relevant regardless which version of Hadoop you are using.

One of my clients came to me with a concern about Oozie apparently running their Hive script much slower than when they kicked if off via the CLI with a hive -f SCRIPT_FILE.hql command. After a little bit of digging it seems that when run by itself there was a large number of reducers, but when Oozie ran it there was only one. That's a good clue to how to fix it.

So, first question is why does Hive not have any trouble with using multiple reducers? The answer is in the mapred.reduce.tasks configuration property of Hive. It defaults to -1 which has the following explanation (taken from Hive 0.11's HiveConf.java file).

// The number of reduce tasks per job. Hadoop sets this value to 1 by default
// By setting this property to -1, Hive will automatically determine the correct
// number of reducers.
HADOOPNUMREDUCERS("mapred.reduce.tasks", -1),

This value (if explicitly set to this default or another value) ultimately shows up in the hive-site.xml file and, again, is the reason the CLI invocation of the script runs wide.

When Oozie runs a Hive action, it seems that it decides to use the overarching MapReduce settings which by default is set to 1 as seen in Hadoop 1.1.2's default settings and then gets percolated out into the mapred-site.xml configuration file (well, if it is explicitly set to htis default or another value).

So, the answer is probably coming to you. Why not set an override in the Oozie job to let Hive "automatically determine the correct number of reducers" or even some specific value? No reason at all not do that. How to do that? Well, fortunately that's pretty darn easy.

You just need to add the property to Oozie's (3.3.2 for HDP 1.3.2) Hive Action Extension as visualized in the center of the example below.

<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
    ...
    <action name="myfirsthivejob">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-traker>foo:8021</job-tracker>
            <name-node>bar:8020</name-node>
            <prepare>
                <delete path="${jobOutput}"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.reduce.tasks</name>
                    <value>-1</value>
                </property>
            </configuration>
            <script>myscript.q</script>
            <param>InputDir=/home/tucu/input-data</param>
            <param>OutputDir=${jobOutput}</param>
        </hive>
        <ok to="myotherjob"/>
        <error to="errorcleanup"/>
    </action>
    ...
</workflow-app>

This should get you back to the same behavior as running the Hive script from the CLI.