use mapreduce to calculate salary statistics for georgia educators (first of a three-part series)

This is the first of a three-part series on showing alternative Hadoop & Big Data tools being utilized for Open Georgia Analysis.  The data we are working against looks like the following which is an include of the Format & Sample Data for Open Georgia wiki page.


The following describes the format of the dataset used for Open Georgia Analysis and was created by the process described in Preparing Open Georgia Test Data.

NAME (String)TITLE (String)SALARY (float)TRAVEL (float)ORG TYPE (String)ORG (String)YEAR (int)
ABBOTT,DEEDEE WGRADES 9-12 TEACHER52,122.100.00LBOEATLANTA INDEPENDENT SCHOOL SYSTEM2010
ALLEN,ANNETTE DSPEECH-LANGUAGE PATHOLOGIST92,937.28260.42LBOEATLANTA INDEPENDENT SCHOOL SYSTEM2010
BAHR,SHERREEN TGRADE 5 TEACHER52,752.710.00LBOECOBB COUNTY SCHOOL DISTRICT2010
BAILEY,ANTOINETTE RSCHOOL SECRETARY/CLERK19,905.900.00LBOECOBB COUNTY SCHOOL DISTRICT2010
BAILEY,ASHLEY NEARLY INTERVENTION PRIMARY TEACHER43,992.82120.00LBOECOBB COUNTY SCHOOL DISTRICT2010
CALVERT,RONALD MARTINSTATE PATROL (SP)51,370.4062.00SABACPUBLIC SAFETY, DEPARTMENT OF2010
CAMERON,MICHAEL DPUBLIC SAFETY TRN (AL)34,748.60259.35SABACPUBLIC SAFETY, DEPARTMENT OF2010
DAAS,TARWYN TARAGRADES 9-12 TEACHER41,614.500.00LBOEFULTON COUNTY BOARD OF EDUCATION2011
DABBS,SANDRA LGRADES 9-12 TEACHER79,801.5941.00LBOEFULTON COUNTY BOARD OF EDUCATION2011
E'LOM,SOPHIA LIS PERSONNEL - GENERAL ADMIN75,509.00613.73LBOEFULTON COUNTY BOARD OF EDUCATION2012
EADDY,FENNER RSUBSTITUTE13,469.000.00LBOEFULTON COUNTY BOARD OF EDUCATION2012
EADY,ARNETTA AASSISTANT PRINCIPAL71,879.00319.60LBOEFULTON COUNTY BOARD OF EDUCATION2012


In this first installment, let's jump right in where Hadoop began; MapReduce.  After you visit Preparing Open Georgia Test Data and get some test data loaded into HDFS, then you'll want to clone my GitHub repo as referenced in GitHub > lestermartin > hadoop-exploration.  Once you have the code up in your favorite IDE (mine is IntelliJ on my MBPro) then you'll want to hone in on the lestermartin.hadoop.exploration.opengeorgia package (details on the major MapReduce stereotypes in that last link).  You can then build the jar file with Maven; or just grab hadoop-exploration-0.0.1-SNAPSHOT.jar.

As with all three editions of this blog posting series, let's use the Hortonworks Sandbox to run everything.  Make sure the hue user has a folder to put your jar in and then put it there.

HW10653:target lmartin$ ssh root@127.0.0.1 -p 2222
root@127.0.0.1's password: 
Last login: Tue Apr 29 16:48:05 2014 from 10.0.2.2
[root@sandbox ~]# su hue
[hue@sandbox root]$ cd ~
[hue@sandbox ~]$ mkdir jars
[hue@sandbox ~]$ exit
exit
[root@sandbox ~]# exit
logout
Connection to 127.0.0.1 closed.
HW10653:target lmartin$ ls 
classes                    maven-archiver
generated-sources            surefire-reports
generated-test-sources            test-classes
hadoop-exploration-0.0.1-SNAPSHOT.jar
HW10653:target lmartin$ scp -P 2222 hadoop-exploration-0.0.1-SNAPSHOT.jar root@127.0.0.1:/usr/lib/hue/jars
root@127.0.0.1's password: 
hadoop-exploration-0.0.1-SNAPSHOT.jar         100%   22KB  22.2KB/s   00:00    
HW10653:target lmartin$ ssh root@127.0.0.1 -p 2222
root@127.0.0.1's password: 
Last login: Tue Apr 29 17:48:35 2014 from 10.0.2.2
[root@sandbox ~]# su hue
[hue@sandbox root]$ cd ~/jars
[hue@sandbox jars]$ ls -l
total 24
-rw-r--r-- 1 root root 22678 Apr 29 18:49 hadoop-exploration-0.0.1-SNAPSHOT.jar

Now go ahead and kick it off.

[hue@sandbox jars]$ hdfs dfs -ls /user/hue/opengeorgia
Found 1 items
-rwxr-xr-x   3 hue hue    7612715 2014-04-29 16:53 /user/hue/opengeorgia/salaryTravelReport.csv
[hue@sandbox jars]$ hadoop jar hadoop-exploration-0.0.1-SNAPSHOT.jar lestermartin.hadoop.exploration.opengeorgia.GenerateStatistics opengeorgia/salaryTravelReport.csv opengeorgia/mroutput

   ... MANY LINES REMOVED ...

14/04/29 19:29:42 INFO input.FileInputFormat: Total input paths to process : 1
14/04/29 19:29:42 INFO mapreduce.JobSubmitter: number of splits:1

   ... MANY LINES REMOVED ...

14/04/29 19:29:43 INFO mapreduce.Job: Running job: job_1398691536449_0080
14/04/29 19:29:50 INFO mapreduce.Job: Job job_1398691536449_0080 running in uber mode : false
14/04/29 19:29:50 INFO mapreduce.Job:  map 0% reduce 0%
14/04/29 19:29:58 INFO mapreduce.Job:  map 100% reduce 0%
14/04/29 19:30:05 INFO mapreduce.Job:  map 100% reduce 100%
14/04/29 19:30:05 INFO mapreduce.Job: Job job_1398691536449_0080 completed successfully
14/04/29 19:30:05 INFO mapreduce.Job: Counters: 43
    File System Counters
        FILE: Number of bytes read=1279390
        FILE: Number of bytes written=2726197
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=7612865
        HDFS: Number of bytes written=13583
        HDFS: Number of read operations=6
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=48256
        Total time spent by all reduces in occupied slots (ms)=34216
    Map-Reduce Framework
        Map input records=76943
        Map output records=44986
        Map output bytes=1189412
        Map output materialized bytes=1279390
        Input split bytes=144
        Combine input records=0
        Combine output records=0
        Reduce input groups=181
        Reduce shuffle bytes=1279390
        Reduce input records=44986
        Reduce output records=181
        Spilled Records=89972
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=146
        CPU time spent (ms)=5440
        Physical memory (bytes) snapshot=598822912
        Virtual memory (bytes) snapshot=2392014848
        Total committed heap usage (bytes)=507117568
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=7612721
    File Output Format Counters 
        Bytes Written=13583

Here's a few records from the output (not annotating the "... MANY LINES REMOVED ..." as seen above).

[hue@sandbox jars]$ hdfs dfs -cat opengeorgia/mroutput/part-r-00000
ASSISTANT PRINCIPAL    {436,3418.530029296875,119646.3125,76514.0635219189}
AUDIOLOGIST    {9,36329.58984375,102240.4609375,73038.71267361111}
BUS DRIVER    {1321,289.9800109863281,59479.87890625,21016.95711573356}
CROSSING GUARD    {100,188.1699981689453,77890.5703125,4792.818007659912}
CUSTODIAL PERSONNEL    {1393,358.0199890136719,83233.4765625,27299.229113730096}
DEPUTY/ASSOC/ASSISTANT SUPT    {36,15089.6298828125,208606.71875,121005.71796332466}
ELEMENTARY COUNSELOR    {265,1472.8499755859375,96220.078125,58518.51150455115}
ESOL TEACHER    {450,1595.3599853515625,92835.65625,51652.298943684895}
GRADE 1 TEACHER    {1041,1912.280029296875,103549.28125,49438.147760777836}
GRADE 10 TEACHER    {59,3861.159912109375,94732.7421875,55489.47224659031}
GRADE 11 TEACHER    {27,5537.2998046875,101728.1875,58273.76153790509}
GRADE 12 TEACHER    {13,40919.26171875,92376.703125,66923.12620192308}
GRADE 2 TEACHER    {1010,1843.0999755859375,95968.078125,50479.97351545579}
GRADE 3 TEACHER    {1036,730.75,102263.2890625,50409.13624861433}
GRADE 4 TEACHER    {873,2955.43994140625,96430.078125,52342.17116019652}
GRADE 5 TEACHER    {872,1400.0,104698.0,52721.27942734465}
GRADE 6 TEACHER    {239,1665.0799560546875,85595.921875,48597.56130636686}
GRADE 7 TEACHER    {257,1615.260009765625,90778.078125,50304.22050220772}
GRADE 8 TEACHER    {240,1746.1600341796875,85965.3828125,51121.745357767744}
GRADE 9 TEACHER    {35,4128.5,90588.578125,55027.259151785714}
GRADES 6-8 TEACHER    {1607,-909.8400268554688,91402.0390625,49418.24870481651}
GRADES 9-12 TEACHER    {3171,200.0,119430.15625,51375.531383229296}
GRADES K-5 TEACHER    {165,150.0,87925.921875,44650.884348366475}
GRADUATION SPECIALIST    {63,6351.25,91945.0625,58873.631510416664}
HIGH SCHOOL COUNSELOR    {225,1100.0,111393.84375,63814.18197102864}
KINDERGARTEN TEACHER    {1054,1615.4100341796875,103798.0,52818.983106001506}
LIBRARIAN/MEDIA SPECIALIST    {342,3208.25,97282.1875,58324.767315423975}
MIDDLE SCHOOL COUNSELOR    {131,3362.93994140625,99340.078125,61327.238445252864}
MILITARY SCIENCE TEACHER    {98,2328.9599609375,100116.0,62636.252752810105}
PRINCIPAL    {318,2202.22998046875,159299.515625,102604.1484375}
SUBSTITUTE TEACHER    {3816,-1006.5,77007.859375,8846.330627846432}
SUPERINTENDENT    {3,216697.15625,411545.8125,299117.1979166667}
TEACHER SUPPORT SPECIALIST    {204,2409.64990234375,96133.21875,62175.75722608379}
TECHNICAL INSTITUTE PRESIDENT    {1,96884.2421875,96884.2421875,96884.2421875}

Again, many lines were removed as I just wanted an illustrative example.  Being married to a Georgia educator myself, I probably am looking at these numbers and making more observations than most.  For example, do we really have a Kindergarten Teacher making over $100K/year?  Really??  Heck, even the highest paid Military Science Teacher is pulling in six-figures (he probably wasn't doing that on active duty!!).  I also feel for the poor Substitute Teacher that was in the hole over $1000.  I can say with certainty that the average pay for Principals of $102K/year surely isn't enough as that's a job with a TON of responsibilities.

Nonetheless, the goal was see if we could answer the Simple Open Georgia Use Case question which we did.  The next installments will be doing the same thing, but with Pig and then Hive.  Let's make sure we check to see that the average salary for the 9 Audiologists is $73,038.71 and the highest paid 3rd Grade Teacher is $102,263.29 when we perform this analysis again.