use mapreduce to calculate salary statistics for georgia educators (first of a three-part series)
This is the first of a three-part series on showing alternative Hadoop & Big Data tools being utilized for Open Georgia Analysis. The data we are working against looks like the following which is an include of the Format & Sample Data for Open Georgia wiki page.
In this first installment, let's jump right in where Hadoop began; MapReduce. After you visit Preparing Open Georgia Test Data and get some test data loaded into HDFS, then you'll want to clone my GitHub repo as referenced in GitHub > lestermartin > hadoop-exploration. Once you have the code up in your favorite IDE (mine is IntelliJ on my MBPro) then you'll want to hone in on the lestermartin.hadoop.exploration.opengeorgia package (details on the major MapReduce stereotypes in that last link). You can then build the jar file with Maven; or just grab hadoop-exploration-0.0.1-SNAPSHOT.jar.
As with all three editions of this blog posting series, let's use the Hortonworks Sandbox to run everything. Make sure the hue
user has a folder to put your jar in and then put it there.
HW10653:target lmartin$ ssh root@127.0.0.1 -p 2222
root@127.0.0.1's password:
Last login: Tue Apr 29 16:48:05 2014 from 10.0.2.2
[root@sandbox ~]# su hue
[hue@sandbox root]$ cd ~
[hue@sandbox ~]$ mkdir jars
[hue@sandbox ~]$ exit
exit
[root@sandbox ~]# exit
logout
Connection to 127.0.0.1 closed.
HW10653:target lmartin$ ls
classes maven-archiver
generated-sources surefire-reports
generated-test-sources test-classes
hadoop-exploration-0.0.1-SNAPSHOT.jar
HW10653:target lmartin$ scp -P 2222 hadoop-exploration-0.0.1-SNAPSHOT.jar root@127.0.0.1:/usr/lib/hue/jars
root@127.0.0.1's password:
hadoop-exploration-0.0.1-SNAPSHOT.jar 100% 22KB 22.2KB/s 00:00
HW10653:target lmartin$ ssh root@127.0.0.1 -p 2222
root@127.0.0.1's password:
Last login: Tue Apr 29 17:48:35 2014 from 10.0.2.2
[root@sandbox ~]# su hue
[hue@sandbox root]$ cd ~/jars
[hue@sandbox jars]$ ls -l
total 24
-rw-r--r-- 1 root root 22678 Apr 29 18:49 hadoop-exploration-0.0.1-SNAPSHOT.jar
Now go ahead and kick it off.
[hue@sandbox jars]$ hdfs dfs -ls /user/hue/opengeorgia
Found 1 items
-rwxr-xr-x 3 hue hue 7612715 2014-04-29 16:53 /user/hue/opengeorgia/salaryTravelReport.csv
[hue@sandbox jars]$ hadoop jar hadoop-exploration-0.0.1-SNAPSHOT.jar lestermartin.hadoop.exploration.opengeorgia.GenerateStatistics opengeorgia/salaryTravelReport.csv opengeorgia/mroutput
... MANY LINES REMOVED ...
14/04/29 19:29:42 INFO input.FileInputFormat: Total input paths to process : 1
14/04/29 19:29:42 INFO mapreduce.JobSubmitter: number of splits:1
... MANY LINES REMOVED ...
14/04/29 19:29:43 INFO mapreduce.Job: Running job: job_1398691536449_0080
14/04/29 19:29:50 INFO mapreduce.Job: Job job_1398691536449_0080 running in uber mode : false
14/04/29 19:29:50 INFO mapreduce.Job: map 0% reduce 0%
14/04/29 19:29:58 INFO mapreduce.Job: map 100% reduce 0%
14/04/29 19:30:05 INFO mapreduce.Job: map 100% reduce 100%
14/04/29 19:30:05 INFO mapreduce.Job: Job job_1398691536449_0080 completed successfully
14/04/29 19:30:05 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=1279390
FILE: Number of bytes written=2726197
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=7612865
HDFS: Number of bytes written=13583
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=48256
Total time spent by all reduces in occupied slots (ms)=34216
Map-Reduce Framework
Map input records=76943
Map output records=44986
Map output bytes=1189412
Map output materialized bytes=1279390
Input split bytes=144
Combine input records=0
Combine output records=0
Reduce input groups=181
Reduce shuffle bytes=1279390
Reduce input records=44986
Reduce output records=181
Spilled Records=89972
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=146
CPU time spent (ms)=5440
Physical memory (bytes) snapshot=598822912
Virtual memory (bytes) snapshot=2392014848
Total committed heap usage (bytes)=507117568
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=7612721
File Output Format Counters
Bytes Written=13583
Here's a few records from the output (not annotating the "... MANY LINES REMOVED ...
" as seen above).
[hue@sandbox jars]$ hdfs dfs -cat opengeorgia/mroutput/part-r-00000
ASSISTANT PRINCIPAL {436,3418.530029296875,119646.3125,76514.0635219189}
AUDIOLOGIST {9,36329.58984375,102240.4609375,73038.71267361111}
BUS DRIVER {1321,289.9800109863281,59479.87890625,21016.95711573356}
CROSSING GUARD {100,188.1699981689453,77890.5703125,4792.818007659912}
CUSTODIAL PERSONNEL {1393,358.0199890136719,83233.4765625,27299.229113730096}
DEPUTY/ASSOC/ASSISTANT SUPT {36,15089.6298828125,208606.71875,121005.71796332466}
ELEMENTARY COUNSELOR {265,1472.8499755859375,96220.078125,58518.51150455115}
ESOL TEACHER {450,1595.3599853515625,92835.65625,51652.298943684895}
GRADE 1 TEACHER {1041,1912.280029296875,103549.28125,49438.147760777836}
GRADE 10 TEACHER {59,3861.159912109375,94732.7421875,55489.47224659031}
GRADE 11 TEACHER {27,5537.2998046875,101728.1875,58273.76153790509}
GRADE 12 TEACHER {13,40919.26171875,92376.703125,66923.12620192308}
GRADE 2 TEACHER {1010,1843.0999755859375,95968.078125,50479.97351545579}
GRADE 3 TEACHER {1036,730.75,102263.2890625,50409.13624861433}
GRADE 4 TEACHER {873,2955.43994140625,96430.078125,52342.17116019652}
GRADE 5 TEACHER {872,1400.0,104698.0,52721.27942734465}
GRADE 6 TEACHER {239,1665.0799560546875,85595.921875,48597.56130636686}
GRADE 7 TEACHER {257,1615.260009765625,90778.078125,50304.22050220772}
GRADE 8 TEACHER {240,1746.1600341796875,85965.3828125,51121.745357767744}
GRADE 9 TEACHER {35,4128.5,90588.578125,55027.259151785714}
GRADES 6-8 TEACHER {1607,-909.8400268554688,91402.0390625,49418.24870481651}
GRADES 9-12 TEACHER {3171,200.0,119430.15625,51375.531383229296}
GRADES K-5 TEACHER {165,150.0,87925.921875,44650.884348366475}
GRADUATION SPECIALIST {63,6351.25,91945.0625,58873.631510416664}
HIGH SCHOOL COUNSELOR {225,1100.0,111393.84375,63814.18197102864}
KINDERGARTEN TEACHER {1054,1615.4100341796875,103798.0,52818.983106001506}
LIBRARIAN/MEDIA SPECIALIST {342,3208.25,97282.1875,58324.767315423975}
MIDDLE SCHOOL COUNSELOR {131,3362.93994140625,99340.078125,61327.238445252864}
MILITARY SCIENCE TEACHER {98,2328.9599609375,100116.0,62636.252752810105}
PRINCIPAL {318,2202.22998046875,159299.515625,102604.1484375}
SUBSTITUTE TEACHER {3816,-1006.5,77007.859375,8846.330627846432}
SUPERINTENDENT {3,216697.15625,411545.8125,299117.1979166667}
TEACHER SUPPORT SPECIALIST {204,2409.64990234375,96133.21875,62175.75722608379}
TECHNICAL INSTITUTE PRESIDENT {1,96884.2421875,96884.2421875,96884.2421875}
Again, many lines were removed as I just wanted an illustrative example. Being married to a Georgia educator myself, I probably am looking at these numbers and making more observations than most. For example, do we really have a Kindergarten Teacher making over $100K/year? Really?? Heck, even the highest paid Military Science Teacher is pulling in six-figures (he probably wasn't doing that on active duty!!). I also feel for the poor Substitute Teacher that was in the hole over $1000. I can say with certainty that the average pay for Principals of $102K/year surely isn't enough as that's a job with a TON of responsibilities.
Nonetheless, the goal was see if we could answer the Simple Open Georgia Use Case question which we did. The next installments will be doing the same thing, but with Pig and then Hive. Let's make sure we check to see that the average salary for the 9 Audiologists is $73,038.71 and the highest paid 3rd Grade Teacher is $102,263.29 when we perform this analysis again.