This is the first of a three-part series on showing alternative Hadoop tools being utilized for Open Georgia Analysis. The data we are working against looks like the following which is an include of the Format & Sample Data for Open Georgia wiki page.
In this first installment, let's jump right in where Hadoop began; MapReduce. After you visit Preparing Open Georgia Test Data and get some test data loaded into HDFS, then you'll want to clone my GitHub repo as referenced in GitHub > lestermartin > hadoop-exploration. Once you have the code up in your favorite IDE (mine is IntelliJ on my MBPro) then you'll want to hone in on the lestermartin.hadoop.exploration.opengeorgia package (details on the major MapReduce stereotypes in that last link). You can then build the jar file with Maven.
As with all three editions of this blog posting series, let's use the Hortonworks Sandbox to run everything. Make sure the hue user has a folder to put your jar in and then put it there.
HW10653:target lmartin$ ssh root@127.0.0.1 -p 2222 root@127.0.0.1's password: Last login: Tue Apr 29 16:48:05 2014 from 10.0.2.2 [root@sandbox ~]# su hue [hue@sandbox root]$ cd ~ [hue@sandbox ~]$ mkdir jars [hue@sandbox ~]$ exit exit [root@sandbox ~]# exit logout Connection to 127.0.0.1 closed. HW10653:target lmartin$ ls classes maven-archiver generated-sources surefire-reports generated-test-sources test-classes hadoop-exploration-0.0.1-SNAPSHOT.jar HW10653:target lmartin$ scp -P 2222 hadoop-exploration-0.0.1-SNAPSHOT.jar root@127.0.0.1:/usr/lib/hue/jars root@127.0.0.1's password: hadoop-exploration-0.0.1-SNAPSHOT.jar 100% 22KB 22.2KB/s 00:00 HW10653:target lmartin$ ssh root@127.0.0.1 -p 2222 root@127.0.0.1's password: Last login: Tue Apr 29 17:48:35 2014 from 10.0.2.2 [root@sandbox ~]# su hue [hue@sandbox root]$ cd ~/jars [hue@sandbox jars]$ ls -l total 24 -rw-r--r-- 1 root root 22678 Apr 29 18:49 hadoop-exploration-0.0.1-SNAPSHOT.jar |
Now go ahead and kick it off.
[hue@sandbox jars]$ hdfs dfs -ls /user/hue/opengeorgia Found 1 items -rwxr-xr-x 3 hue hue 7612715 2014-04-29 16:53 /user/hue/opengeorgia/salaryTravelReport.csv [hue@sandbox jars]$ hadoop jar hadoop-exploration-0.0.1-SNAPSHOT.jar lestermartin.hadoop.exploration.opengeorgia.GenerateStatistics opengeorgia/salaryTravelReport.csv opengeorgia/mroutput ... MANY LINES REMOVED ... 14/04/29 19:29:42 INFO input.FileInputFormat: Total input paths to process : 1 14/04/29 19:29:42 INFO mapreduce.JobSubmitter: number of splits:1 ... MANY LINES REMOVED ... 14/04/29 19:29:43 INFO mapreduce.Job: Running job: job_1398691536449_0080 14/04/29 19:29:50 INFO mapreduce.Job: Job job_1398691536449_0080 running in uber mode : false 14/04/29 19:29:50 INFO mapreduce.Job: map 0% reduce 0% 14/04/29 19:29:58 INFO mapreduce.Job: map 100% reduce 0% 14/04/29 19:30:05 INFO mapreduce.Job: map 100% reduce 100% 14/04/29 19:30:05 INFO mapreduce.Job: Job job_1398691536449_0080 completed successfully 14/04/29 19:30:05 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=1279390 FILE: Number of bytes written=2726197 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=7612865 HDFS: Number of bytes written=13583 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=48256 Total time spent by all reduces in occupied slots (ms)=34216 Map-Reduce Framework Map input records=76943 Map output records=44986 Map output bytes=1189412 Map output materialized bytes=1279390 Input split bytes=144 Combine input records=0 Combine output records=0 Reduce input groups=181 Reduce shuffle bytes=1279390 Reduce input records=44986 Reduce output records=181 Spilled Records=89972 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=146 CPU time spent (ms)=5440 Physical memory (bytes) snapshot=598822912 Virtual memory (bytes) snapshot=2392014848 Total committed heap usage (bytes)=507117568 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=7612721 File Output Format Counters Bytes Written=13583 |
Here's a few records from the output (not annotating the "... MANY LINES REMOVED ...
" as seen above).
[hue@sandbox jars]$ hdfs dfs -cat opengeorgia/mroutput/part-r-00000 ASSISTANT PRINCIPAL {436,3418.530029296875,119646.3125,76514.0635219189} AUDIOLOGIST {9,36329.58984375,102240.4609375,73038.71267361111} BUS DRIVER {1321,289.9800109863281,59479.87890625,21016.95711573356} CROSSING GUARD {100,188.1699981689453,77890.5703125,4792.818007659912} CUSTODIAL PERSONNEL {1393,358.0199890136719,83233.4765625,27299.229113730096} DEPUTY/ASSOC/ASSISTANT SUPT {36,15089.6298828125,208606.71875,121005.71796332466} ELEMENTARY COUNSELOR {265,1472.8499755859375,96220.078125,58518.51150455115} ESOL TEACHER {450,1595.3599853515625,92835.65625,51652.298943684895} GRADE 1 TEACHER {1041,1912.280029296875,103549.28125,49438.147760777836} GRADE 10 TEACHER {59,3861.159912109375,94732.7421875,55489.47224659031} GRADE 11 TEACHER {27,5537.2998046875,101728.1875,58273.76153790509} GRADE 12 TEACHER {13,40919.26171875,92376.703125,66923.12620192308} GRADE 2 TEACHER {1010,1843.0999755859375,95968.078125,50479.97351545579} GRADE 3 TEACHER {1036,730.75,102263.2890625,50409.13624861433} GRADE 4 TEACHER {873,2955.43994140625,96430.078125,52342.17116019652} GRADE 5 TEACHER {872,1400.0,104698.0,52721.27942734465} GRADE 6 TEACHER {239,1665.0799560546875,85595.921875,48597.56130636686} GRADE 7 TEACHER {257,1615.260009765625,90778.078125,50304.22050220772} GRADE 8 TEACHER {240,1746.1600341796875,85965.3828125,51121.745357767744} GRADE 9 TEACHER {35,4128.5,90588.578125,55027.259151785714} GRADES 6-8 TEACHER {1607,-909.8400268554688,91402.0390625,49418.24870481651} GRADES 9-12 TEACHER {3171,200.0,119430.15625,51375.531383229296} GRADES K-5 TEACHER {165,150.0,87925.921875,44650.884348366475} GRADUATION SPECIALIST {63,6351.25,91945.0625,58873.631510416664} HIGH SCHOOL COUNSELOR {225,1100.0,111393.84375,63814.18197102864} KINDERGARTEN TEACHER {1054,1615.4100341796875,103798.0,52818.983106001506} LIBRARIAN/MEDIA SPECIALIST {342,3208.25,97282.1875,58324.767315423975} MIDDLE SCHOOL COUNSELOR {131,3362.93994140625,99340.078125,61327.238445252864} MILITARY SCIENCE TEACHER {98,2328.9599609375,100116.0,62636.252752810105} PRINCIPAL {318,2202.22998046875,159299.515625,102604.1484375} SUBSTITUTE TEACHER {3816,-1006.5,77007.859375,8846.330627846432} SUPERINTENDENT {3,216697.15625,411545.8125,299117.1979166667} TEACHER SUPPORT SPECIALIST {204,2409.64990234375,96133.21875,62175.75722608379} TECHNICAL INSTITUTE PRESIDENT {1,96884.2421875,96884.2421875,96884.2421875} |
Again, many lines were removed as I just wanted an illustrative example. Being married to a Georgia educator myself, I probably am looking at these numbers and making more observations than most. For example, do we really have a Kindergarten Teacher making over $100K/year? Really?? Heck, even the highest paid Military Science Teacher is pulling in six-figures (he probably wasn't doing that on active duty!!). I also feel for the poor Substitue Teacher that was in the hole over $1000. I can say with certainty that the average pay for Principals of $102K/year surely isn't enough as that's a job with a TON of responsibilities.
Nonetheless, the goal was see if we could answer the Simple Open Georgia Use Case question which we did. The next installments will be doing the same thing, but with Pig and then Hive. Let's make sure we check to see that the average salary for the 9 Audiologists is $73,038.71 and the highest paid 3rd Grade Teacher is $102,263.29 when we perform this analysis again.