This is the first of a three-part series on showing alternative Hadoop tools being utilized for Open Georgia Analysis. The data we are working against looks like the following which is an include of the Format & Sample Data for Open Georgia wiki page.
The following describes the format of the dataset used for Open Georgia Analysis and was created by the process described in Preparing Open Georgia Test Data.
NAME (String) | TITLE (String) | SALARY (float) | TRAVEL (float) | ORG TYPE (String) | ORG (String) | YEAR (int) |
---|---|---|---|---|---|---|
ABBOTT,DEEDEE W | GRADES 9-12 TEACHER | 52,122.10 | 0.00 | LBOE | ATLANTA INDEPENDENT SCHOOL SYSTEM | 2010 |
ALLEN,ANNETTE D | SPEECH-LANGUAGE PATHOLOGIST | 92,937.28 | 260.42 | LBOE | ATLANTA INDEPENDENT SCHOOL SYSTEM | 2010 |
BAHR,SHERREEN T | GRADE 5 TEACHER | 52,752.71 | 0.00 | LBOE | COBB COUNTY SCHOOL DISTRICT | 2010 |
BAILEY,ANTOINETTE R | SCHOOL SECRETARY/CLERK | 19,905.90 | 0.00 | LBOE | COBB COUNTY SCHOOL DISTRICT | 2010 |
BAILEY,ASHLEY N | EARLY INTERVENTION PRIMARY TEACHER | 43,992.82 | 120.00 | LBOE | COBB COUNTY SCHOOL DISTRICT | 2010 |
CALVERT,RONALD MARTIN | STATE PATROL (SP) | 51,370.40 | 62.00 | SABAC | PUBLIC SAFETY, DEPARTMENT OF | 2010 |
CAMERON,MICHAEL D | PUBLIC SAFETY TRN (AL) | 34,748.60 | 259.35 | SABAC | PUBLIC SAFETY, DEPARTMENT OF | 2010 |
DAAS,TARWYN TARA | GRADES 9-12 TEACHER | 41,614.50 | 0.00 | LBOE | FULTON COUNTY BOARD OF EDUCATION | 2011 |
DABBS,SANDRA L | GRADES 9-12 TEACHER | 79,801.59 | 41.00 | LBOE | FULTON COUNTY BOARD OF EDUCATION | 2011 |
E'LOM,SOPHIA L | IS PERSONNEL - GENERAL ADMIN | 75,509.00 | 613.73 | LBOE | FULTON COUNTY BOARD OF EDUCATION | 2012 |
EADDY,FENNER R | SUBSTITUTE | 13,469.00 | 0.00 | LBOE | FULTON COUNTY BOARD OF EDUCATION | 2012 |
EADY,ARNETTA A | ASSISTANT PRINCIPAL | 71,879.00 | 319.60 | LBOE | FULTON COUNTY BOARD OF EDUCATION | 2012 |
In this first installment, let's jump right in where Hadoop began; MapReduce. After you visit Preparing Open Georgia Test Data and get some test data loaded into HDFS, then you'll want to clone my GitHub repo as referenced in GitHub > lestermartin > hadoop-exploration. Once you have the code up in your favorite IDE (mine is IntelliJ on my MBPro) then you'll want to hone in on the lestermartin.hadoop.exploration.opengeorgia package (details on the major MapReduce stereotypes in that last link). You can then build the jar file with Maven.
As with all three of this blog posting series, let's use the Hortonworks Sandbox to run everything. Make sure the hue user has a folder to put your jar in and then put it there.
HW10653:target lmartin$ ssh root@127.0.0.1 -p 2222 root@127.0.0.1's password: Last login: Tue Apr 29 16:48:05 2014 from 10.0.2.2 [root@sandbox ~]# su hue [hue@sandbox root]$ cd ~ [hue@sandbox ~]$ mkdir jars [hue@sandbox ~]$ exit exit [root@sandbox ~]# exit logout Connection to 127.0.0.1 closed. HW10653:target lmartin$ ls classes maven-archiver generated-sources surefire-reports generated-test-sources test-classes hadoop-exploration-0.0.1-SNAPSHOT.jar HW10653:target lmartin$ scp -P 2222 hadoop-exploration-0.0.1-SNAPSHOT.jar root@127.0.0.1:/usr/lib/hue/jars root@127.0.0.1's password: hadoop-exploration-0.0.1-SNAPSHOT.jar 100% 22KB 22.2KB/s 00:00 HW10653:target lmartin$ ssh root@127.0.0.1 -p 2222 root@127.0.0.1's password: Last login: Tue Apr 29 17:48:35 2014 from 10.0.2.2 [root@sandbox ~]# su hue [hue@sandbox root]$ cd ~/jars [hue@sandbox jars]$ ls -l total 24 -rw-r--r-- 1 root root 22678 Apr 29 18:49 hadoop-exploration-0.0.1-SNAPSHOT.jar
Now go ahead and kick it off.
[hue@sandbox jars]$ hdfs dfs -ls /user/hue/opengeorgia Found 1 items -rwxr-xr-x 3 hue hue 7612715 2014-04-29 16:53 /user/hue/opengeorgia/salaryTravelReport.csv [hue@sandbox jars]$ hadoop jar hadoop-exploration-0.0.1-SNAPSHOT.jar lestermartin.hadoop.exploration.opengeorgia.GenerateStatistics opengeorgia/salaryTravelReport.csv mroutput ... MANY LINES REMOVED ... XXXXXXxXXXXXXXXX