Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: cross-linked spark implementation

...

Now, we've got some data and are ready to answer the following question as a simple, figurative, example of what kinds of analysis could be done.

Include Page
Simple Open Georgia Use Case
Simple Open Georgia Use Case

As for which tools to use, the following list of blog entries (if they are not linked, they are coming soon) presents varying tool options to address questions such as this.

 

 

 

 

Warning

Please ignore everything below this warning – it will soon move!!

 

 

We can then just upload the file via Hue to /user/hue/something which has the following format.

add table with format and sample data...

We just want to know answer a simple question.  Only for the year 2010 and for employees of local boards of education, what is the total number of employees & min/max/average salary for each job title.

For MapReduce, do ...

 

For Pig, a script such as the following will generate these statistics.

Code Block
themeMidnight
languagetext
titleTitleBreakdownForSchoolsIn2010.pig
-- load up the base User Defined Function library (piggybank) and get a handle on the REPLACE function
register /user/hue/shared/pig/udfs/piggybank.jar;
define REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE();
 
-- load the salary file and declare its structure
inputFile = LOAD '/user/hue/vzSeminar/openGeorgia/salaryTravelReport-SMALL.csv'
 using org.apache.pig.piggybank.storage.CSVExcelStorage()
 as (name:chararray, title:chararray, salary:chararray, travel:chararray, orgType:chararray, org:chararray, year:int);

-- loop thru the input data to clean up the number fields a bit
cleanedUpNumbers = foreach inputFile GENERATE
 name as name, title as title,
 (float)REPLACE(salary, ',','') as salary,  -- take out the commas and cast to a float
 (float)REPLACE(travel, ',','') as travel,  -- take out the commas and cast to a float
 orgType as orgType, org as org, year as year;
 
-- trim down to just Local Boards of Education
onlySchoolBoards = filter cleanedUpNumbers by orgType == 'LBOE';
 
-- further trim it down to just be for the year in question
onlySchoolBoardsFor2010 = filter onlySchoolBoards by year == 2010;
 
-- bucket them up by the job title
byTitle = GROUP onlySchoolBoardsFor2010 BY title;
 
-- loop through the titles and for each one...
salaryBreakdown = FOREACH byTitle GENERATE
 group as title, -- we grouped on this above
 COUNT(onlySchoolBoardsFor2010), -- how many people with this title
 MIN(onlySchoolBoardsFor2010.salary), -- determine the min
 MAX(onlySchoolBoardsFor2010.salary), -- determine the max
 AVG(onlySchoolBoardsFor2010.salary); -- determine the avg 
 
-- guarantee the order on the way out
sortedSalaryBreakdown = ORDER salaryBreakdown by title;

dump sortedSalaryBreakdown;
-- save results back to HDFS
--STORE sortedSalaryBreakdown into '/user/hue/vzSeminar/openGeorgia/titleBreakdownForSchoolsIn2010.txt';

Sample output.  UPDATE IT WITH FULL DATA VALUES!!

Code Block
languagetext
(ADAPTED PHYS ED TEACHER,4,24349.45,64689.09,43712.0302734375)
(ADULT EDUCATION DIRECTOR/COORD,3,1033.6,74164.01,37078.07238769531)
(ADULT EDUCATION TEACHER,9,3002.0,52424.72,16903.59646267361)
(ASSISTANT PRINCIPAL,38,6243.25,100989.04,77926.6794819079)
(ATTENDANCE WORKER,1,7742.09,7742.09,7742.08984375)
(AUDITOR,1,83317.92,83317.92,83317.921875)
(BOOKKEEPER,11,10650.44,33484.04,26734.2587890625)
(BUS DRIVER,49,767.68,50915.81,20278.801754075655)
(BUSINESS SERV SECRETARY/CLERK,26,789.99,72080.0,26443.907658503605)
(CENTRAL SUPPORT CLERK,1,23392.33,23392.33,23392.330078125)
(CONSTRUCTION MANAGER,1,33890.67,33890.67,33890.671875)
(CROSSING GUARD,13,528.09,5384.09,3670.4876755934497)
(CUSTODIAL PERSONNEL,132,3803.52,46626.88,26624.915542140152)
(DEPUTY/ASSOC/ASSISTANT SUPT,2,108993.76,161051.34,135022.55078125)
(DIAGNOSTICIAN,4,59711.15,69417.72,65523.5888671875)
(DIRECTOR OF CURRICULUM/INSTR,6,31162.52,188324.23,98002.13346354167)

With Hive, tons easier still!!

...

languagetext
titleTitleBreakdownForSchoolsIn2010.hql

...