Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • use mapreduce to calculate salary statistics for georgia educators (first of a three-part series)
  • use pig to calculate salary statistics for georgia educators (second of a three-part series)
  • use hive to calculate salary statistics for georgia educators (third of a three-part series)

 

 

 

 

Warning

Please ignore everything below this warning – it will soon move!!

 

 

We can then just upload the file via Hue to /user/hue/something which has the following format.

add table with format and sample data...

We just want to know answer a simple question.  Only for the year 2010 and for employees of local boards of education, what is the total number of employees & min/max/average salary for each job title.

For MapReduce, do ...

 

For Pig, a script such as the following will generate these statistics.

Code Block
themeMidnight
languagetext
titleTitleBreakdownForSchoolsIn2010.pig
-- load up the base User Defined Function library (piggybank) and get a handle on the REPLACE function
register /user/hue/shared/pig/udfs/piggybank.jar;
define REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE();
 
-- load the salary file and declare its structure
inputFile = LOAD '/user/hue/vzSeminar/openGeorgia/salaryTravelReport-SMALL.csv'
 using org.apache.pig.piggybank.storage.CSVExcelStorage()
 as (name:chararray, title:chararray, salary:chararray, travel:chararray, orgType:chararray, org:chararray, year:int);

-- loop thru the input data to clean up the number fields a bit
cleanedUpNumbers = foreach inputFile GENERATE
 name as name, title as title,
 (float)REPLACE(salary, ',','') as salary,  -- take out the commas and cast to a float
 (float)REPLACE(travel, ',','') as travel,  -- take out the commas and cast to a float
 orgType as orgType, org as org, year as year;
 
-- trim down to just Local Boards of Education
onlySchoolBoards = filter cleanedUpNumbers by orgType == 'LBOE';
 
-- further trim it down to just be for the year in question
onlySchoolBoardsFor2010 = filter onlySchoolBoards by year == 2010;
 
-- bucket them up by the job title
byTitle = GROUP onlySchoolBoardsFor2010 BY title;
 
-- loop through the titles and for each one...
salaryBreakdown = FOREACH byTitle GENERATE
 group as title, -- we grouped on this above
 COUNT(onlySchoolBoardsFor2010), -- how many people with this title
 MIN(onlySchoolBoardsFor2010.salary), -- determine the min
 MAX(onlySchoolBoardsFor2010.salary), -- determine the max
 AVG(onlySchoolBoardsFor2010.salary); -- determine the avg 
 
-- guarantee the order on the way out
sortedSalaryBreakdown = ORDER salaryBreakdown by title;

dump sortedSalaryBreakdown;
-- save results back to HDFS
--STORE sortedSalaryBreakdown into '/user/hue/vzSeminar/openGeorgia/titleBreakdownForSchoolsIn2010.txt';

Sample output.  UPDATE IT WITH FULL DATA VALUES!!

Code Block
languagetext
(ADAPTED PHYS ED TEACHER,4,24349.45,64689.09,43712.0302734375)
(ADULT EDUCATION DIRECTOR/COORD,3,1033.6,74164.01,37078.07238769531)
(ADULT EDUCATION TEACHER,9,3002.0,52424.72,16903.59646267361)
(ASSISTANT PRINCIPAL,38,6243.25,100989.04,77926.6794819079)
(ATTENDANCE WORKER,1,7742.09,7742.09,7742.08984375)
(AUDITOR,1,83317.92,83317.92,83317.921875)
(BOOKKEEPER,11,10650.44,33484.04,26734.2587890625)
(BUS DRIVER,49,767.68,50915.81,20278.801754075655)
(BUSINESS SERV SECRETARY/CLERK,26,789.99,72080.0,26443.907658503605)
(CENTRAL SUPPORT CLERK,1,23392.33,23392.33,23392.330078125)
(CONSTRUCTION MANAGER,1,33890.67,33890.67,33890.671875)
(CROSSING GUARD,13,528.09,5384.09,3670.4876755934497)
(CUSTODIAL PERSONNEL,132,3803.52,46626.88,26624.915542140152)
(DEPUTY/ASSOC/ASSISTANT SUPT,2,108993.76,161051.34,135022.55078125)
(DIAGNOSTICIAN,4,59711.15,69417.72,65523.5888671875)
(DIRECTOR OF CURRICULUM/INSTR,6,31162.52,188324.23,98002.13346354167)

With Hive, tons easier still!!

Code Block
languagetext
titleTitleBreakdownForSchoolsIn2010.hql
SELECT title, count(title) as count, 
       MIN(salary) as min, MAX(salary) as max, AVG(salary) as avg
  FROM testing1
 where orgType = 'LBOE'
   and year = 2010
 group by title;