Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

WIP

Work In Progress – Please pardon my mess!

This will be the parent page for analysis efforts done on the Open Georgia (Transparency in Government) website; http://www.open.georgia.gov/.

Want a more complete ELT (not ETL) process, but for now can just hand-swizzle the data to generate a test file by doing X, Y, and Z.  We can then just upload the file via Hue to /user/hue/something which has the following format.

add table with format and sample data...

We just want to know answer a simple question.  Only for the year 2010 and for employees of local boards of education, what is the total number of employees & min/max/average salary for each job title.

For MapReduce, do ...

 

For Pig, a script such as the following will generate these statistics.

TitleBreakdownForSchoolsIn2010.pig
-- load up the base User Defined Function library (piggybank) and get a handle on the REPLACE function
register /user/hue/shared/pig/udfs/piggybank.jar;
define REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE();
 
-- load the salary file and declare its structure
inputFile = LOAD '/user/hue/vzSeminar/openGeorgia/salaryTravelReport-SMALL.csv'
 using org.apache.pig.piggybank.storage.CSVExcelStorage()
 as (name:chararray, title:chararray, salary:chararray, travel:chararray, orgType:chararray, org:chararray, year:int);

-- loop thru the input data to clean up the number fields a bit
cleanedUpNumbers = foreach inputFile GENERATE
 name as name, title as title,
 (float)REPLACE(salary, ',','') as salary,  -- take out the commas and cast to a float
 (float)REPLACE(travel, ',','') as travel,  -- take out the commas and cast to a float
 orgType as orgType, org as org, year as year;
 
-- trim down to just Local Boards of Education
onlySchoolBoards = filter cleanedUpNumbers by orgType == 'LBOE';
 
-- further trim it down to just be for the year in question
onlySchoolBoardsFor2010 = filter onlySchoolBoards by year == 2010;
 
-- bucket them up by the job title
byTitle = GROUP onlySchoolBoardsFor2010 BY title;
 
-- loop through the titles and for each one...
salaryBreakdown = FOREACH byTitle GENERATE
 group as title, -- we grouped on this above
 COUNT(onlySchoolBoardsFor2010), -- how many people with this title
 MIN(onlySchoolBoardsFor2010.salary), -- determine the min
 MAX(onlySchoolBoardsFor2010.salary), -- determine the max
 AVG(onlySchoolBoardsFor2010.salary); -- determine the avg 
 
-- guarantee the order on the way out
sortedSalaryBreakdown = ORDER salaryBreakdown by title;

dump sortedSalaryBreakdown;
-- save results back to HDFS
--STORE sortedSalaryBreakdown into '/user/hue/vzSeminar/openGeorgia/titleBreakdownForSchoolsIn2010.txt';

Sample output.

(ADAPTED PHYS ED TEACHER,35,19384.24,96320.19)
(ADULT EDUCATION DIRECTOR/COORD,19,1033.60,99918.00)
(ADULT EDUCATION TEACHER,63,10120.79,9987.29)
(AFTER-SCHOOL PROGRAM WORKER,2,624.04,78493.98)
(ALTERNATIVE SCHOOL DIRECTOR,2,111199.80,127149.12)
(ASSISTANT PRINCIPAL,436,100008.96,99928.92)
(ATHLETICS DIRECTOR,1,122789.04,122789.04)
(ATTENDANCE WORKER,12,10300.04,9573.33)
(AUDIOLOGIST,9,102240.46,81184.24)
(AUDITOR,11,50214.08,83811.88)
(BOOKKEEPER,118,10650.44,93.38)
(BUS DRIVER,1321,10125.34,9935.17)
(BUSINESS SERV SECRETARY/CLERK,224,1000.00,989.99)
(CENTRAL SUPPORT CLERK,4,23392.33,74785.12)
(CONSTRUCTION MANAGER,8,101025.99,80899.57)
(CROSSING GUARD,100,1074.39,940.85)
(CROSSROADS,17,30390.72,77269.82)
(CROSSROADS ALT SCHOOL TEACHER,16,37425.80,71386.92)
(CUSTODIAL PERSONNEL,1393,10099.11,9915.36)
  • No labels