WIP
Work In Progress – Please pardon my mess!
This is the parent page for (exploratory-oriented) analysis efforts done on the Open Georgia (Transparency in Government) website; http://www.open.georgia.gov/.
As the above screen capture indicates, Open Georgia provides public records regarding how/where money is spent in the State. While there are many avenues to explore on the site, this analysis effort hones in on the salaries/expenses data presented; especially on the local boards of education. Preparing Open Georgia Test Data walks you through the download/preparation process that can be visualized in the Format & Sample Data for Open Georgia.
For this analysis effort, the Hortonworks Sandbox was utilized. Specifically, version 2.0 was utilized, but every effort, but testing was not done, to ensure the code will run on the 1.3 and 2.1 versions as well.
We can then just upload the file via Hue to /user/hue/something which has the following format.
add table with format and sample data...
We just want to know answer a simple question. Only for the year 2010 and for employees of local boards of education, what is the total number of employees & min/max/average salary for each job title.
For MapReduce, do ...
For Pig, a script such as the following will generate these statistics.
-- load up the base User Defined Function library (piggybank) and get a handle on the REPLACE function register /user/hue/shared/pig/udfs/piggybank.jar; define REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE(); -- load the salary file and declare its structure inputFile = LOAD '/user/hue/vzSeminar/openGeorgia/salaryTravelReport-SMALL.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage() as (name:chararray, title:chararray, salary:chararray, travel:chararray, orgType:chararray, org:chararray, year:int); -- loop thru the input data to clean up the number fields a bit cleanedUpNumbers = foreach inputFile GENERATE name as name, title as title, (float)REPLACE(salary, ',','') as salary, -- take out the commas and cast to a float (float)REPLACE(travel, ',','') as travel, -- take out the commas and cast to a float orgType as orgType, org as org, year as year; -- trim down to just Local Boards of Education onlySchoolBoards = filter cleanedUpNumbers by orgType == 'LBOE'; -- further trim it down to just be for the year in question onlySchoolBoardsFor2010 = filter onlySchoolBoards by year == 2010; -- bucket them up by the job title byTitle = GROUP onlySchoolBoardsFor2010 BY title; -- loop through the titles and for each one... salaryBreakdown = FOREACH byTitle GENERATE group as title, -- we grouped on this above COUNT(onlySchoolBoardsFor2010), -- how many people with this title MIN(onlySchoolBoardsFor2010.salary), -- determine the min MAX(onlySchoolBoardsFor2010.salary), -- determine the max AVG(onlySchoolBoardsFor2010.salary); -- determine the avg -- guarantee the order on the way out sortedSalaryBreakdown = ORDER salaryBreakdown by title; dump sortedSalaryBreakdown; -- save results back to HDFS --STORE sortedSalaryBreakdown into '/user/hue/vzSeminar/openGeorgia/titleBreakdownForSchoolsIn2010.txt';
Sample output. UPDATE IT WITH FULL DATA VALUES!!
(ADAPTED PHYS ED TEACHER,4,24349.45,64689.09,43712.0302734375) (ADULT EDUCATION DIRECTOR/COORD,3,1033.6,74164.01,37078.07238769531) (ADULT EDUCATION TEACHER,9,3002.0,52424.72,16903.59646267361) (ASSISTANT PRINCIPAL,38,6243.25,100989.04,77926.6794819079) (ATTENDANCE WORKER,1,7742.09,7742.09,7742.08984375) (AUDITOR,1,83317.92,83317.92,83317.921875) (BOOKKEEPER,11,10650.44,33484.04,26734.2587890625) (BUS DRIVER,49,767.68,50915.81,20278.801754075655) (BUSINESS SERV SECRETARY/CLERK,26,789.99,72080.0,26443.907658503605) (CENTRAL SUPPORT CLERK,1,23392.33,23392.33,23392.330078125) (CONSTRUCTION MANAGER,1,33890.67,33890.67,33890.671875) (CROSSING GUARD,13,528.09,5384.09,3670.4876755934497) (CUSTODIAL PERSONNEL,132,3803.52,46626.88,26624.915542140152) (DEPUTY/ASSOC/ASSISTANT SUPT,2,108993.76,161051.34,135022.55078125) (DIAGNOSTICIAN,4,59711.15,69417.72,65523.5888671875) (DIRECTOR OF CURRICULUM/INSTR,6,31162.52,188324.23,98002.13346354167)
With Hive, tons easier still!!
SELECT title, count(title) as count, MIN(salary) as min, MAX(salary) as max, AVG(salary) as avg FROM testing1 where orgType = 'LBOE' and year = 2010 group by title;
Add Comment