This is the parent page for (exploratory-oriented) analysis efforts done on the Open Georgia (Transparency in Government) website; http://www.open.georgia.gov/.
As the above screen capture indicates, Open Georgia provides public records regarding how/where money is spent in the State. While there are many avenues to explore on the site, this analysis effort hones in on the salaries/expenses data presented; especially on the local boards of education. Preparing Open Georgia Test Data walks you through the download/preparation process that can be visualized in the Format & Sample Data for Open Georgia.
For this analysis effort, the Hortonworks Sandbox was utilized. Specifically, version 2.0 was utilized, but every effort, but testing was not done, to ensure the code will run on the 1.3 and 2.1 versions as well.
To get started, just load the salaryTravelReport.csv
file that you create (or simply download) from the instructions in Preparing Open Georgia Test Data into HDFS itself. For the examples provided, log into the Sandbox's Hue UI as the user hue
and from the File Browser create an opengeorgia
folder within /user/hue
and then upload the file. You should see something similar to the following once that is done.
Now, we've got some data and are ready to answer the following question as a simple, figurative, example of what kinds of analysis could be done.
For all Open Georgia Salary/Travel data loaded in HDFS that is aligned with Fiscal Year 2010 and Organization Type of Local Boards of Education, produce a distinct list of all Job Titles along with the total number of employees aligned with each Job Title & the minimum/maximum/average salaries for each of the identified Job Titles. |
---|
As for which tools to use, the following list of blog entries (if they are not linked, they are coming soon) presents varying tool options to address questions such as this.
- use mapreduce to calculate salary statistics for georgia educators (first of a three-part series)
- use pig to calculate salary statistics for georgia educators (second of a three-part series)
- use hive to calculate salary statistics for georgia educators (third of a three-part series)
Please ignore everything below this warning – it will soon move!!
We can then just upload the file via Hue to /user/hue/something which has the following format.
add table with format and sample data...
We just want to know answer a simple question. Only for the year 2010 and for employees of local boards of education, what is the total number of employees & min/max/average salary for each job title.
For MapReduce, do ...
For Pig, a script such as the following will generate these statistics.
-- load up the base User Defined Function library (piggybank) and get a handle on the REPLACE function register /user/hue/shared/pig/udfs/piggybank.jar; define REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE(); -- load the salary file and declare its structure inputFile = LOAD '/user/hue/vzSeminar/openGeorgia/salaryTravelReport-SMALL.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage() as (name:chararray, title:chararray, salary:chararray, travel:chararray, orgType:chararray, org:chararray, year:int); -- loop thru the input data to clean up the number fields a bit cleanedUpNumbers = foreach inputFile GENERATE name as name, title as title, (float)REPLACE(salary, ',','') as salary, -- take out the commas and cast to a float (float)REPLACE(travel, ',','') as travel, -- take out the commas and cast to a float orgType as orgType, org as org, year as year; -- trim down to just Local Boards of Education onlySchoolBoards = filter cleanedUpNumbers by orgType == 'LBOE'; -- further trim it down to just be for the year in question onlySchoolBoardsFor2010 = filter onlySchoolBoards by year == 2010; -- bucket them up by the job title byTitle = GROUP onlySchoolBoardsFor2010 BY title; -- loop through the titles and for each one... salaryBreakdown = FOREACH byTitle GENERATE group as title, -- we grouped on this above COUNT(onlySchoolBoardsFor2010), -- how many people with this title MIN(onlySchoolBoardsFor2010.salary), -- determine the min MAX(onlySchoolBoardsFor2010.salary), -- determine the max AVG(onlySchoolBoardsFor2010.salary); -- determine the avg -- guarantee the order on the way out sortedSalaryBreakdown = ORDER salaryBreakdown by title; dump sortedSalaryBreakdown; -- save results back to HDFS --STORE sortedSalaryBreakdown into '/user/hue/vzSeminar/openGeorgia/titleBreakdownForSchoolsIn2010.txt';
Sample output. UPDATE IT WITH FULL DATA VALUES!!
(ADAPTED PHYS ED TEACHER,4,24349.45,64689.09,43712.0302734375) (ADULT EDUCATION DIRECTOR/COORD,3,1033.6,74164.01,37078.07238769531) (ADULT EDUCATION TEACHER,9,3002.0,52424.72,16903.59646267361) (ASSISTANT PRINCIPAL,38,6243.25,100989.04,77926.6794819079) (ATTENDANCE WORKER,1,7742.09,7742.09,7742.08984375) (AUDITOR,1,83317.92,83317.92,83317.921875) (BOOKKEEPER,11,10650.44,33484.04,26734.2587890625) (BUS DRIVER,49,767.68,50915.81,20278.801754075655) (BUSINESS SERV SECRETARY/CLERK,26,789.99,72080.0,26443.907658503605) (CENTRAL SUPPORT CLERK,1,23392.33,23392.33,23392.330078125) (CONSTRUCTION MANAGER,1,33890.67,33890.67,33890.671875) (CROSSING GUARD,13,528.09,5384.09,3670.4876755934497) (CUSTODIAL PERSONNEL,132,3803.52,46626.88,26624.915542140152) (DEPUTY/ASSOC/ASSISTANT SUPT,2,108993.76,161051.34,135022.55078125) (DIAGNOSTICIAN,4,59711.15,69417.72,65523.5888671875) (DIRECTOR OF CURRICULUM/INSTR,6,31162.52,188324.23,98002.13346354167)
With Hive, tons easier still!!
SELECT title, count(title) as count, MIN(salary) as min, MAX(salary) as max, AVG(salary) as avg FROM testing1 where orgType = 'LBOE' and year = 2010 group by title;
0 Comments