Preparing Open Georgia Test Data
If the intention was to build an automated "scraping" application to retrieve all the various information available on Open Georgia, then a rigorous, production-ready, ELT (not ETL) solution would need to be developed, tested, and deployed to ensure all data was presented and that it met a high degree of data quality. For the Open Georgia Analysis effort, we just want to focus in on the salary data presented for teachers and staff across several local boards of education.
Drilling down into the Salaries & Travel Reimbursements area of the site, much information is quickly at your fingertips. For this analysis effort, only the following Organization data was retrieved.
Fiscal Year | Organization Type | Organization | Records |
---|---|---|---|
2010 | Local Boards of Education | ATLANTA INDEPENDENT SCHOOL SYSTEM | 9,201 |
2010 | Local Boards of Education | COBB COUNTY SCHOOL DISTRICT | 20,377 |
2010 | State Agencies, Boards, Authorities and Commissions | PUBLIC SAFETY, DEPARTMENT OF | 1,914 |
2010 | Local Boards of Education | FULTON COUNTY BOARD OF EDUCATION | 15,408 |
2011 | Local Boards of Education | FULTON COUNTY BOARD OF EDUCATION | 15,200 |
2012 | Local Boards of Education | FULTON COUNTY BOARD OF EDUCATION | 14,843 |
Quite obviously these 76,943 (very well-formed) records do not meet Gartner's 3Vs definition of "big data", but they do allow us to exercise the Hadoop & Big Data tooling to perform some level of data analysis. I also did the following little bit of manual pre-processing (that would normally happen in the ELT space) prior to concatenating these six files together.
- I stripped out the "
$
" characters just to help make the later conversion to floating point values go a bit easier. - I added an "organization type" field just before the Organization name itself. For that, I used abbreviations shown in the following CSV file that I also loaded into HDFS.
SABAC|State Agencies, Boards, Authorities and Commissions UUSGMC|Units of the University System and Georgia Military College RESA|Regional Educational Service Agencies TC|Technical Colleges LBOE|Local Boards of Education
Please see Format & Sample Data for Open Georgia for a visual representation of what the data looked like at the end of this preparation phase. I ended up with a (tiny by Hadoop standards for sure) 7.6 MB file I called salaryTravelReport.csv that shrunk down to 1.2 MB when zipped. If you don't want to perform these, or similar, steps to get this data (and possibly much more) then feel free to download salaryTravelReport.csv.zip, unwind it, and then store it on HDFS yourself.
Clean Dataset
I decided for further analysis of this file, such as done in use spark to calculate salary statistics for georgia educators (the fourth book of the trilogy), I would use the code from use pig to calculate salary statistics for georgia educators (second of a three-part series) that actually cleans up the dataset's comma problems. I saved the results of that which is the 76,943 tab-delimited records that are zipped up into cleanSalaryTravelReport.tsv.zip (1.2 MB).