use spark to calculate salary statistics for georgia educators (the fourth book of the trilogy)

As the title suggests, this posting was something I came up with AFTER I published the first three installments of my Open Georgia Analysis way back in 2014. And yes, you might have also noticed I took a long break from blogging about Big Data technologies in 2018 and I’m hoping to change that for 2019. On the other hand, my personal blog had a LOT of fun entries due to a TON of international travel in 2018.

 

Dataset & Use Case

I decided to clean up the dataset used and just focus on the Simple Open Georgia Use Case, so I ran the cleanup code from use pig to calculate salary statistics for georgia educators (second of a three-part series) to prepare a transformed version of the file which you can find a download link to at the bottom of Preparing Open Georgia Test Data. This will make the Spark code much simpler and more germane to the analysis code I want to present.

On an HDP 3.1.0 cluster I have available, I loaded this file as shown below.

[dev1@ip-172-30-10-1 ~]$ hdfs dfs -ls Found 1 items -rw-r--r-- 3 dev1 hdfs 7133542 2019-03-09 22:33 cleanSalaryTravelReport.tsv [dev1@ip-172-30-10-1 ~]$ hdfs dfs -tail cleanSalaryTravelReport.tsv ZUCKER,STACEY E PARAPROFESSIONAL/TEACHER AIDE 23387.87 0.0 LBOE FULTON COUNTY BOARD OF EDUCATION 2012 ZURAS,LINDA D SPECIAL ED PARAPRO/AIDE 29046.0 0.0 LBOE FULTON COUNTY BOARD OF EDUCATION 2012 ZVONAR,JESSICA L GIFTED 41672.9 44.37 LBOE FULTON COUNTY BOARD OF EDUCATION 2012 ZWEIGEL,RENEE E SCHOOL SECRETARY/CLERK 42681.23 0.0 LBOE FULTON COUNTY BOARD OF EDUCATION 2012 [dev1@ip-172-30-10-1 ~]$

RDD Implementation

The cluster I am using has Spark 2.3.2 available to me as shown from the pyspark shell I will be using for my Resilient Distributed Dataset (RDD) API example.

[dev1@ip-172-30-10-1 ~]$ pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.2.3.1.0.0-78 /_/ Using Python version 2.7.5 (default, Oct 30 2018 23:45:53) SparkSession available as 'spark'. >>>

NOTE: The RDD Programming Guide is a great reference.

Load and Filter

Load up the HDFS file and verify it looks good.

>>> inputRDD = sc.textFile("/user/dev1/cleanSalaryTravelReport.tsv") >>> inputRDD.take(2) [u'ABBOTT,DEEDEE W\tGRADES 9-12 TEACHER\t52122.1\t0.0\tLBOE\tATLANTA INDEPENDENT SCHOOL SYSTEM\t2010', u'ABBOTT,RYAN V\tGRADE 4 TEACHER\t56567.24\t0.0\tLBOE\tATLANTA INDEPENDENT SCHOOL SYSTEM\t2010'] >>> inputRDD.count() 76943

Tokenize the tab-separated string to create an array.

Trim down to just the Local Boards of Education.

We just want the 2010 records.

Put it All Together

We showed this with a bunch of variables, but normally we would just chain all of these together into one long statement. Additionally, I’m chaining a method to cache these results since we will use them a few times below.

Number of Employees by Title

To figure out how many people are in each job title we need to treat this a bit like the canonical Word Count problem. We create KVPs of the titles as the keys and hard-code a 1 as the value.

Then we can loop through them to present the totals for each of the 181 distinct job titles.

Sort and show all results. As with the other analysis efforts, we see there are 9 Audiologists.

Put It All Together

This can more simply be expressed by chaining these all together.

Minimum Salary by Title

Create a PairRDD of title and salary.

Incur the shuffle to reduce them by key and figure out the smallest number.

Fortunately, python has us covered already with the max() function that we can leverage instead of that silly anonymous function.

Sort and show all results. As with the other analysis efforts, we see the minimum salary for Audiologists is $36,329.59.

Put It All Together

Maximum Salary by Title

We already did most of this above, here it is “all together” and switching min for max. We can see that, like before, the max salary for Audiologists is $102,240.46.

Average Salary by Title

This one is a bit more “interesting”… First, we need to roll up the total salary values for each title along with the count for each.

Yes, that does look “interesting”… It took me a bit to figure it all out as the API docs for this function were a bit tough for me to grasp. Thankfully, I found https://stackoverflow.com/questions/29930110/calculating-the-averages-for-each-key-in-a-pairwise-k-v-rdd-in-spark-with-pyth which explains what the a’s and b’s all mean above and I’ll just let you read it from this Q&A as I did and we can discuss it in the comments section of this post if there are still questions.

Then, we need to calculate the average from the total salary values and the count of salaries. This is also described in that last link, but is really just total salaries divided by total number of educators with that title.

The answer at https://stackoverflow.com/questions/46171294/how-can-i-count-the-average-from-spark-rdd shows how this all could look with Scala.

Sort it and we can see an average salary of $73,038.71 for Audiologists like the other frameworks also calculated.

Put It All Together

Like before, feel free to pull it all together into one line.

Spark SQL Implementation

When I first was learning Spark I remember the RDD API being described as an “elegant” API. I’m not sure if that’s what I called it back then (or now). The good news is that for structured datasets like the Format & Sample Data for Open Georgia we have Spark SQL available for our use. More great news is that like working with RDDs, the Spark website has a good Programming Guide for Spark SQL that you can use as a place to start.

Getting a DataFrame

There are multiple ways to create a DataFrame, but I’ll use a common practice with pyspark which infers the schema using reflection to promote an RDD to a DataFrame. To get started, let’s reread the original file again and convert the TSV records into arrays.

From here we can create an RDD full of Row objects and then use then have the Spark SQL API convert it up to a DataFrame.

Now that looks a bit better than the output of the RDD’s take() method!!

Calculate Statistics via API

Trim down to just educator records for 2010.

Now we can just do the grouping and apply some aggregate functions.

Like before, we can “put it all together” on one line with method chaining, so I won’t bore you with that.

Calculate Statistics via SQL

While the API looks fun for us programmers, we probably want to leave code behind that MANY people can read. You know… so we can get promoted to a better job and not be held back by being the only person who can understand our code!

So, instead… let’s just write some SQL.

Now that wasn’t so bad, was it?