Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

I decided to clean up the dataset used so I could and just focus on the Simple Open Georgia Use Case, so I ran the cleanup code from use pig to calculate salary statistics for georgia educators (second of a three-part series) to prepare a transformed version of the file which you can find a download link to at the bottom of Preparing Open Georgia Test Data. This will make the Spark code much simpler and more germane to the analysis code I want to present.

...

To figure out how many people are in each job title we need to treat this a bit like the canonical WordCount Word Count problem. We create KVPs of the titles as the keys and hard-code a 1 as the value.

...

Minimum Salary by Title

Create a PairRDD of Title title and Salarysalary.

Code Block
breakoutModewide
languagepy
>>> salaryKVPs = filteredRecs.map(lambda eR: (eR[1], float(eR[2])))
>>> salaryKVPs.take(2)
[(u'GRADES 9-12 TEACHER', 52122.1), (u'GRADE 4 TEACHER', 56567.24)]

...

Fortunately, python has us covered already with the max() function that we can leverage instead of that silly anonymous function.

Code Block
breakoutModewide
languagepy
>>> minSals = salaryKVPs.reduceByKey(min)
>>> minSals.take(2)
[(u'SPEECH-LANGUAGE PATHOLOGIST', 2917.23), (u'RVI TEACHER', 27918.14)]

...

Yes, that does look “interesting”… It took me a bit to figure it all out as the API docs for this function were a bit tough for me to grasp. Thankfully, I found https://stackoverflow.com/questions/29930110/calculating-the-averages-for-each-key-in-a-pairwise-k-v-rdd-in-spark-with-pyth which explains what the a’s and b’s all mean above and I’ll just let you read it from the source that this Q&A as I did and we can discuss it in the comments section of this post if there are still questions.

Then, we need to calculate the average from the total salary values and the count of salaries. This is also described in that last link, but is really just total salaries divided by total number of educators with that title.

Code Block
breakoutModewide
languagepy
>>> avgSals = totalSalaryAndCountByTitle.mapValues(lambda v: v[0]/v[1])
>>> avgSals.take(10)
[(u'SPEECH-LANGUAGE PATHOLOGIST', 31909.640552050496), (u'RVI TEACHER', 66967.09206896552), (u'SPECIAL EDUCATION BUS AIDE', 13871.381803278691), (u'TECHNOLOGY SPECIALIST', 54501.812348993306), (u'SCHOOL IMPROVEMENT SPECIALIST', 79359.73837988825), (u'SPECIAL EDUCATION SECRETARY/CLERK', 32192.98714285714), (u'TEACHER OF EMOTIONAL/BEHAVIORAL', 53380.03302631576), (u'OTHER INSTRUCTIONAL PROVIDER', 18972.18256097561), (u'BUS DRIVER', 21016.957100681313), (u'HOSPITAL/HOMEBOUND INSTRUCTOR', 41357.86272727273)]

...

Sort it and we can see an average salary of $73,038.71 for Audiologists like the other technologies frameworks also calculated.

Code Block
breakoutModewide
languagepy
>>> avgSals.sortByKey().take(10)
[(u'ADAPTED PHYS ED TEACHER', 56632.125714285714), (u'ADULT EDUCATION DIRECTOR/COORD', 39572.89157894737), (u'ADULT EDUCATION TEACHER', 19230.814603174604), (u'AFTER-SCHOOL PROGRAM WORKER', 39559.009999999995), (u'ALTERNATIVE SCHOOL DIRECTOR', 119174.45999999999), (u'ASSISTANT PRINCIPAL', 76514.0633944955), (u'ATHLETICS DIRECTOR', 122789.04), (u'ATTENDANCE WORKER', 12826.486666666666), (u'AUDIOLOGIST', 73038.71333333333), (u'AUDITOR', 66145.27090909092)]

...

When I first was learning Spark I remember the RDD API being described as an “elegant” API. I’m not sure if that’s what I called it back then (or now). (wink) Good The good news is that for structured datasets like the Format & Sample Data for Open Georgia we have Spark SQL available for our use. More great news is that like working with RDDs, the Spark website has a good Programming Guide for Spark SQL that you can use for as a referenceplace to start.

Getting a DataFrame

There are multiple ways to create a DataFrame, but I’ll use a common practice with PySpark to infer pyspark which infers the schema using reflection to promote an RDD to a DataFrame. To get started, let’s reread the original file again and convert the TSV records into arrays.

...

Like before, we can “put it all together” on one line with method chaining, so I won’t bore you with that.

Calculate Statistics

...

via SQL

While the API looks fun for us programmers, we probably want to leave code behind that MANY people can read. You know, know… so we can get promoted to a better job and not be held back by being the only person who can understand our code! (wink)

...

Code Block
breakoutModewide
languagesqlpy
>>> teachers.createOrReplaceTempView("teachers")
>>> spark.sql("SELECT title, count(title), min(salary), max(salary), avg(salary) FROM teachers WHERE orgType = 'LBOE' AND year = 2010 GROUP BY title ORDER BY title").show(10)
+--------------------+------------+-----------+-----------+------------------+  
|               title|count(title)|min(salary)|max(salary)|       avg(salary)|
+--------------------+------------+-----------+-----------+------------------+
|ADAPTED PHYS ED T...|          35|   19384.24|   96320.19|56632.125714285714|
|ADULT EDUCATION D...|          19|      182.4|  179041.16| 39572.89157894737|
|ADULT EDUCATION T...|          63|      775.2|   60668.84|19230.814603174604|
|AFTER-SCHOOL PROG...|           2|     624.04|   78493.98|39559.009999999995|
|ALTERNATIVE SCHOO...|           2|   111199.8|  127149.12|119174.45999999999|
| ASSISTANT PRINCIPAL|         436|    3418.53|  119646.31|  76514.0633944955|
|  ATHLETICS DIRECTOR|           1|  122789.04|  122789.04|         122789.04|
|   ATTENDANCE WORKER|          12|    7553.42|    23392.9|12826.486666666666|
|         AUDIOLOGIST|           9|   36329.59|  102240.46| 73038.71333333333|
|             AUDITOR|          11|    5380.63|   83811.88| 66145.27090909092|
+--------------------+------------+-----------+-----------+------------------+

...