Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

As the title suggests, this posting was something I came up with AFTER I published the first three installments of my Open Georgia Analysis way back in 2014. And yes, you might have also noticed I took a long break from blogging about Big Data technologies in 2018 and I’m hoping to change that for 2019. On the other hand, my personal blog had a LOT of fun entries due to a TON of international travel in 2018.

...

Code Block
breakoutModewide
languagepy
>>> filtered = teachers.filter(teachers['orgType'] == 'LBOE').filter(teachers['year'] == 2010)
>>> filtered.show(2)
+---------------+--------------------+-------+--------+-------------------+------+----+
|           name|                 org|orgType|  salary|              title|travel|year|
+---------------+--------------------+-------+--------+-------------------+------+----+
|ABBOTT,DEEDEE W|ATLANTA INDEPENDE...|   LBOE| 52122.1|GRADES 9-12 TEACHER|   0.0|2010|
|  ABBOTT,RYAN V|ATLANTA INDEPENDE...|   LBOE|56567.24|    GRADE 4 TEACHER|   0.0|2010|
+---------------+--------------------+-------+--------+-------------------+------+----+

kdfjkdsfjkdfsjNow we can just do the grouping and apply some aggregate functions.

Code Block
breakoutModewide
languagepy
>>> from pyspark.sql.functions import min, max, avg, count, col
>>> expr = [count(col("title")),min(col("salary")),max(col("salary")),avg(col("salary"))]
>>> filtered.groupBy("title").agg(*expr).sort("title").show(10)
+--------------------+------------+-----------+-----------+------------------+               
|               title|count(title)|min(salary)|max(salary)|       avg(salary)|
+--------------------+------------+-----------+-----------+------------------+
|ADAPTED PHYS ED T...|      19384.24|   96320.19 35|   19384.24|   96320.19|56632.125714285714|
|ADULT EDUCATION D...|          19|      182.4|  179041.16| 39572.89157894737|
|ADULT EDUCATION T...|          63|      775.2|   60668.84|19230.814603174604|
|AFTER-SCHOOL PROG...|           2|     624.04|   78493.98|39559.009999999995|
|ALTERNATIVE SCHOO...|           2|   111199.8|  127149.12|119174.45999999999|
| ASSISTANT PRINCIPAL|         436|    3418.53|  119646.31|  76514.0633944955|
|  ATHLETICS DIRECTOR|           1|  122789.04|  122789.04|         122789.04|
|   ATTENDANCE WORKER|          12|    7553.42|    23392.9|12826.486666666666|
|         AUDIOLOGIST|           9|   36329.59|  102240.46| 73038.71333333333|
|             AUDITOR|          11|    5380.63|   83811.88| 66145.27090909092|
+--------------------+------------+-----------+-----------+------------------+

Like before, we can “put it all together” on one line with method chaining, so won’t bore you with that.

Calculate Statistics with SQL

While the API looks fun for us programmers, we probably want to leave code behind that MANY people can read. You know, so we can get promoted to a better job and not be held back by being the only person who can understand our code! (wink)

So, instead… let’s just write some SQL.

Code Block
breakoutModewide
languagesql
>>> teachers.createOrReplaceTempView("teachers")
>>> spark.sql("SELECT title, count(title), min(salary), max(salary), avg(salary) FROM teachers WHERE orgType = 'LBOE' AND year = 2010 GROUP BY title ORDER BY title").show(10)
+--------------------+------------+-----------+-----------+------------------+  
|               title|count(title)|min(salary)|max(salary)|       avg(salary)|
+--------------------+------------+-----------+-----------+------------------+
|ADAPTED PHYS ED T...|          35|   19384.24|   96320.19|56632.125714285714|
|ADULT EDUCATION D...|          19|      182.4|  179041.16| 39572.89157894737|
|ADULT EDUCATION T...|          63|      775.2|   60668.84|19230.814603174604|
|AFTER-SCHOOL PROG...|           2|     624.04|   78493.98|39559.009999999995|
|ALTERNATIVE SCHOO...|           2|   111199.8|  127149.12|119174.45999999999|
| ASSISTANT PRINCIPAL|         436|    3418.53|  119646.31|  76514.0633944955|
|  ATHLETICS DIRECTOR|           1|  122789.04|  122789.04|         122789.04|
|   ATTENDANCE WORKER|          12|    7553.42|    23392.9|12826.486666666666|
|         AUDIOLOGIST|           9|   36329.59|  102240.46| 73038.71333333333|
|             AUDITOR|          11|    5380.63|   83811.88| 66145.27090909092|
+--------------------+------------+-----------+-----------+------------------+

WORK IN PROGRESSNow that wasn’t so bad, was it?