As the title suggests, this posting was something I came up with AFTER I published the first three installments of my Open Georgia Analysis way back in 2014. And yes, you might have also noticed I took a long break from blogging about Big Data technologies in 2018 and I’m hoping to change that for 2019. On the other hand, my personal blog had a LOT of fun entries due to a TON of international travel in 2018.
...
Code Block | ||||
---|---|---|---|---|
| ||||
>>> filtered = teachers.filter(teachers['orgType'] == 'LBOE').filter(teachers['year'] == 2010) >>> filtered.show(2) +---------------+--------------------+-------+--------+-------------------+------+----+ | name| org|orgType| salary| title|travel|year| +---------------+--------------------+-------+--------+-------------------+------+----+ |ABBOTT,DEEDEE W|ATLANTA INDEPENDE...| LBOE| 52122.1|GRADES 9-12 TEACHER| 0.0|2010| | ABBOTT,RYAN V|ATLANTA INDEPENDE...| LBOE|56567.24| GRADE 4 TEACHER| 0.0|2010| +---------------+--------------------+-------+--------+-------------------+------+----+ |
kdfjkdsfjkdfsjNow we can just do the grouping and apply some aggregate functions.
Code Block | ||||
---|---|---|---|---|
| ||||
>>> from pyspark.sql.functions import min, max, avg, count, col >>> expr = [count(col("title")),min(col("salary")),max(col("salary")),avg(col("salary"))] >>> filtered.groupBy("title").agg(*expr).sort("title").show(10) +--------------------+------------+-----------+-----------+------------------+ | title|count(title)|min(salary)|max(salary)| avg(salary)| +--------------------+------------+-----------+-----------+------------------+ |ADAPTED PHYS ED T...| 19384.24| 96320.19 35| 19384.24| 96320.19|56632.125714285714| |ADULT EDUCATION D...| 19| 182.4| 179041.16| 39572.89157894737| |ADULT EDUCATION T...| 63| 775.2| 60668.84|19230.814603174604| |AFTER-SCHOOL PROG...| 2| 624.04| 78493.98|39559.009999999995| |ALTERNATIVE SCHOO...| 2| 111199.8| 127149.12|119174.45999999999| | ASSISTANT PRINCIPAL| 436| 3418.53| 119646.31| 76514.0633944955| | ATHLETICS DIRECTOR| 1| 122789.04| 122789.04| 122789.04| | ATTENDANCE WORKER| 12| 7553.42| 23392.9|12826.486666666666| | AUDIOLOGIST| 9| 36329.59| 102240.46| 73038.71333333333| | AUDITOR| 11| 5380.63| 83811.88| 66145.27090909092| +--------------------+------------+-----------+-----------+------------------+ |
Like before, we can “put it all together” on one line with method chaining, so won’t bore you with that.
Calculate Statistics with SQL
While the API looks fun for us programmers, we probably want to leave code behind that MANY people can read. You know, so we can get promoted to a better job and not be held back by being the only person who can understand our code!
So, instead… let’s just write some SQL.
Code Block | ||||
---|---|---|---|---|
| ||||
>>> teachers.createOrReplaceTempView("teachers") >>> spark.sql("SELECT title, count(title), min(salary), max(salary), avg(salary) FROM teachers WHERE orgType = 'LBOE' AND year = 2010 GROUP BY title ORDER BY title").show(10) +--------------------+------------+-----------+-----------+------------------+ | title|count(title)|min(salary)|max(salary)| avg(salary)| +--------------------+------------+-----------+-----------+------------------+ |ADAPTED PHYS ED T...| 35| 19384.24| 96320.19|56632.125714285714| |ADULT EDUCATION D...| 19| 182.4| 179041.16| 39572.89157894737| |ADULT EDUCATION T...| 63| 775.2| 60668.84|19230.814603174604| |AFTER-SCHOOL PROG...| 2| 624.04| 78493.98|39559.009999999995| |ALTERNATIVE SCHOO...| 2| 111199.8| 127149.12|119174.45999999999| | ASSISTANT PRINCIPAL| 436| 3418.53| 119646.31| 76514.0633944955| | ATHLETICS DIRECTOR| 1| 122789.04| 122789.04| 122789.04| | ATTENDANCE WORKER| 12| 7553.42| 23392.9|12826.486666666666| | AUDIOLOGIST| 9| 36329.59| 102240.46| 73038.71333333333| | AUDITOR| 11| 5380.63| 83811.88| 66145.27090909092| +--------------------+------------+-----------+-----------+------------------+ |
WORK IN PROGRESSNow that wasn’t so bad, was it?