/
Spark Cheat Sheet

Spark Cheat Sheet

Spark test wordcount: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/run-spark2-sample-apps.html

Dynamic Resource Allocation; https://community.hortonworks.com/content/supportkb/49510/how-to-enable-dynamic-resource-allocation-in-spark.html

Integration details about ElasticSearch and Spark (RDD, Spark SQL, and Streaming) can be found at https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html.

Yes, you can partition out your JDBC DataFrame creation efforts as described in https://stackoverflow.com/questions/41085238/what-is-the-meaning-of-partitioncolumn-lowerbound-upperbound-numpartitions-pa 
spark.sql.shuffle.partitions is the property you can modify (defaults to 200) for WHEN you know better (or are experimenting) on how many reducers you want Spark SQL to use on join and aggregation operations as referenced in https://spark.apache.org/docs/latest/sql-performance-tuning.html and https://stackoverflow.com/questions/33297689/number-reduce-tasks-spark 

A great write-up on integrating Spark Streaming to consume data from NiFi via Remote Processor Groups; https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html 

(databricks blog post) Deep Dive into Spark SQL's Catalyst Optimizer; https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Francois' blog post about Adaptive Query Execution (i.e. intelligently selecting # of reducers) and other performance concepts; https://blog.cloudera.com/how-does-apache-spark-3-0-increase-the-performance-of-your-sql-workloads/

Cloudera blog on UDF and UDAF development; https://blog.cloudera.com/working-with-udfs-in-apache-spark/

Good stuff from Ranga Reddy

https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-hints.html shows that the pre 3.0 hint was only for broadcast (and didn't require it to actually happy (sounds like a good "hint" to me)) and 3.0 onward we get 4 types of join hints

good article about joining; https://towardsdatascience.com/the-art-of-joining-in-spark-dcbd33d693c

Related content

Hive Cheat Sheet
Hive Cheat Sheet
More like this
why spark's mapPartitions transformation is faster than map (calls your function once/partition, not once/element)
why spark's mapPartitions transformation is faster than map (calls your function once/partition, not once/element)
More like this
Hadoop & Big Data
Hadoop & Big Data
More like this
Links & Cheat Sheets for Hadoop & Big Data
Links & Cheat Sheets for Hadoop & Big Data
More like this
use spark to calculate salary statistics for georgia educators (the fourth book of the trilogy)
use spark to calculate salary statistics for georgia educators (the fourth book of the trilogy)
More like this
connecting dbvisualizer to hive (running on hdp 2.2)
connecting dbvisualizer to hive (running on hdp 2.2)
More like this