Spark Cheat Sheet
Spark test wordcount: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/run-spark2-sample-apps.html
Dynamic Resource Allocation; https://community.hortonworks.com/content/supportkb/49510/how-to-enable-dynamic-resource-allocation-in-spark.html
Integration details about ElasticSearch and Spark (RDD, Spark SQL, and Streaming) can be found at https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html.
A great write-up on integrating Spark Streaming to consume data from NiFi via Remote Processor Groups; https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html
(databricks blog post) Deep Dive into Spark SQL's Catalyst Optimizer; https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
Francois' blog post about Adaptive Query Execution (i.e. intelligently selecting # of reducers) and other performance concepts; https://blog.cloudera.com/how-does-apache-spark-3-0-increase-the-performance-of-your-sql-workloads/
Cloudera blog on UDF and UDAF development; https://blog.cloudera.com/working-with-udfs-in-apache-spark/
Good stuff from Ranga Reddy
- Memory management article >> https://community.cloudera.com/t5/Community-Articles/Spark-Memory-Management/ta-p/317794
- All Cloudera Community articles >> https://community.cloudera.com/t5/tkb/usercontributedarticlespage/user-id/78612/tkb-id/CommunityArticles
- https://github.com/rangareddy projects such as https://github.com/rangareddy/ranga_spark_experiments
https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-hints.html shows that the pre 3.0 hint was only for broadcast (and didn't require it to actually happy (sounds like a good "hint" to me)) and 3.0 onward we get 4 types of join hints
good article about joining; https://towardsdatascience.com/the-art-of-joining-in-spark-dcbd33d693c