/

Spark Cheat Sheet

Spark Cheat Sheet

Nov 04, 2021

Spark test wordcount: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/run-spark2-sample-apps.html

Dynamic Resource Allocation; https://community.hortonworks.com/content/supportkb/49510/how-to-enable-dynamic-resource-allocation-in-spark.html

Integration details about ElasticSearch and Spark (RDD, Spark SQL, and Streaming) can be found at https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html.

Reading XML files (even those across multiple rows in the file) into DataFrames; example file at https://github.com/lestermartin/oss-transform-processing-comparison/blob/master/file-formats/xml/catalog.xml and source at https://github.com/lestermartin/oss-transform-processing-comparison/blob/master/file-formats/xml/read-xml.spark

The Hive Warehouse Connector for Spark; https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html

Yes, you can partition out your JDBC DataFrame creation efforts as described in https://stackoverflow.com/questions/41085238/what-is-the-meaning-of-partitioncolumn-lowerbound-upperbound-numpartitions-pa

spark.sql.shuffle.partitions is the property you can modify (defaults to 200) for WHEN you know better (or are experimenting) on how many reducers you want Spark SQL to use on join and aggregation operations as referenced in https://spark.apache.org/docs/latest/sql-performance-tuning.html and https://stackoverflow.com/questions/33297689/number-reduce-tasks-spark

A great write-up on integrating Spark Streaming to consume data from NiFi via Remote Processor Groups; https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html

(databricks blog post) Deep Dive into Spark SQL's Catalyst Optimizer; https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

Francois' blog post about Adaptive Query Execution (i.e. intelligently selecting # of reducers) and other performance concepts; https://blog.cloudera.com/how-does-apache-spark-3-0-increase-the-performance-of-your-sql-workloads/

Cloudera blog on UDF and UDAF development; https://blog.cloudera.com/working-with-udfs-in-apache-spark/

Good stuff from Ranga Reddy

Memory management article >> https://community.cloudera.com/t5/Community-Articles/Spark-Memory-Management/ta-p/317794
All Cloudera Community articles >> https://community.cloudera.com/t5/tkb/usercontributedarticlespage/user-id/78612/tkb-id/CommunityArticles
https://github.com/rangareddy projects such as https://github.com/rangareddy/ranga_spark_experiments

https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-hints.html shows that the pre 3.0 hint was only for broadcast (and didn't require it to actually happy (sounds like a good "hint" to me)) and 3.0 onward we get 4 types of join hints

good article about joining; https://towardsdatascience.com/the-art-of-joining-in-spark-dcbd33d693c