Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

I'm writing this blog post for those learning Spark's RDD programming model and who might have heard that using the mapPartitions() transformation is usually faster than its map() brethren and wondered why.  If in doubt... RTFM, which brought me to the following excerpt from http://spark.apache.org/docs/latest/programming-guide.html.

 

The good news is that the answer is right there.  It just might not be as apparent for those from Missouri (you know... the "Show Me State") and almost all of us can benefit from a working example.  If you're almost there from reading the description above then https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/ might just push you over the "aha" moment.  It worked for me, but I wanted to build a fresh example all on my own which I'll try to document well enough for you to review and even recreate if you like.

...