Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

WORK IN PROGRESS!!!!!  SWAP FUNCTION FOR METHOD

I'm writing this blog post for those learning Spark's RDD programming model and who might have heard that using the mapPartitions() transformation is usually faster than its map() brethren and wondered why.  If in doubt... RTFM, which brought me to the following excerpt from http://spark.apache.org/docs/latest/programming-guide.html.

...

If you made it this far then you know that Spark's Resilient Distributed Dataset (RDD) is a partitioned beast all its own somewhat similar to how HDFS breaks files up into blocks.  In fact, when we load a file in Spark from HDFS by default the number of RDD partitions is the same as the number of HDFS blocks.  You can suggest set the number of partitions you want for the RDD partitioned differently (I'm breaking my example into three partitions) and that itself is a topic for another, even bigger, discussion.  If you are still with me then you probably know that "narrow" transformations/tasks happen independently on each of the partitions.  Therefore, the well-used map() function is working in parallel on each of the RDD's partition that it is walking through.

That's good news – in fact, that's great news as these narrow tasks are key to performance!  The sister mapPartitions() transformation also works independently on the partitions, ; so what's so special that makes it run better in most cases?  Well... it comes down to the fact that map() exercises the function being utilized at a per element level while mapPartitions() exercises the function at the partition level.  What does that mean in practice? 

...

This next little bit simple loads up a timeless Dr Suess story into an RDD that then gets split transformed into another RDD with a single word for each element.  I'll explain more in a bit, but let's also cache this RDD into memory to aid our ultimate test later in the blog.

For my testing, I used version 2.4 of the Hortonworks Sandbox.  Once you get this started up, SSH into the machine as root, but then switch to the mktg1 user who has a local linux account as well as an HDFS home directory.  If you want to use a brand-new user, try my simple hadoop cluster user provisioning process (simple = w/o pam or kerberos) process.  Once there, you can copy the contents of XXXX of GreenEggsAndHam.txt onto your clipboard and paste it into the vi editor (just type "i" once vi open us starts and paste from your clipboard, then end it all by hitting ESC twice then and typing ":" to get the command prompt and "wq" to write & quit the editor) and then copy that file into HDFS.

...