Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

WORK IN PROGRESS!!!!!  SWAP FUNCTION FOR METHOD

I'm writing this blog post for those learning Spark's RDD programming model and who might have heard that using the mapPartitions() transformation is usually faster than its map() brethren and wondered why.  If in doubt... RTFM, which brought me to the following excerpt from http://spark.apache.org/docs/latest/programming-guide.html.

 

The good news is that the answer is right there.  It just might not be as apparent for those from Missouri (you know... the "Show Me State") and almost all of us can benefit from a working example.  If you're almost there from reading the description above then https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/ might just push you over the "aha" moment.  It worked for me, but I wanted to build a fresh example all on my own which I'll try to document well enough for you to review and even recreate if you like.

If you made it this far then you know that Spark's Resilient Distributed Dataset (RDD) is a partitioned beast all its own somewhat similar to how HDFS breaks files up into blocks.  In fact, when we load a file in Spark from HDFS by default the number of RDD partitions is the same as the number of HDFS blocks.  You can suggest set the number of partitions you want for the RDD partitioned differently (I'm breaking my example into three partitions) and that itself is a topic for another, even bigger, discussion.  If you are still with me then you probably know that "narrow" transformations/tasks happen independently on each of the partitions.  Therefore, the well-used map() function is working in parallel on each of the RDD's partition that it is walking through.

That's good news – in fact, that's great news as these narrow tasks are key to performance!  The sister mapPartitions() transformation also works independently on the partitions, ; so what's so special that makes it run better in most cases?  Well... it comes down to the fact that map() exercises the function being utilized at a per element level while mapPartitions() exercises the function at the partition level.  What does that mean in practice? 

...

This next little bit simple loads up a timeless Dr Suess story into an RDD that then gets split transformed into another RDD with a single word for each element.  I'll explain more in a bit, but let's also cache this RDD into memory to aid our ultimate test later in the blog.

For my testing, I used version 2.4 of the Hortonworks Sandbox.  Once you get this started up, SSH into the machine as root, but then switch to the mktg1 user who has a local linux account as well as an HDFS home directory.  If you want to use a brand-new user, try my simple hadoop cluster user provisioning process (simple = w/o pam or kerberos) process.  Once there, you can copy the contents of XXXX of GreenEggsAndHam.txt onto your clipboard and paste it into the vi editor (just type "i" once vi open us starts and paste from your clipboard, then end it all by hitting ESC twice then and typing ":" to get the command prompt and "wq" to write & quit the editor) and then copy that file into HDFS.

...

Then let's start-up the Spark shell.  NOTE: Like in the following capture, I'm eliminating a lot of "noise" to keep us focused on the import important stuff.

Code Block
languagebash
themeEmacs
[mktg1@sandbox ~]$ spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_95)
Type in expressions to have them evaluated.
Type :help for more information.
SQL context available as sqlContext.

scala> 

Now that we got some of the housekeeping out of the way, what are we going to do to see and what does it all mean?  As a reminder, Spark's Resilient Distributed Dataset (RDD) is a partitioned beast all its own somewhat similar to how HDFS breaks files up into blocks.  In fact, when we load a file in Spark from HDFS by default the number of RDD partitions is the same as the number of HDFS blocks.  You can suggest you want the RDD partitioned differently (I'm breaking my example into three partitions) and that itself is a topic for another, even bigger, discussion.  For our purposes, the next little bit simple loads up the Dr As described earlier, now we load up the Dr Suess story into an RDD that then gets split into another RDD of a single word per element.  I'll explain more in a bit, but let'Let's also cache this RDD into memory to aid our ultimate test later in the blogit to establish a baseline of simply reading through the RDD so that we don't introduce any additional variability in our comparison testing.

Code Block
languagebash
themeEmacs
scala> val wordsRdd = sc.textFile("GreenEggsAndHam.txt", 3).flatMap(line => line.split(" "))
wordsRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:27

scala> wordsRdd.persist() //mark it as cached
res0: wordsRdd.type = MapPartitionsRDD[2] at flatMap at <console>:27

scala> sc.setLogLevel("INFO")  //enable timing messages

scala> wordsRdd.take(3)   //trigger cache load
16/05/19 23:08:22 INFO DAGScheduler: Job 0 finished: take at <console>:30, took 0.565019 s
res2: Array[String] = Array(I, am, Daniel)

...

Now that we got an order of magnitude speed improvement, and somewhat consistent response times, we are ready to stand up a test harness to prove that mapPartitions() is faster than map() when the function we are calling produces negative results when call once per record instead of once per partition.  The wrapSingleWord() function is just going to add ">>" before each word in the RDD and slap on "<<" on the backside.  And to really make the point, I've snuck in a bogus method function that burns 50 10 millis of clock time for each function call which is my simulation of some arbitrary expensive operation.  Just paste in the following functions separately in the shell.

...

Code Block
languagebash
themeEmacs
scala> wordsRdd.map(word => wrapSingleWord(word)).take(10)
16/05/20 00:18:46 INFO DAGScheduler: Job 7 finished: take at <console>:34, took 0.151721 s
res9: Array[String] = Array(>>I<<, >>am<<, >>Daniel<<, >><<, >>I<<, >>am<<, >>Sam<<, >>Sam<<, >>I<<, >>am<<)

scala> wordsRdd.map(word => wrapSingleWord(word)).take(10)
16/05/20 00:18:52 INFO DAGScheduler: Job 8 finished: take at <console>:34, took 0.146428 s
res10: Array[String] = Array(>>I<<, >>am<<, >>Daniel<<, >><<, >>I<<, >>am<<, >>Sam<<, >>Sam<<, >>I<<, >>am<<)

scala> wordsRdd.map(word => wrapSingleWord(word)).take(10)
16/05/20 00:18:54 INFO DAGScheduler: Job 9 finished: take at <console>:34, took 0.153416 s
res11: Array[String] = Array(>>I<<, >>am<<, >>Daniel<<, >><<, >>I<<, >>am<<, >>Sam<<, >>Sam<<, >>I<<, >>am<<)
Tip

Remember, wordsRdd.map(wrapSingleWord) could be used as shorthand for wordsRdd.map(word => wrapSingleWord(word)) since the function takes in the entire RDD element.

In this tightly constrained Sandbox cluster, we are seeing about 0.15 secs to execute this completely.  Now, to get mapPartitions() to work we need another method function that does the same thing, but it has to allow the whole collection of elements to be passed into the function and it needs to return a whole collection on the way back.  The following is just a jazzed up version of the earlier methodfunction.

Code Block
languagescala
themeEmacs
import java.util._
import scala.collection.JavaConversions._

def wrapMultiWords(words: Iterator[String]) : Iterator[String] = {
 simulateExpensiveObjectCreation()
  val sb = new StringBuilder()
  val wList = new ArrayList[String]()
  while( words.hasNext ) {
    sb.setLength(0)
    wList.add( sb.append(">>").append(words.next()).append("<<").toString() )
  }
  return wList.iterator()
}

...

As you can see, the new RDD is has the same data, but the performance was dramatically better finishing in about a third of the time in this testing scenario.  Obviously, you'll need to do some testing with your own data and the functions that are being used in the mapping transformations, but if you do have any measurable difference in calling the related function, it will surely surface when you have to call it over and over again.  While Spark is not necessarily Hadoop, the need to POC your particular problem and validate your hypothesis is just as important in this space.  With that, you will likely see improvements by moving from map() to mapPartitions() when your related function can process all the elements at once.