Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Then let's start-up the Spark shell.  NOTE: Like in the following capture, I'm eliminating a lot of "noise" to keep us focused on the import important stuff.

Code Block
languagebash
themeEmacs
[mktg1@sandbox ~]$ spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_95)
Type in expressions to have them evaluated.
Type :help for more information.
SQL context available as sqlContext.

scala> 

...

 :help for more information.
SQL context available as sqlContext.

scala> 

As described earlier, now we load up the Dr Suess story into an RDD that then gets split into another RDD of a single word per element.  I'll explain more in a bit, but let's also cache this RDD into memory to aid our ultimate test later in the blog.

...

Now that we got an order of magnitude speed improvement, and somewhat consistent response times, we are ready to stand up a test harness to prove that mapPartitions() is faster than map() when the function we are calling produces negative results when call once per record instead of once per partition.  The wrapSingleWord() function is just going to add ">>" before each word in the RDD and slap on "<<" on the backside.  And to really make the point, I've snuck in a bogus method that burns 50 10 millis of clock time for each function call which is my simulation of some arbitrary expensive operation.  Just paste in the following functions separately in the shell.

...

Code Block
languagebash
themeEmacs
scala> wordsRdd.map(word => wrapSingleWord(word)(wrapSingleWord).take(10)
16/05/20 00:18:46 INFO DAGScheduler: Job 7 finished: take at <console>:34, took 0.151721 s
res9: Array[String] = Array(>>I<<, >>am<<, >>Daniel<<, >><<, >>I<<, >>am<<, >>Sam<<, >>Sam<<, >>I<<, >>am<<)

scala> wordsRdd.map(word => wrapSingleWord(word)).take(10)
16/05/20 00:18:52 INFO DAGScheduler: Job 8 finished: take at <console>:34, took 0.146428 s
res10: Array[String] = Array(>>I<<, >>am<<, >>Daniel<<, >><<, >>I<<, >>am<<, >>Sam<<, >>Sam<<, >>I<<, >>am<<)

scala> wordsRdd.map(word => wrapSingleWord(word)).take(10)
16/05/20 00:18:54 INFO DAGScheduler: Job 9 finished: take at <console>:34, took 0.153416 s
res11: Array[String] = Array(>>I<<, >>am<<, >>Daniel<<, >><<, >>I<<, >>am<<, >>Sam<<, >>Sam<<, >>I<<, >>am<<)
Tip

Remember, wordsRdd.map(wrapSingleWord) is shorthand for wordsRdd.map(word => wrapSingleWord(word)) since the function takes in the entire RDD element.

In this tightly constrained Sandbox cluster, we are seeing about 0.15 secs to execute this completely.  Now, to get mapPartitions() to work we need another method that does the same thing, but it has to allow the whole collection of elements to be passed into the function and it needs to return a whole collection on the way back.  The following is just a jazzed up version of the earlier method.

...