why spark's mapPartition transformation is faster than map (calls your function once/partition, not once/element)

WORK IN PROGRESS!!!!!

I'm writing this blog post for those learning Spark's RDD programming model and who might have heard that using the mapPartition() transformation is usually faster than its map() brethren and wondered why. If in doubt... RTFM, which brought me to the following excerpt from http://spark.apache.org/docs/latest/programming-guide.html.

The good news is that the answer is right there. It just might not be as apparent for those from Missouri (you know... the "Show Me State") and almost all of us can benefit from a working example. If you're almost there from reading the description above then https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/ might just push you over the "aha" moment. It worked for me, but I wanted to build a fresh example all on my own which I'll try to document well enough for you to review and even recreate if you like.

For my testing, I used version 2.4 of the Hortonworks Sandbox. Once you get this started up, SSH into the machine as root, but then switch user to mktg1 who has a local linux account as well as an HDFS home directory. If you want to use a brand-new user, try my simple hadoop cluster user provisioning process (simple = w/o pam or kerberos) process. Once there, you can copy the contents of XXXX onto your clipboard and paste it into the vi editor (just type "i" once vi open us and paste from your clipboard then end it all by hitting ESC twice then typing ":" to get the command prompt and "wq" to write & quit the editor) and then copy that file into HDFS.

HW10653-2:~ lmartin$ ssh root@127.0.0.1 -p 2222
root@127.0.0.1's password: 
Last login: Thu May 19 22:04:06 2016 from 10.0.2.2
[root@sandbox ~]# su - mktg1
[mktg1@sandbox ~]$ vi GreenEggsAndHam.txt
[mktg1@sandbox ~]$ hdfs dfs -put GreenEggsAndHam.txt 
[mktg1@sandbox ~]$ hdfs dfs -ls 
Found 1 items
-rw-r--r--   3 mktg1 hdfs       3478 2016-05-19 22:13 GreenEggsAndHam.txt

Then let's start-up the Spark shell. NOTE: Like in the following capture, I'm eliminating a lot of "noise" to keep us focused on the import stuff.

[mktg1@sandbox ~]$ spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_95)
Type in expressions to have them evaluated.
Type :help for more information.
SQL context available as sqlContext.

scala>

Now that we got some of the housekeeping out of the way, what are we going to do to see and what does it all mean? As a reminder, Spark's Resilient Distributed Dataset (RDD) is a partitioned beast all its own somewhat similar to how HDFS breaks files up into blocks. In fact, when we load a file in Spark from HDFS by default the number of RDD partitions is the same as the number of HDFS blocks. You can suggest you want the RDD partitioned differently (I'm breaking my example into three partitions) and that itself is a topic for another, even bigger, discussion. For our purposes, the next little bit simple loads up the Dr Suess story into an RDD that then gets split into another RDD of a single word per element. I'll explain more in a bit, but let's also cache this RDD into memory to aid our ultimate test later in the blog.

scala> val wordsRdd = sc.textFile("10xselfishgiant.txt", 3).flatMap(line => line.split(" "))
wordsRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[34] at flatMap at <console>:27

scala> wordsRdd.persist() //mark it as cached
res44: wordsRdd.type = MapPartitionsRDD[34] at flatMap at <console>:27

scala> wordsRdd.take(3)   //trigger cache load
16/05/19 18:29:27 INFO DAGScheduler: Job 38 finished: take at <console>:30, took 0.059390 s
res45: Array[String] = Array(EVERY, afternoon,, as)

c

ave stumbled across the following