Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

As described earlier, now we load up the Dr Suess story into an RDD that then gets split into another RDD of a single word per element.  I'll explain more in a bit, but letLet's also cache this RDD into memory to aid our ultimate test later in the blogit to establish a baseline of simply reading through the RDD so that we don't introduce any additional variability in our comparison testing.

Code Block
languagebash
themeEmacs
scala> val wordsRdd = sc.textFile("GreenEggsAndHam.txt", 3).flatMap(line => line.split(" "))
wordsRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:27

scala> wordsRdd.persist() //mark it as cached
res0: wordsRdd.type = MapPartitionsRDD[2] at flatMap at <console>:27

scala> sc.setLogLevel("INFO")  //enable timing messages

scala> wordsRdd.take(3)   //trigger cache load
16/05/19 23:08:22 INFO DAGScheduler: Job 0 finished: take at <console>:30, took 0.565019 s
res2: Array[String] = Array(I, am, Daniel)

...