Page Comparison

...

As described earlier, now we load up the Dr Suess story into an RDD that then gets split into another RDD of a single word per element. I'll explain more in a bit, but letLet's also cache this RDD into memory to aid our ultimate test later in the blogit to establish a baseline of simply reading through the RDD so that we don't introduce any additional variability in our comparison testing.

Code Block

language	bash
theme	Emacs

scala> val wordsRdd = sc.textFile("GreenEggsAndHam.txt", 3).flatMap(line => line.split(" "))
wordsRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:27

scala> wordsRdd.persist() //mark it as cached
res0: wordsRdd.type = MapPartitionsRDD[2] at flatMap at <console>:27

scala> sc.setLogLevel("INFO")  //enable timing messages

scala> wordsRdd.take(3)   //trigger cache load
16/05/19 23:08:22 INFO DAGScheduler: Job 0 finished: take at <console>:30, took 0.565019 s
res2: Array[String] = Array(I, am, Daniel)

...

Versions Compared

Old Version 6

New Version 7

Key