...
As described earlier, now we load up the Dr Suess story into an RDD that then gets split into another RDD of a single word per element. I'll explain more in a bit, but letLet's also cache this RDD into memory to aid our ultimate test later in the blogit to establish a baseline of simply reading through the RDD so that we don't introduce any additional variability in our comparison testing.
Code Block | ||||
---|---|---|---|---|
| ||||
scala> val wordsRdd = sc.textFile("GreenEggsAndHam.txt", 3).flatMap(line => line.split(" ")) wordsRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:27 scala> wordsRdd.persist() //mark it as cached res0: wordsRdd.type = MapPartitionsRDD[2] at flatMap at <console>:27 scala> sc.setLogLevel("INFO") //enable timing messages scala> wordsRdd.take(3) //trigger cache load 16/05/19 23:08:22 INFO DAGScheduler: Job 0 finished: take at <console>:30, took 0.565019 s res2: Array[String] = Array(I, am, Daniel) |
...