Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

As you can see, the new RDD is has the same data, but the performance was dramatically better finishing in about a third of the time in this testing scenario.  Obviously, you'll need to do some testing with your own data and the functions that are being used in the mapping transformations, but if you do have any measurable difference in calling the related function, it will surely surface when you have to call it over and over again.  While Spark is not necessarily Hadoop, the need to POC your particular problem and validate your hypothesis is just as important in this space.  With that, you will likely see improvements by moving from map() to mapPartitions() when your related function can process all the elements at once.