# Programming with Apache Spark ![](Gents.jpg) ^ These Spark “introduction” sections and labs are new as of Rev6. If there are time constraints, feel free to convert the labs into demonstrations to ensure the students get an opportunity to at least watch some Spark programming. ^ Est. 45 mins for the slides, plus 30 mins for the lab at the end of the section ^ Spark 1.4.1 Programming Guide; http://spark.apache.org/docs/1.4.1/programming-guide.html --- ## Learning Objectives After you complete this lesson you should be able to: - Start the Spark shell - Understand what an RDD is - Load data from HDFS and perform a word count - Know the differences between Transformation and Action - Explain Lazy Evaluation --- ## The Spark Ecosystem ![inline](SparkEcosystem.jpg) ^ Spark consists of a core library. It was then built upon by Spark SQL, Streaming, ML-Lib and GraphX. ^ Spark SQL and Dataframe have exploded in popularity recently as there have been many performance improvements. ^ GraphX is very new and currently not supported by anyone. --- ## The Resilient Distributed Dataset An *Immutable* collection of objects (or records) that can be operated in parallel - **Resilient**: can be created from parent RDDs - An RDD keeps its lineage information - **Distributed**: partitions of data are distributed across nodes in the cluster - **Dataset**: a set of data that can be accessed Each RDD is composed of 1 or more partitions - The user can control the number of partitions - More partitions => More parallelism --- ## What Does "Lazy Execution" Mean? ```python file = sc.textFile("hdfs://some-text-file") counts = file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word,1)) \ .reduceByKey(lambda a,b: a+b) ``` DAG of transformations is built by Spark on driver side ```python counts.saveAsTextFile("hdfs://wordcount-out") ``` Action triggers execution of whole DAG --- ## Transformation: filter() Keep some elements based on a predicate ```python rdd = sc.parallelize([1, 2, 3, 4, 5]) rdd.filter(lambda x: x%2 == 0).collect() [2, 4] rdd.filter(lambda x: x<3).collect() [1, 2] ``` --- ## Creating a DataFrame: from a table in Hive Load the entire table ```python df = hc.table("patients") ``` Load using a SQL query ```python df1 = hc.sql("SELECT * FROM patients WHERE age>50") df2 = hc.sql(""" SELECT col1 AS timestamp, SUBSTR(date,1,4) AS year, event FROM events WHERE year>2014""") ``` --- # Defining Workflow with Oozie ![](Gents.jpg) ^ BE SURE IN THIS SECTION TO PULL UP THE APACHE PROJECT PAGE; http://oozie.apache.org ^ With the addition in Rev6 of the Spark sections just before this one, feel free to drop the lab (or even this entire section) as time permits. --- ## Overview of Oozie Oozie has two main capabilities: - **Oozie Workflow**: a collection of actions - **Oozie Coordinator**: a recurring workflow --- ## Defining an Oozie Workflow ![inline](OozieWorkflow.png) ^ Spark consists of a core library. It was then built upon by Spark SQL, Streaming, ML-Lib and GraphX. ^ Spark SQL and Dataframe have exploded in popularity recently as there have been many performance improvements. ^ GraphX is very new and currently not supported by anyone. --- ## Pig Actions ```xml ${resourceManager} ${nameNode} Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}] ```