SlideShare a Scribd company logo
Farzad Nozarian
4/18/15 @AUT
Purpose
This tutorial provides a quick introduction to using Spark. We will first
introduce the API through Spark’s interactive shell, then show how to write
applications in Scala.
To follow along with this guide, first download a packaged release of Spark
from the Spark website.
2
Interactive Analysis with the Spark Shell-
Basics
• Spark’s shell provides a simple way to learn the API, as well as a powerful tool
to analyze data interactively.
• It is available in either Scala or Python.
• Start it by running the following in the Spark directory:
• RDDs can be created from Hadoop InputFormats (such as HDFS files) or by
transforming other RDDs.
• Let’s make a new RDD from the text of the README file in the Spark source
directory:
3
./bin/spark-shell
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
Interactive Analysis with the Spark Shell-
Basics
• RDDs have actions, which return values, and transformations, which return
pointers to new RDDs. Let’s start with a few actions:
• Now let’s use a transformation:
• We will use the filter transformation to return a new RDD with a subset of the
items in the file.
4
scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line =>
line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
Interactive Analysis with the Spark Shell-
More on RDD Operations
• We can chain together transformations and actions:
• RDD actions and transformations can be used for more complex computations.
• Let’s say we want to find the line with the most words:
• The arguments to map and reduce are Scala function literals (closures), and can
use any language feature or Scala/Java library.
5
scala> textFile.filter(line => line.contains("Spark")).count()
// How many lines contain "Spark"?
res3: Long = 15
scala> textFile.map(line => line.split(" ").size).reduce((a, b)
=> if (a > b) a else b)
res4: Long = 15
Interactive Analysis with the Spark Shell-
More on RDD Operations
• We can easily call functions declared elsewhere.
• We’ll use Math.max() function to make previous code easier to understand:
• One common data flow pattern is MapReduce, as popularized by Hadoop.
• Spark can implement MapReduce flows easily:
6
scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b)
=> Math.max(a, b))
res5: Int = 15
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word =>
(word, 1)).reduceByKey((a, b) => a + b)
wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
Interactive Analysis with the Spark Shell-
More on RDD Operations
• Here, we combined the flatMap, map and reduceByKey transformations to
compute the per-word counts in the file as an RDD of (String, Int) pairs.
• To collect the word counts in our shell, we can use the collect action:
7
scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3),
(Because,1), (Python,2), (agree,1), (cluster.,1), ...)
Interactive Analysis with the Spark Shell-
Caching
• Spark also supports pulling data sets into a cluster-wide in-memory cache.
• This is very useful when data is accessed repeatedly:
• Querying a small “hot” dataset.
• Running an iterative algorithm like PageRank.
• Let’s mark our linesWithSpark dataset to be cached:
8
scala> linesWithSpark.cache()
res7: spark.RDD[String] = spark.FilteredRDD@17e51082
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15
Self-Contained Applications
9
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
Self-Contained Applications (Cont.)
• This program just counts the number of lines containing ‘a’ and the
number containing ‘b’ in the Spark README.
• Note that you’ll need to replace YOUR_SPARK_HOME with the location
where Spark is installed.
• Note that applications should define a main() method instead of
extending scala.App. Subclasses of scala.App may not work correctly.
• Unlike the earlier examples with the Spark shell, which initializes its own
SparkContext, we initialize a SparkContext as part of the program.
• We pass the SparkContext constructor a SparkConf object which
contains information about our application.
10
Self-Contained Applications (Cont.)
• Our application depends on the Spark API, so we’ll also include an sbt
configuration file, simple.sbt which explains that Spark is a dependency.
• For sbt to work correctly, we’ll need to layout SimpleApp.scala and
simple.sbt according to the typical directory structure.
• Then we can create a JAR package containing the application’s code and
use the spark-submit script to run our program.
11
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1"
Self-Contained Applications (Cont.)
12
# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit 
--class "SimpleApp" 
--master local[4] 
target/scala-2.10/simple-project_2.10-1.0.jar
...
Lines with a: 46, Lines with b: 23

More Related Content

What's hot (20)

PDF
Introduction to Apache Spark
Datio Big Data
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PPTX
Apache Spark overview
DataArt
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PDF
Apache spark basics
sparrowAnalytics.com
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
ODP
Introduction to Spark with Scala
Himanshu Gupta
 
PPTX
Spark Sql and DataFrame
Prashant Gupta
 
PDF
Spark shuffle introduction
colorant
 
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
PPTX
Spark core
Prashant Gupta
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
Apache spark Intro
Tudor Lapusan
 
PPTX
Transformations and actions a visual guide training
Spark Summit
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Spark 1.6 vs Spark 2.0
Sigmoid
 
Introduction to Apache Spark
Datio Big Data
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark overview
DataArt
 
Intro to Apache Spark
Robert Sanders
 
Apache spark basics
sparrowAnalytics.com
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Introduction to Spark with Scala
Himanshu Gupta
 
Spark Sql and DataFrame
Prashant Gupta
 
Spark shuffle introduction
colorant
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Spark core
Prashant Gupta
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Architecture
Alexey Grishchenko
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache spark Intro
Tudor Lapusan
 
Transformations and actions a visual guide training
Spark Summit
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark 1.6 vs Spark 2.0
Sigmoid
 

Viewers also liked (20)

PDF
Object Based Databases
Farzad Nozarian
 
PDF
Apache HDFS - Lab Assignment
Farzad Nozarian
 
PDF
Shark - Lab Assignment
Farzad Nozarian
 
PDF
Apache Storm Tutorial
Farzad Nozarian
 
PPTX
Introduction to Spark: Data Analysis and Use Cases in Big Data
Jongwook Woo
 
PDF
Apache HBase - Lab Assignment
Farzad Nozarian
 
PDF
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
PDF
Big Data Processing in Cloud Computing Environments
Farzad Nozarian
 
PPTX
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 
PPTX
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Cloudera, Inc.
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PDF
BDM25 - Spark runtime internal
David Lauzon
 
PPTX
S4: Distributed Stream Computing Platform
Farzad Nozarian
 
PPTX
The Vortex of Change - Digital Transformation (Presented by Intel)
Cloudera, Inc.
 
PDF
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
PPTX
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PPTX
Top 5 IoT Use Cases
Cloudera, Inc.
 
PPTX
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
 
Object Based Databases
Farzad Nozarian
 
Apache HDFS - Lab Assignment
Farzad Nozarian
 
Shark - Lab Assignment
Farzad Nozarian
 
Apache Storm Tutorial
Farzad Nozarian
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Jongwook Woo
 
Apache HBase - Lab Assignment
Farzad Nozarian
 
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Big Data Processing in Cloud Computing Environments
Farzad Nozarian
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Cloudera, Inc.
 
Big Data and Cloud Computing
Farzad Nozarian
 
BDM25 - Spark runtime internal
David Lauzon
 
S4: Distributed Stream Computing Platform
Farzad Nozarian
 
The Vortex of Change - Digital Transformation (Presented by Intel)
Cloudera, Inc.
 
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Top 5 IoT Use Cases
Cloudera, Inc.
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
 
Ad

Similar to Apache Spark Tutorial (20)

PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
Spark浅谈
Jiahua Zhu
 
PDF
Scala+data
Samir Bessalah
 
PDF
Intro to apache spark
Amine Sagaama
 
PPT
An Introduction to Apache spark with scala
johnn210
 
PPTX
Spark and scala..................................... ppt.pptx
shivani22y
 
PPTX
Learning spark ch09 - Spark SQL
phanleson
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
PDF
Apache Spark Overview @ ferret
Andrii Gakhov
 
PPTX
Spark from the Surface
Josi Aranda
 
PDF
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
PDF
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
PPTX
Apache Spark An Overview
Mohit Jain
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
Meetup ml spark_ppt
Snehal Nagmote
 
PPTX
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Introduction to Apache Spark
Rahul Jain
 
Apache Spark Tutorial
Ahmet Bulut
 
Introduction to Spark - DataFactZ
DataFactZ
 
Spark浅谈
Jiahua Zhu
 
Scala+data
Samir Bessalah
 
Intro to apache spark
Amine Sagaama
 
An Introduction to Apache spark with scala
johnn210
 
Spark and scala..................................... ppt.pptx
shivani22y
 
Learning spark ch09 - Spark SQL
phanleson
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Apache Spark Overview @ ferret
Andrii Gakhov
 
Spark from the Surface
Josi Aranda
 
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Apache Spark An Overview
Mohit Jain
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Meetup ml spark_ppt
Snehal Nagmote
 
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Ad

Recently uploaded (20)

PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Import Data Form Excel to Tally Services
Tally xperts
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Executive Business Intelligence Dashboards
vandeslie24
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Human Resources Information System (HRIS)
Amity University, Patna
 

Apache Spark Tutorial

  • 2. Purpose This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell, then show how to write applications in Scala. To follow along with this guide, first download a packaged release of Spark from the Spark website. 2
  • 3. Interactive Analysis with the Spark Shell- Basics • Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. • It is available in either Scala or Python. • Start it by running the following in the Spark directory: • RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. • Let’s make a new RDD from the text of the README file in the Spark source directory: 3 ./bin/spark-shell scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
  • 4. Interactive Analysis with the Spark Shell- Basics • RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions: • Now let’s use a transformation: • We will use the filter transformation to return a new RDD with a subset of the items in the file. 4 scala> textFile.count() // Number of items in this RDD res0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
  • 5. Interactive Analysis with the Spark Shell- More on RDD Operations • We can chain together transformations and actions: • RDD actions and transformations can be used for more complex computations. • Let’s say we want to find the line with the most words: • The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. 5 scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15 scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15
  • 6. Interactive Analysis with the Spark Shell- More on RDD Operations • We can easily call functions declared elsewhere. • We’ll use Math.max() function to make previous code easier to understand: • One common data flow pattern is MapReduce, as popularized by Hadoop. • Spark can implement MapReduce flows easily: 6 scala> import java.lang.Math import java.lang.Math scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
  • 7. Interactive Analysis with the Spark Shell- More on RDD Operations • Here, we combined the flatMap, map and reduceByKey transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. • To collect the word counts in our shell, we can use the collect action: 7 scala> wordCounts.collect() res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
  • 8. Interactive Analysis with the Spark Shell- Caching • Spark also supports pulling data sets into a cluster-wide in-memory cache. • This is very useful when data is accessed repeatedly: • Querying a small “hot” dataset. • Running an iterative algorithm like PageRank. • Let’s mark our linesWithSpark dataset to be cached: 8 scala> linesWithSpark.cache() res7: spark.RDD[String] = spark.FilteredRDD@17e51082 scala> linesWithSpark.count() res8: Long = 15 scala> linesWithSpark.count() res9: Long = 15
  • 9. Self-Contained Applications 9 /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) }
  • 10. Self-Contained Applications (Cont.) • This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in the Spark README. • Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. • Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly. • Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the program. • We pass the SparkContext constructor a SparkConf object which contains information about our application. 10
  • 11. Self-Contained Applications (Cont.) • Our application depends on the Spark API, so we’ll also include an sbt configuration file, simple.sbt which explains that Spark is a dependency. • For sbt to work correctly, we’ll need to layout SimpleApp.scala and simple.sbt according to the typical directory structure. • Then we can create a JAR package containing the application’s code and use the spark-submit script to run our program. 11 name := "Simple Project" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1"
  • 12. Self-Contained Applications (Cont.) 12 # Your directory layout should look like this $ find . . ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/SimpleApp.scala # Package a jar containing your application $ sbt package ... [info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar # Use spark-submit to run your application $ YOUR_SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar ... Lines with a: 46, Lines with b: 23