SlideShare a Scribd company logo
Apache Spark Basics
Apache Spark 1
Starting Spark
Apache Spark 2
Change the directory.
cd $SPARK_HOME
Start spark-shell by typing below command.
./bin/spark-shell
Start pyspark by typing below command.
./bin/pyspark
Start SparkR by typing below command.
./bin/sparkR
Spark Application details
Apache Spark 3
Driver program: Program which runs the user’s main function and executes various parallel
operations on a cluster.
SparkConf :Object that contains information about your application.
SparkContext :Object used to access the cluster.
Resilient distributed dataset (RDD) :Collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel.
Operations on RDD
Apache Spark 4
Transformations : Returns another RDD
Action : Returns value.
Create a file spark_notes.txt with below
contents
Apache Spark 5
Apache Spark is an open source Big Data analytical framework.
RDD is the main abstraction in Apache Spark
Apache Spark can also be called as an unified engine.
Scala is programming and functional language.
Apache Spark is developed by using Scala programming language.
Lets start learning Apache Spark and become Data Scientist in Big Data Space.
RDD creation(Scala)
Apache Spark 6
1)
val rdd = sc.parallelize(List(1,2,3,4,5))
val multiply = rdd.map(x =>x*x)
multiply.collect()
2)
val textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt")
textRdd.first()
RDD creation(Python)
Apache Spark 7
1)
rdd = sc.parallelize([1,2,3,4,5])
multiply = rdd.map(lambda x :x*x)
multiply.collect()
2)
textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt")
textRdd.first()
Examples
Apache Spark 8
val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt")
lines.count() // Count the number of items in this RDD
val sparkLines = lines.filter(line => line.contains("Spark"))
sparkLines.count()
val scalaLines = lines.filter(line => line.contains("Scala"))
scalaLines.count()
Word Count Example.
Apache Spark 9
val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt")
val flatMapWords = lines.flatMap(line => line.split(" "))
flatMapWords.collect()
val wordwithOneNumber = flatMapWords.map(word => (word, 1))
val count =wordwithOneNumber.reduceByKey((x, y) => x + y)
count.collect()
FlatMap() and map()
Apache Spark 10
val lines = sc.parallelize(List("hello world","hello spark"))
val wordsFlatMap = lines.flatMap(line => line.split(" "))
wordsFlatMap.collect()
val wordsMap = lines.map(line => line.split(" "))
wordsMap.collect()
Custom Method
Apache Spark 11
def sp(n:String):Array[String] = {n.split(" ")}
val rdd = sc.parallelize(List("Apache spark","spark core","spark ml")
val words = rdd.flatMap(sp)
words.collect()
val words = rdd.map(sp)
words.collect()
Transformations & Actions
Apache Spark 12
Assignments
Apache Spark 13
Lets take List =1,2,3,4,5,1,2,3,1
Write code for below problems
1)Add each element by itselft for above list
2)add one number to each element in List
3)Filter 1 from of above list
4)top 10 words from a file
5)Take only words which are more than 4 chars from a file
Thanks
Apache Spark 14

More Related Content

What's hot (20)

PPTX
Spark tutorial
Sahan Bulathwela
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PDF
Introduction to spark
Duyhai Doan
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
PPTX
Tuning and Debugging in Apache Spark
Patrick Wendell
 
PDF
Spark overview
Lisa Hua
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PDF
Automated Spark Deployment With Declarative Infrastructure
Spark Summit
 
PDF
Road to Analytics
Datio Big Data
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PDF
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
PPTX
Spark etl
Imran Rashid
 
PDF
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
PDF
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
Spark tutorial
Sahan Bulathwela
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Introduction to Spark Internals
Pietro Michiardi
 
Introduction to spark
Duyhai Doan
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Stanford CS347 Guest Lecture: Apache Spark
Reynold Xin
 
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Spark overview
Lisa Hua
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Automated Spark Deployment With Declarative Infrastructure
Spark Summit
 
Road to Analytics
Datio Big Data
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Spark SQL - 10 Things You Need to Know
Kristian Alexander
 
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Alton Alexander
 
Spark etl
Imran Rashid
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 

Viewers also liked (20)

PPTX
Hadoop admiin demo
sparrowAnalytics.com
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
PDF
Introduction to Apache Spark
datamantra
 
PDF
Build application using sbt
sparrowAnalytics.com
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PPTX
Machine learning with raspberrypi
elmokhtar Benfraj
 
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
PDF
BIGDATA & HADOOP PROJECT
sparrowAnalytics.com
 
PDF
Big Data visualization with Apache Spark and Zeppelin
prajods
 
PPTX
Intro to Apache Spark
Mammoth Data
 
PPTX
Apache Spark & Scala
Edureka!
 
PPTX
1 divya
divyabaraskar22
 
PPT
Brig waseem closed versus open managemnt of condylar fractures
khyber college of dentistry
 
PPTX
Settlement of international disputes (International Law) Amicable(Rajat Vaish...
R V
 
DOC
120105040 panduan-kawad-krs-dan-tkrs
Zamri Talib
 
PDF
IoT for Mushroom cultivation farm
Embionics Technologies Private Limited
 
Hadoop admiin demo
sparrowAnalytics.com
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Introduction to Apache Spark
Rahul Jain
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Apache Spark Architecture
Alexey Grishchenko
 
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
Introduction to Apache Spark
datamantra
 
Build application using sbt
sparrowAnalytics.com
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Machine learning with raspberrypi
elmokhtar Benfraj
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Anyscale
 
BIGDATA & HADOOP PROJECT
sparrowAnalytics.com
 
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Intro to Apache Spark
Mammoth Data
 
Apache Spark & Scala
Edureka!
 
Brig waseem closed versus open managemnt of condylar fractures
khyber college of dentistry
 
Settlement of international disputes (International Law) Amicable(Rajat Vaish...
R V
 
120105040 panduan-kawad-krs-dan-tkrs
Zamri Talib
 
IoT for Mushroom cultivation farm
Embionics Technologies Private Limited
 
Ad

Similar to Apache spark basics (20)

PPTX
Spark core
Prashant Gupta
 
PDF
Meetup ml spark_ppt
Snehal Nagmote
 
PDF
Introduction to Spark
Li Ming Tsai
 
PDF
Introduction to apache spark
Muktadiur Rahman
 
PDF
Introduction to apache spark
JUGBD
 
PDF
Apache Spark Tutorial
Farzad Nozarian
 
PPTX
SparkNotes
Demet Aksoy
 
PDF
Intro to apache spark
Amine Sagaama
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
A Deep Dive Into Spark
Ashish kumar
 
PDF
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
PDF
Let's start with Spark
Milos Milovanovic
 
PDF
Apache Spark with Scala
Fernando Rodriguez
 
PDF
Spark devoxx2014
Andy Petrella
 
PDF
Apache Spark Introduction
sudhakara st
 
PPTX
Apache Spark Introduction
Rich Lee
 
PPTX
Apache Spark Fundamentals Training
Eren Avşaroğulları
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Spark core
Prashant Gupta
 
Meetup ml spark_ppt
Snehal Nagmote
 
Introduction to Spark
Li Ming Tsai
 
Introduction to apache spark
Muktadiur Rahman
 
Introduction to apache spark
JUGBD
 
Apache Spark Tutorial
Farzad Nozarian
 
SparkNotes
Demet Aksoy
 
Intro to apache spark
Amine Sagaama
 
Apache Spark Tutorial
Ahmet Bulut
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
A Deep Dive Into Spark
Ashish kumar
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
spinningmatt
 
Let's start with Spark
Milos Milovanovic
 
Apache Spark with Scala
Fernando Rodriguez
 
Spark devoxx2014
Andy Petrella
 
Apache Spark Introduction
sudhakara st
 
Apache Spark Introduction
Rich Lee
 
Apache Spark Fundamentals Training
Eren Avşaroğulları
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Ad

Recently uploaded (20)

PDF
epi editorial commitee meeting presentation
MIPLM
 
PDF
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
PDF
Council of Chalcedon Re-Examined
Smiling Lungs
 
PPTX
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
PPTX
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
PPTX
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
PPTX
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PPTX
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
PPTX
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
PDF
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
PPTX
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
PDF
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
PPTX
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
PDF
AI-Powered-Visual-Storytelling-for-Nonprofits.pdf
TechSoup
 
PPTX
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
PPTX
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
PDF
Horarios de distribución de agua en julio
pegazohn1978
 
PDF
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
PDF
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
PDF
Exploring the Different Types of Experimental Research
Thelma Villaflores
 
epi editorial commitee meeting presentation
MIPLM
 
Aprendendo Arquitetura Framework Salesforce - Dia 03
Mauricio Alexandre Silva
 
Council of Chalcedon Re-Examined
Smiling Lungs
 
How to Configure Re-Ordering From Portal in Odoo 18 Website
Celine George
 
Nitrogen rule, ring rule, mc lafferty.pptx
nbisen2001
 
Universal immunization Programme (UIP).pptx
Vishal Chanalia
 
grade 5 lesson matatag ENGLISH 5_Q1_PPT_WEEK4.pptx
SireQuinn
 
PPT-Q1-WEEK-3-SCIENCE-ERevised Matatag Grade 3.pptx
reijhongidayawan02
 
Post Dated Cheque(PDC) Management in Odoo 18
Celine George
 
Characteristics, Strengths and Weaknesses of Quantitative Research.pdf
Thelma Villaflores
 
CATEGORIES OF NURSING PERSONNEL: HOSPITAL & COLLEGE
PRADEEP ABOTHU
 
Knee Extensor Mechanism Injuries - Orthopedic Radiologic Imaging
Sean M. Fox
 
How to Set Up Tags in Odoo 18 - Odoo Slides
Celine George
 
AI-Powered-Visual-Storytelling-for-Nonprofits.pdf
TechSoup
 
Cultivation practice of Litchi in Nepal.pptx
UmeshTimilsina1
 
DIGITAL CITIZENSHIP TOPIC TLE 8 MATATAG CURRICULUM
ROBERTAUGUSTINEFRANC
 
Horarios de distribución de agua en julio
pegazohn1978
 
The History of Phone Numbers in Stoke Newington by Billy Thomas
History of Stoke Newington
 
Vani - The Voice of Excellence - Jul 2025 issue
Savipriya Raghavendra
 
Exploring the Different Types of Experimental Research
Thelma Villaflores
 

Apache spark basics

  • 2. Starting Spark Apache Spark 2 Change the directory. cd $SPARK_HOME Start spark-shell by typing below command. ./bin/spark-shell Start pyspark by typing below command. ./bin/pyspark Start SparkR by typing below command. ./bin/sparkR
  • 3. Spark Application details Apache Spark 3 Driver program: Program which runs the user’s main function and executes various parallel operations on a cluster. SparkConf :Object that contains information about your application. SparkContext :Object used to access the cluster. Resilient distributed dataset (RDD) :Collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
  • 4. Operations on RDD Apache Spark 4 Transformations : Returns another RDD Action : Returns value.
  • 5. Create a file spark_notes.txt with below contents Apache Spark 5 Apache Spark is an open source Big Data analytical framework. RDD is the main abstraction in Apache Spark Apache Spark can also be called as an unified engine. Scala is programming and functional language. Apache Spark is developed by using Scala programming language. Lets start learning Apache Spark and become Data Scientist in Big Data Space.
  • 6. RDD creation(Scala) Apache Spark 6 1) val rdd = sc.parallelize(List(1,2,3,4,5)) val multiply = rdd.map(x =>x*x) multiply.collect() 2) val textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt") textRdd.first()
  • 7. RDD creation(Python) Apache Spark 7 1) rdd = sc.parallelize([1,2,3,4,5]) multiply = rdd.map(lambda x :x*x) multiply.collect() 2) textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt") textRdd.first()
  • 8. Examples Apache Spark 8 val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt") lines.count() // Count the number of items in this RDD val sparkLines = lines.filter(line => line.contains("Spark")) sparkLines.count() val scalaLines = lines.filter(line => line.contains("Scala")) scalaLines.count()
  • 9. Word Count Example. Apache Spark 9 val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt") val flatMapWords = lines.flatMap(line => line.split(" ")) flatMapWords.collect() val wordwithOneNumber = flatMapWords.map(word => (word, 1)) val count =wordwithOneNumber.reduceByKey((x, y) => x + y) count.collect()
  • 10. FlatMap() and map() Apache Spark 10 val lines = sc.parallelize(List("hello world","hello spark")) val wordsFlatMap = lines.flatMap(line => line.split(" ")) wordsFlatMap.collect() val wordsMap = lines.map(line => line.split(" ")) wordsMap.collect()
  • 11. Custom Method Apache Spark 11 def sp(n:String):Array[String] = {n.split(" ")} val rdd = sc.parallelize(List("Apache spark","spark core","spark ml") val words = rdd.flatMap(sp) words.collect() val words = rdd.map(sp) words.collect()
  • 13. Assignments Apache Spark 13 Lets take List =1,2,3,4,5,1,2,3,1 Write code for below problems 1)Add each element by itselft for above list 2)add one number to each element in List 3)Filter 1 from of above list 4)top 10 words from a file 5)Take only words which are more than 4 chars from a file