SlideShare a Scribd company logo
Apache Spark 
In-Memory Data Processing 
September 2014 Meetup 
Organized by Big Data Hyderabad Meetup Group. 
http://www.meetup.com/Big-Data-Hyderabad/ 
Rahul Jain 
@rahuldausa
Agenda 
• Why Spark 
• Introduction 
• Basics 
• Hands-on 
– Installation 
– Examples 
2
Quick Questionnaire 
How many people know/work on Scala ? 
How many people know/work on Python ? 
How many people know/heard/are using Spark ?
Why Spark ? 
• Most of Machine Learning Algorithms are iterative because each iteration 
can improve the results 
• With Disk based approach each iteration’s output is written to disk making 
it slow 
Hadoop execution flow 
Spark execution flow 
http://www.wiziq.com/blog/hype-around-apache-spark/
About Apache Spark 
• Initially started at UC Berkeley in 2009 
• Fast and general purpose cluster computing system 
• 10x (on disk) - 100x (In-Memory) faster 
• Most popular for running Iterative Machine Learning Algorithms. 
• Provides high level APIs in 
• Java 
• Scala 
• Python 
• Integration with Hadoop and its eco-system and can read existing data. 
• http://spark.apache.org/
Spark Stack 
• Spark SQL 
– For SQL and unstructured data 
processing 
• MLib 
– Machine Learning Algorithms 
• GraphX 
– Graph Processing 
• Spark Streaming 
– stream processing of live data 
streams 
http://spark.apache.org
Execution Flow 
http://spark.apache.org/docs/latest/cluster-overview.html
Terminology 
• Application Jar 
– User Program and its dependencies except Hadoop & Spark Jars bundled into a 
Jar file 
• Driver Program 
– The process to start the execution (main() function) 
• Cluster Manager 
– An external service to manage resources on the cluster (standalone manager, 
YARN, Apache Mesos) 
• Deploy Mode 
– cluster : Driver inside the cluster 
– client : Driver outside of Cluster
Terminology (contd.) 
• Worker Node : Node that run the application program in cluster 
• Executor 
– Process launched on a worker node, that runs the Tasks 
– Keep data in memory or disk storage 
• Task : A unit of work that will be sent to executor 
• Job 
– Consists multiple tasks 
– Created based on a Action 
• Stage : Each Job is divided into smaller set of tasks called Stages that is sequential 
and depend on each other 
• SparkContext : 
– represents the connection to a Spark cluster, and can be used to create RDDs, 
accumulators and broadcast variables on that cluster.
Resilient Distributed Dataset (RDD) 
• Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark 
• Immutable, Partitioned collection of elements that can be operated in parallel 
• Basic Operations 
– map 
– filter 
– persist 
• Multiple Implementation 
– PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join 
– DoubleRDDFunctions : Operation related to double values 
– SequenceFileRDDFunctions : Operation related to SequenceFiles 
• RDD main characteristics: 
– A list of partitions 
– A function for computing each split 
– A list of dependencies on other RDDs 
– Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) 
– Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) 
• Custom RDD can be also implemented (by overriding functions)
Cluster Deployment 
• Standalone Deploy Mode 
– simplest way to deploy Spark on a private cluster 
• Amazon EC2 
– EC2 scripts are available 
– Very quick launching a new cluster 
• Apache Mesos 
• Hadoop YARN
Monitoring
Monitoring – Stages
Monitoring – Stages
Let’s try some examples…
Spark Shell 
./bin/spark-shell --master local[2] 
The --master option specifies the master URL for a distributed cluster, or local to run 
locally with one thread, or local[N] to run locally with N threads. You should start by 
using local for testing. 
./bin/run-example SparkPi 10 
This will run 10 iterations to calculate the value of Pi
Basic operations… 
scala> val textFile = sc.textFile("README.md") 
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 
scala> textFile.count() // Number of items in this RDD 
ees0: Long = 126 
scala> textFile.first() // First item in this RDD 
res1: String = # Apache Spark 
scala> val linesWithSpark = textFile.filter(line => 
line.contains("Spark")) 
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 
Simplier - Single liner: 
scala> textFile.filter(line => line.contains("Spark")).count() 
// How many lines contain "Spark"? 
res3: Long = 15
Map - Reduce 
scala> textFile.map(line => line.split(" ").size).reduce((a, b) 
=> if (a > b) a else b) 
res4: Long = 15 
scala> import java.lang.Math 
scala> textFile.map(line => line.split(" ").size).reduce((a, b) 
=> Math.max(a, b)) 
res5: Int = 15 
scala> val wordCounts = textFile.flatMap(line => line.split(" 
")).map(word => (word, 1)).reduceByKey((a, b) => a + b) 
wordCounts: spark.RDD[(String, Int)] = 
spark.ShuffledAggregatedRDD@71f027b8 
wordCounts.collect()
With Caching… 
scala> linesWithSpark.cache() 
res7: spark.RDD[String] = spark.FilteredRDD@17e51082 
scala> linesWithSpark.count() 
res8: Long = 15 
scala> linesWithSpark.count() 
res9: Long = 15
With HDFS… 
val lines = spark.textFile(“hdfs://...”) 
val errors = lines.filter(line => line.startsWith(“ERROR”)) 
println(Total errors: + errors.count())
Standalone (Scala) 
/* SimpleApp.scala */ 
import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 
object SimpleApp { 
def main(args: Array[String]) { 
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your 
system 
val conf = new SparkConf().setAppName("Simple Application") 
.setMaster(“local") 
val sc = new SparkContext(conf) 
val logData = sc.textFile(logFile, 2).cache() 
val numAs = logData.filter(line => line.contains("a")).count() 
val numBs = logData.filter(line => line.contains("b")).count() 
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) 
} 
}
Standalone (Java) 
/* SimpleApp.java */ 
import org.apache.spark.api.java.*; 
import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.function.Function; 
public class SimpleApp { 
public static void main(String[] args) { 
String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system 
SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); 
JavaSparkContext sc = new JavaSparkContext(conf); 
JavaRDD<String> logData = sc.textFile(logFile).cache(); 
long numAs = logData.filter(new Function<String, Boolean>() { 
public Boolean call(String s) { return s.contains("a"); } 
}).count(); 
long numBs = logData.filter(new Function<String, Boolean>() { 
public Boolean call(String s) { return s.contains("b"); } 
}).count(); 
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); 
} 
}
Standalone (Python) 
"""SimpleApp.py""" 
from pyspark import SparkContext 
logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your 
system 
sc = SparkContext("local", "Simple App") 
logData = sc.textFile(logFile).cache() 
numAs = logData.filter(lambda s: 'a' in s).count() 
numBs = logData.filter(lambda s: 'b' in s).count() 
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
Job Submission 
$SPARK_HOME/bin/spark-submit  
--class "SimpleApp"  
--master local[4]  
target/scala-2.10/simple-project_2.10-1.0.jar
Configuration 
val conf = new SparkConf() 
.setMaster("local") 
.setAppName("CountingSheep") 
.set("spark.executor.memory", "1g") 
val sc = new SparkContext(conf)
Questions ? 
26
Thanks! 
@rahuldausa on twitter and slideshare 
http://www.linkedin.com/in/rahuldausa 
27 
Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR 
http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 
http://www.meetup.com/DataAnalyticsGroup/ 
Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. 
http://www.meetup.com/Big-Data-Hyderabad/

More Related Content

What's hot (20)

PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PDF
Spark SQL
Joud Khattab
 
PDF
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Apache Spark Core
Girish Khanzode
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PPTX
Introduction to Storm
Chandler Huang
 
PDF
Intro to HBase
alexbaranau
 
PPTX
Apache Flink and what it is used for
Aljoscha Krettek
 
PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PPTX
Spark
Heena Madan
 
PPTX
Apache Spark overview
DataArt
 
PDF
Physical Plans in Spark SQL
Databricks
 
PDF
Apache spark
shima jafari
 
PDF
Spark overview
Lisa Hua
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Apache Spark Architecture
Alexey Grishchenko
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
Spark SQL
Joud Khattab
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark Core
Girish Khanzode
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Apache Spark Fundamentals
Zahra Eskandari
 
Introduction to Spark with Python
Gokhan Atil
 
Introduction to Storm
Chandler Huang
 
Intro to HBase
alexbaranau
 
Apache Flink and what it is used for
Aljoscha Krettek
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Apache Spark overview
DataArt
 
Physical Plans in Spark SQL
Databricks
 
Apache spark
shima jafari
 
Spark overview
Lisa Hua
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 

Viewers also liked (9)

PDF
Introduction to Spark Internals
Pietro Michiardi
 
PPTX
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
PDF
Introduction to Apache Spark
datamantra
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
PDF
SQL to Hive Cheat Sheet
Hortonworks
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Introduction to Spark Internals
Pietro Michiardi
 
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
Introduction to Apache Spark
datamantra
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
SQL to Hive Cheat Sheet
Hortonworks
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Ad

Similar to Introduction to Apache Spark (20)

PPTX
Spark core
Prashant Gupta
 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PDF
Apache spark basics
sparrowAnalytics.com
 
PDF
Apache Spark Tutorial
Ahmet Bulut
 
PDF
A Deep Dive Into Spark
Ashish kumar
 
PDF
Introduction to Spark
Li Ming Tsai
 
PPTX
Intro to Apache Spark
clairvoyantllc
 
PPTX
SparkNotes
Demet Aksoy
 
PPT
Scala and spark
Fabio Fumarola
 
PDF
Apache Spark RDDs
Dean Chen
 
PDF
Meetup ml spark_ppt
Snehal Nagmote
 
PDF
Apache Spark Tutorial
Farzad Nozarian
 
PDF
Scala Meetup Hamburg - Spark
Ivan Morozov
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PDF
Introduction to apache spark
Muktadiur Rahman
 
PDF
Introduction to apache spark
JUGBD
 
PDF
Introduction to Apache Spark
Datio Big Data
 
PPTX
OVERVIEW ON SPARK.pptx
Aishg4
 
Spark core
Prashant Gupta
 
Introduction to Apache Spark
Mohamed hedi Abidi
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Apache spark basics
sparrowAnalytics.com
 
Apache Spark Tutorial
Ahmet Bulut
 
A Deep Dive Into Spark
Ashish kumar
 
Introduction to Spark
Li Ming Tsai
 
Intro to Apache Spark
clairvoyantllc
 
SparkNotes
Demet Aksoy
 
Scala and spark
Fabio Fumarola
 
Apache Spark RDDs
Dean Chen
 
Meetup ml spark_ppt
Snehal Nagmote
 
Apache Spark Tutorial
Farzad Nozarian
 
Scala Meetup Hamburg - Spark
Ivan Morozov
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Introduction to apache spark
Muktadiur Rahman
 
Introduction to apache spark
JUGBD
 
Introduction to Apache Spark
Datio Big Data
 
OVERVIEW ON SPARK.pptx
Aishg4
 
Ad

More from Rahul Jain (15)

PDF
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
PPTX
Emerging technologies /frameworks in Big Data
Rahul Jain
 
PPTX
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PPTX
Introduction to Machine Learning
Rahul Jain
 
PPTX
Introduction to Scala
Rahul Jain
 
PPTX
What is NoSQL and CAP Theorem
Rahul Jain
 
PPTX
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
PPTX
Introduction to Apache Lucene/Solr
Rahul Jain
 
PPTX
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
PPTX
Introduction to Kafka and Zookeeper
Rahul Jain
 
PPTX
Apache kafka
Rahul Jain
 
PPTX
Hadoop & HDFS for Beginners
Rahul Jain
 
DOC
Hibernate tutorial for beginners
Rahul Jain
 
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Introduction to Machine Learning
Rahul Jain
 
Introduction to Scala
Rahul Jain
 
What is NoSQL and CAP Theorem
Rahul Jain
 
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Introduction to Apache Lucene/Solr
Rahul Jain
 
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Introduction to Kafka and Zookeeper
Rahul Jain
 
Apache kafka
Rahul Jain
 
Hadoop & HDFS for Beginners
Rahul Jain
 
Hibernate tutorial for beginners
Rahul Jain
 

Recently uploaded (20)

PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 

Introduction to Apache Spark

  • 1. Apache Spark In-Memory Data Processing September 2014 Meetup Organized by Big Data Hyderabad Meetup Group. http://www.meetup.com/Big-Data-Hyderabad/ Rahul Jain @rahuldausa
  • 2. Agenda • Why Spark • Introduction • Basics • Hands-on – Installation – Examples 2
  • 3. Quick Questionnaire How many people know/work on Scala ? How many people know/work on Python ? How many people know/heard/are using Spark ?
  • 4. Why Spark ? • Most of Machine Learning Algorithms are iterative because each iteration can improve the results • With Disk based approach each iteration’s output is written to disk making it slow Hadoop execution flow Spark execution flow http://www.wiziq.com/blog/hype-around-apache-spark/
  • 5. About Apache Spark • Initially started at UC Berkeley in 2009 • Fast and general purpose cluster computing system • 10x (on disk) - 100x (In-Memory) faster • Most popular for running Iterative Machine Learning Algorithms. • Provides high level APIs in • Java • Scala • Python • Integration with Hadoop and its eco-system and can read existing data. • http://spark.apache.org/
  • 6. Spark Stack • Spark SQL – For SQL and unstructured data processing • MLib – Machine Learning Algorithms • GraphX – Graph Processing • Spark Streaming – stream processing of live data streams http://spark.apache.org
  • 8. Terminology • Application Jar – User Program and its dependencies except Hadoop & Spark Jars bundled into a Jar file • Driver Program – The process to start the execution (main() function) • Cluster Manager – An external service to manage resources on the cluster (standalone manager, YARN, Apache Mesos) • Deploy Mode – cluster : Driver inside the cluster – client : Driver outside of Cluster
  • 9. Terminology (contd.) • Worker Node : Node that run the application program in cluster • Executor – Process launched on a worker node, that runs the Tasks – Keep data in memory or disk storage • Task : A unit of work that will be sent to executor • Job – Consists multiple tasks – Created based on a Action • Stage : Each Job is divided into smaller set of tasks called Stages that is sequential and depend on each other • SparkContext : – represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
  • 10. Resilient Distributed Dataset (RDD) • Resilient Distributed Dataset (RDD) is a basic Abstraction in Spark • Immutable, Partitioned collection of elements that can be operated in parallel • Basic Operations – map – filter – persist • Multiple Implementation – PairRDDFunctions : RDD of Key-Value Pairs, groupByKey, Join – DoubleRDDFunctions : Operation related to double values – SequenceFileRDDFunctions : Operation related to SequenceFiles • RDD main characteristics: – A list of partitions – A function for computing each split – A list of dependencies on other RDDs – Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned) – Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file) • Custom RDD can be also implemented (by overriding functions)
  • 11. Cluster Deployment • Standalone Deploy Mode – simplest way to deploy Spark on a private cluster • Amazon EC2 – EC2 scripts are available – Very quick launching a new cluster • Apache Mesos • Hadoop YARN
  • 15. Let’s try some examples…
  • 16. Spark Shell ./bin/spark-shell --master local[2] The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. ./bin/run-example SparkPi 10 This will run 10 iterations to calculate the value of Pi
  • 17. Basic operations… scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() // Number of items in this RDD ees0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 Simplier - Single liner: scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15
  • 18. Map - Reduce scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15 scala> import java.lang.Math scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8 wordCounts.collect()
  • 19. With Caching… scala> linesWithSpark.cache() res7: spark.RDD[String] = spark.FilteredRDD@17e51082 scala> linesWithSpark.count() res8: Long = 15 scala> linesWithSpark.count() res9: Long = 15
  • 20. With HDFS… val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(line => line.startsWith(“ERROR”)) println(Total errors: + errors.count())
  • 21. Standalone (Scala) /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") .setMaster(“local") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }
  • 22. Standalone (Java) /* SimpleApp.java */ import org.apache.spark.api.java.*; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SimpleApp { public static void main(String[] args) { String logFile = "YOUR_SPARK_HOME/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> logData = sc.textFile(logFile).cache(); long numAs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("a"); } }).count(); long numBs = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("b"); } }).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); } }
  • 23. Standalone (Python) """SimpleApp.py""" from pyspark import SparkContext logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system sc = SparkContext("local", "Simple App") logData = sc.textFile(logFile).cache() numAs = logData.filter(lambda s: 'a' in s).count() numBs = logData.filter(lambda s: 'b' in s).count() print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
  • 24. Job Submission $SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
  • 25. Configuration val conf = new SparkConf() .setMaster("local") .setAppName("CountingSheep") .set("spark.executor.memory", "1g") val sc = new SparkContext(conf)
  • 27. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa 27 Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ http://www.meetup.com/DataAnalyticsGroup/ Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. http://www.meetup.com/Big-Data-Hyderabad/