SlideShare a Scribd company logo
APACHE-SPARK	

LARGE-SCALE DATA PROCESSING ENGINE
Bartosz Bogacki <bbogacki@bidlab.pl>
CTO, CODER, ROCK CLIMBER
• current: 	

• Chief Technology Officer at Bidlab
• previous:	

• IT Director at
InternetowyKantor.pl SA
• Software Architect / Project
Manager at Wolters Kluwer
Polska
• find out more (if you care):	

• linkedin.com/in/bartoszbogacki
WE PROCESS MORETHAN
200GB OF LOGS DAILY
Did I mention that…
?
WHY?
• To discover inventory and potential	

• To optimize traffic	

• To optimize campaigns	

• To learn about trends	

• To calculate conversions
APACHE SPARK !
HISTORY
• 2013-06-19 Project enters Apache incubation	

• 2014-02-19 Project established as an ApacheTop
Level Project.	

• 2014-05-30 Spark 1.0.0 released
• "Apache Spark is a (lightning-) fast and
general-purpose cluster computing system"	

• Engine compatible with Apache Hadoop	

• Up to 100x faster than Hadoop 	

• Less code to write, more elastic	

• Active community (117 developers
contributed to release 1.0.0)
KEY CONCEPTS
• Spark /YARN / Mesos resources compatible	

• HDFS / S3 support built-in	

• RDD - Resilient Distribiuted Dataset	

• Transformations & Actions	

• Written in Scala,API for Java / Scala / Python
ECOSYSTEM
• Spark Streaming	

• Shark	

• MLlib (machine learning)	

• GraphX	

• Spark SQL
RDD
• Collections of objects	

• Stored in memory (or disk)	

• Spread across the cluster	

• Auto-rebuild on failure
TRANSFORMATIONS
• map / flatMap	

• filter	

• union / intersection / join / cogroup	

• distinct	

• many more…
ACTIONS
• reduce / reduceByKey	

• foreach	

• count / countByKey	

• first / take / takeOrdered	

• collect / saveAsTextFile / saveAsObjectFile
EXAMPLES
val s1=sc.parallelize(Array(1,2,3,4,5))
val s2=sc.parallelize(Array(3,4,6,7,8))
val s3=sc.parallelize(Array(1,2,2,3,3,3))
!
s2.map(num => num * num)
// => 9, 16, 36, 49, 64
s1.reduce((a,b) => a + b)
// => 15
s1 union s2
// => 1, 2, 3, 4, 5, 3, 4, 6, 7, 8
s1 subtract s2
// => 1, 5, 2
s1 intersection s2
// => 4, 3
s3.distinct
// => 1, 2, 3
EXAMPLES
val set1 = sc.parallelize(Array[(Integer,String)]
((1,”bartek"),(2,"jacek"),(3,"tomek")))
val set2 = sc.parallelize(Array[(Integer,String)]
((2,”nowak”),(4,"kowalski"),(5,"iksiński")))
!
set1 join set2
// =>(2,(jacek,nowak))
set1 leftOuterJoin set2
// =>(1,(bartek,None)), (2,(jacek,Some(nowak))), (3,
(tomek,None))
set1 rightOuterJoin set2
// =>(4,(None,kowalski)), (5,(None,iksiński)), (2,
(Some(jacek),nowak))
EXAMPLES
set1.cogroup(set2).sortByKey()
// => (1,(ArrayBuffer(bartek),ArrayBuffer())), (2,
(ArrayBuffer(jacek),ArrayBuffer(nowak))), (3,
(ArrayBuffer(tomek),ArrayBuffer())), (4,
(ArrayBuffer(),ArrayBuffer(kowalski))), (5,
(ArrayBuffer(),ArrayBuffer(iksiński)))
!
set2.map((t) => (t._1, t._2.length))
// => (2,5), (4,8), (5,8)
!
val set3 = sc.parallelize(Array[(String,Long)]
(("onet.pl",1), ("onet.pl",1), ("wp.pl",1))
!
set3.reduceByKey((n1,n2) => n1 + n2)
// => (onet.pl,2), (wp.pl,1)
HANDS ON
RUNNING EC2 	

SPARK CLUSTER
./spark-ec2 -k spark-key -i spark-key.pem
-s 5
-t m3.2xlarge
launch cluster-name
--region=eu-west-1
SPARK CONSOLE
LINKING WITH SPARK
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.0.0</version>
</dependency>
If you want to use HDFS	

!
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
If you want to use Spark Streaming	

!
groupId = org.apache.spark
artifactId = spark-streaming_2.10
version = 1.0.0
INITIALIZING
• SparkConf conf = new SparkConf()
.setAppName("TEST")
.setMaster("local");	

• JavaSparkContext sc = new
JavaSparkContext(conf);
CREATING RDD
• List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);	

• JavaRDD<Integer> distData = sc.parallelize(data);
CREATING RDD
• JavaRDD<String> logLines = sc.textFile("data.txt");
CREATING RDD
• JavaRDD<String> logLines = sc.textFile(”hdfs://
<HOST>:<PORT>/daily/data-20-00.txt”);	

• JavaRDD<String> logLines = sc.textFile(”s3n://my-
bucket/daily/data-*.txt”);
TRANSFORM
JavaRDD<Log> logs =
logLines.map(new Function<String, Log>() {
public Log call(String s) {
return LogParser.parse(s);
}
}).filter(new Function<Log, Boolean>(){
public Integer call(Log log) {
return log.getLevel() == 1;
}
});
ACTION :)
logs.count();
TRANSFORM-ACTION
List<Tuple2<String,Integer>> result = 	
	 sc.textFile(”/data/notifies-20-00.txt”)	
	 .mapToPair(new PairFunction<String, String, Integer>() {	
	 	 	 @Override	
	 	 	 public Tuple2<String, Integer> call(String line) throws Exception {	
	 	 	 	 NotifyRequest nr = LogParser.parseNotifyRequest(line);	
	 	 	 	 return new Tuple2<String, Integer>(nr.getFlightId(), 1);	
	 	 	 }	
	 	 })	
	 .reduceByKey(new Function2<Integer, Integer, Integer>(){	
	 	 	 @Override	
	 	 	 public Integer call(Integer v1, Integer v2) throws Exception {	
	 	 	 	 return v1 + v2;	
	 	 	 }})	
	 .sortByKey()	
.collect();
FUNCTIONS, 	

PAIRFUNCTIONS, 	

ETC.
BROADCASTVARIABLES
• "allow the programmer to keep a read-only
variable cached on each machine rather than
shipping a copy of it with tasks"
Broadcast<int[]> broadcastVar =
sc.broadcast(new int[] {1, 2, 3});
!
broadcastVar.value();
// returns [1, 2, 3]
ACCUMULATORS
• variables that are only “added” to through an associative
operation (add())	

• only the driver program can read the accumulator’s value
Accumulator<Integer> accum = sc.accumulator(0);
!
sc.parallelize(Arrays.asList(1, 2, 3, 4)).foreach(x ->
accum.add(x));
!
accum.value();
// returns 10
SERIALIZATION
• All objects used in your code have to be
serializable	

• Otherwise:
org.apache.spark.SparkException: Job aborted: Task not
serializable: java.io.NotSerializableException
USE KRYO SERIALIZER
public class MyRegistrator implements KryoRegistrator {	
	 @Override	
	 public void registerClasses(Kryo kryo) {	
	 	 kryo.register(BidRequest.class);	
	 	 kryo.register(NotifyRequest.class);	
	 	 kryo.register(Event.class);	
}	
}
sparkConfig.set(	
	 "spark.serializer", "org.apache.spark.serializer.KryoSerializer");	
sparkConfig.set(	
	 "spark.kryo.registrator", "pl.instream.dsp.offline.MyRegistrator");	
sparkConfig.set(	
	 "spark.kryoserializer.buffer.mb", "10");
CACHE !
JavaPairRDD<String, Integer> cachedSet = 	
	 sc.textFile(”/data/notifies-20-00.txt”)	
	 .mapToPair(new PairFunction<String, String, Integer>() {	
	 	 	 @Override	
	 	 	 public Tuple2<String, Integer> call(String line) throws Exception
	 	 	 {	
	 	 	 	 NotifyRequest nr = LogParser.parseNotifyRequest(line);	
	 	 	 	 return new Tuple2<String, Integer>(nr.getFlightId(), 1);	
	 	 	 }	
	 	 }).cache();
RDD PERSISTANCE
• MEMORY_ONLY	

• MEMORY_AND_DISK	

• MEMORY_ONLY_SER	

• MEMORY_AND_DISK_SER	

• DISK_ONLY	

• MEMORY_ONLY_2, MEMORY_AND_DISK_2, …	

• OFF_HEAP (Tachyon, ecperimental)
PARTITIONS
• RDD is partitioned	

• You may (and probably should) control number
and size of partitions with coalesce() method	

• By default 1 input file = 1 partition
PARTITIONS
• If your partitions are too big, you’ll face:
[GC 5208198K(5208832K), 0,2403780 secs]
[Full GC 5208831K->5208212K(5208832K), 9,8765730 secs]
[Full GC 5208829K->5208238K(5208832K), 9,7567820 secs]
[Full GC 5208829K->5208295K(5208832K), 9,7629460 secs]
[GC 5208301K(5208832K), 0,2403480 secs]
[Full GC 5208831K->5208344K(5208832K), 9,7497710 secs]
[Full GC 5208829K->5208366K(5208832K), 9,7542880 secs]
[Full GC 5208831K->5208415K(5208832K), 9,7574860 secs]
WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(0, ip-xx-xx-xxx-xxx.eu-
west-1.compute.internal, 60048, 0) with no recent heart
beats: 64828ms exceeds 45000ms
RESULTS
• result.saveAsTextFile(„hdfs://<HOST>:<PORT>/
out.txt")	

• result.saveAsObjectFile(„/result/out.obj”)	

• collect()
PROCESS RESULTS 	

PARTITION BY PARTITION
for (Partition partition : result.rdd().partitions()) {	
	 List<String> subresult[] = 	
	 	 result.collectPartitions(new int[] { partition.index() });	
	 	
	 for (String line : subresult[0])	
	 {	
	 	 System.out.println(line);	
	 }	
}
SPARK STREAMING
„SPARK STREAMING IS AN EXTENSION OFTHE
CORE SPARK APITHAT ENABLES 	

HIGH-THROUGHPUT, FAULT-TOLERANT
STREAM PROCESSING OF LIVE DATA STREAMS.”
HOW IT WORKS?
DSTREAMS
• continuous stream of data, either the input data
stream received from source, or the processed
data stream generated by transforming the input
stream	

• represented by a continuous sequence of RDDs
INITIALIZING
• SparkConf conf = new
SparkConf().setAppName("Real-Time
Analytics").setMaster("local");	

• JavaStreamingContext jssc = new
JavaStreamingContext(conf, new
Duration(TIME_IN_MILIS));;
CREATING DSTREAM
• JavaReceiverInputDStream<String> logLines =
jssc.socketTextStream(sourceAddr, sourcePort,
StorageLevels.MEMORY_AND_DISK_SER);
DATA SOURCES
• plainTCP sockets	

• Apache Kafka	

• Apache Flume	

• ZeroMQ
TRANSFORMATIONS
• map, flatMap, filter, union, join, etc.	

• transform	

• updateStateByKey
WINDOW OPERATIONS
• window	

• countByWindow / countByValueAndWindow	

• reduceByWindow / reduceByKeyAndWindow
OUTPUT OPERTIONS
• print	

• foreachRDD	

• saveAsObjectFiles	

• saveAsTextFiles	

• saveAsTextFiles
THINGSTO REMEMBER
USE SPARK-SHELLTO LEARN
PROVIDE ENOUGH RAM 	

TO WORKERS
PROVIDE ENOUGH RAM 	

TO EXECUTOR
SET FRAME SIZE / BUFFERS
ACCORDINGLY
USE KRYO SERIALIZER
SPLIT DATATO APPROPRIATE
NUMBER OF PARTITIONS
PACKAGEYOUR APPLICATION	

IN UBER-JAR
DESIGNYOUR DATA FLOW
AND…
BUILD A FRAMEWORKTO
PROCESS DATA EFFICIENTLY
IT’S EASIER WITH SCALA!
	 // word count example	
	 inputLine.flatMap(line => line.split(" "))	
	 	 .map(word => (word, 1))	
	 	 .reduceByKey(_ + _);
HOW WE USE SPARK?
HOW WE USE SPARK?
HOW WE USE SPARK?
THANKS!
we’re hiring !	

mail me: bbogacki@bidlab.pl

More Related Content

What's hot (20)

PPTX
Python database interfaces
Mohammad Javad Beheshtian
 
PDF
Cassandra Materialized Views
Carl Yeksigian
 
PDF
Neo4j GraphTour: Utilizing Powerful Extensions for Analytics and Operations
Mark Needham
 
PDF
Cassandra 3.0
Robert Stupp
 
PDF
Advanced akka features
Grzegorz Duda
 
PDF
Real data models of silicon valley
Patrick McFadin
 
PDF
Profiling Oracle with GDB
Enkitec
 
PDF
Appengine Java Night #2b
Shinichi Ogawa
 
PDF
Cassandra 3.0 Data Modeling
DataStax Academy
 
PDF
Appengine Java Night #2a
Shinichi Ogawa
 
PDF
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
PDF
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
PDF
GDG DevFest 2015 - Reactive approach for slowpokes
Sergey Tarasevich
 
PDF
Cassandra EU - Data model on fire
Patrick McFadin
 
KEY
Openstack grizzley puppet_talk
bodepd
 
PDF
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
PDF
Developing your own OpenStack Swift middleware
Christian Schwede
 
PDF
An Introduction to time series with Team Apache
Patrick McFadin
 
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
PDF
Time series with Apache Cassandra - Long version
Patrick McFadin
 
Python database interfaces
Mohammad Javad Beheshtian
 
Cassandra Materialized Views
Carl Yeksigian
 
Neo4j GraphTour: Utilizing Powerful Extensions for Analytics and Operations
Mark Needham
 
Cassandra 3.0
Robert Stupp
 
Advanced akka features
Grzegorz Duda
 
Real data models of silicon valley
Patrick McFadin
 
Profiling Oracle with GDB
Enkitec
 
Appengine Java Night #2b
Shinichi Ogawa
 
Cassandra 3.0 Data Modeling
DataStax Academy
 
Appengine Java Night #2a
Shinichi Ogawa
 
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Patrick McFadin
 
GDG DevFest 2015 - Reactive approach for slowpokes
Sergey Tarasevich
 
Cassandra EU - Data model on fire
Patrick McFadin
 
Openstack grizzley puppet_talk
bodepd
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Patrick McFadin
 
Developing your own OpenStack Swift middleware
Christian Schwede
 
An Introduction to time series with Team Apache
Patrick McFadin
 
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 
Time series with Apache Cassandra - Long version
Patrick McFadin
 

Viewers also liked (13)

PDF
Large-Scale Distributed Systems in Display Advertising
bbogacki
 
PDF
Scaling Big Data with Hadoop and Mesos
Discover Pinterest
 
PDF
Lessons learned from building Demand Side Platform
bbogacki
 
PDF
Cubes – pluggable model explained
Stefan Urbanek
 
PDF
Bubbles – Virtual Data Objects
Stefan Urbanek
 
PPTX
Designing the perfect display monetization dashboard (public)
Ian Thomas
 
PPTX
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
PDF
Apache Spark & Hadoop : Train-the-trainer
IMC Institute
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PDF
Apache Spark Introduction
sudhakara st
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
Large-Scale Distributed Systems in Display Advertising
bbogacki
 
Scaling Big Data with Hadoop and Mesos
Discover Pinterest
 
Lessons learned from building Demand Side Platform
bbogacki
 
Cubes – pluggable model explained
Stefan Urbanek
 
Bubbles – Virtual Data Objects
Stefan Urbanek
 
Designing the perfect display monetization dashboard (public)
Ian Thomas
 
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
Apache Spark & Hadoop : Train-the-trainer
IMC Institute
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Apache Spark Introduction
sudhakara st
 
Apache Spark Architecture
Alexey Grishchenko
 
Ad

Similar to Introduction to Apache Spark / PUT 06.2014 (20)

PPTX
Intro to Spark development
Spark Summit
 
PDF
Introduction to Spark Training
Spark Summit
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PDF
Introduction to Apache Spark
Vincent Poncet
 
PDF
Big Data processing with Apache Spark
Lucian Neghina
 
PPTX
Spark Overview and Performance Issues
Antonios Katsarakis
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PPTX
Spark 计算模型
wang xing
 
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PPTX
Zaharia spark-scala-days-2012
Skills Matter Talks
 
PDF
Austin Data Meetup 092014 - Spark
Steve Blackmon
 
PDF
TriHUG talk on Spark and Shark
trihug
 
PPTX
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
PDF
Toying with spark
Raymond Tay
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Distributed computing with spark
Javier Santos Paniego
 
Intro to Spark development
Spark Summit
 
Introduction to Spark Training
Spark Summit
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Introduction to Apache Spark
Vincent Poncet
 
Big Data processing with Apache Spark
Lucian Neghina
 
Spark Overview and Performance Issues
Antonios Katsarakis
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Spark 计算模型
wang xing
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Apache Spark - A High Level overview
Karan Alang
 
Zaharia spark-scala-days-2012
Skills Matter Talks
 
Austin Data Meetup 092014 - Spark
Steve Blackmon
 
TriHUG talk on Spark and Shark
trihug
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Jetlore
 
Toying with spark
Raymond Tay
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Distributed computing with spark
Javier Santos Paniego
 
Ad

Recently uploaded (20)

PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
PDF
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Dipole Tech Innovations – Global IT Solutions for Business Growth
dipoletechi3
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PPTX
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PDF
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PPTX
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
AOMEI Partition Assistant Crack 10.8.2 + WinPE Free Downlaod New Version 2025
bashirkhan333g
 
The 5 Reasons for IT Maintenance - Arna Softech
Arna Softech
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Dipole Tech Innovations – Global IT Solutions for Business Growth
dipoletechi3
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
Smart Doctor Appointment Booking option in odoo.pptx
AxisTechnolabs
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Function & Procedure: Function Vs Procedure in PL/SQL
Shani Tiwari
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
iaas vs paas vs saas :choosing your cloud strategy
CloudlayaTechnology
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 

Introduction to Apache Spark / PUT 06.2014