SlideShare a Scribd company logo
Apache	Spark
AN ENGINE	FOR	LARGE-SCALE	DATA	PROCESSING
Introducing	myself…
• Mylène	Reiners
• Architect	@Atos
• Focus	innovation
Sketching	the	context
• Big	Data
• New	insights
• Analytics
• Data	discovery
Sketching	the	context
• Hadoop
• Storing	and	managing	data
Apache	Spark
• Speed
• General	purpose
Short	demo	in	Scala	(shell)
• Simple	data	analysis
• Read	“README.md”
• Count	the	number	of	lines
Role	of	SparkContext (sc)
RDD
• Resilient	Distributed	Dataset
• Creation
• Transformations
• Actions
RDD
• Lazy
• Recomputed
Java	example	(accumulator)
JavaRDD<String> rdd = sc.textFile(args[1]);
final Accumulator<Integer> blankLines = sc.accumulator(0);
JavaRDD<String> callSigns = rdd.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
if (line.equals("")) {
blankLines.add(1);
}
return Arrays.asList(line.split(" "));
}});
callSigns.saveAsTextFile("output.txt")
Apache	Spark	stack
Spark	SQL
• Interface	for	working	with	(semi)structured	data
Hive	example
// Import Spark SQL
import org.apache.spark.sql.hive.HiveContext;
// Or if you can't have the hive dependencies
import org.apache.spark.sql.SQLContext;
// Import the JavaSchemaRDD
import org.apache.spark.sql.SchemaRDD;
import org.apache.spark.sql.Row;
(...)
JavaSparkContext ctx = new JavaSparkContext(...);
SQLContext hiveCtx = new HiveContext(ctx);
Hive	example	(cont’d)
SchemaRDD input = hiveCtx.jsonFile(inputFile);
// Register the input schema RDD
input.registerTempTable("tweets");
// Select tweets based on the retweetCount
SchemaRDD topTweets = hiveCtx.sql(
"SELECT text, retweetCount FROM tweets ORDER BY
retweetCount LIMIT 10");
Spark	Streaming
• Acting	on	data	as	soon	as	it	arrives	
• Dstreams
Example
// Create a StreamingContext with a 1-second batch size from a SparkConf
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream from all the input on port 7777
JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);
// Filter our DStream for lines with "error"
JavaDStream<String> errorLines = lines.filter(new Function<String, Boolean>() {
public Boolean call(String line) {
return line.contains("error");
}});
// Print out the lines with errors
errorLines.print();
Example
// Start our streaming context and wait for it
// to "finish"
jssc.start();
// Wait for the job to finish
jssc.awaitTermination();
GraphX
• Graphdatabase
Example
// Load the edges as a graph
val graph = GraphLoader.edgeListFile(sc,
"followers.txt")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
Example
val ranksByUsername =
users.join(ranks)
.map {case (id, (username, rank))
=> (username, rank)
}
// Print the result
println(ranksByUsername.collect().mkString("n"))
MLib
• Machine	learning
Thank	you

More Related Content

What's hot (20)

PPTX
Quark Virtualization Engine for Analytics
DataWorks Summit/Hadoop Summit
 
PDF
Spark in 15 min
Christophe Marchal
 
PDF
Introduction to TitanDB
Knoldus Inc.
 
PPTX
Database Choices
Lynn Langit
 
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
PPTX
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
ScyllaDB
 
PPTX
Bleeding Edge Databases
Lynn Langit
 
PDF
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Julian Hyde
 
PPTX
Spark sql meetup
Michael Zhang
 
PPTX
Spark - The beginnings
Daniel Leon
 
PDF
SFScon18 - Stefano Pampaloni - The SQL revenge
South Tyrol Free Software Conference
 
PDF
Meetup070416 Presentations
Ana Rebelo
 
PDF
Spark and scala course content | Spark and scala course online training
Selfpaced
 
PPTX
Spark and Spark Streaming
宇 傅
 
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
PPTX
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
PDF
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
PPTX
Building a Lambda Architecture with Elasticsearch at Yieldbot
yieldbot
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PPTX
Spark Introduction
DataStax Academy
 
Quark Virtualization Engine for Analytics
DataWorks Summit/Hadoop Summit
 
Spark in 15 min
Christophe Marchal
 
Introduction to TitanDB
Knoldus Inc.
 
Database Choices
Lynn Langit
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark Summit
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
ScyllaDB
 
Bleeding Edge Databases
Lynn Langit
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Julian Hyde
 
Spark sql meetup
Michael Zhang
 
Spark - The beginnings
Daniel Leon
 
SFScon18 - Stefano Pampaloni - The SQL revenge
South Tyrol Free Software Conference
 
Meetup070416 Presentations
Ana Rebelo
 
Spark and scala course content | Spark and scala course online training
Selfpaced
 
Spark and Spark Streaming
宇 傅
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
Cassandra vs. ScyllaDB: Evolutionary Differences
ScyllaDB
 
Munich March 2015 - Cassandra + Spark Overview
Christopher Batey
 
Building a Lambda Architecture with Elasticsearch at Yieldbot
yieldbot
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Spark Introduction
DataStax Academy
 

Viewers also liked (6)

DOCX
Budaya politik dan praktiknya
Arya Ningrat
 
PDF
3.giao trinh sql_va_pl_sql
minhduc_cv
 
PDF
Java development with the dynamo framework
Patrick Deenen
 
PPTX
Closing the Knowledge Gap
Centre of Geographic Sciences (COGS)
 
PPTX
What's In A Building?
Centre of Geographic Sciences (COGS)
 
PPT
Campsite project presentation
Tho Xitin
 
Budaya politik dan praktiknya
Arya Ningrat
 
3.giao trinh sql_va_pl_sql
minhduc_cv
 
Java development with the dynamo framework
Patrick Deenen
 
Closing the Knowledge Gap
Centre of Geographic Sciences (COGS)
 
Campsite project presentation
Tho Xitin
 
Ad

Similar to Apache Spark part of Eindhoven Java Meetup (20)

PPTX
Spark
Koushik Mondal
 
PDF
Introduction to Apache Spark
Samy Dindane
 
PDF
Apache Spark Introduction
sudhakara st
 
PDF
Introduction to apache spark and the architecture
sundharakumarkb2
 
PPTX
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
PPTX
An Introduction to Spark
jlacefie
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PPTX
Spark core
Prashant Gupta
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Let's start with Spark
Milos Milovanovic
 
PDF
An Introduction to Apache Spark
Elvis Saravia
 
PPTX
Veri Bilimi Istanbul, Spark
Sukru Hasdemir
 
PDF
Introduction to Apache Spark Ecosystem
Bojan Babic
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PDF
Intro to apache spark
Amine Sagaama
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
IOT.ppt
Mvidhya9
 
PDF
[@NaukriEngineering] Apache Spark
Naukri.com
 
Introduction to Apache Spark
Samy Dindane
 
Apache Spark Introduction
sudhakara st
 
Introduction to apache spark and the architecture
sundharakumarkb2
 
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
An Introduction to Spark
jlacefie
 
Apache Spark Overview
Dharmjit Singh
 
Unified Big Data Processing with Apache Spark
C4Media
 
Spark core
Prashant Gupta
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Let's start with Spark
Milos Milovanovic
 
An Introduction to Apache Spark
Elvis Saravia
 
Veri Bilimi Istanbul, Spark
Sukru Hasdemir
 
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Intro to apache spark
Amine Sagaama
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
IOT.ppt
Mvidhya9
 
[@NaukriEngineering] Apache Spark
Naukri.com
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 

Apache Spark part of Eindhoven Java Meetup