SlideShare a Scribd company logo
Robert	Hryniewicz
Developer	Advocate
T:	@RobH8z
E:	rhryniewicz@hortonworks.com
Apache	Spark	
Crash	Course	- DataWorks Summit	– Sydney	2017
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Data	Sources
à Internet	of	Things	(IoT)
– Wind	Turbines,	Oil	Rigs
– Beacons,	Wearables
– Smart	Cars
à User	Generated	Content	(Social,	Web	&	Mobile)
– Twitter,	Facebook,	Snapchat
– Clickstream
– Paypal,	Venmo
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
0
10
20
30
40
50
60
70
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Data	Growth	in	Zeta	Bytes	(ZB)
50+	ZB	in	2021
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Visualizing	50	ZB
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
The	“Big	Data”	Problem
à A	single	machine	cannot	process	or	even	store	all	the	data!
Problem
Solution
à Distribute	data	over	large	clusters
Difficulty
à How	to	split	work	across	machines?
à Moving	data	over	network	is	expensive
à Must	consider	data	&	network	locality
à How	to	deal	with	failures?
à How	to	deal	with	slow	nodes?
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Spark	Background
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	Is	Apache	Spark?
à Apache	open	source	project	
originally	developed	at	AMPLab
(University	of	California	Berkeley)
à Unified,	general	data	processing	
engine	that	operates	across	varied	
data	workloads	and	platforms
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark?
à Elegant	Developer	APIs
– Single	environment	for	data	munging,	data	wrangling,	and	Machine	Learning	(ML)
à In-memory	computation	model	– Fast!
– Effective	for	iterative	computations	and	ML
à Machine	Learning
– Implementation	of	distributed	ML	algorithms
– Pipeline	API	(Spark	MLlib)
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
More	Flexible Better	Storage	and	Performance///
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL	Overview
à Spark	module	for	structured	data	processing	(e.g.	ORC,	Parquet,	Avro,	MySQL)
à Two	ways	to	manipulate	data:
– DataFrame/Dataset	API
– SQL	query
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SparkSession
à Main	entry	point	for	Spark	functionality
à Allows	programming	with	DataFrame and	Dataset	APIs
à Represented	as	spark	and	auto-initialized	in	a	notebook	type	env.	(Zeppelin	or	Jupyter)
What	is	it?
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
à Distributed collection of	data organized into	named	
columns
à Conceptually	equivalent	to	a	table	in	relational	DB	or	
a	data	frame	in	R/Python
à API	available	in	Scala,	Java,	Python,	and	R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data	is	described	as	a	DataFrame
with	rows,	columns,	and	a	schema
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sources
CSVAvro
HIVE
Spark	SQL
Col1 Col2 … … ColN
DataFrame
Column
Row
JSON
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Create	a	DataFrame
val path = "examples/flights.json"
val flights = spark.read.json(path)
Example
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Register	a	Temporary	View	(SQL	API)
Example
flights.createOrReplaceTempView("flightsView")
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Two	API	Examples:	DataFrame and	SQL	APIs
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flightsView
WHERE DepDelay > 15 LIMIT 5
SQL	API
DataFrame API
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	is	Stream	Processing?
Batch	Processing
• Ability	to	process	and	analyze	data	at-rest	(stored	data)
• Request-based,	bulk	evaluation	and	short-lived	processing
• Enabler	for	Retrospective,	Reactive	and	On-demand	Analytics
Stream	Processing
• Ability	to	ingest,	process	and	analyze	data	in-motion	in	real- or	near-real-time
• Event	or	micro-batch	driven,	continuous	evaluation	and	long-lived	processing
• Enabler	for	real-time	Prospective,	Proactive	and	Predictive	Analytics	 for	Next	Best	
Action
Stream	Processing	 +		Batch	Processing	 =			All	Data	Analytics
real-time (now) historical (past)
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
23
Modern	Data	Applications	approach	to	Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
à Extension	of	Spark	Core	API
à Stream	processing	of	live	data	streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT
No	longer	
supported	
in
Spark	2.x
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
Discretized	Streams	(DStreams)
à High-level	abstraction	representing	continuous	stream	of	data
à Internally	represented	as	a	sequence	of	RDDs
à Operation	applied	on	a	DStream translates	to	operations	on	the	underlying	RDDs
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
Example:	flatMap operation
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
à Apply	transformations	over	a	sliding	window	of	data,	e.g.	rolling	average
Window	Operations
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Challenges	in	Streaming	Data
à Consistency
à Fault	tolerance
à Out-of-order	data
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Structured	Streaming
à High-Level	APIs	- DataFrames,	Datasets	and	SQL.	Same	in	streaming	and	in	batch
à Event-time	Processing	- Native	support	for	working	w/	out	-of-order	and	late	data
à End-to-end	Exactly	Once	- Transactional	both	in	processing	and	output
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Structured	Streaming:	Basics
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Structured	Streaming:	Model
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Handling	late	arriving	data
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	MLlib
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
36 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Machine Learning use cases
Healthcare
• Predict	diagnosis
• Prioritize	screenings
• Reduce	re-admittance	rates
Financial	services
• Fraud	Detection/prevention
• Predict	underwriting	risk
• New	account	risk	screens
Public	Sector
• Analyze	public	sentiment
• Optimize	resource	allocation
• Law	enforcement	&	security	
Retail
• Product	recommendation
• Inventory	management
• Price	optimization
Telco/mobile
• Predict	customer	churn
• Predict	equipment	failure
• Customer	behavior	analysis
Oil	&	Gas
• Predictive	maintenance
• Seismic	data	management
• Predict	well	production	levels
37 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
START
Regression	
Classification Collaborative	Filtering
Clustering
Dimensionality	Reduction
• Logistic	Regression
• Support	Vector	Machines	(SVM)
• Random	Forest	(RF)
• Naïve	Bayes
• Linear	Regression
• Alternating	Least	Squares	(ALS)
• K-Means,	LDA
• Principal	Component	Analysis	(PCA)
38 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	is	a	ML	Model?
à Mathematical	formula	with	a	number	of	parameters that	need	to	be learned from	the	
data.	And	fitting	a	model	to	the	data	is	a	process	known	as model	training
à E.g.	linear	regression
– Goal:	fit	a	line	y	=	mx	+	c to	data	points
– After	model	training:	y	=	2x	+	5
Input OutputModel
1,	0,	7,	2,	… 7,	5,	19,	9,	…
39 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Scatter 2D Data Visualized
scatterData
|label|features|
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
40 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression Model Training (one feature)
Coefficients:	2.81				Intercept:	3.05
y	=	2.81x	+	3.05
Training
Result
41 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563
42 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark ML Pipeline
à fit() is for training
à transform() is for prediction
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline	Model
fit()
transform()
Train
Predict
43 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark ML Pipeline
Feature	
transform	
1
Feature	
transform	
2
Combine	
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline	Model
Train
Predict
Export	Model
44 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
45 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Exporting ML Models - PMML
à Predictive	Model	Markup	Language	(PMML)
–>	XML-based	predictive	model	interchange	format
à Supported	models
–K-Means	
–Linear	Regression
–Ridge	Regression	
–Lasso
–SVM
–Binary
46 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	GraphX
47 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
48 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
à Page	Rank
à Topic	Modeling	(LDA)
à Community	Detection
Source:	ampcamp.berkeley.edu
49 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
50 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
GraphX Algorithms
à PageRank
à Connected	components
à Label	propagation
à SVD++
à Strongly	connected	components
à Triangle	count
51 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sample	GraphX Code	in	Scala
graph = Graph(vertices, edges)
messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}
52 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Zeppelin	Basics
53 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Python, Scala and more
54 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Zeppelin	with	HDP	2.6
• Data	exploration	and	discovery
• Visualization
• Interactive	snippet-at-a-time	
experience
• “Modern	Data	Science	Studio”
Features
• Ad-hoc	experimentation
• Deeply	integrated	with	
Spark	+	Hadoop
• Supports	multiple	
language	backends
• Incubating	at	Apache
Use	Case
Web-based	Notebook	for	interactive	analytics
55 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
56 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
How	does	Zeppelin	work?
Notebook	
Author
Collaborators/
Report	viewers
Zeppelin
Cluster
Spark	|	Hive	|	HBase
Any	of	30+	back	ends
57 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big	Data	Lifecycle
Collect
ETL	/
Process
Analysis
Report
Data
Product
Business	user
Customer
Data	ScientistData	Engineer
All	in	Zeppelin!
58 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Multitenancy	with	Zeppelin
59 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Livy
à Livy	is	the	open	source	REST	interface	for	interacting	with	Apache	Spark	from	anywhere	
à Installed	as	Spark	Ambari Service
Livy Client
HTTP HTTP	(RPC)
Spark	Interactive	Session
SparkContext
Spark	Batch	Session
SparkContext
Livy Server
60 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Security	Across	Zeppelin-Livy-Spark
Shiro
Ispark	Group	Interpreter
SPNego:	Kerberos Kerberos
Livy	APIs
Spark	on	YARN
Zeppelin
Driver
LDAP
Livy Server
61 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Reasons	to	Integrate	with	Livy
à Bring	Sessions	to	Apache	Zeppelin
– Isolation
– Session	sharing	
à Enable	efficient	cluster	resource	utilization
– Default	Spark	interpreter	keeps	YARN/Spark	job	running	forever
– Livy	interpreter	recycled	after	60	minutes	of	inactivity	
(controlled	by	livy.server.session.timeout )
à To	Identity	Propagation
– Send	user	identity	from	Zeppelin		>	Livy		>	Spark	on	YARN
62 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Livy Server
SparkSession Sharing
Session-2
Session-1
SparkSession-1
SparkContext
SparkSession-2
SparkContext
Client	1
Client	2
Client	3
Session-1
Session-1
Session-2
63 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Zeppelin	+	Livy	End-to-End	Security
Ispark	Group	Interpreter
SPNego:	Kerberos Kerberos/RPC
Livy	APIs
Spark	on	YARN
Zeppelin
LDAP
Livy Server
Job	runs	as
Tommy	Callahan
Tommy	Callahan
64 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Data	Platform	(HDP)	Basics
65 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
à Zeppelin	è Interactive	notebook
à Spark
à YARN	è Resource	Management
à HDFS	è Distributed	Storage	Layer
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
66 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications
Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
67 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark	on	YARN?
à Resource	management	
à Utilizes	existing	HDP	cluster	
infrastructure
à Scheduling	and	queues
Spark	Driver
Client
Spark
Application	Master
YARN	container
Spark	Executor
YARN	container
Task Task
Spark	Executor
YARN	container
Task Task
Spark	Executor
YARN	container
Task Task
68 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why HDFS?
Fault Tolerant Distributed Storage
• Divide	files	into	big	blocks	and	distribute	3	copies	randomly across	the	cluster
• Processing	Data	Locality
• Not	Just	storage	but	computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
69 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks Data Platform
70 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Data	Cloud	(HDCloud)	Basics
71 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Cloud	Solutions
Microsoft AWS Google
Managed Azure	HDInsight
Non-Managed	/
Marketplace
Hortonworks	Data	
Cloud	for	AWS
Cloud	IaaS
Hortonworks	Data	Platform
(via	Ambari	and	via	Cloudbreak)
72 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Cloud	Solutions:	Flexibility	and	Choice
Hortonworks	Data	
Cloud	for	AWS
Cloudbreak
HDP	on	Cloud	IaaS
More	Prescriptive
More	Ephemeral	/	Short	Lived
More	Options
More	Long	Running
73 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
74 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
75 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sample	Architecture
76 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Modern	Data	Apps
à HDP	2.6
– Batch	Processing
à HDF	3.0
– Streaming	Apps
DATA	AT
REST
DATA	IN	
MOTION
ACTIONABLE
INTELLIGENCE
Modern	Data	Applications
77 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Modern	Data	Applications
Custom	or	Off	the	Shelf
Real-Time	Cyber	Security
protects	systems	with	superior	threat	
detection
Smart	Manufacturing
dramatically	improves	yields	by	managing	
more	variables	in	greater	detail
Connected,	Autonomous	Cars
drive	themselves	and	improve	road	safety
Future	Farming
optimizing	soil,	seeds	and	equipment	to	
measured	conditions	on	each	square	foot
Automatic	Recommendation	Engines
match	products	to	preferences	in	milliseconds
DATA	AT
REST
DATA	IN	
MOTION
ACTIONABLE
INTELLIGENCE
Modern	Data	Applications
Hortonworks	
DataFlow
Hortonworks	
Data	Platform
78 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Managed	Dataflow
SOURCES
REGIONAL	
INFRASTRUCTURE
CORE	
INFRASTRUCTURE
79 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
80 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
NiFi part	of	HDF
81 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
82 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
83 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
High-Level	Overview
IoT Edge
(single	node)
IoT Edge
(single	node)
IoT Devices
IoT Devices
NiFi Hub Data	Broker
Column	
DB
Data	
Store
Live	Dashboard
Data	Center
(on	prem/cloud)
HDFS/S3 HBase/Cassandra
Robert	Hryniewicz
T:	@RobH8z
E:	rhryniewicz@hortonworks.com
Thanks!

More Related Content

What's hot (20)

PDF
Apache Hadoop In Theory And Practice
Adam Kawa
 
PDF
NOSQL- Presentation on NoSQL
Ramakant Soni
 
PDF
MongoDB Fundamentals
MongoDB
 
PDF
Scalability, Availability & Stability Patterns
Jonas Bonér
 
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
PPTX
Introduction to NoSQL Databases
Derek Stainer
 
PDF
Spark Summit EU talk by Ted Malaska
Spark Summit
 
PPTX
Introduction to Apache Kafka
AIMDek Technologies
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PDF
Migrating to Apache Spark at Netflix
Databricks
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
Kafka 101 and Developer Best Practices
confluent
 
PPTX
Batch Processing vs Stream Processing Difference
jeetendra mandal
 
PDF
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
HostedbyConfluent
 
PPTX
Introduction to gRPC
Chandresh Pancholi
 
PPTX
Introduction to Apache ZooKeeper
Saurav Haloi
 
PDF
ksqlDB: A Stream-Relational Database System
confluent
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Apache Hadoop In Theory And Practice
Adam Kawa
 
NOSQL- Presentation on NoSQL
Ramakant Soni
 
MongoDB Fundamentals
MongoDB
 
Scalability, Availability & Stability Patterns
Jonas Bonér
 
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Introduction to NoSQL Databases
Derek Stainer
 
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Introduction to Apache Kafka
AIMDek Technologies
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Migrating to Apache Spark at Netflix
Databricks
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Kafka 101 and Developer Best Practices
confluent
 
Batch Processing vs Stream Processing Difference
jeetendra mandal
 
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
HostedbyConfluent
 
Introduction to gRPC
Chandresh Pancholi
 
Introduction to Apache ZooKeeper
Saurav Haloi
 
ksqlDB: A Stream-Relational Database System
confluent
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 

Viewers also liked (19)

PPTX
Transactional SQL in Apache Hive
DataWorks Summit
 
PDF
Next Generation Execution for Apache Storm
DataWorks Summit
 
PDF
The Apache Way
DataWorks Summit
 
PDF
SparkR Best Practices for R Data Scientists
DataWorks Summit
 
PDF
Beyond Big Data: Data Science and AI
DataWorks Summit
 
PDF
Delivering Data Science to the Business
DataWorks Summit
 
PDF
Data Guarantees and Fault Tolerance in Streaming Systems
DataWorks Summit
 
PDF
Apache Hadoop Crash Course
DataWorks Summit
 
PDF
Data-In-Motion Unleashed
DataWorks Summit
 
PDF
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
DataWorks Summit
 
PDF
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
DataWorks Summit
 
PDF
Data Science Crash Course
DataWorks Summit
 
PDF
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
DataWorks Summit
 
PDF
How Big Data and Deep Learning are Revolutionizing AML and Financial Crime De...
DataWorks Summit
 
PDF
The Future of Data in Telecom and the Rise of Connected Communities
DataWorks Summit
 
PDF
Running Zeppelin in Enterprise
DataWorks Summit
 
PDF
An Apache Hive Based Data Warehouse
DataWorks Summit
 
PDF
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
DataWorks Summit
 
PPTX
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
 
Transactional SQL in Apache Hive
DataWorks Summit
 
Next Generation Execution for Apache Storm
DataWorks Summit
 
The Apache Way
DataWorks Summit
 
SparkR Best Practices for R Data Scientists
DataWorks Summit
 
Beyond Big Data: Data Science and AI
DataWorks Summit
 
Delivering Data Science to the Business
DataWorks Summit
 
Data Guarantees and Fault Tolerance in Streaming Systems
DataWorks Summit
 
Apache Hadoop Crash Course
DataWorks Summit
 
Data-In-Motion Unleashed
DataWorks Summit
 
Driving in the Desert - Running Your HDP Cluster with Helion, Openstack, and ...
DataWorks Summit
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
DataWorks Summit
 
How Big Data and Deep Learning are Revolutionizing AML and Financial Crime De...
DataWorks Summit
 
The Future of Data in Telecom and the Rise of Connected Communities
DataWorks Summit
 
Running Zeppelin in Enterprise
DataWorks Summit
 
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Intelligently Collecting Data at the Edge - Intro to Apache MiNiFi
DataWorks Summit
 
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
 
Ad

Similar to Apache Spark Crash Course (20)

PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PPTX
Apache Spark Crash Course
DataWorks Summit
 
PDF
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
PPTX
Real time streaming analytics
Anirudh
 
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
PPTX
Enterprise data science at scale
Carolyn Duby
 
PDF
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
bemeneqhueen
 
PDF
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PDF
Deep dive into spark streaming
Tao Li
 
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
PDF
DIscover Spark and Spark streaming
Maturin BADO
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
PPTX
Apache Spark Components
Girish Khanzode
 
PPTX
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit
 
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Real time streaming analytics
Anirudh
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Enterprise data science at scale
Carolyn Duby
 
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
bemeneqhueen
 
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
Apache Spark - A High Level overview
Karan Alang
 
Deep dive into spark streaming
Tao Li
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
DIscover Spark and Spark streaming
Maturin BADO
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Apache Spark Components
Girish Khanzode
 
Apache Spark: Lightning Fast Cluster Computing
All Things Open
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

Recently uploaded (20)

PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 

Apache Spark Crash Course