SlideShare a Scribd company logo
Robert	Hryniewicz
Developer	Advocate
@RobertH8z
Apache	Spark	
Crash	Course	- DataWorks Summit	- Munich	2017
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
“Big	Data”
à Internet	of	Anything	(IoT)
– Wind	Turbines,	Oil	Rigs
– Beacons,	Wearables
– Smart	Cars
à User	Generated	Content	(Social,	Web	&	Mobile)
– Twitter,	Facebook,	Snapchat
– Clickstream
– Paypal,	Venmo
44ZB	in	2020
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Visualizing	44ZB	
100	pixels	=	1M	TB
100	px ->	1M	TB		assumes	5M	pixel	resolution	screen
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
The	“Big	Data”	Problem
à A	single	machine	cannot	process	or	even	store	all	the	data!
Problem
Solution
à Distribute	data	over	large	clusters
Difficulty
à How	to	split	work	across	machines?
à Moving	data	over	network	is	expensive
à Must	consider	data	&	network	locality
à How	to	deal	with	failures?
à How	to	deal	with	slow	nodes?
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Background
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	Is	Apache	Spark?
à Apache	open	source	project	
originally	developed	at	AMPLab
(University	of	California	Berkeley)
à Unified	data	processing	engine	that	
operates	across	varied	data	
workloads	and	platforms
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark?
à Elegant	Developer	APIs
– Single	environment	for	data	munging,	data	wrangling,	and	Machine	Learning	(ML)
à In-memory	computation	model	– Fast!
– Effective	for	iterative	computations	and	ML
à Machine	Learning
– Implementation	of	distributed	ML	algorithms
– Pipeline	API	(Spark	ML)
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Basics
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
SparkSession
à Main	entry	point	for	Spark	functionality
à Allows	programming	with	DataFrame and	Dataset	APIs
– Fewer	concepts	and	constructs	a	developer	has	to	juggle	while	interacting	with	Spark
à Represented	as	spark	and	auto-initialized	in	Zeppelin	env.
What	is	it?
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
More	Flexible Better	Storage	and	Performance///
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL	Overview
à Spark	module	for	structured	data	processing	(e.g.	DB	tables,	JSON	files,	CSV)
à Three	ways	to	manipulate	data:
– DataFrames API
– SQL	queries
– Datasets	API
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DataFrames
à Distributed collection of	data organized into	named	
columns
à Conceptually	equivalent	to	a	table	in	relational	DB	or	
a	data	frame	in	R/Python
à API	available	in	Scala,	Java,	Python,	and	R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data	is	described	as	a	DataFrame
with	rows,	columns,	and	a	schema
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sources
CSVAvro
HIVE
Spark	SQL
Col1 Col2 … … ColN
DataFrame
Column
Row
JSON
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Create	a	DataFrame
val path = "examples/flights.json"
val flights = spark.read.json(path)
Example
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Register	a	Temporary	View	(SQL	API)
Example
flights.createOrReplaceTempView("flightsView")
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Two	API	Examples:	DataFrame and	SQL	APIs
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)
Results
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
SELECT Origin, Dest, DepDelay
FROM flightsView
WHERE DepDelay > 15 LIMIT 5
SQL	API
DataFrame API
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What	is	Stream	Processing?
Batch	Processing
• Ability	to	process	and	analyze	data	at-rest	(stored	data)
• Request-based,	bulk	evaluation	and	short-lived	processing
• Enabler	for	Retrospective,	Reactive	and	On-demand	Analytics
Stream	Processing
• Ability	to	ingest,	process	and	analyze	data	in-motion	in	real- or	near-real-time
• Event	or	micro-batch	driven,	continuous	evaluation	and	long-lived	processing
• Enabler	for	real-time	Prospective,	Proactive	and	Predictive	Analytics	 for	Next	Best	
Action
Stream	Processing	 +		Batch	Processing	 =			All	Data	Analytics
real-time (now) historical (past)
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Next Generation Analytics
Iterative & Exploratory
Data is the structure
Traditional Analytics
Structured & Repeatable
Structure built to store data
22
Modern	Data	Applications	approach	to	Insights
Start with hypothesis
Test against selected data
Data leads the way
Explore all data, identify correlations
Analyze after landing… Analyze in motion…
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
à Extension	of	Spark	Core	API
à Stream	processing	of	live	data	streams
– Scalable
– High-throughput
– Fault-tolerant
Overview
ZeroMQ
MQTT
No	longer	
supported	
in
Spark	2.x
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
Discretized	Streams	(DStreams)
à High-level	abstraction	representing	continuous	stream	of	data
à Internally	represented	as	a	sequence	of	RDDs
à Operation	applied	on	a	DStream translates	to	operations	on	the	underlying	RDDs
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
Example:	flatMap operation
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	Streaming
à Apply	transformations	over	a	sliding	window	of	data,	e.g.	rolling	average
Window	Operations
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Challenges	in	Streaming	Data
à Consistency
à Fault	tolerance
à Out-of-order	data
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Structured	Streaming:	Basics
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Structured	Streaming:	Model
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Handling	late	arriving	data
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
AI	in	Media	&	Pop	Culture
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Machine Learning use cases
Healthcare
• Predict	diagnosis
• Prioritize	screenings
• Reduce	re-admittance	rates
Financial	services
• Fraud	Detection/prevention
• Predict	underwriting	risk
• New	account	risk	screens
Public	Sector
• Analyze	public	sentiment
• Optimize	resource	allocation
• Law	enforcement	&	security	
Retail
• Product	recommendation
• Inventory	management
• Price	optimization
Telco/mobile
• Predict	customer	churn
• Predict	equipment	failure
• Customer	behavior	analysis
Oil	&	Gas
• Predictive	maintenance
• Seismic	data	management
• Predict	well	production	levels
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
36 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression Model Training (one feature)
Coefficients:	2.81				Intercept:	3.05
y	=	2.81x	+	3.05
Training
Result
37 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563
38 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark API for building ML pipelines
Feature	
transform	
1
Feature	
transform	
2
Combine	
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline	Model
Train
Predict
Export	Model
39 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
40 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
41 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
à Page	Rank
à Topic	Modeling	(LDA)
à Community	Detection
Source:	ampcamp.berkeley.edu
42 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Zeppelin	&	HDP
43 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s Apache Zeppelin?
Web-based notebook
that enables interactive
data analytics.
You can make beautiful
data-driven, interactive
and collaborative
documents with SQL,
Python, Scala and more
44 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
imple	line	chart
45 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
orizontal	plot	of	three	line	charts
46 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
treaming	data	into	a	line	chart
47 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
lotting	Iris	data	features	in	one	plot
48 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
omparing	Iris	data	distributions
49 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What is a Note/Notebook?
• A	web	based	GUI	for	small	code	snippets
• Write	code	snippets	in	browser
• Zeppelin	sends	code	to	backend	for	execution
• Zeppelin	gets	data	back	from	backend
• Zeppelin	visualizes	data
• Zeppelin	Note	=	Set	of	(Paragraphs/Cells)
• Other	Features	- Sharing/Collaboration/Reports/Import/Export
50 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
How	does	Zeppelin	work?
Notebook	
Author
Collaborators/
Report	viewers
Zeppelin
Cluster
Spark	|	Hive	|	HBase
Any	of	30+	back	ends
51 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Big	Data	Lifecycle
Collect
ETL	/
Process
Analysis
Report
Data
Product
Business	user
Customer
Data	ScientistData	Engineer
All	in	Zeppelin!
52 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
à Zeppelin	è Interactive	notebook
à Spark
à YARN	è Resource	Management
à HDFS	è Distributed	Storage	Layer
YARN
Scala
Java
Python
R
APIs
Spark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
HDFS
53 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
54 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why	Apache	Spark	on	YARN?
à Resource	management	
à Utilizes	existing	HDP	cluster	
infrastructure
à Scheduling	and	queues
Spark	Driver
Client
Spark
Application	Master
YARN	container
Spark	Executor
YARN	container
Task Task
Spark	Executor
YARN	container
Task Task
Spark	Executor
YARN	container
Task Task
55 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Why HDFS?
Fault Tolerant Distributed Storage
• Divide	files	into	big	blocks	and	distribute	3	copies	randomly across	the	cluster
• Processing	Data	Locality
• Not	Just	storage	but	computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
56 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark and HDP
57 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
HDCloud
58 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Cloud	Solutions
Microsoft AWS Google
Managed Azure	HDInsight
Non-Managed	/
Marketplace
Hortonworks	Data	
Cloud	for	AWS
Cloud	IaaS
Hortonworks	Data	Platform
(via	Ambari	and	via	Cloudbreak)
59 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Cloud	Solutions:	Flexibility	and	Choice
Hortonworks	Data	
Cloud	for	AWS
Cloudbreak
HDP	on	Cloud	IaaS
More	Prescriptive
More	Ephemeral
More	Options
More	Long	Running
60 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
HDP	2.6	and	New	Cluster	Types
Spark	
2.1
Druid
TP
Interactive	
Hive
61 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
62 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
63 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Multitenancy	with	Zeppelin
64 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Livy
à Livy	is	the	open	source	REST	interface	for	interacting	with	Apache	Spark	from	anywhere	
à Installed	as	Spark	Ambari Service
Livy Client
HTTP HTTP	(RPC)
Spark	Interactive	Session
SparkContext
Spark	Batch	Session
SparkContext
Livy Server
65 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Security	Across	Zeppelin-Livy-Spark
Shiro
Ispark	Group	Interpreter
SPNego:	Kerberos Kerberos
Livy	APIs
Spark	on	YARN
Zeppelin
Driver
LDAP
Livy Server
66 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Reasons	to	Integrate	with	Livy
à Bring	Sessions	to	Apache	Zeppelin
– Isolation
– Session	sharing	
à Enable	efficient	cluster	resource	utilization
– Default	Spark	interpreter	keeps	YARN/Spark	job	running	forever
– Livy	interpreter	recycled	after	60	minutes	of	inactivity	
(controlled	by	livy.server.session.timeout )
à To	Identity	Propagation
– Send	user	identity	from	Zeppelin		>	Livy		>	Spark	on	YARN
67 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Livy Server
SparkSession Sharing
Session-2
Session-1
SparkSession-1
SparkContext
SparkSession-2
SparkContext
Client	1
Client	2
Client	3
Session-1
Session-1
Session-2
68 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Zeppelin	Security:	Authentication	+	SSL
Tommy	Callahan
Zeppelin Spark	on	YARN
LDAP
SSL
Firewall
1
2
3
69 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache	Zeppelin	+	Livy	End-to-End	Security
Ispark	Group	Interpreter
SPNego:	Kerberos Kerberos/RPC
Livy	APIs
Spark	on	YARN
Zeppelin
LDAP
Livy Server
Job	runs	as
Tommy	Callahan
Tommy	Callahan
70 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Sample	Architecture
71 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Modern	Data	Apps
à HDP	2.6
– Batch	Processing
à HDF	2.1
– Streaming	Apps
DATA	AT
REST
DATA	IN	
MOTION
ACTIONABLE
INTELLIGENCE
Modern	Data	Applications
72 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Modern	Data	Applications
Custom	or	Off	the	Shelf
Real-Time	Cyber	Security
protects	systems	with	superior	threat	
detection
Smart	Manufacturing
dramatically	improves	yields	by	managing	
more	variables	in	greater	detail
Connected,	Autonomous	Cars
drive	themselves	and	improve	road	safety
Future	Farming
optimizing	soil,	seeds	and	equipment	to	
measured	conditions	on	each	square	foot
Automatic	Recommendation	Engines
match	products	to	preferences	in	milliseconds
DATA	AT
REST
DATA	IN	
MOTION
ACTIONABLE
INTELLIGENCE
Modern	Data	Applications
Hortonworks	
DataFlow
Hortonworks	
Data	Platform
73 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Managed	Dataflow
SOURCES
REGIONAL	
INFRASTRUCTURE
CORE	
INFRASTRUCTURE
74 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
75 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
High-Level	Overview
IoT Edge
(single	node)
IoT Edge
(single	node)
IoT Devices
IoT Devices
NiFi Hub Data	Broker
Column	
DB
Data	
Store
Live	Dashboard
Data	Center
(on	prem/cloud)
HDFS/S3 HBase/Cassandra
76 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Labs	/	Tutorials
77 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Future	Tutorials
à Deploying	Models	with	Spark	Structured	Streaming
à Predicting	Airline	Delays	with	SparkR
à Sentiment	Analysis	with	Apache	Spark	(Gradient	Boosting)
à Auto	Text	Classification	(Naïve	Bayes)
78 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Hortonworks	Community	Connection
Read access for everyone, join to participate and be recognized
• Full	Q&A	Platform	(like	StackOverflow)
• Knowledge	Base	Articles
• Code	Samples	and	Repositories
79 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Community	Engagement
Participate now at: community.hortonworks.com©	Hortonworks	Inc.	2011	– 2015.	All	Rights	Reserved
12,000+
Registered	Users
35,000+
Answers
55,000+
Technical	Assets
One Website!
80 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
www.futureofdata.io
Future	of	Data	Meetups
81 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
FB	Sort
à Spark	job	that	reads	60	TB	of	compressed	data	
and	performs	a	90	TB	shuffle	and	sort.
à Largest	real-world	Spark	job	to	date!
– Databricks’	PetaByte sort	was	on	synthetic	data.
à Multiple	reliability	fixes.
à Spark	job	that	reads	60	TB	of	compressed	data	
and	performs	a	90	TB	shuffle	and	sort.
à Largest	real-world	Spark	job	to	date!
– Databricks’	PetaByte sort	was	on	synthetic	data.
à Multiple	reliability	fixes.
“Spark	could	reliably	shuffle	and	sort	90	TB+	intermediate	data	and	run	250,000	tasks	in	a	
single	job	[...]	and	it	has	been	running	in	production	for	several	months.”
82 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Spark	SQL
Structured	Data
Spark	Streaming
Near	Real-time
Spark	MLlib
Machine	Learning
GraphX
Graph	Analysis
Robert	Hryniewicz
@robertH8z
83 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
What’s	new	in	HDP	2.6	– Spark	&	Zeppelin
à Spark	1.6.3	GA
à Spark	2.1	GA
à REST	API	(Livy)	GA
à Spark	Thrift	Server	doAS GA
à SparkSQL – Row/Column	Security	(GA)
à Spark	Streaming	+	Kafka	over	SSL
à Multi	Cluster	HBase support	for	SHC
à Package	support	in	PySpark &	SparkR
Spark
à Spark	2.x	support
à Improved	Livy	integration
à No	password	in	clear
à JDBC	interpreter	improvements
à Smart	Sense	integration
à Knox	proxy	Zeppelin	UI
Zeppelin	0.7.x
Robert	Hryniewicz
@RobertH8z
Thanks!

More Related Content

What's hot (20)

PPTX
Spark
Heena Madan
 
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
PPTX
Big Data and Hadoop Guide
Simplilearn
 
PPTX
Hadoop
ABHIJEET RAJ
 
PPTX
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
PDF
Getting started with Hadoop, Hive, Spark and Kafka
Edelweiss Kammermann
 
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
SAS Viya
bidwhm
 
PPT
session and cookies.ppt
Jayaprasanna4
 
PPTX
Big Data
Subhavinolin Raja
 
PDF
Mapreduce by examples
Andrea Iacono
 
PPTX
Voldemort
fasiha ikram
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PPT
Pentaho etl-tool
Sreenivas Kappala
 
PPTX
What is big data?
David Wellman
 
PPTX
Map Reduce
Prashant Gupta
 
PDF
Intro to HBase
alexbaranau
 
PPTX
Introduction to sqoop
Uday Vakalapudi
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Big Data and Hadoop Guide
Simplilearn
 
Hadoop
ABHIJEET RAJ
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
Getting started with Hadoop, Hive, Spark and Kafka
Edelweiss Kammermann
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Demystifying data engineering
Thang Bui (Bob)
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
SAS Viya
bidwhm
 
session and cookies.ppt
Jayaprasanna4
 
Mapreduce by examples
Andrea Iacono
 
Voldemort
fasiha ikram
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Pentaho etl-tool
Sreenivas Kappala
 
What is big data?
David Wellman
 
Map Reduce
Prashant Gupta
 
Intro to HBase
alexbaranau
 
Introduction to sqoop
Uday Vakalapudi
 

Similar to Apache Spark Crash Course (20)

PDF
Apache Spark Crash Course
DataWorks Summit
 
PPTX
Apache Spark Crash Course
DataWorks Summit
 
PDF
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
PDF
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
PPTX
Real time streaming analytics
Anirudh
 
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PDF
Introduction to Spark Streaming
datamantra
 
PDF
DIscover Spark and Spark streaming
Maturin BADO
 
PDF
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
bemeneqhueen
 
PPTX
Apache Spark Components
Girish Khanzode
 
PDF
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
PDF
Deep dive into spark streaming
Tao Li
 
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PPTX
Enterprise data science at scale
Carolyn Duby
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Apache Spark Crash Course
DataWorks Summit
 
Apache Spark Crash Course
DataWorks Summit
 
#HSTokyo16 Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Real time streaming analytics
Anirudh
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
Apache Spark - A High Level overview
Karan Alang
 
Introduction to Spark Streaming
datamantra
 
DIscover Spark and Spark streaming
Maturin BADO
 
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
bemeneqhueen
 
Apache Spark Components
Girish Khanzode
 
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
Deep dive into spark streaming
Tao Li
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
Paco Nathan
 
Apache Spark in Industry
Dorian Beganovic
 
Enterprise data science at scale
Carolyn Duby
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Ad

Recently uploaded (20)

DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Digital Circuits, important subject in CS
contactparinay1
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 

Apache Spark Crash Course