SlideShare a Scribd company logo
1 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
MatFast
IN-MEMORY	DISTRIBUTED	MATRIX	COMPUTATION	
PROCESSING	AND	OPTIMIZATION	BASED	ON	SPARK	SQL
Mingjie	Tang
Yanbo Liang	
Oct,	2017
2 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
About	Authors
à Yongyang Yu
• Machine	learning,	Database	system,	Computation	Algebra	
• PhD	students	at	Purdue	University	
à Mingjie	Tang
• Spark	SQL,	Spark	ML,	Database,	Machine	Learning
• Software	Engineer	at	Hortonworks
à Yanbo Liang
• Apache	Spark	committer,	Spark	MLlib
• Staff	Software	Engineer	at	Hortonworks
à …	All	Other	Contributors
3 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Agenda
Motivation
Overview	of	MatFast
Implementation	 and	optimization
Use	cases
4 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Motivation
à Many	applications	rely	on	efficient	processing	of	queries	over	big	
matrix	data:
– Recommender	 systems
– Social	network	analysis
– Predict	traffic	data	flow
– Anti-fraud	and	spam	detection		
– Bioinformatics
5 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Motivation
à Recommender	 Systems
Netflix’s	user-movie	rating	table	(sample)
Problem:	Predict	 the	missing	entries	 in	the	table
Input:	User-movie	rating	table	with	missing	entries
Output:	Complete	user-movie	rating	table	with	predictions
For	Netflix,	#users	=	80	million,	#movies	=	2	million
Batma
n	
begins
Alice	in	
Wonde
rland
Doctor	
Strange
Trolls Ironma
n
Alice 4 ? 3 5 4
Bob ? 5 4 ? ?
Cindy 3 ? ? ? 2
movies
users
6 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Motivation
à Gaussian	Non-negative	Matrix	Factorization	(GNMF)
– Assumption:				𝑉"×$ ≈ 𝑊"×'	×	𝐻'×$
V ≈ W H×users
(80M)
movies	 (2M)
users
(80M)
modeling	 dims	
(e.g.,	topics,	age,	language,	etc.)
modeling
dims	(500)
movies	 (2M)
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = 𝑊 ∗ 𝑉	×	𝐻,
	/	(𝑊	×	𝐻	×	𝐻,
)
fori =	1	to nIter do
end
Initialize	W and	H
Matrix	operation	for	GNMF	 Algorithm
huge	volume
dense/sparse	
storage
iterative	 execution
7 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Motivation
à User-Movie	 Rating	Prediction	with	GNMF
val p = 200 // number of topics
val V = loadMatrix(“in/V”) // read matrix
val max_niter = 10 // max number of iteration
W = RandomMatrix(V.nrows, p)
H = RandomMatrix(p, V.ncols)
for (i <- 0 until max_niter) {
H = H * (W.t %*% V) / (W.t %*% W %*% H)
W = W * (V %*% H.t) / (W %*% H %*% H.t)
}
(H %*% W).saveToHive()
8 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
State	of	the	art	solution	in	Spark	ecosystem
à Alternative	Least	Square	approach	in	Spark	(ALS)
– Experiment	on	Spotify	data
– 50+	million	users	x	30+	million	songs	
– 50	billion	ratings	For	rank	10	with	10	iterations	
– ~1	hour	running	time
à How	to	extend	ALS	to	other	matrix	computation?
– SVD
– PCA
– QR
9 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Observation
H
W
transpose
V
mat-mat mat-mat
mat-mat mat-elem
mat-elem
loop
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = (													)										(																														)
for i =	1	to nIter do
end
Initialize	 W and	H
Matrix	computation	evaluation	pipeline
𝑊 ∗ 𝑉	× 𝐻, / 𝑊	×	𝐻	 ×
intermediate	result	cost:
8 X	1016( )
10 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
𝑊	×	𝐻	
Observation
H
W
transpose
V
mat-mat mat-mat
mat-mat mat-elem
mat-elem
intermediate	result	cost:
5	X	1011
loop
materialization	of	result
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = (													)									(																												)
for i =	1	to nIter do
end
Initialize	 W and	H
Matrix	computation	evaluation	pipeline
𝑊 ∗ 𝑉	× 𝐻, / 𝐻,
)×(
11 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
𝑊	×	𝐻	
Observation
H
W
transpose
V
mat-mat mat-mat
mat-mat
chained
mat-elem
intermediate	result	size:
5	X	1011
loop
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = (													)										(																											)
for i =	1	to nIter do
end
Initialize	 W and	H
Matrix	computation	evaluation	pipeline
𝑊 ∗ 𝑉	× 𝐻, / 𝐻,×( )
12 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Overview	of	MatFast
13 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Matrix	operators
à Unary	operator
– Transpose:	B =	AT
à Binary	operators
– B =	A +	𝛽;	B =	A *	𝛽;	
– C =	A ★ B,	★∈{+,	*,	/};	
– C =	A B (A %*% B)
à Others
– return	a	matrix:	abs(A),	pow(A,	p)
– return	a	vector:	rowSum(A),	colSum(A)
– return	a	scalar:		max(A),	min(A)
matrix-matrix	multiplication
14 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	targets
à MATFAST generates	a	computation- and	communication-
efficient	execution	plan:
– Optimize	a	single	matrix	operator	in	an	expression
– Optimize	multiple	operators	in	an	expression
– Exploit	data	dependency	between	different	 expressions
15 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Comparison	with	other	systems
Single Distributed w.	multiple	 nodes
R ScaLAPACK SciDB SystemML MLlib DMac
huge	volume. ✔ ✔ ✔ ✔ ✔
sparse	comp. ✔ 〜 ✔ 〜 〜
multiple	
operators	
✔ ✔ ✔ ✔ ✔ ✔
partition	w.	
dependency
✔
opt.	exec. plan ✔
interface R	script C/Fortran SQL-like R-like Java/Scala Scala
fault	tolerance ✔ ✔ ✔ ✔
open	source ✔ ✔ 〜 ✔ ✔
16 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Compare	with	Spark	SQL
Matrix	operators SQL	relational	
query
Data	type matrix relational	table
Operators
transpose, mat-mat,	
mat-scalar,	mat-elem
join,	select,	group
by,	aggregate
Execution	
scheme
iterative acyclic
17 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Spark	SQL
System	framework
MATFAST
ML	algorithms:	SVD,	PCA,	NMF,	
PageRank,	QR,	etc
Spark	RDD
Applications:	Image processing, Text	
processing,	Collaborative	filtering,
Spatial	computation,	etc.
18 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
System	framework
MATFAST
Components Architecture
19 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
MatFast within	Spark	Catalyst	
à Extend	Spark	Catalyst
Rule	based	optimization
(single	matrix	operators,	
multiple	matrix	operators)
Cost	based	optimization
(optimizing	data	partitioning)
20 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Implementation	and	optimization
21 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	1:	a	Single	Operator	- Cost	Based	Optimization
plan
1
plan
2
plan
3
plan
4
plan
5
MatFast
AverageExecutionTime(s)
102
10
3
10
4
((A1
× A2
) × A3
) × A4
(A1
× (A2
× A3
)) × A4
A1
× ((A2
× A3
) × A4
)
A1
× (A2
× (A3
× A4
))
(A1
× A2
) × (A3
× A4
)
22 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
à Distribute	matrix	data	over	a	set	of	workers
à How	to	determine	the	data	partitioning	scheme	for	a	matrix	such	that	minimum	
shuffle	cost	is	introduced	for	the	entire	pipeline?	
à Partitioning	schemes
– Row	scheme	(“r”)
– Column	scheme	(“c”)
– Block-Cyclic	scheme	(“b-c”)
– Broadcast	scheme	(“b”)
23 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
à How	to	determine	the	data	partitioning	scheme	for	a	matrix	such	that	minimum	
shuffle	cost	is	introduced	for	the	entire	pipeline?
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = 𝑊 ∗ 𝑉	×	𝐻,
	/	(𝑊	×	𝐻	×	𝐻,
)
compute	firstWT
00 WT
10 WT
20 WT
30
WT
01 WT
11 WT
21 WT
31
E0
E1
E2
E3
W
W00 W01
W10 W11
W20 W21
W30 W31
WT
Hash-based	partition,	(i+	j)	%	N
3	block	shuffles 7	block	shuffles
Total:	 20	block	shuffles
Executors
24 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
à How	to	determine	the	data	partitioning	scheme	for	a	matrix	such	that	minimum	
shuffle	cost	is	introduced	for	the	entire	pipeline?
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = 𝑊 ∗ 𝑉	×	𝐻,
	/	(𝑊	×	𝐻	×	𝐻,
)
compute	firstWT
00 WT
10 WT
20 WT
30
WT
01 WT
11 WT
21 WT
31
E0
E1
E2
E3
W
W00 W01
W10 W11
W20 W21
W30 W31
WT
Row-based	partition
Total:	 12	block	shuffles
Executors
3	block	shuffles 3 block	shuffles
25 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
…
……
…
……
……
…
W
T
V
W
T
V
WT
W
H
WTW
WTWH
…
H
…
H
…
V
…
VHT
…
W
…
WH
…
WHH
T
…
W
W
…
H
stage 1
stage 2 stage 3 stage 4
stage 5
stage 6
𝐻 = 𝐻 ∗ 𝑊,
	×	𝑉 	/	(𝑊,
	×	𝑊	×	𝐻)
𝑊 = 𝑊 ∗ 𝑉	×	𝐻,
	/	(𝑊	×	𝐻	×	𝐻,
)
H
à We	need	an	optimized	plan	to	
determine	an	optimized	data	
partitioning	scheme	for	each	
matrix	such	that	minimum	shuffle	
overhead	is	introduced	for	the	
entire	pipeline.
à For	example,	with	hash-based	
data	partitioning,	the	
computation	pipeline	involves	
multiple	shuffles	for	aligning	the	
data	blocks.
26 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Optimization	2:	optimizing	data	partitioning	in	pipeline	
à MATFAST determines	 the	
partitioning	scheme	for	an	
input	matrix	with	min	shuffle	
cost	according	to	the	cost	
model.
à Greedily optimizes	each	
operator
𝑠23(24) ⟵ argmin
<=>(=?)
𝐶AB$$(𝑜𝑝, 𝑠23 , 𝑠24 , 𝑠B)
à Physical	execution	plan	with	optimized	data	
partitioning
……
…
WT
V
W
T
V
…
……
W
T
W
…
WTW
…
W
T
WH
…
H
…
H
…
V
…
VHT
…
HT
…
HHT
…
…
W
WHH
T
…
W
W
stage 1
stage 2 stage 3 stage 4
…
Row	scheme
for	W
Row	scheme	
for	V
27 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Case	studies
28 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Experiments
à Dataset	APIs
– Code	examples	link
à Compare	with	state-of-the-art	systems
– Spark	MLlib (provided	matrix	operation)
– SystemML (Spark)
– ScaLAPACK
– SciDB
à Netflix	data	
– 100,480,507	ratings	
– 17,770	movies	 from	480,189	customers
à Social	network	data
29 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
PageRank	on	different	datasets
MATFAST
30 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
GNMF	on	the	Netflix	dataset
MATFAST
31 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Future	plan
à More	user	friend	APIs
à Advanced	plan	optimizer
à Python	and	R	interface
à Vertical	applications
32 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Conclusion
à Proposed	and	realized	MATFAST,	an	in-memory	distributed	platform	that	
optimizes	query	pipelines	of	matrix	operations
à Take	advantage	of	dynamic	cost-based	 analysis	and	rule-based	 heuristics	to	
generate	a	query	execution	plan
à Communication-efficient	 data	partitioning	scheme	assignment
33 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Reference
– Yongyang Yu, MingJie Tang, Walid G.	Aref, Qutaibah M.	Malluhi, Mostafa	
M.	Abbas, Mourad Ouzzani:
In-Memory	Distributed	Matrix	Computation	Processing	and	
Optimization. ICDE 2017: 1047-1058
34 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Thanks
Q	&	A
mtang@hortonworks.com

More Related Content

What's hot (20)

PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
PDF
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
PDF
Scaling Machine Learning with Apache Spark
Databricks
 
PDF
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
PDF
Apache Pulsar: The Next Generation Messaging and Queuing System
Databricks
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
PDF
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
PPTX
Apache Arrow Flight Overview
Jacques Nadeau
 
PDF
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
PDF
Cooperative Task Execution for Apache Spark
Databricks
 
PDF
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
PDF
Spark Summit EU talk by Berni Schiefer
Spark Summit
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit
 
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
Scaling Machine Learning with Apache Spark
Databricks
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit
 
Apache Pulsar: The Next Generation Messaging and Queuing System
Databricks
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Apache Arrow Flight Overview
Jacques Nadeau
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
Cooperative Task Execution for Apache Spark
Databricks
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Spark Summit EU talk by Berni Schiefer
Spark Summit
 

Viewers also liked (20)

PDF
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
PPTX
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
PPTX
Ibm watson
Vivek Mohan
 
PDF
빅데이터윈윈 컨퍼런스_데이터시각화자료
ABRC_DATA
 
PDF
Softnix Messaging Server
Softnix Technology
 
PPTX
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
PPTX
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Cloudera, Inc.
 
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
PPTX
Put Alternative Data to Use in Capital Markets

Cloudera, Inc.
 
PPTX
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
confluent
 
PDF
Softnix Security Data Lake
Softnix Technology
 
PDF
The Fast Path to Building Operational Applications with Spark
SingleStore
 
PDF
Building the Ideal Stack for Real-Time Analytics
SingleStore
 
PPTX
The Evolution of Data Architecture
Wei-Chiu Chuang
 
PDF
CWIN17 Frankfurt / Cloudera
Capgemini
 
PDF
Spark meetup - Zoomdata Streaming
Zoomdata
 
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
PDF
Zoomdata
Vivek Mohan
 
PPTX
Security implementation on hadoop
Wei-Chiu Chuang
 
PDF
Cloudera and Qlik: Big Data Analytics for Business
Data IQ Argentina
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
Ibm watson
Vivek Mohan
 
빅데이터윈윈 컨퍼런스_데이터시각화자료
ABRC_DATA
 
Softnix Messaging Server
Softnix Technology
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Cloudera, Inc.
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 
Put Alternative Data to Use in Capital Markets

Cloudera, Inc.
 
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
confluent
 
Softnix Security Data Lake
Softnix Technology
 
The Fast Path to Building Operational Applications with Spark
SingleStore
 
Building the Ideal Stack for Real-Time Analytics
SingleStore
 
The Evolution of Data Architecture
Wei-Chiu Chuang
 
CWIN17 Frankfurt / Cloudera
Capgemini
 
Spark meetup - Zoomdata Streaming
Zoomdata
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Zoomdata
Vivek Mohan
 
Security implementation on hadoop
Wei-Chiu Chuang
 
Cloudera and Qlik: Big Data Analytics for Business
Data IQ Argentina
 
Ad

Similar to MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL Yanbo Liang and Mingie Tang (20)

PPTX
Machine Learning With Spark
Shivaji Dutta
 
PPTX
Spark Meetup July 2015
Debasish Das
 
PDF
Bringing Algebraic Semantics to Mahout
sscdotopen
 
PDF
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Databricks
 
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
PDF
Melbourne Spark Meetup Dec 09 2015
Chris Fregly
 
PDF
Sydney Spark Meetup Dec 08, 2015
Chris Fregly
 
PPTX
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
PDF
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
PPTX
Machine learning with Spark
Khalid Salama
 
PPTX
Big Data Transformation Powered By Apache Spark.pptx
Knoldus Inc.
 
PPTX
Big Data Transformations Powered By Spark
Knoldus Inc.
 
PDF
Singapore Spark Meetup Dec 01 2015
Chris Fregly
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PDF
Toronto Spark Meetup Dec 14 2015
Chris Fregly
 
PDF
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Machine Learning With Spark
Shivaji Dutta
 
Spark Meetup July 2015
Debasish Das
 
Bringing Algebraic Semantics to Mahout
sscdotopen
 
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Databricks
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Chetan Khatri
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
Melbourne Spark Meetup Dec 09 2015
Chris Fregly
 
Sydney Spark Meetup Dec 08, 2015
Chris Fregly
 
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Machine learning with Spark
Khalid Salama
 
Big Data Transformation Powered By Apache Spark.pptx
Knoldus Inc.
 
Big Data Transformations Powered By Spark
Knoldus Inc.
 
Singapore Spark Meetup Dec 01 2015
Chris Fregly
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Toronto Spark Meetup Dec 14 2015
Chris Fregly
 
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
PDF
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 

Recently uploaded (20)

PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
big data eco system fundamentals of data science
arivukarasi
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
BinarySearchTree in datastructures in detail
kichokuttu
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL Yanbo Liang and Mingie Tang

  • 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MatFast IN-MEMORY DISTRIBUTED MATRIX COMPUTATION PROCESSING AND OPTIMIZATION BASED ON SPARK SQL Mingjie Tang Yanbo Liang Oct, 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved About Authors à Yongyang Yu • Machine learning, Database system, Computation Algebra • PhD students at Purdue University à Mingjie Tang • Spark SQL, Spark ML, Database, Machine Learning • Software Engineer at Hortonworks à Yanbo Liang • Apache Spark committer, Spark MLlib • Staff Software Engineer at Hortonworks à … All Other Contributors
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation Overview of MatFast Implementation and optimization Use cases
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Motivation à Many applications rely on efficient processing of queries over big matrix data: – Recommender systems – Social network analysis – Predict traffic data flow – Anti-fraud and spam detection – Bioinformatics
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Motivation à Recommender Systems Netflix’s user-movie rating table (sample) Problem: Predict the missing entries in the table Input: User-movie rating table with missing entries Output: Complete user-movie rating table with predictions For Netflix, #users = 80 million, #movies = 2 million Batma n begins Alice in Wonde rland Doctor Strange Trolls Ironma n Alice 4 ? 3 5 4 Bob ? 5 4 ? ? Cindy 3 ? ? ? 2 movies users
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Motivation à Gaussian Non-negative Matrix Factorization (GNMF) – Assumption: 𝑉"×$ ≈ 𝑊"×' × 𝐻'×$ V ≈ W H×users (80M) movies (2M) users (80M) modeling dims (e.g., topics, age, language, etc.) modeling dims (500) movies (2M) 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = 𝑊 ∗ 𝑉 × 𝐻, / (𝑊 × 𝐻 × 𝐻, ) fori = 1 to nIter do end Initialize W and H Matrix operation for GNMF Algorithm huge volume dense/sparse storage iterative execution
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Motivation à User-Movie Rating Prediction with GNMF val p = 200 // number of topics val V = loadMatrix(“in/V”) // read matrix val max_niter = 10 // max number of iteration W = RandomMatrix(V.nrows, p) H = RandomMatrix(p, V.ncols) for (i <- 0 until max_niter) { H = H * (W.t %*% V) / (W.t %*% W %*% H) W = W * (V %*% H.t) / (W %*% H %*% H.t) } (H %*% W).saveToHive()
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved State of the art solution in Spark ecosystem à Alternative Least Square approach in Spark (ALS) – Experiment on Spotify data – 50+ million users x 30+ million songs – 50 billion ratings For rank 10 with 10 iterations – ~1 hour running time à How to extend ALS to other matrix computation? – SVD – PCA – QR
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Observation H W transpose V mat-mat mat-mat mat-mat mat-elem mat-elem loop 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = ( ) ( ) for i = 1 to nIter do end Initialize W and H Matrix computation evaluation pipeline 𝑊 ∗ 𝑉 × 𝐻, / 𝑊 × 𝐻 × intermediate result cost: 8 X 1016( )
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved 𝑊 × 𝐻 Observation H W transpose V mat-mat mat-mat mat-mat mat-elem mat-elem intermediate result cost: 5 X 1011 loop materialization of result 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = ( ) ( ) for i = 1 to nIter do end Initialize W and H Matrix computation evaluation pipeline 𝑊 ∗ 𝑉 × 𝐻, / 𝐻, )×(
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved 𝑊 × 𝐻 Observation H W transpose V mat-mat mat-mat mat-mat chained mat-elem intermediate result size: 5 X 1011 loop 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = ( ) ( ) for i = 1 to nIter do end Initialize W and H Matrix computation evaluation pipeline 𝑊 ∗ 𝑉 × 𝐻, / 𝐻,×( )
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Overview of MatFast
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Matrix operators à Unary operator – Transpose: B = AT à Binary operators – B = A + 𝛽; B = A * 𝛽; – C = A ★ B, ★∈{+, *, /}; – C = A B (A %*% B) à Others – return a matrix: abs(A), pow(A, p) – return a vector: rowSum(A), colSum(A) – return a scalar: max(A), min(A) matrix-matrix multiplication
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization targets à MATFAST generates a computation- and communication- efficient execution plan: – Optimize a single matrix operator in an expression – Optimize multiple operators in an expression – Exploit data dependency between different expressions
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Comparison with other systems Single Distributed w. multiple nodes R ScaLAPACK SciDB SystemML MLlib DMac huge volume. ✔ ✔ ✔ ✔ ✔ sparse comp. ✔ 〜 ✔ 〜 〜 multiple operators ✔ ✔ ✔ ✔ ✔ ✔ partition w. dependency ✔ opt. exec. plan ✔ interface R script C/Fortran SQL-like R-like Java/Scala Scala fault tolerance ✔ ✔ ✔ ✔ open source ✔ ✔ 〜 ✔ ✔
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Compare with Spark SQL Matrix operators SQL relational query Data type matrix relational table Operators transpose, mat-mat, mat-scalar, mat-elem join, select, group by, aggregate Execution scheme iterative acyclic
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Spark SQL System framework MATFAST ML algorithms: SVD, PCA, NMF, PageRank, QR, etc Spark RDD Applications: Image processing, Text processing, Collaborative filtering, Spatial computation, etc.
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved System framework MATFAST Components Architecture
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MatFast within Spark Catalyst à Extend Spark Catalyst Rule based optimization (single matrix operators, multiple matrix operators) Cost based optimization (optimizing data partitioning)
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implementation and optimization
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 1: a Single Operator - Cost Based Optimization plan 1 plan 2 plan 3 plan 4 plan 5 MatFast AverageExecutionTime(s) 102 10 3 10 4 ((A1 × A2 ) × A3 ) × A4 (A1 × (A2 × A3 )) × A4 A1 × ((A2 × A3 ) × A4 ) A1 × (A2 × (A3 × A4 )) (A1 × A2 ) × (A3 × A4 )
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline à Distribute matrix data over a set of workers à How to determine the data partitioning scheme for a matrix such that minimum shuffle cost is introduced for the entire pipeline? à Partitioning schemes – Row scheme (“r”) – Column scheme (“c”) – Block-Cyclic scheme (“b-c”) – Broadcast scheme (“b”)
  • 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline à How to determine the data partitioning scheme for a matrix such that minimum shuffle cost is introduced for the entire pipeline? 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = 𝑊 ∗ 𝑉 × 𝐻, / (𝑊 × 𝐻 × 𝐻, ) compute firstWT 00 WT 10 WT 20 WT 30 WT 01 WT 11 WT 21 WT 31 E0 E1 E2 E3 W W00 W01 W10 W11 W20 W21 W30 W31 WT Hash-based partition, (i+ j) % N 3 block shuffles 7 block shuffles Total: 20 block shuffles Executors
  • 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline à How to determine the data partitioning scheme for a matrix such that minimum shuffle cost is introduced for the entire pipeline? 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = 𝑊 ∗ 𝑉 × 𝐻, / (𝑊 × 𝐻 × 𝐻, ) compute firstWT 00 WT 10 WT 20 WT 30 WT 01 WT 11 WT 21 WT 31 E0 E1 E2 E3 W W00 W01 W10 W11 W20 W21 W30 W31 WT Row-based partition Total: 12 block shuffles Executors 3 block shuffles 3 block shuffles
  • 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline … …… … …… …… … W T V W T V WT W H WTW WTWH … H … H … V … VHT … W … WH … WHH T … W W … H stage 1 stage 2 stage 3 stage 4 stage 5 stage 6 𝐻 = 𝐻 ∗ 𝑊, × 𝑉 / (𝑊, × 𝑊 × 𝐻) 𝑊 = 𝑊 ∗ 𝑉 × 𝐻, / (𝑊 × 𝐻 × 𝐻, ) H Ã We need an optimized plan to determine an optimized data partitioning scheme for each matrix such that minimum shuffle overhead is introduced for the entire pipeline. Ã For example, with hash-based data partitioning, the computation pipeline involves multiple shuffles for aligning the data blocks.
  • 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization 2: optimizing data partitioning in pipeline à MATFAST determines the partitioning scheme for an input matrix with min shuffle cost according to the cost model. à Greedily optimizes each operator 𝑠23(24) ⟵ argmin <=>(=?) 𝐶AB$$(𝑜𝑝, 𝑠23 , 𝑠24 , 𝑠B) à Physical execution plan with optimized data partitioning …… … WT V W T V … …… W T W … WTW … W T WH … H … H … V … VHT … HT … HHT … … W WHH T … W W stage 1 stage 2 stage 3 stage 4 … Row scheme for W Row scheme for V
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Case studies
  • 28. 28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Experiments à Dataset APIs – Code examples link à Compare with state-of-the-art systems – Spark MLlib (provided matrix operation) – SystemML (Spark) – ScaLAPACK – SciDB à Netflix data – 100,480,507 ratings – 17,770 movies from 480,189 customers à Social network data
  • 29. 29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved PageRank on different datasets MATFAST
  • 30. 30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved GNMF on the Netflix dataset MATFAST
  • 31. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Future plan à More user friend APIs à Advanced plan optimizer à Python and R interface à Vertical applications
  • 32. 32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Conclusion à Proposed and realized MATFAST, an in-memory distributed platform that optimizes query pipelines of matrix operations à Take advantage of dynamic cost-based analysis and rule-based heuristics to generate a query execution plan à Communication-efficient data partitioning scheme assignment
  • 33. 33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Reference – Yongyang Yu, MingJie Tang, Walid G. Aref, Qutaibah M. Malluhi, Mostafa M. Abbas, Mourad Ouzzani: In-Memory Distributed Matrix Computation Processing and Optimization. ICDE 2017: 1047-1058
  • 34. 34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thanks Q & A [email protected]