Spark meetup TCHUG

LARGE-SCALE ANALYTICS WITH
APACHE SPARK
THOMSON REUTERS R&D
TWIN CITIES HADOOP USER GROUP
FRANK SCHILDER
SEPTEMBER 22, 2014

THOMSON REUTERS
• The Thomson Reuters Corporation
– 50,000+ employees
– 2,000+ journalists at news desks world wide
– Offices in more than 1,000 countries
– $12 billion dollars revenue/year
• Products: intelligent information for professionals and enterprises
– Legal: WestlawNext legal search engine
– Financial: Eikon financial platform; Datastream real-time share price data
– News: REUTERS news
– Science: Endnote, ISI journal impact factor, Derwent World Patent Index
– Tax & Accounting: OneSource tax information
• Corporate R&D
– Around 40 researchers and developers (NLP, IR, ML)
– Three R&D sites the US one in the UK: Eagan, MN; Rochester, NY; NYC
and London
– We are hiring… email me at frank.schilder@thomsonreuters.com

OVERVIEW
• Speed
– Data locality, scalability, fault tolerance
• Ease of Use
– Scala, interactive Shell
• Generality
– SparkSQL, MLLib
• Comparing ML frameworks
– Vowpal Wabbit (VW)
– Sparkling Water
• The Future

WHAT IS SPARK?
Apache Spark is a fast and general engine
for large-scale data processing.
• Speed: allows to run iterative Map-Reduce
faster because of in-Memory computation:
Resilient Distributed Datasets (RDD)
• Ease of use: enables interactive data analysis
in Scala, Python, or Java; interactive Shell
• Generality: offers libraries for SQL, Streaming
and large-scale analytics (graph processing
and machine learning)
• Integrated with Hadoop: runs on Hadoop 2’s
YARN cluster

ACKNOWLEDGMENTS
• Matei Zaharia and ampLab and databricks team for
fantastic learning material and tutorials on Spark
• Hiroko Bretz, Thomas Vacek, Dezhao Song, Terry
Heinze for Spark and Scala support and running
experiments
• Adam Glaser for his time as a TSAP intern
• Mahadev Wudali and Mike Edwards for letting us
play in the “sandbox” (cluster)

PRIMARY GOALS OF SPARK
• Extend the MapReduce model to better support
two common classes of analytics apps:
– Iterative algorithms (machine learning, graphs)
– Interactive data mining (R, Python)
• Enhance programmability:
– Integrate into Scala programming language
– Allow interactive use from Scala interpreter
– Make Spark easily accessible from other
languages (Python, Java)

MOTIVATION
• Acyclic data flow is inefficient for
applications that repeatedly reuse a working
set of data:
– Iterative algorithms (machine learning, graphs)
– Interactive data mining tools (R, Python)
• With current frameworks, apps reload data
from stable storage on each query

SOLUTION: Resilient
Distributed Datasets (RDDs)
• Allow apps to keep working sets in memory for
efficient reuse
• Retain the attractive properties of MapReduce
– Fault tolerance, data locality, scalability
• Support a wide range of applications

PROGRAMMING MODEL
Resilient distributed datasets (RDDs)
– Immutable, partitioned collections of objects
– Created through parallel transformations (map, filter,
groupBy, join, …) on data in stable storage
– Functions follow the same patterns as Scala operations
on lists
– Can be cached for efficient reuse
80+ Actions on RDDs
– count, reduce, save, take, first, …

EXAMPLE: LOG MINING
Load error messages from a log into memory, then
interactively search for various patterns
Base RDD
Transformed RDD
Val lines = spark.textFile(“hdfs://...”)
Val errors = lines.filter(_.startsWith(“ERROR”))
Val messages = errors.map(_.split(‘t’)(2))
Val cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
results
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“timeout”)).count
cachedMsgs.filter(_.contains(“license”)).count
. . .
tasks
Cache 1
Cache 2
Cache 3
Action
Result: scaled to 1 TB data in 5-7 sec
Result: full-text search of Wikipedia in <1 sec
(vs 170 sec for on-disk data)
(vs 20 sec for on-disk data)

BEHAVIOR WITH NOT ENOUGH RAM
68.8
58.1
40.7
29.7
11.5
100
80
60
40
20
0
Cache
disabled
25%
50%
75%
Fully
cached
Iteration
time
(s)
%
of
working
set
in
memory

RDD Fault Tolerance
RDDs maintain lineage information that can be used
to reconstruct lost partitions
Ex:
messages = textFile(...).filter(_.startsWith(“ERROR”))
.map(_.split(‘t’)(2))
HDFS File Filtered RDD Mapped RDD
filter
(func
=
_.contains(...))
map
(func
=
_.split(...))

Fault Recovery Results
119
No
Failure
Failure
in
the
6th
Iteration
57
56
58
58
81
57
59
57
59
140
120
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
Iteratrion
time
(s)
Iteration

INTERACTIVE SHELL
• Data analysis can be done in the interactive shell.
– Start from local machine or cluster
– Access multi-core processor with local[n]
– Spark context is already set up for you: SparkContext sc
• Load data from anywhere (local, HDFS,
Cassandra, Amazon S3 etc.):
• Start analyzing your data:
Processing
starts here
Local data file

ANALYZE YOUR DATA
• Word count in one line:
• List the word counts:
• Broadcast variables (e.g. dictionary, stop word list)
because local variables need to distributed to the workers:

PYTHON SHELL & IPYTHON
• The interactive shell can also be started as Python
shell called pySpark:
• Start analyzing your data in python now:
• Since it’s Python, you may want to use iPython
– (command shell for interactive programming in your
brower) :

IPYTHON AND SPARK
• The iPython notebook environment and pySpark:
– Document data analysis results
– Carry out machine learning experiments
– Visualize results with matplotlib or other visualization
libraries
– Combine with NLP libraries such as NLTK
• PySpark does not offer the full functionality of
Spark Shell in Scala (yet)
• Some bugs (e.g. problems with unicode)

PROJECTS AT R&D USING SPARK
• Entity linking
– Alternative name extraction from
Wikipedia, Freebase, free text, ClueWeb12;
several TB large web collection (planned)
• Large-scale text data analysis:
– creating fingerprints for entities/events
– Temporal slot filling: Assigning a begin and end time
stamp to a slot filler (e.g. A is employee of company B
from BEGIN to END)
– Large-Scale text classification of Reuters News Archive
articles (10 years)
• Language model computation used for search
query analysis

SPARK MODULES
• Spark streaming:
– Processing real-time data streams
• Spark SQL:
– Support for structured data (JSON, Parquet) and
relational queries (SQL)
• MLlib:
– Machine learning library
• GraphX:
– New graph processing API

SPARK SQL
• Relational queries expressed in
– SQL
– HiveQL
– Scala Domain specific language (DSL)
• New type of RDD: SchemaRDD :
– RDD composed of Row objects
– Schema definition or inferred from a Parquet file, JSON
data set, or data store in Hive
• SPARK SQL is in alpha: API may change in the
future!

MLLIB
• A machine learning module that comes with Spark
• Shipped since Spark 0.8.0
• Provides various machine learning algorithms for
classification and clustering
• Sparse vector representation since 1.0.0
• New features in recently released version 1.1.0:
– Includes a standard statistics library (e.g. correlation,
Hypothesis testing, sampling)
– More algorithms ported to Java and Python
– More feature engineering: TF-IDF, Singular Value
Decomposition (SVD)

MLLIB
• Provides various machine learning algorithms:
– Classification:
• Logistic regression, support vector machine (SVM), naïve
Bayes, decision trees
– Regression:
• Linear regression, regression trees
– Collaborative Filtering:
• Alternative least square (ALS)
– Clustering:
• K-means
– Decomposition
• Singular value decomposition (SVD), Principal component
analysis (PCA)

OTHER ML FRAMEWORKS
• Mahout
• LIBLINEAR
• MatLAB
• Scikit-learn
• GraphLab
• R
• Weka
• Vowpal Wabbit
• BigML

LARGE-SCALE ML INFRASTRUCTURE
• More data implies bigger training sets and richer
feature sets.
• More data with simple ML algorithm often beats
small data with complicated ML algorithm
• Large-scale ML requires big data infrastructure:
– Faster processing: Hadoop, Spark
– Feature engineering: Principal Component Analysis,
Hashing trick, Word2Vec

PREDICTIVE ANALYTICS WITH MLLIB

PREDICTIVE ANALYTICS WITH MLLIB
http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-
data-using-spark-2.html

VW AND MLLIB COMPARISON
• We compared Vowpal Wabbit and MLlib in
December 2013 (work with Tom Vacek)
• Vowpal Wabbit (VW) is a large-scale ML tool
developed by John Langford (Microsoft)
• Task: binary text classification task on Reuters
articles
– Ease of implementation
– Feature Extraction
– Parameter tuning
– Speed
– Accessibility of programming languages

VW VS. MLLIB
• Ease of implementation
– VW: user tool designed for ML, not programming language
– MLlib: programming language, some support now (e.g. regularization)
• Feature Extraction
– VW: specific capabilities for bi-grams, prefix etc.
– MLlib: no limit in terms of creating features
• Parameter tuning
– VW: no parameter search capability, but multiple parameters can be hand-tuned
– MLlib: offers cross-validation
• Speed
– VW: highly optimized, very fast even on a single machine with multiple cores
– MLlib: fast with lots of machines
• Accessibility of programming languages
– VW: written in C++, a few wrappers (e.g. Python)
– MLlib: Scala, Python, Java
• Conclusion end of 2013: VW had a slight advantage, but MLlib has caught up in at
least some of the areas (e.g. sparse feature representation)

FINDINGS SO FAR
• Large-scale extraction is a great fit for Spark when
working with large data sets (> 1GB)
• Ease of use makes Spark an ideal framework for
rapid prototyping.
• MLlib is a fast growing ML library, but “under
development”
• Vowpal Wabbit has been shown to crunch even
large data sets with ease.
250
200
150
100
50
0
vw liblinear Spark
local[4]
0/1 loss
time

OTHER ML FRAMEWORKS
• Internship by Adam Glaser compared various ML
frameworks with 5 standard data sets (NIPS)
– Mass-spectrometric data (cancer), handwritten digit
detection, Reuters news classification, synthetic data sets
– Data sets were not very big, but had up to 1.000.000
features
• Evaluated accuracy of the generated models and
speed for training time
• H20, GraphLab and Microsoft Azure showed strong
performances in terms of accuracy and training
time.

WHAT IS NEXT?
• Oxdata plans to release Sparkling Water in October
2014:
• Microsoft Azure also offers a strong platform with
multiple ML algorithm and an intuitive user interface
• GraphLab has GraphLab Canvas ™ for visualizing your
data and plans to incorporate more ML algorithms.

CONCLUSIONS
• Apache Spark is the most active project in the Hadoop
eco system
• Spark offers speed and ease of use because of
– RDDs
– Interactive shell and
– Easy integration of Scala, Java, Python scripts
• Integrated in Spark are modules for
– Easy data access via SparkSQL
– Large-scale analytics via MLlib
• Other ML frameworks enable analytics as well
• Evaluate which framework is the best fit for your data
problem

THE FUTURE?
• Apache Spark will be a unified platform to run
under various work loads:
– Batch
– Streaming
– Interactive
• And connect with different runtime systems
– Hadoop
– Cassandra
– Mesos
– Cloud
– …

THE FUTURE?
• Spark will extend its offering of large-scale
algorithms for doing complex analytics:
– Graph processing
– Classification
– Clustering
– …
• Other frameworks will continue to offer similar
capabilities.
• If you can’t beat them, join them.

http://labs.thomsonreuters.com/about-rd-careers/
FRANK.SCHILDER@THOMSONREUTERS.COM

Example: Logistic Regression
Goal: find best line separating two sets of points
+
–
–
+
+
+ + +
+
+ +
–
– –
–
–
– –
+
target
–
random
initial
line

Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)

Logistic Regression Performance
4500
4000
3500
3000
2500
2000
1500
1000
500
0
1 5 10 20 30
Running Time (s)
Number of Iterations
127
s
/
iteration
Hadoop
Spark
first
iteration
174
s
further
iterations
6
s

Spark Scheduler
Dryad-like DAGs
Pipelines functions
within a stage
Cache-aware work
reuse & locality
Partitioning-aware
to avoid shuffles
join
groupBy
union
map
Stage
3
A:
Stage
1
Stage
2
B:
C:
D:
E:
F:
G:
=
cached
data
partition

Spark Operations
Transformations
(define a new
RDD)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
Actions
(return a result to
driver program)
collect
reduce
count
save
lookupKey

Spark meetup TCHUG

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Spark meetup TCHUG (20)

Recently uploaded (20)

Spark meetup TCHUG