As simple as Apache Spark

1 like509 views

The document discusses Apache Spark and its ecosystem. It begins with introducing the speaker who has 5 years of experience in knowledge discovery and has used big data technologies like Hadoop and Spark. It then explains that Spark provides a versatile ecosystem for batch, streaming, SQL, machine learning and graph processing workloads through components like Spark Core, Spark SQL, Spark Streaming, MLLib and GraphX. The document demonstrates Spark's seamless integration through an example that performs SQL queries, trains a machine learning model and performs streaming analysis in one workflow. It encourages attendees to start using Spark by downloading it and experimenting through hands-on coding examples.

Technology

Data Science Warsaw, 2015.10.13
As simple as
Apache Spark

Data Science Warsaw, 2015.10.13
About me
● At ICM since 5 years
● Knowledge Discovery in Documents
○ Object disambiguation, Document classification, Document
similarity, etc., etc.
● Enough big to use Big Data ecosystems
○ Hadoop since 2012
○ Spark since 2013 (2014 for real)
2

Data Science Warsaw, 2015.10.13
We have still
about 19 minutes...

Data Science Warsaw, 2015.10.13
Obligatory word count example
■ Task: to count the number of occurrences of each word in a text
■ Frequently used when introducing the MapReduce paradigm
4
Tell me
you’ve already
know it!

All rights reserved, © 2015 ICM UW 5
Hadoop has a rich set of libraries
Map-Reduce — good for batch
Pig — Scripts
Oozie — Workflows
Mahout — Machine Learning
Hive — SQL Queries
Impala —
Ad-hoc Queries
Storm — Real Time
Streaming
Giraph — Graphs

All rights reserved, © 2015 ICM UW 6
Hadoop ecosystem is (too) large
Map-Reduce — good for batch
■ Using multiple libraries results in
● long deployment, costful support, burden of administering
number of configuration files
● lots of glueing code between libraries
Pig — Scripts
Oozie — Workflows
Mahout — Machine Learning
Hive — SQL Queries
Impala —
Ad-hoc Queries
Storm — Real Time
Streaming
Giraph — Graphs

All rights reserved, © 2015 ICM UW
Let’s walk into Big Data
like a Boss with .

All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
Ecosystem of high-level tools for various use-cases
8
Spark Core
Spark SQL
Spark
Streaming
near real-time
MLLib
machine
learning
GraphX
graph
processing
SparkR
R on Spark

All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
9
„One to rule them all"

All rights reserved, © 2015 ICM UW
Example: versatile yet seamless
1. Select positions from historic tweets.
2. Train a model of 10 clusters of neighbouring nodes.
3. Classify real–time tweets from last 20 sec. every 3 sec. and count them for each
cluster.
10
points = sc.runSql[Double, Double]("SELECT latitude, longitude FROM historic_tweets")
model = KMeans.train(points, 10)
sc.twitterStream(...)
.map(lambda t: (model.closestCenter(t.location), 1))
.reduceByKeyAndWindow(lambda x,y: x+y , Seconds(20), Seconds(3))
Source: The State of Spark, and Where We’re Going Next, presentation by M. Zaharia, 2013

All rights reserved, © 2015 ICM UW
How to start?

All rights reserved, © 2015 ICM UW
First: download
12

All rights reserved, © 2015 ICM UW
Second: ./bin/pyspark
13

All rights reserved, © 2015 ICM UW
Third: Code much and often
14
import pyspark.ml.recommendation.ALS
import pyspark.ml.recommendation.Rating
// Transform Strings: "user_id,movie_id,rating" to Ratings: Rating(user_id:Int,movie_id:Int,rating:Double)
data = sc.textFile("path/to/data.csv")
ratings = data.map(lambda s: s.split(',')).map(lambda arr: Rating(int(arr[0]), int(arr[1]), float(arr[2]))
// Build the recommendation model using ALS
// Factor the rating matrix A=[n,m] into B=[n,f] and C=[f,m], where A ~= B x C
numFeatures = 10; numIterations = 20
model = ALS.train(ratings, numFeatures, numIterations, 0.01)

All rights reserved, © 2015 ICM UW
● We know which products are preferred by a particular user
● Having information about preferences -- recommend to a
particular user -- a product which she or he is likely to purchase
● Iterative method
15
= X
Collaborative Filtering: Problem Statement

All rights reserved, © 2015 ICM UW
Collaborative Filtering: Problem Statement
16
USER
USER
=
~TASTE
DEMANDMOVIE
MOVIE
~TASTE
SUPPLY
X
Demo time!

All rights reserved, © 2015 ICM UW
Tweets exploration
17
Demo time!

All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
Ecosystem of high-level tools for various use-cases
18
Spark Core
Spark SQL
Spark
Streaming
near real-time
MLLib
machine
learning
GraphX
graph
processing
SparkR
R on Spark

All rights reserved, © 2015 ICM UW 19
What next? Trainings!

All rights reserved, © 2015 ICM UW
Thank you!
20
Piotr Dendek
pdendek@icm.edu.pl
@pjden

More Related Content

What's hot (20)

PDF

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon

PPTX

Deep Learning on Aerial Imagery: What does it look like on a map?Rob Emanuele

PPTX

Data Stream Algorithms in Storm and RRadek Maciaszek

PPTX

Pycon 2016-open-spaceChetan Khatri

PDF

Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet

PPTX

Mining data streamsAkash Gupta

PDF

Influxdb and time series dataMarcin Szepczyński

PDF

Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet

ODP

Google's DremelMaria Stylianou

PPT

5.1 mining data streamsKrish_ver2

PPT

Dremel: Interactive Analysis of Web-Scale Datasets robertlz

PPTX

SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp

PDF

ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...Srinath Perera

PDF

Introduction to Data streaming - 05/12/2014Raja Chiky

PPTX

Dremel interactive analysis of web scale datasetsCarl Lu

PDF

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

PDF

Declarative Infrastructure Tools Yulia Shcherbachova

ODP

Apache sironaOlivier Lamy

PDF

Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf

PDF

Introduction to Data Analtics with Pandas [PyCon Cz]Alexander Hendorf

End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon

Deep Learning on Aerial Imagery: What does it look like on a map?Rob Emanuele

Data Stream Algorithms in Storm and RRadek Maciaszek

Pycon 2016-open-spaceChetan Khatri

Efficient Data Stream Classification via Probabilistic Adaptive WindowsAlbert Bifet

Mining data streamsAkash Gupta

Influxdb and time series dataMarcin Szepczyński

Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet

Google's DremelMaria Stylianou

5.1 mining data streamsKrish_ver2

Dremel: Interactive Analysis of Web-Scale Datasets robertlz

SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp

ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...Srinath Perera

Introduction to Data streaming - 05/12/2014Raja Chiky

Dremel interactive analysis of web scale datasetsCarl Lu

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

Declarative Infrastructure Tools Yulia Shcherbachova

Apache sironaOlivier Lamy

Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLMLconf

Introduction to Data Analtics with Pandas [PyCon Cz]Alexander Hendorf

Similar to As simple as Apache Spark (20)

PDF

Simplifying Big Data Analytics with Apache SparkDatabricks

PDF

Bds session 13 14Infinity Tech Solutions

PDF

Apache Spark Overview @ ferretAndrii Gakhov

PDF

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere

PDF

Dev Ops TrainingSpark Summit

PDF

2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai

PDF

Apache Spark OverviewVadim Y. Bichutskiy

PDF

big datakiller_joe

PDF

NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinZenika

PDF

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

PPTX

Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies

PDF

2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai

PDF

Strata NYC 2015 - What's coming for the Spark communityDatabricks

PDF

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей

PDF

SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking

PPTX

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB

PPTX

Yarn spark next_gen_hadoop_8_jan_2014Vijay Srinivas Agneeswaran, Ph.D

PDF

How Apache Spark fits into the Big Data landscapePaco Nathan

PPTX

Zaharia spark-scala-days-2012Skills Matter Talks

PPTX

The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.