SlideShare a Scribd company logo
Data Science Warsaw, 2015.10.13
As simple as
Apache Spark
Data Science Warsaw, 2015.10.13
About me
● At ICM since 5 years
● Knowledge Discovery in Documents
○ Object disambiguation, Document classification, Document
similarity, etc., etc.
● Enough big to use Big Data ecosystems
○ Hadoop since 2012
○ Spark since 2013 (2014 for real)
2
Data Science Warsaw, 2015.10.13
We have still
about 19 minutes...
Data Science Warsaw, 2015.10.13
Obligatory word count example
■ Task: to count the number of occurrences of each word in a text
■ Frequently used when introducing the MapReduce paradigm
4
Tell me
you’ve already
know it!
All rights reserved, © 2015 ICM UW 5
Hadoop has a rich set of libraries
Map-Reduce — good for batch
Pig — Scripts
Oozie — Workflows
Mahout — Machine Learning
Hive — SQL Queries
Impala —
Ad-hoc Queries
Storm — Real Time
Streaming
Giraph — Graphs
All rights reserved, © 2015 ICM UW 6
Hadoop ecosystem is (too) large
Map-Reduce — good for batch
■ Using multiple libraries results in
● long deployment, costful support, burden of administering
number of configuration files
● lots of glueing code between libraries
Pig — Scripts
Oozie — Workflows
Mahout — Machine Learning
Hive — SQL Queries
Impala —
Ad-hoc Queries
Storm — Real Time
Streaming
Giraph — Graphs
All rights reserved, © 2015 ICM UW
Let’s walk into Big Data
like a Boss with .
All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
Ecosystem of high-level tools for various use-cases
8
Spark Core
Spark SQL
Spark
Streaming
near real-time
MLLib
machine
learning
GraphX
graph
processing
SparkR
R on Spark
All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
9
„One to rule them all"
All rights reserved, © 2015 ICM UW
Example: versatile yet seamless
1. Select positions from historic tweets.
2. Train a model of 10 clusters of neighbouring nodes.
3. Classify real–time tweets from last 20 sec. every 3 sec. and count them for each
cluster.
10
points = sc.runSql[Double, Double]("SELECT latitude, longitude FROM historic_tweets")
model = KMeans.train(points, 10)
sc.twitterStream(...)
.map(lambda t: (model.closestCenter(t.location), 1))
.reduceByKeyAndWindow(lambda x,y: x+y , Seconds(20), Seconds(3))
Source: The State of Spark, and Where We’re Going Next, presentation by M. Zaharia, 2013
All rights reserved, © 2015 ICM UW
How to start?
All rights reserved, © 2015 ICM UW
First: download
12
All rights reserved, © 2015 ICM UW
Second: ./bin/pyspark
13
All rights reserved, © 2015 ICM UW
Third: Code much and often
14
import pyspark.ml.recommendation.ALS
import pyspark.ml.recommendation.Rating
// Transform Strings: "user_id,movie_id,rating" to Ratings: Rating(user_id:Int,movie_id:Int,rating:Double)
data = sc.textFile("path/to/data.csv")
ratings = data.map(lambda s: s.split(',')).map(lambda arr: Rating(int(arr[0]), int(arr[1]), float(arr[2]))
// Build the recommendation model using ALS
// Factor the rating matrix A=[n,m] into B=[n,f] and C=[f,m], where A ~= B x C
numFeatures = 10; numIterations = 20
model = ALS.train(ratings, numFeatures, numIterations, 0.01)
All rights reserved, © 2015 ICM UW
● We know which products are preferred by a particular user
● Having information about preferences -- recommend to a
particular user -- a product which she or he is likely to purchase
● Iterative method
15
= X
Collaborative Filtering: Problem Statement
All rights reserved, © 2015 ICM UW
Collaborative Filtering: Problem Statement
16
USER
USER
=
~TASTE
DEMANDMOVIE
MOVIE
~TASTE
SUPPLY
X
Demo time!
All rights reserved, © 2015 ICM UW
Tweets exploration
17
Demo time!
All rights reserved, © 2015 ICM UW
Spark ecosystem is versatile yet seamless
Ecosystem of high-level tools for various use-cases
18
Spark Core
Spark SQL
Spark
Streaming
near real-time
MLLib
machine
learning
GraphX
graph
processing
SparkR
R on Spark
All rights reserved, © 2015 ICM UW 19
What next? Trainings!
All rights reserved, © 2015 ICM UW
Thank you!
20
Piotr Dendek
pdendek@icm.edu.pl
@pjden

More Related Content

What's hot (20)

PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
PPTX
Deep Learning on Aerial Imagery: What does it look like on a map?
Rob Emanuele
 
PPTX
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
PPTX
Pycon 2016-open-space
Chetan Khatri
 
PDF
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Albert Bifet
 
PPTX
Mining data streams
Akash Gupta
 
PDF
Influxdb and time series data
Marcin Szepczyński
 
PDF
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Albert Bifet
 
ODP
Google's Dremel
Maria Stylianou
 
PPT
5.1 mining data streams
Krish_ver2
 
PPT
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
PPTX
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Ansgar Scherp
 
PDF
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
Srinath Perera
 
PDF
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
PPTX
Dremel interactive analysis of web scale datasets
Carl Lu
 
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
PDF
Declarative Infrastructure Tools
Yulia Shcherbachova
 
ODP
Apache sirona
Olivier Lamy
 
PDF
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
PDF
Introduction to Data Analtics with Pandas [PyCon Cz]
Alexander Hendorf
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Rob Emanuele
 
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Pycon 2016-open-space
Chetan Khatri
 
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Albert Bifet
 
Mining data streams
Akash Gupta
 
Influxdb and time series data
Marcin Szepczyński
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Albert Bifet
 
Google's Dremel
Maria Stylianou
 
5.1 mining data streams
Krish_ver2
 
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Ansgar Scherp
 
ACM DEBS Grand Challenge: Continuous Analytics on Geospatial Data Streams wit...
Srinath Perera
 
Introduction to Data streaming - 05/12/2014
Raja Chiky
 
Dremel interactive analysis of web scale datasets
Carl Lu
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
Declarative Infrastructure Tools
Yulia Shcherbachova
 
Apache sirona
Olivier Lamy
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
MLconf
 
Introduction to Data Analtics with Pandas [PyCon Cz]
Alexander Hendorf
 

Similar to As simple as Apache Spark (20)

PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PDF
Apache Spark Overview @ ferret
Andrii Gakhov
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PDF
Dev Ops Training
Spark Summit
 
PDF
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
PDF
Apache Spark Overview
Vadim Y. Bichutskiy
 
PDF
big data
killer_joe
 
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
PPTX
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
PPTX
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
PPTX
Zaharia spark-scala-days-2012
Skills Matter Talks
 
PPTX
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Bds session 13 14
Infinity Tech Solutions
 
Apache Spark Overview @ ferret
Andrii Gakhov
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Dev Ops Training
Spark Summit
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Apache Spark Overview
Vadim Y. Bichutskiy
 
big data
killer_joe
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Zaharia spark-scala-days-2012
Skills Matter Talks
 
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Ad

More from Data Science Warsaw (20)

PDF
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Data Science Warsaw
 
PDF
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Data Science Warsaw
 
PDF
CRISP-DM Agile Approach to Data Mining Projects
Data Science Warsaw
 
PDF
Online content popularity prediction
Data Science Warsaw
 
PDF
Rozwiązywanie problemów optymalizacyjnych
Data Science Warsaw
 
PDF
Ile informacji jest w danych?
Data Science Warsaw
 
PDF
Analiza języka naturalnego
Data Science Warsaw
 
PDF
Otwarte Miasta
Data Science Warsaw
 
PDF
How to build your own google
Data Science Warsaw
 
PDF
To się w ram ie nie zmieści
Data Science Warsaw
 
PDF
Azure - Duże zbiory w chmurze
Data Science Warsaw
 
PDF
Data Science Warsaw
Data Science Warsaw
 
PDF
Data science w ubezpieczeniach
Data Science Warsaw
 
PDF
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Data Science Warsaw
 
PDF
Ask Data Anything
Data Science Warsaw
 
PDF
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Data Science Warsaw
 
PDF
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Data Science Warsaw
 
PDF
Data Exchange - the missing link in the big data value chain
Data Science Warsaw
 
PDF
Metody logiczne w analizie danych
Data Science Warsaw
 
PDF
Małe dane, duży wpływ - Dominik Batorski ICM
Data Science Warsaw
 
Wizualne budowanie aplikacji na Sparku przy pomocy narzędzia Seahorse
Data Science Warsaw
 
Neptune - narzędzie do monitorowania i zarządzania eksperymentami Machine Lea...
Data Science Warsaw
 
CRISP-DM Agile Approach to Data Mining Projects
Data Science Warsaw
 
Online content popularity prediction
Data Science Warsaw
 
Rozwiązywanie problemów optymalizacyjnych
Data Science Warsaw
 
Ile informacji jest w danych?
Data Science Warsaw
 
Analiza języka naturalnego
Data Science Warsaw
 
Otwarte Miasta
Data Science Warsaw
 
How to build your own google
Data Science Warsaw
 
To się w ram ie nie zmieści
Data Science Warsaw
 
Azure - Duże zbiory w chmurze
Data Science Warsaw
 
Data Science Warsaw
Data Science Warsaw
 
Data science w ubezpieczeniach
Data Science Warsaw
 
Big Data, Wearable, sztuczna inteligencja i ekonomia współpracy
Data Science Warsaw
 
Ask Data Anything
Data Science Warsaw
 
Oracle Big Data Discovery - ludzka twarz Hadoop'a
Data Science Warsaw
 
Geolokalizacja i analizy przestrzenne: trzy wymiary a ile pracy dla analityka!
Data Science Warsaw
 
Data Exchange - the missing link in the big data value chain
Data Science Warsaw
 
Metody logiczne w analizie danych
Data Science Warsaw
 
Małe dane, duży wpływ - Dominik Batorski ICM
Data Science Warsaw
 
Ad

Recently uploaded (20)

PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
July Patch Tuesday
Ivanti
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
July Patch Tuesday
Ivanti
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Designing Production-Ready AI Agents
Kunal Rai
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 

As simple as Apache Spark

  • 1. Data Science Warsaw, 2015.10.13 As simple as Apache Spark
  • 2. Data Science Warsaw, 2015.10.13 About me ● At ICM since 5 years ● Knowledge Discovery in Documents ○ Object disambiguation, Document classification, Document similarity, etc., etc. ● Enough big to use Big Data ecosystems ○ Hadoop since 2012 ○ Spark since 2013 (2014 for real) 2
  • 3. Data Science Warsaw, 2015.10.13 We have still about 19 minutes...
  • 4. Data Science Warsaw, 2015.10.13 Obligatory word count example ■ Task: to count the number of occurrences of each word in a text ■ Frequently used when introducing the MapReduce paradigm 4 Tell me you’ve already know it!
  • 5. All rights reserved, © 2015 ICM UW 5 Hadoop has a rich set of libraries Map-Reduce — good for batch Pig — Scripts Oozie — Workflows Mahout — Machine Learning Hive — SQL Queries Impala — Ad-hoc Queries Storm — Real Time Streaming Giraph — Graphs
  • 6. All rights reserved, © 2015 ICM UW 6 Hadoop ecosystem is (too) large Map-Reduce — good for batch ■ Using multiple libraries results in ● long deployment, costful support, burden of administering number of configuration files ● lots of glueing code between libraries Pig — Scripts Oozie — Workflows Mahout — Machine Learning Hive — SQL Queries Impala — Ad-hoc Queries Storm — Real Time Streaming Giraph — Graphs
  • 7. All rights reserved, © 2015 ICM UW Let’s walk into Big Data like a Boss with .
  • 8. All rights reserved, © 2015 ICM UW Spark ecosystem is versatile yet seamless Ecosystem of high-level tools for various use-cases 8 Spark Core Spark SQL Spark Streaming near real-time MLLib machine learning GraphX graph processing SparkR R on Spark
  • 9. All rights reserved, © 2015 ICM UW Spark ecosystem is versatile yet seamless 9 „One to rule them all"
  • 10. All rights reserved, © 2015 ICM UW Example: versatile yet seamless 1. Select positions from historic tweets. 2. Train a model of 10 clusters of neighbouring nodes. 3. Classify real–time tweets from last 20 sec. every 3 sec. and count them for each cluster. 10 points = sc.runSql[Double, Double]("SELECT latitude, longitude FROM historic_tweets") model = KMeans.train(points, 10) sc.twitterStream(...) .map(lambda t: (model.closestCenter(t.location), 1)) .reduceByKeyAndWindow(lambda x,y: x+y , Seconds(20), Seconds(3)) Source: The State of Spark, and Where We’re Going Next, presentation by M. Zaharia, 2013
  • 11. All rights reserved, © 2015 ICM UW How to start?
  • 12. All rights reserved, © 2015 ICM UW First: download 12
  • 13. All rights reserved, © 2015 ICM UW Second: ./bin/pyspark 13
  • 14. All rights reserved, © 2015 ICM UW Third: Code much and often 14 import pyspark.ml.recommendation.ALS import pyspark.ml.recommendation.Rating // Transform Strings: "user_id,movie_id,rating" to Ratings: Rating(user_id:Int,movie_id:Int,rating:Double) data = sc.textFile("path/to/data.csv") ratings = data.map(lambda s: s.split(',')).map(lambda arr: Rating(int(arr[0]), int(arr[1]), float(arr[2])) // Build the recommendation model using ALS // Factor the rating matrix A=[n,m] into B=[n,f] and C=[f,m], where A ~= B x C numFeatures = 10; numIterations = 20 model = ALS.train(ratings, numFeatures, numIterations, 0.01)
  • 15. All rights reserved, © 2015 ICM UW ● We know which products are preferred by a particular user ● Having information about preferences -- recommend to a particular user -- a product which she or he is likely to purchase ● Iterative method 15 = X Collaborative Filtering: Problem Statement
  • 16. All rights reserved, © 2015 ICM UW Collaborative Filtering: Problem Statement 16 USER USER = ~TASTE DEMANDMOVIE MOVIE ~TASTE SUPPLY X Demo time!
  • 17. All rights reserved, © 2015 ICM UW Tweets exploration 17 Demo time!
  • 18. All rights reserved, © 2015 ICM UW Spark ecosystem is versatile yet seamless Ecosystem of high-level tools for various use-cases 18 Spark Core Spark SQL Spark Streaming near real-time MLLib machine learning GraphX graph processing SparkR R on Spark
  • 19. All rights reserved, © 2015 ICM UW 19 What next? Trainings!
  • 20. All rights reserved, © 2015 ICM UW Thank you! 20 Piotr Dendek [email protected] @pjden