SlideShare a Scribd company logo
Taboola's Road to Scale
A Focus on Data &
Apache Spark
Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement
Largest Content Discovery and
Monetization Network
60M
monthly unique
users
130B
Monthly
recommendations
1M+
sourced content
providers
1M+
sourced content
items
What Does it Mean?
• Zero downtime allowed
– Every single component is fault tolerant
• 5 Data Centers across the globe
• Tera-bytes of data / day (many billion events)
• Data must be processed and analyzed in real time, for
example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics
Taboola 2007
• Events and logs
(rawdata) written
directly to DB
• Recs Are read from
DB
• Crashed when CNN
launched
Frontend
RecServer
Taboola 2007.5
• Same as
before, but
without direct
write to DB
• Switching to
bulk load
• But – Very
Basic
Reporting, not
scalable
Frontend
Bulk Load
RecServer
Taboola 2008
• Introduced a semi
realtime events
parsing services:
Session Parser and
Session Analyzer
• Divided analysis work
by unit (session)
• Files were pushed
from RecServer(s) to
Backend processing
• Files are gzip textual
INSERT statements
• But – not real time
enough
Frontend
NFS
Backend
RecServer SessionParser SessionAnalyzer
Write Summarized Data
Write rawdata
Read session
files
Read rawdata
Write session
files
Taboola 2010
• Made a leap towards real-
time stream processing
• Unified Session Parser and
Session Analyzer to an in-
memory service (without
going through disk)
• Made dramatic optimization
to memory allocation and
data models
• Failure safe architecture -
can endure data delays,
front-end servers’
malfunction
• No direct DB access - key
for performance, only using
bulk loading for loading
hourly data
Frontend
NFS
Backend
RecServer Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata
Taboola 2011-2013
• Roughly same
architecture
• Increasing backend
growth by scaling in
(monster machines)
• Introduced real-time
analyzers
• Introduced sharding
• Moved to lsync based
file sync
• Introduced Top
Reports capabilities
Frontend
Lsync
Backend
RecServer Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata
Taboola 2014
• Spark as the distributed engine for data analysis
(and distributed computing in general)
• All critical data path already moved to Spark
• New data modelling based on ProtoStuff(Buf)
• Easily scalable
• Easy ad hoc analysis/research
About Spark
• Open Sourced
• Apache top level project (since Feb. 19th)
• DataBricks - A commercial company that supports it
• Hadoop-compatible computing engine
• Can run side-by-side with Hadoop/Hive on the same
data
• Drastically faster than Hadoop through in-memory
computing
• Multiple H/A options - standalone cluster, Apache
mesos and ZooKeeper or YARN
• With over 100 developers and 25 companies, one of
the most active communities in big data
Spark Development Community
Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)
Past 6 months: more active devs than Hadoop MapReduce!
The Spark Community
Spark Performance
Hive
Impala(disk)
Impala(mem)
Shark(disk)
Shark(mem)
0
5
10
15
20
25
30
35
40
45
ResponseTime(s)
SQL
Hadoop
Giraph
GraphX
0
5
10
15
20
25
30
ResponseTime(min)
Graph
Storm
Spark
0
5
10
15
20
25
30
35
Throughput(MB/s/node)
Streaming
Spark API
• Simple to write through easy APIs in
Java, Scala and Python
• The same analytics code can be used for both
streaming data and offline data processing
Spark Key Concepts
Resilient Distributed
Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
• Immutable
Operations
• Transformations
(e.g.
map, filter, groupB
y)
• Actions
(e.g.
count, collect, save
)
Write programs in terms of
transformations on distributed
datasets
Load error messages from a log into
memory, then interactively search for various
patterns
Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
messages.filter(lambda s: “mysql” in s).count()
messages.filter(lambda s: “php” in s).count()
. . .
Base RDD
Transformed RDD
Full-text search of Wikipedia
• 60GB on 20 EC2 machine
• 0.5 sec vs. 20s for on-disk
• General task
graphs
• Automatically
pipelines
functions
• Data locality
aware
• Partitioning
aware
to avoid shuffles
Task Scheduler
= cached partition= RDD
reduceByKey
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
• Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Software Components
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
System Architecture & Data Flow @ Taboola
Driver +
ConsumersSpark Cluster
MySQL Cluster
FE Servers
C* Cluster
FE Servers
Execution Graph @ Taboola
• Data start point (dates, etc)rdd1 = Context.parallize([data])
• Loading data from external sources
(Cassandra, MySQL, etc)
rdd2 =
rdd1.mapPartitions(loadfunc())
• Aggregating the data and storing
results
rdd3 =
rdd2.reduceByKey(reduceFunc())
• Saving the results to a DBrdd4 =
rdd3.mapPartitions(saverfunc())
• Executing the above graph by
forcing an output operation
rdd4.count()
Cassandra as a Distributed Storage
• Event Log Files saved as blobs to a dedicated keyspace
• C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is
easier for maintenance and simpler to load into Spark
• Using Astyanax driver + CQL3
– Recipe to load all keys of a table very fast (hundred of thousands / sec)
– Split by keys and then load data by key in batches – in parallel partitions
• Wrote hadoop InputFormat that supports loading this into a lines RDD<String>
– The DataStax InputFormat had issues and at the time was not formally supported
• Worked well, but ended up not using it – instead using mapPartitions
– Very simple, no overhead, no need to be tied to hadoop
– Will probably use the InputFormat when we deploy a Shark solution
• Plans to open source all this
Key (String) Data (blob)
GUID (originally
log file name)
Gzipped file
GUID Gzipped file
… …
userevent_2014-02-19
Key (String) Data (blob)
GUID (originally
log file name)
Gzipped file
GUID Gzipped file
… …
userevent_2014-02-20
Sample – Click Counting for Campaign Stopping
1. mapPartitions – mapping from strings to objects with
a pre designed click key
2. reduceByKey – removing duplicate clicks (see next
slide)
3. Map – switch keys to a campaign+day key
4. reduceByKey – aggregate the data by
campaign+day
Campaign Stopping – Removing Dup Clicks
• When more than 1 click found from the same user on the same
item, leave only the oldest
• Using accumulators to track duplicate numbers
• Notice – not Spark specific
Our Deployment
• 16 nodes, each-
– 24 cores
– 256G Ram
– 6 1TB SSD Disks – JBOD configuration
– 10G Ethernet
• Total Cluster Power
– 4096GB Ram
– 384 CPUs
– 96 TB storage – (effective space is less, Cassandra Keyspaces defined with replication
factor 3)
• Symmetric Deployment
– Mesos + Spark
– Cassandra
• More
– Rabbit MQ on 3 nodes
– ZooKeeper on 3 nodes
– MySQL cluster outside this cluster
• Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes
Things that work well with Spark
(from our experience)
• Very easy to code complex jobs
– Harder than SQL, but better than other Map Reduce options
– Simple concepts, “small” API
• Easy to Unit Test
– Runs in local mode, so ideal for micro E2E tests
– Each mapper/reducer can be unit tested without Spark – if you
do not use anonymous classes
• Very resilient
• Can read/write to/from any data source, including
RDBMS, Cassandra, HDFS, local files, etc.
• Great monitoring
• Easy to deploy & upgrade
• Blazing fast
Things that do not work that well
(from our experience)
• Long (endless) running tasks require some workarounds
– Temp files - Spark creates a lot of files in
spark.local.dir, requires periodic cleanup
– Use spark.cleaner.ttl for long running tasks
• Spark Streaming – not fully mature when we tested
– Some end cases can cause loss of data
– Sliding window / batch model does not fit our needs
• We always load some history to deal with late arriving data
• State management left to the user and not trivial
– BUT – we were able to easily implement a bullet proof home
grown, near real time, streaming solution with minimal amount
of code
General / Optimization Tips
• Use Spark Accumulators to collect and report
operational data
• 10G Ethernet
• Multiple SSD disks per node, JBOD configuration
• A lot of memory for the cluster
Technologies Taboola Uses for Spark
• Spark – computing cluster
• Mesos – cluster resource manager
– Better resource allocation (coarse grained) for Spark
• ZooKeeper – distributed coordination
– Enables multi master for mesos & spark
• Cassandra
– Distributed Data Store
• Monitoring – http://metrics.codahale.com/
Attributions
Many of the general Spark slides were taken from the
DataBricks Spark Summit 2013 slides.
There are great materials at:
https://spark.apache.org/
http://spark-summit.org/summit-2013/
Thank You!
tal.s@taboola.com

More Related Content

What's hot (20)

PDF
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
PPTX
Cloud native data platform
Li Gao
 
PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
PPTX
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
StreamNative
 
PPTX
Volta: Logging, Metrics, and Monitoring as a Service
LN Renganarayana
 
PPTX
Big Data Platform at Pinterest
Qubole
 
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
PDF
ASPgems - kappa architecture
Juantomás García Molina
 
PDF
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
HostedbyConfluent
 
PPTX
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
PDF
Cloud Connect 2012, Big Data @ Netflix
Jerome Boulon
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
PDF
Modern ETL Pipelines with Change Data Capture
Databricks
 
PDF
Apache HBase Workshop
Valerii Moisieienko
 
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Cloud native data platform
Li Gao
 
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
StreamNative
 
Volta: Logging, Metrics, and Monitoring as a Service
LN Renganarayana
 
Big Data Platform at Pinterest
Qubole
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
ASPgems - kappa architecture
Juantomás García Molina
 
Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...
HostedbyConfluent
 
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
Cloud Connect 2012, Big Data @ Netflix
Jerome Boulon
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Modern ETL Pipelines with Change Data Capture
Databricks
 
Apache HBase Workshop
Valerii Moisieienko
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit
 

Viewers also liked (7)

PDF
Tabtale story: Building a publishing and monitoring mobile games architecture...
Tikal Knowledge
 
PDF
Heatmap
Tikal Knowledge
 
PPTX
TechX Azure 2015 - Application Insights
Andreas Hammar
 
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
PDF
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Helena Edelson
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Tabtale story: Building a publishing and monitoring mobile games architecture...
Tikal Knowledge
 
TechX Azure 2015 - Application Insights
Andreas Hammar
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
tsliwowicz
 
Building Reactive Distributed Systems For Streaming Big Data, Analytics & Mac...
Helena Edelson
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Helena Edelson
 
Ad

Similar to Taboola Road To Scale With Apache Spark (20)

PDF
spark_v1_2
Frank Schroeter
 
PDF
Apache Spark PDF
Naresh Rupareliya
 
PPTX
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
PPTX
Apache Spark Fundamentals
Zahra Eskandari
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PPTX
Spark meetup2 final (Taboola)
tsliwowicz
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PDF
Hadoop and Spark
Shravan (Sean) Pabba
 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
PDF
Dev Ops Training
Spark Summit
 
PPTX
Introduction to spark
Home
 
PDF
Spark forplainoldjavageeks svforum_20140724
sdeeg
 
PDF
Spark Intro @ analytics big data summit
Sujee Maniyam
 
PPTX
Glint with Apache Spark
Venkata Naga Ravi
 
spark_v1_2
Frank Schroeter
 
Apache Spark PDF
Naresh Rupareliya
 
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
Apache Spark Fundamentals
Zahra Eskandari
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Spark meetup2 final (Taboola)
tsliwowicz
 
Bds session 13 14
Infinity Tech Solutions
 
Unified Big Data Processing with Apache Spark
C4Media
 
Hadoop and Spark
Shravan (Sean) Pabba
 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Dev Ops Training
Spark Summit
 
Introduction to spark
Home
 
Spark forplainoldjavageeks svforum_20140724
sdeeg
 
Spark Intro @ analytics big data summit
Sujee Maniyam
 
Glint with Apache Spark
Venkata Naga Ravi
 
Ad

Recently uploaded (20)

PPTX
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PPTX
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Homogeneity of Variance Test Options IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
Milwaukee Marketo User Group - Summer Road Trip: Mapping and Personalizing Yo...
bbedford2
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
Change Common Properties in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 

Taboola Road To Scale With Apache Spark

  • 1. Taboola's Road to Scale A Focus on Data & Apache Spark
  • 2. Collaborative Filtering Bucketed Consumption Groups Geo Region-based Recommendations Context Metadata Social Facebook/Twitter API User Behavior Cookie Data Engine Focused on Maximizing CTR & Post Click Engagement
  • 3. Largest Content Discovery and Monetization Network 60M monthly unique users 130B Monthly recommendations 1M+ sourced content providers 1M+ sourced content items
  • 4. What Does it Mean? • Zero downtime allowed – Every single component is fault tolerant • 5 Data Centers across the globe • Tera-bytes of data / day (many billion events) • Data must be processed and analyzed in real time, for example: – Real-time, per user content recommendations – Real-time expenditure reports – Automated campaign management – Automated recommendation algorithms calibration – Real-time analytics
  • 5. Taboola 2007 • Events and logs (rawdata) written directly to DB • Recs Are read from DB • Crashed when CNN launched Frontend RecServer
  • 6. Taboola 2007.5 • Same as before, but without direct write to DB • Switching to bulk load • But – Very Basic Reporting, not scalable Frontend Bulk Load RecServer
  • 7. Taboola 2008 • Introduced a semi realtime events parsing services: Session Parser and Session Analyzer • Divided analysis work by unit (session) • Files were pushed from RecServer(s) to Backend processing • Files are gzip textual INSERT statements • But – not real time enough Frontend NFS Backend RecServer SessionParser SessionAnalyzer Write Summarized Data Write rawdata Read session files Read rawdata Write session files
  • 8. Taboola 2010 • Made a leap towards real- time stream processing • Unified Session Parser and Session Analyzer to an in- memory service (without going through disk) • Made dramatic optimization to memory allocation and data models • Failure safe architecture - can endure data delays, front-end servers’ malfunction • No direct DB access - key for performance, only using bulk loading for loading hourly data Frontend NFS Backend RecServer Session Parser + Analyzer Write Hourly Data (Bulk Loading) Write rawdata Read rawdata
  • 9. Taboola 2011-2013 • Roughly same architecture • Increasing backend growth by scaling in (monster machines) • Introduced real-time analyzers • Introduced sharding • Moved to lsync based file sync • Introduced Top Reports capabilities Frontend Lsync Backend RecServer Session Parser + Analyzer Write Hourly Data (Bulk Loading) Write rawdata Read rawdata
  • 10. Taboola 2014 • Spark as the distributed engine for data analysis (and distributed computing in general) • All critical data path already moved to Spark • New data modelling based on ProtoStuff(Buf) • Easily scalable • Easy ad hoc analysis/research
  • 11. About Spark • Open Sourced • Apache top level project (since Feb. 19th) • DataBricks - A commercial company that supports it • Hadoop-compatible computing engine • Can run side-by-side with Hadoop/Hive on the same data • Drastically faster than Hadoop through in-memory computing • Multiple H/A options - standalone cluster, Apache mesos and ZooKeeper or YARN
  • 12. • With over 100 developers and 25 companies, one of the most active communities in big data Spark Development Community Comparison: Storm (48), Giraph (52), Drill (18), Tez (12) Past 6 months: more active devs than Hadoop MapReduce!
  • 15. Spark API • Simple to write through easy APIs in Java, Scala and Python • The same analytics code can be used for both streaming data and offline data processing
  • 16. Spark Key Concepts Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure • Immutable Operations • Transformations (e.g. map, filter, groupB y) • Actions (e.g. count, collect, save ) Write programs in terms of transformations on distributed datasets
  • 17. Load error messages from a log into memory, then interactively search for various patterns Example: Log Mining lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() messages.filter(lambda s: “mysql” in s).count() messages.filter(lambda s: “php” in s).count() . . . Base RDD Transformed RDD Full-text search of Wikipedia • 60GB on 20 EC2 machine • 0.5 sec vs. 20s for on-disk
  • 18. • General task graphs • Automatically pipelines functions • Data locality aware • Partitioning aware to avoid shuffles Task Scheduler = cached partition= RDD reduceByKey filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: map
  • 19. • Spark runs as a library in your program (1 instance per app) • Runs tasks locally or on cluster – Mesos, YARN or standalone mode • Accesses storage systems via Hadoop InputFormat API – Can use HBase, HDFS, S3, … Software Components Your application SparkContext Local threads Cluster manager Worker Spark executor Worker Spark executor HDFS or other storage
  • 20. System Architecture & Data Flow @ Taboola Driver + ConsumersSpark Cluster MySQL Cluster FE Servers C* Cluster FE Servers
  • 21. Execution Graph @ Taboola • Data start point (dates, etc)rdd1 = Context.parallize([data]) • Loading data from external sources (Cassandra, MySQL, etc) rdd2 = rdd1.mapPartitions(loadfunc()) • Aggregating the data and storing results rdd3 = rdd2.reduceByKey(reduceFunc()) • Saving the results to a DBrdd4 = rdd3.mapPartitions(saverfunc()) • Executing the above graph by forcing an output operation rdd4.count()
  • 22. Cassandra as a Distributed Storage • Event Log Files saved as blobs to a dedicated keyspace • C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is easier for maintenance and simpler to load into Spark • Using Astyanax driver + CQL3 – Recipe to load all keys of a table very fast (hundred of thousands / sec) – Split by keys and then load data by key in batches – in parallel partitions • Wrote hadoop InputFormat that supports loading this into a lines RDD<String> – The DataStax InputFormat had issues and at the time was not formally supported • Worked well, but ended up not using it – instead using mapPartitions – Very simple, no overhead, no need to be tied to hadoop – Will probably use the InputFormat when we deploy a Shark solution • Plans to open source all this Key (String) Data (blob) GUID (originally log file name) Gzipped file GUID Gzipped file … … userevent_2014-02-19 Key (String) Data (blob) GUID (originally log file name) Gzipped file GUID Gzipped file … … userevent_2014-02-20
  • 23. Sample – Click Counting for Campaign Stopping 1. mapPartitions – mapping from strings to objects with a pre designed click key 2. reduceByKey – removing duplicate clicks (see next slide) 3. Map – switch keys to a campaign+day key 4. reduceByKey – aggregate the data by campaign+day
  • 24. Campaign Stopping – Removing Dup Clicks • When more than 1 click found from the same user on the same item, leave only the oldest • Using accumulators to track duplicate numbers • Notice – not Spark specific
  • 25. Our Deployment • 16 nodes, each- – 24 cores – 256G Ram – 6 1TB SSD Disks – JBOD configuration – 10G Ethernet • Total Cluster Power – 4096GB Ram – 384 CPUs – 96 TB storage – (effective space is less, Cassandra Keyspaces defined with replication factor 3) • Symmetric Deployment – Mesos + Spark – Cassandra • More – Rabbit MQ on 3 nodes – ZooKeeper on 3 nodes – MySQL cluster outside this cluster • Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes
  • 26. Things that work well with Spark (from our experience) • Very easy to code complex jobs – Harder than SQL, but better than other Map Reduce options – Simple concepts, “small” API • Easy to Unit Test – Runs in local mode, so ideal for micro E2E tests – Each mapper/reducer can be unit tested without Spark – if you do not use anonymous classes • Very resilient • Can read/write to/from any data source, including RDBMS, Cassandra, HDFS, local files, etc. • Great monitoring • Easy to deploy & upgrade • Blazing fast
  • 27. Things that do not work that well (from our experience) • Long (endless) running tasks require some workarounds – Temp files - Spark creates a lot of files in spark.local.dir, requires periodic cleanup – Use spark.cleaner.ttl for long running tasks • Spark Streaming – not fully mature when we tested – Some end cases can cause loss of data – Sliding window / batch model does not fit our needs • We always load some history to deal with late arriving data • State management left to the user and not trivial – BUT – we were able to easily implement a bullet proof home grown, near real time, streaming solution with minimal amount of code
  • 28. General / Optimization Tips • Use Spark Accumulators to collect and report operational data • 10G Ethernet • Multiple SSD disks per node, JBOD configuration • A lot of memory for the cluster
  • 29. Technologies Taboola Uses for Spark • Spark – computing cluster • Mesos – cluster resource manager – Better resource allocation (coarse grained) for Spark • ZooKeeper – distributed coordination – Enables multi master for mesos & spark • Cassandra – Distributed Data Store • Monitoring – http://metrics.codahale.com/
  • 30. Attributions Many of the general Spark slides were taken from the DataBricks Spark Summit 2013 slides. There are great materials at: https://spark.apache.org/ http://spark-summit.org/summit-2013/

Editor's Notes

  • #14: One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
  • #17: RDD  Colloquially referred to as RDDs (e.g. caching in RAM)Lazy operations to build RDDs from other RDDsReturn a result or write it to storage
  • #18: Add “variables” to the “functions” in functional programming
  • #19: NOT a modified versionof Hadoop