SlideShare a Scribd company logo
Jörg Schad, Mesosphere
SMACK STACK AND
BEYOND
BUILDING FAST DATA PIPELINES
@dcos @joerg_schad
© 2017 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Software Engineer @Mesosphere
@joerg_schad
@joerg.mesosphere
© 2017 Mesosphere, Inc. All Rights Reserved. 3
MapReduce is
crunching Data
Ancient
Times...
© 2016 Mesosphere, Inc. All Rights Reserved. 4
But then business
demanded
FAST DATA
We need to turn faster!
Today...
Batch Event ProcessingMicro-Batch
Days Hours Minutes Seconds Microseconds
Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics
Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations
Data Processing
5
6
Fast Data Pipelines
Data Ingestion
Request/Response
Devices
Client
Sensors
Message
Queue/Bus
Microservices Distributed Storage
Analytics
(Streaming)
7
Fast Data Pipelines
Data Ingestion
Request/Response
Devices
Client
Sensors
Message
Queue/Bus
Microservices Distributed Storage
Analytics
(Streaming)
8
EVENTS
Ubiquitous data streams
from connected devices
INGEST STOREANALYZE ACT
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
Fast Data Pipelines
SMACK Stack
9
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
Apache Mesos
Sensors
Devices
Clients
SMACK Stack
10
EVENTS
Ubiquitous data streams
from connected devices
INGEST STOREANALYZE ACT
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
Message Queues
11
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
DC/OS
Sensors
Devices
Clients
MESSAGE QUEUES
Apache Kafka
ØMQ, RabbitMQ, Disque (Redis-based), etc.
fluentd, Logstash, Flume
Akka streams
cloud-only:
AWS SQS
Google Cloud Pub/Sub
12
APACHE KAFKA
High-throughput, distributed, persistent publish-subscribe
messaging system
Originates from LinkedIn
Typically used as buffer/de-coupling layer in online stream
processing
13
fluentd
14
© 2017 Mesosphere, Inc. All Rights Reserved. 15
● Scalability
! Message Type
! Log vs …
! Delivery Guarantees/Message
durability
! Routing Capabilities
! Failover
! Community
! Mesos Support ;-)
HOW TO
CHOOSE?
DELIVERY GUARANTEES
At most once—Messages may be lost but are never redelivered.
At least once—Messages are never lost but may be redelivered.
Exactly once—this is what people actually want, each message
is delivered once and only once.
16
Murphy’s Law of Distributed
Systems:


Anything that can
go wrong, will go
wrong … partially!
Routing
17
Simple Pipes Routing
Stream Processing
18
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
DC/OS
Sensors
Devices
Clients
STREAM PROCESSING
• Apache Storm
• Apache Spark
• Apache Samza
• Apache Flink
• Apache Apex
• Concord
• cloud-only: AWS Kinesis,

Google Cloud Dataflow
19
© 2016 Mesosphere, Inc. All Rights Reserved. 20
APACHE SPARK
APACHE SPARK (STREAMING 2.0)
Typical Use: distributed, large-scale data processing;
micro-batching


Why Spark Streaming?
• Micro-batching creates very low latency, which can
be faster
• Well defined role means it fits in well with other
pieces of the pipeline
21
© 2016 Mesosphere, Inc. All Rights Reserved. 22
! Execution Model
! Native Streaming vs Microbatch
! Fault Tolerance Granularity
! Per record, per batch
! Delivery Guarantees
! API
! SQL
! Spark
! Performance….
! Realtime ≠ Realtime
! Community
! Mesos Support ;-)
HOW TO
CHOOSE?
EXECUTION MODEL
Micro-Batching
23
Native
Streaming
FAULT TOLERANCE
Checkpoint per “Batch”
24
Ack-Per-Record Checkpoint per Batch
DELIVERY GUARANTEES
“Exactly once”
25
At least Once
Storage
26
EVENTS
Ubiquitous data streams
from connected devices
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
DC/OS
Sensors
Devices
Clients
Datastores
27
Data Model
28
GraphRelational Document
! Schema
! SQL
! Foreign
Keys/Joins
! OLTP/
OLAP
! Simple
! Scalable
! Cache
FilesTime-Series
! Complex
relations
! Social
Graph
! Recommen
dation
! Fraud
detections
! Schema-
Less
! Semi-
structured
queries
! Product
catalogue
! Session
data
Key-Value
© 2017 Mesosphere, Inc. All Rights Reserved. 29
Datacenter
NAIVE APPROACH
30
Typical Datacenter

siloed, over-provisioned servers,

low utilization
Industry Average

12-15% utilization
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka
© 2017 Mesosphere, Inc. All Rights Reserved. 31
MULTIPLEXING OF DATA, SERVICES, USERS,
ENVIRONMENTS
32
Typical Datacenter

siloed, over-provisioned servers,

low utilization
Mesos/ DC/OS

automated schedulers, workload multiplexing onto the
same machines
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka
Apache Mesos
• A top-level Apache project

• A cluster resource negotiator

• Scalable to 10,000s of nodes

• Fault-tolerant, battle-tested

• An SDK for distributed apps

• Native Docker support
33
MESOS: FUNDAMENTAL ARCHITECTURE
34
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executor
Cassandra
Task
Cassandra
Scheduler
Container
Scheduler
Spark
Scheduler
Spark
Executor
Spark

Task
Mesos AgentMesos Agent Service
Docker
Executor
Docker

Task
Spark
Executor
Spark

Task
Two-level Scheduling
1. Agents advertise resources to Master
2. Master offers resources to Framework
3. Framework rejects / uses resources
4. Agent reports task status to Master
Challenges
35
• Mesos is just the kernel 

• Need for OS:

• Scheduler

• Monitoring

• Security

• CLI

• Package Repository

• …
© 2017 Mesosphere, Inc. All Rights Reserved. 36
Operating Systems
© 2017 Mesosphere, Inc. All Rights Reserved. 37
DC/OS
38
! Datacenter-wide services to power your apps
! Turnkey installation and lifecycle management
DC/OS Universe
DC/OS
Any Infrastructure
! Container operations & big data operations
! Security, fault tolerance & high availability
! Open Source (ASL2.0)
! Based on Apache Mesos
! Production proven at scale
! Requires only a modern linux distro 

(windows coming soon)
! Hybrid Datacenter
Datacenter Operating System (DC/OS)
Distributed Systems Kernel (Mesos)
Big Data + Analytics EnginesMicroservices ( containers)
Streaming
Batch
Machine Learning
Analytics
Search
Time Series
SQL / NoSQL
Databases
Modern App Components
Any Infrastructure (Physical, Virtual, Cloud)
Developing Distributed Services
39
• Failures (Task, Node, Network,…)

• Zero Downtime Upgrades

• Persistence

• Multiple Frameworks 

• Service Discovery

• Metrics

• ….
© 2017 Mesosphere, Inc. All Rights Reserved. 40
Operating Distributed
Services
Distributed Services: Challenges
41
● As simple as Docker Compose
● Don’t need to write any Java code
● Don’t need to be an app expert
● Need to be an app expert
● Need to write a little Java code
● Don’t want to understand DC/OS
● Can’t use the default scheduler
● Need to write a lot of Java code
● Willing to understand DC/OS
Custom
Jobs & Strategies
Build Your Own Scheduler
Defaults
Demo Time
42
Generator Display
1. Financial data
created by generator
2. Written to
Kafka topics
3. Kafka Topics
consumed by Spark or
Flink
4. Results written back into
Kafka stream (another topic)
7. Results displayed
© 2017 Mesosphere, Inc. All Rights Reserved. 43
Keep it running!
SERVICE OPERATIONS
44
● Configuration Updates (ex: Scaling, re-configuration)
● Binary Upgrades
● Cluster Maintenance (ex: Backup, Restore, Restart)
● Monitor progress of operations
● Debug any runtime blockages
Questions?
@dcos @joerg_schad
dcos.io
Demo

More Related Content

What's hot (20)

PDF
Big Telco - Yousun Jeong
Spark Summit
 
PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
PPTX
Spark Streaming the Industrial IoT
Jim Haughwout
 
PDF
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Spark Summit
 
PDF
Spark Summit EU talk by Michael Nitschinger
Spark Summit
 
PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
PDF
Rethinking Streaming Analytics For Scale
Helena Edelson
 
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
PDF
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
PDF
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark Summit
 
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
DataWorks Summit
 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
PDF
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
PDF
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
PDF
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
PPTX
How do you decide where your customer was?
DataWorks Summit/Hadoop Summit
 
PDF
Enancing Threat Detection with Big Data and AI
Databricks
 
PDF
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
Big Telco - Yousun Jeong
Spark Summit
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
Spark Streaming the Industrial IoT
Jim Haughwout
 
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...
Spark Summit
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up wit...
Spark Summit
 
Implementing the Lambda Architecture efficiently with Apache Spark
DataWorks Summit
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
Spark Summit
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
How do you decide where your customer was?
DataWorks Summit/Hadoop Summit
 
Enancing Threat Detection with Big Data and AI
Databricks
 
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 

Similar to Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad (20)

PDF
Webinar - Big Data: Let's SMACK - Jorg Schad
Codemotion
 
PDF
Kubernetes on DC/OS
Cloud Technology Experts
 
PPTX
Journey to the Modern App with Containers, Microservices and Big Data
Lightbend
 
PDF
[DO16] Mesosphere : Microservices meet Fast Data on Azure
de:code 2017
 
PDF
Downtime is not an option - day 2 operations - Jörg Schad
Codemotion
 
PDF
Alluxio Mesos Meetup - SMACK to SMAACK
Alluxio, Inc.
 
PDF
DOD 2016 - Jörg Schad - How Fast Data and Microservices Change the Datacenter.
PROIDEA
 
PPTX
Episode 4: Operating Kubernetes at Scale with DC/OS
Mesosphere Inc.
 
PPTX
DevOps in Age of Kubernetes
Mesosphere Inc.
 
PDF
A Journey to Modern Apps with Containers, Microservices and Big Data
Edward Hsu
 
PDF
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps.com
 
PDF
Mesos, DC/OS and the Architecture of the New Datacenter
QAware GmbH
 
PDF
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
PDF
Elastic data services on Apache Mesos via Mesosphere’s DCOS
harrythewiz
 
PDF
AI Scalability for the Next Decade
Paula Koziol
 
PDF
Fom io t_to_bigdata_step_by_step-final
Luis Filipe Silva
 
PDF
SMACK Stack 1.1
Joe Stein
 
PPTX
Doing Dropbox the Native Cloud Native Way
Minio
 
PDF
Real-Time Analytics with Confluent and MemSQL
SingleStore
 
PPTX
Stream Processing and Real-Time Data Pipelines
Vladimír Schreiner
 
Webinar - Big Data: Let's SMACK - Jorg Schad
Codemotion
 
Kubernetes on DC/OS
Cloud Technology Experts
 
Journey to the Modern App with Containers, Microservices and Big Data
Lightbend
 
[DO16] Mesosphere : Microservices meet Fast Data on Azure
de:code 2017
 
Downtime is not an option - day 2 operations - Jörg Schad
Codemotion
 
Alluxio Mesos Meetup - SMACK to SMAACK
Alluxio, Inc.
 
DOD 2016 - Jörg Schad - How Fast Data and Microservices Change the Datacenter.
PROIDEA
 
Episode 4: Operating Kubernetes at Scale with DC/OS
Mesosphere Inc.
 
DevOps in Age of Kubernetes
Mesosphere Inc.
 
A Journey to Modern Apps with Containers, Microservices and Big Data
Edward Hsu
 
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
DevOps.com
 
Mesos, DC/OS and the Architecture of the New Datacenter
QAware GmbH
 
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
Elastic data services on Apache Mesos via Mesosphere’s DCOS
harrythewiz
 
AI Scalability for the Next Decade
Paula Koziol
 
Fom io t_to_bigdata_step_by_step-final
Luis Filipe Silva
 
SMACK Stack 1.1
Joe Stein
 
Doing Dropbox the Native Cloud Native Way
Minio
 
Real-Time Analytics with Confluent and MemSQL
SingleStore
 
Stream Processing and Real-Time Data Pipelines
Vladimír Schreiner
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Climate Action.pptx action plan for climate
justfortalabat
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

  • 1. Jörg Schad, Mesosphere SMACK STACK AND BEYOND BUILDING FAST DATA PIPELINES @dcos @joerg_schad
  • 2. © 2017 Mesosphere, Inc. All Rights Reserved. 2 Jörg Schad Software Engineer @Mesosphere @joerg_schad @joerg.mesosphere
  • 3. © 2017 Mesosphere, Inc. All Rights Reserved. 3 MapReduce is crunching Data Ancient Times...
  • 4. © 2016 Mesosphere, Inc. All Rights Reserved. 4 But then business demanded FAST DATA We need to turn faster! Today...
  • 5. Batch Event ProcessingMicro-Batch Days Hours Minutes Seconds Microseconds Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations Data Processing 5
  • 6. 6 Fast Data Pipelines Data Ingestion Request/Response Devices Client Sensors Message Queue/Bus Microservices Distributed Storage Analytics (Streaming)
  • 7. 7 Fast Data Pipelines Data Ingestion Request/Response Devices Client Sensors Message Queue/Bus Microservices Distributed Storage Analytics (Streaming)
  • 8. 8 EVENTS Ubiquitous data streams from connected devices INGEST STOREANALYZE ACT Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications Fast Data Pipelines
  • 9. SMACK Stack 9 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications Apache Mesos Sensors Devices Clients
  • 10. SMACK Stack 10 EVENTS Ubiquitous data streams from connected devices INGEST STOREANALYZE ACT Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications
  • 11. Message Queues 11 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  • 12. MESSAGE QUEUES Apache Kafka ØMQ, RabbitMQ, Disque (Redis-based), etc. fluentd, Logstash, Flume Akka streams cloud-only: AWS SQS Google Cloud Pub/Sub 12
  • 13. APACHE KAFKA High-throughput, distributed, persistent publish-subscribe messaging system Originates from LinkedIn Typically used as buffer/de-coupling layer in online stream processing 13
  • 15. © 2017 Mesosphere, Inc. All Rights Reserved. 15 ● Scalability ! Message Type ! Log vs … ! Delivery Guarantees/Message durability ! Routing Capabilities ! Failover ! Community ! Mesos Support ;-) HOW TO CHOOSE?
  • 16. DELIVERY GUARANTEES At most once—Messages may be lost but are never redelivered. At least once—Messages are never lost but may be redelivered. Exactly once—this is what people actually want, each message is delivered once and only once. 16 Murphy’s Law of Distributed Systems: 
 Anything that can go wrong, will go wrong … partially!
  • 18. Stream Processing 18 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  • 19. STREAM PROCESSING • Apache Storm • Apache Spark • Apache Samza • Apache Flink • Apache Apex • Concord • cloud-only: AWS Kinesis,
 Google Cloud Dataflow 19
  • 20. © 2016 Mesosphere, Inc. All Rights Reserved. 20 APACHE SPARK
  • 21. APACHE SPARK (STREAMING 2.0) Typical Use: distributed, large-scale data processing; micro-batching 
 Why Spark Streaming? • Micro-batching creates very low latency, which can be faster • Well defined role means it fits in well with other pieces of the pipeline 21
  • 22. © 2016 Mesosphere, Inc. All Rights Reserved. 22 ! Execution Model ! Native Streaming vs Microbatch ! Fault Tolerance Granularity ! Per record, per batch ! Delivery Guarantees ! API ! SQL ! Spark ! Performance…. ! Realtime ≠ Realtime ! Community ! Mesos Support ;-) HOW TO CHOOSE?
  • 24. FAULT TOLERANCE Checkpoint per “Batch” 24 Ack-Per-Record Checkpoint per Batch
  • 26. Storage 26 EVENTS Ubiquitous data streams from connected devices INGEST Apache Kafka STORE Apache Spark ANALYZE Apache Cassandra ACT Akka Ingest millions of events per second Distributed & highly scalable database Real-time and batch process data Visualize data and build data driven applications DC/OS Sensors Devices Clients
  • 28. Data Model 28 GraphRelational Document ! Schema ! SQL ! Foreign Keys/Joins ! OLTP/ OLAP ! Simple ! Scalable ! Cache FilesTime-Series ! Complex relations ! Social Graph ! Recommen dation ! Fraud detections ! Schema- Less ! Semi- structured queries ! Product catalogue ! Session data Key-Value
  • 29. © 2017 Mesosphere, Inc. All Rights Reserved. 29 Datacenter
  • 30. NAIVE APPROACH 30 Typical Datacenter
 siloed, over-provisioned servers,
 low utilization Industry Average
 12-15% utilization mySQL microservice Cassandra Spark/Hadoop Kafka
  • 31. © 2017 Mesosphere, Inc. All Rights Reserved. 31
  • 32. MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS 32 Typical Datacenter
 siloed, over-provisioned servers,
 low utilization Mesos/ DC/OS
 automated schedulers, workload multiplexing onto the same machines mySQL microservice Cassandra Spark/Hadoop Kafka
  • 33. Apache Mesos • A top-level Apache project • A cluster resource negotiator • Scalable to 10,000s of nodes • Fault-tolerant, battle-tested • An SDK for distributed apps • Native Docker support 33
  • 34. MESOS: FUNDAMENTAL ARCHITECTURE 34 Mesos Master Mesos Master Mesos Master Mesos AgentMesos Agent Service Cassandra Executor Cassandra Task Cassandra Scheduler Container Scheduler Spark Scheduler Spark Executor Spark
 Task Mesos AgentMesos Agent Service Docker Executor Docker
 Task Spark Executor Spark
 Task Two-level Scheduling 1. Agents advertise resources to Master 2. Master offers resources to Framework 3. Framework rejects / uses resources 4. Agent reports task status to Master
  • 35. Challenges 35 • Mesos is just the kernel • Need for OS: • Scheduler • Monitoring • Security • CLI • Package Repository • …
  • 36. © 2017 Mesosphere, Inc. All Rights Reserved. 36 Operating Systems
  • 37. © 2017 Mesosphere, Inc. All Rights Reserved. 37
  • 38. DC/OS 38 ! Datacenter-wide services to power your apps ! Turnkey installation and lifecycle management DC/OS Universe DC/OS Any Infrastructure ! Container operations & big data operations ! Security, fault tolerance & high availability ! Open Source (ASL2.0) ! Based on Apache Mesos ! Production proven at scale ! Requires only a modern linux distro 
 (windows coming soon) ! Hybrid Datacenter Datacenter Operating System (DC/OS) Distributed Systems Kernel (Mesos) Big Data + Analytics EnginesMicroservices ( containers) Streaming Batch Machine Learning Analytics Search Time Series SQL / NoSQL Databases Modern App Components Any Infrastructure (Physical, Virtual, Cloud)
  • 39. Developing Distributed Services 39 • Failures (Task, Node, Network,…) • Zero Downtime Upgrades • Persistence • Multiple Frameworks • Service Discovery • Metrics • ….
  • 40. © 2017 Mesosphere, Inc. All Rights Reserved. 40 Operating Distributed Services
  • 41. Distributed Services: Challenges 41 ● As simple as Docker Compose ● Don’t need to write any Java code ● Don’t need to be an app expert ● Need to be an app expert ● Need to write a little Java code ● Don’t want to understand DC/OS ● Can’t use the default scheduler ● Need to write a lot of Java code ● Willing to understand DC/OS Custom Jobs & Strategies Build Your Own Scheduler Defaults
  • 42. Demo Time 42 Generator Display 1. Financial data created by generator 2. Written to Kafka topics 3. Kafka Topics consumed by Spark or Flink 4. Results written back into Kafka stream (another topic) 7. Results displayed
  • 43. © 2017 Mesosphere, Inc. All Rights Reserved. 43 Keep it running!
  • 44. SERVICE OPERATIONS 44 ● Configuration Updates (ex: Scaling, re-configuration) ● Binary Upgrades ● Cluster Maintenance (ex: Backup, Restore, Restart) ● Monitor progress of operations ● Debug any runtime blockages