Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

Jörg Schad, Mesosphere
SMACK STACK AND
BEYOND
BUILDING FAST DATA PIPELINES
@dcos @joerg_schad

© 2017 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Software Engineer @Mesosphere
@joerg_schad
@joerg.mesosphere

MapReduce is
crunching Data
Ancient
Times...

But then business
demanded
FAST DATA
We need to turn faster!
Today...

Batch Event ProcessingMicro-Batch
Days Hours Minutes Seconds Microseconds
Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics
Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations
Data Processing
5

6
Fast Data Pipelines
Data Ingestion
Request/Response
Devices
Client
Sensors
Message
Queue/Bus
Microservices Distributed Storage
Analytics
(Streaming)

7
Fast Data Pipelines
Data Ingestion
Request/Response
Devices
Client
Sensors
Message
Queue/Bus
Microservices Distributed Storage
Analytics
(Streaming)

8
EVENTS
Ubiquitous data streams
from connected devices
INGEST STOREANALYZE ACT
Ingest millions of
events per second
Distributed & highly
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
Fast Data Pipelines

SMACK Stack
9
EVENTS
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
Apache Mesos
Sensors
Devices
Clients

SMACK Stack
10
EVENTS
INGEST STOREANALYZE ACT
Ingest millions of
events per second
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications

Message Queues
11
EVENTS
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
DC/OS
Sensors
Devices
Clients

MESSAGE QUEUES
Apache Kafka
ØMQ, RabbitMQ, Disque (Redis-based), etc.
fluentd, Logstash, Flume
Akka streams
cloud-only:
AWS SQS
Google Cloud Pub/Sub
12

APACHE KAFKA
High-throughput, distributed, persistent publish-subscribe
messaging system
Originates from LinkedIn
Typically used as buffer/de-coupling layer in online stream
processing
13

● Scalability
! Message Type
! Log vs …
! Delivery Guarantees/Message
durability
! Routing Capabilities
! Failover
! Community
! Mesos Support ;-)
HOW TO
CHOOSE?

DELIVERY GUARANTEES
At most once—Messages may be lost but are never redelivered.
At least once—Messages are never lost but may be redelivered.
Exactly once—this is what people actually want, each message
is delivered once and only once.
16
Murphy’s Law of Distributed
Systems:
 
Anything that can
go wrong, will go
wrong … partially!

Routing
17
Simple Pipes Routing

Stream Processing
18
EVENTS
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
DC/OS
Sensors
Devices
Clients

STREAM PROCESSING
• Apache Storm
• Apache Spark
• Apache Samza
• Apache Flink
• Apache Apex
• Concord
• cloud-only: AWS Kinesis, 
Google Cloud Dataflow
19

APACHE SPARK

APACHE SPARK (STREAMING 2.0)
Typical Use: distributed, large-scale data processing;
micro-batching
 
Why Spark Streaming?
• Micro-batching creates very low latency, which can
be faster
• Well defined role means it fits in well with other
pieces of the pipeline
21

! Execution Model
! Native Streaming vs Microbatch
! Fault Tolerance Granularity
! Per record, per batch
! Delivery Guarantees
! API
! SQL
! Spark
! Performance….
! Realtime ≠ Realtime
! Community
! Mesos Support ;-)
HOW TO
CHOOSE?

EXECUTION MODEL
Micro-Batching
23
Native
Streaming

FAULT TOLERANCE
Checkpoint per “Batch”
24
Ack-Per-Record Checkpoint per Batch

DELIVERY GUARANTEES
“Exactly once”
25
At least Once

Storage
26
EVENTS
INGEST
Apache
Kafka
STORE
Apache
Spark
ANALYZE
Apache
Cassandra
ACT
Akka
Ingest millions of
events per second
scalable database
Real-time and batch
process data
Visualize data and
build data driven
applications
DC/OS
Sensors
Devices
Clients

Data Model
28
GraphRelational Document
! Schema
! SQL
! Foreign
Keys/Joins
! OLTP/
OLAP
! Simple
! Scalable
! Cache
FilesTime-Series
! Complex
relations
! Social
Graph
! Recommen
dation
! Fraud
detections
! Schema-
Less
! Semi-
structured
queries
! Product
catalogue
! Session
data
Key-Value

Datacenter

NAIVE APPROACH
30
Typical Datacenter 
siloed, over-provisioned servers, 
low utilization
Industry Average 
12-15% utilization
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka

MULTIPLEXING OF DATA, SERVICES, USERS,
ENVIRONMENTS
32
Typical Datacenter 
siloed, over-provisioned servers, 
low utilization
Mesos/ DC/OS 
automated schedulers, workload multiplexing onto the
same machines
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka

Apache Mesos
• A top-level Apache project

• A cluster resource negotiator

• Scalable to 10,000s of nodes

• Fault-tolerant, battle-tested

• An SDK for distributed apps

• Native Docker support
33

MESOS: FUNDAMENTAL ARCHITECTURE
34
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executor
Cassandra
Task
Cassandra
Scheduler
Container
Scheduler
Spark
Scheduler
Spark
Executor
Spark 
Task
Mesos AgentMesos Agent Service
Docker
Executor
Docker 
Task
Spark
Executor
Spark 
Task
Two-level Scheduling
1. Agents advertise resources to Master
2. Master offers resources to Framework
3. Framework rejects / uses resources
4. Agent reports task status to Master

Challenges
35
• Mesos is just the kernel

• Need for OS:

• Scheduler

• Monitoring

• Security

• CLI

• Package Repository

• …

Operating Systems

DC/OS
38
! Datacenter-wide services to power your apps
! Turnkey installation and lifecycle management
DC/OS Universe
DC/OS
Any Infrastructure
! Container operations & big data operations
! Security, fault tolerance & high availability
! Open Source (ASL2.0)
! Based on Apache Mesos
! Production proven at scale
! Requires only a modern linux distro  
(windows coming soon)
! Hybrid Datacenter
Datacenter Operating System (DC/OS)
Distributed Systems Kernel (Mesos)
Big Data + Analytics EnginesMicroservices ( containers)
Streaming
Batch
Machine Learning
Analytics
Search
Time Series
SQL / NoSQL
Databases
Modern App Components
Any Infrastructure (Physical, Virtual, Cloud)

Developing Distributed Services
39
• Failures (Task, Node, Network,…)

• Zero Downtime Upgrades

• Persistence

• Multiple Frameworks

• Service Discovery

• Metrics

• ….

Operating Distributed
Services

Distributed Services: Challenges
41
● As simple as Docker Compose
● Don’t need to write any Java code
● Don’t need to be an app expert
● Need to be an app expert
● Need to write a little Java code
● Don’t want to understand DC/OS
● Can’t use the default scheduler
● Need to write a lot of Java code
● Willing to understand DC/OS
Custom
Jobs & Strategies
Build Your Own Scheduler
Defaults

Demo Time
42
Generator Display
1. Financial data
created by generator
2. Written to
Kafka topics
3. Kafka Topics
consumed by Spark or
Flink
4. Results written back into
Kafka stream (another topic)
7. Results displayed

Keep it running!

SERVICE OPERATIONS
44
● Configuration Updates (ex: Scaling, re-configuration)
● Binary Upgrades
● Cluster Maintenance (ex: Backup, Restore, Restart)
● Monitor progress of operations
● Debug any runtime blockages

Questions?
@dcos @joerg_schad
dcos.io
Demo

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad

More Related Content

What's hot (20)

Similar to Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad (20)

More from Spark Summit (20)

Recently uploaded (20)

Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad