SlideShare a Scribd company logo
Martin Zapletal @zapletal_martin
Cake Solutions @cakesolutions
● Increasing importance of data analytics, data mining and
machine learning
● Current state
○ Destructive updates
○ Analytics tools with poor scalability and integration
○ Manual processes
○ Slow iterations
○ Not suitable for large amounts and fast data
● Shared memory, disk, shared nothing, threads, mutexes, transactional memory, message
passing, CSP, actors, futures, coroutines, evented, dataflow, ...
We can think of two reasons for using distributed machine learning: because you have to (so
much data), or because you want to (hoping it will be faster). Only the first reason is good.
Elapsed times for 20 PageRank iterations
[1, 2]
Zygmunt Z
● Complementary
● Distributed data processing framework Apache Spark won Daytona
Gray Sort 100TB Benchmark
● Distributed databases
● Whole lifecycle of data
● Data processing - Futures, Akka, Akka Cluster, Reactive Streams,
Spark, …
● Data stores
● Integration and messaging
● Distributed computing primitives
● Cluster managers and task schedulers
● Deployment, configuration management and DevOps
● Data analytics and machine learning
ACID Mutable State
CQRS
Kappa architecture
Batch-Pipeline
Kafka
Allyourdata
NoSQL
SQL
Spark
Client
Client
Client Views
Stream
processor
Client
QueryCommand
DBDB
Denormalise
/Precompute
Flume
Scoop
Hive
Impala
Serving DB
Oozie
HDFS
Lambda Architecture
Batch Layer Serving
Layer
Stream layer (fast)
Query
Query
Allyourdata
[3]
[4, 5]
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
Output 0 with result 0.6615020337700888 in 12:15:53.564
Output 0 with result 0.6622847063345205 in 12:15:53.564
● Pure scala
● Functional programming
● Synchronization and memory management
● Actor framework for truly concurrent and distributed systems
● Thread safe mutable state - consistency boundary
● Domain modelling
● Distributed state, work, communication patterns
● Simple programming model - send messages, create new actors,
change behaviour
class UserActor extends PersistentActor {
override def persistenceId: String = UserPersistenceId(self.path.name).persistenceId
private[this] val userAccountKey = GSetKey[Account]("userAccountKey")
override def receiveCommand: Receive = notRegistered(DistributedData(context.system).replicator)
def notRegistered(distributedData: ActorRef): Receive = {
case cmd: AccountCommand =>
persist(AccountEvent(cmd.account)){ acc =>
distributedData ! Update(userAccountKey, GSet.empty[Account], WriteLocal)(_ + acc.account)
context.become(registered(acc))
}
}
def registered(account: Account): Receive = {
case eres @ EntireResistanceExerciseSession(id, session, sets, examples, deviations) =>
persist(eres)(data => sender() ! /-(id))
}
override def receiveRecover: Receive = {
...
}
}
class SensorDataProcessor[P, S] extends ActorPublisher[SensorData] with DataSink[P] with DataProcessingFlow[S] {
implicit val materializer = ActorMaterializer()
override def preStart() = {
FlowGraph.closed(sink) { implicit builder: FlowGraph.Builder[Future[Unit]] => s =>
Source(ActorPublisher(self)) ~> flow ~> s
}.run()
super.preStart()
}
def source(buffer: Seq[SensorData]): Receive = {
case data: SensorData if totalDemand > 0 && buffer.isEmpty => onNext(data)
case data: SensorData => context.become(source(buffer :+ data))
case Request(_) if buffer.nonEmpty =>
onNext(buffer.head)
context.become(source(buffer.tail))
}
override def receive: Receive = source(Seq())
}
Persistence
Sharding Replication
1.
4.
7.
2.
3.
5.
6.
8.
9.
10.
11.
?
?
?
? + 1
? + 1
? + 1
? + 1
? + 1
? + 1
? + 2
? + 2
● At-most-once. Messages may be lost.
● At-least-once. Messages may be duplicated but not lost.
● Exactly-once.
Ack
[6]
Output 18853 with result 0.6445355972059068 in 17:33:12.248
Output 18854 with result 0.6392081778097862 in 17:33:12.248
Output 18855 with result 0.6476549338361918 in 17:33:12.248
[17:33:12.353] [ClusterSystem-akka.actor.default-dispatcher-21] [Cluster(akka://ClusterSystem)]
Cluster Node [akka.tcp://ClusterSystem@127.0.0.1:2551] - Leader is removing unreachable node [akka.
tcp://ClusterSystem@127.0.0.1:54495]
[17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-22]
[akka.tcp://ClusterSystem@127.0.0.1:2551/user/sharding/PerceptronCoordinator] Member removed [akka.
tcp://ClusterSystem@127.0.0.1:54495]
[17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-35]
[17:33:12.415] [ClusterSystem-akka.actor.default-dispatcher-18] [akka:
//ClusterSystem/user/sharding/Edge/e-2-1-3-1] null java.lang.NullPointerException
● Microsoft's data centers average failure rate is 5.2 devices per day and 40.8 links per day,
with a median time to repair of approximately five minutes (and a maximum of one week).
● Google new cluster over one year. Five times rack issues 40-80 machines seeing 50 percent
packet loss. Eight network maintenance events (four of which might cause ~30-minute
random connectivity losses). Three router failures (resulting in the need to pull traffic
immediately for an hour).
● CENIC 500 isolating network partitions with median 2.7 and 32 minutes; 95th percentile of
19.9 minutes and 3.7 days, respectively for software and hardware problems
[7]
● MongoDB separated primary from its 2 secondaries. 2 hours later the old primary rejoined and rolled back
everything on the new primary
● A network partition isolated the Redis primary from all secondaries. Every API call caused the billing
system to recharge customer credit cards automatically, resulting in 1.1 percent of customers being
overbilled over a period of 40 minutes.
● The partition caused inconsistency in the MySQL database. Because foreign key relationships were not
consistent, Github showed private repositories to the wrong users' dashboards and incorrectly routed
some newly created repositories.
● For several seconds, Elasticsearch is happy to believe two nodes in the same cluster are both primaries, will
accept writes on both of those nodes, and later discard the writes to one side.
● RabbitMQ lost ~35% of acknowledged writes under those conditions.
● Redis threw away 56% of the writes it told us succeeded.
● In Riak, last-write-wins resulted in dropping 30-70% of writes, even with the strongest consistency
settings
● MongoDB “strictly consistent” reads see stale versions of documents, but they can also return garbage
data from writes that never should have occurred.
[8]
● Publisher and subscriber
● Lazy topology definition
Source[Circle].map(_.toSquare).filter(_.color == blue)
Publisher Subscriber
toSquare
color == blue
backpressure
weights ~> zip.in0
zip.out ~> transform ~> broadcast
broadcast ~> zipWithIndex ~> sink
zip.in1 <~ concat <~ input
concat <~ broadcast
Network
zip transform
*
zipWithIndex
Layer
input n + 1
input 1
broadcast
index
weights
[9]
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
7 * Dumbbell
Alternating Curl
● In memory dataflow distributed data processing framework, streaming
and batch
● Distributes computation using a higher level API
● Load balancing
● Moves computation to data
● Fault tolerant
● Resilient Distributed Datasets
● Fault tolerance
● Caching
● Serialization
● Transformations
○ Lazy, form the DAG
○ map, filter, flatMap, union, group, reduce, sort, join, repartition, cartesian, glom, ...
● Actions
○ Execute DAG, retrieve result
○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...
● Accumulators
● Broadcast Variables
● Integration
● Streaming
● Machine Learning
● Graph Processing
Data
transform
transform
transform
collect
textFile mapmap
reduceByKey
collect
sc.textFile("counts")
.map(line => line.split("t"))
.map(word => (word(0), word(1).toInt))
.reduceByKey(_ + _)
.collect()
[10]
● Catalyst
● Multiple phases
● DataFrame
[11]
Data
Data
Preprocessing
Preprocessing
Features
Features
Training
Testing
Error %
val events = sc.eventTable().cache().toDF()
val pipeline = new Pipeline().setStages(Array(
new UserFilter(),
new ZScoreNormalizer(),
new IntensityFeatureExtractor(),
new LinearRegression()
))
getEligibleUsers(events, sessionEndedBefore)
.map { user =>
val model = pipeline.fit(
events,
ParamMap(ParamPair(userIdParam, user)))
val testData = // Prepare test data.
val predictions = model.transform(testData)
submitResult(userId, predictions, config)
}
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
Choose the best combination of tools for given use case.
Understand the internals of selected tools.
The environment often fully asynchronous and distributed.
1)
2)
3)
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
● Jobs at www.cakesolutions.net/careers
● Code at https://github.com/muvr
● Twitter @zapletal_martin
[1] http://www.csie.ntu.edu.tw/~cjlin/talks/twdatasci_cjlin.pdf
[2] http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/
[3] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/
[4] http://malteschwarzkopf.de/research/assets/google-stack.pdf
[5] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
[6] http://en.wikipedia.org/wiki/Two_Generals%27_Problem
[7] https://queue.acm.org/detail.cfm?id=2655736
[8] https://aphyr.com/
[9] http://www.smartjava.org/content/visualizing-back-pressure-and-reactive-streams-akka-streams-statsd-grafana-and-influxdb
[10] http://www.slideshare.net/LisaHua/spark-overview-37479609
[11] https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

More Related Content

What's hot (20)

PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PDF
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 
PPTX
Apache Spark II (SparkSQL)
Datio Big Data
 
PDF
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Databricks
 
PDF
Databases and how to choose them
Datio Big Data
 
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
PDF
Cassandra + Spark + Elk
Vasil Remeniuk
 
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
PDF
Spark streaming: Best Practices
Prakash Chockalingam
 
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 
PDF
Apache Spark RDDs
Dean Chen
 
PDF
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
DataStax
 
PDF
So you think you can stream.pptx
Prakash Chockalingam
 
PDF
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
PDF
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Stephane Manciot
 
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
PPTX
An Introduction to Spark
jlacefie
 
PPTX
Apache Spark overview
DataArt
 
PPTX
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Tathagata Das
 
PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Valerii Vasylkov Erlang. measurements and benefits.
Аліна Шепшелей
 
Apache Spark II (SparkSQL)
Datio Big Data
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Databricks
 
Databases and how to choose them
Datio Big Data
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
Cassandra + Spark + Elk
Vasil Remeniuk
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
Spark streaming: Best Practices
Prakash Chockalingam
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 
Apache Spark RDDs
Dean Chen
 
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
DataStax
 
So you think you can stream.pptx
Prakash Chockalingam
 
Distributed Stream Processing - Spark Summit East 2017
Petr Zapletal
 
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Stephane Manciot
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
An Introduction to Spark
jlacefie
 
Apache Spark overview
DataArt
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Tathagata Das
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 

Viewers also liked (20)

KEY
Curator intro
Jordan Zimmerman
 
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
PDF
Data in Motion: Streaming Static Data Efficiently 2
Martin Zapletal
 
PDF
Data in Motion: Streaming Static Data Efficiently
Martin Zapletal
 
PDF
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
PPTX
Kafka Lambda architecture with mirroring
Anant Rustagi
 
PDF
Demystifying salesforce for developers
Heitor Souza
 
PDF
Extreme Salesforce Data Volumes Webinar
Salesforce Developers
 
PPTX
How Apache Kafka is transforming Hadoop, Spark and Storm
Edureka!
 
PDF
Handling of Large Data by Salesforce
Thinqloud
 
PPTX
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Data Con LA
 
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
PPT
Salesforce REST API
Bohdan Dovhań
 
PDF
Understanding the Salesforce Architecture: How We Do the Magic We Do
Salesforce Developers
 
PDF
Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar
Salesforce Developers
 
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
PPTX
Introduction to Apache NiFi - Seattle Scalability Meetup
Saptak Sen
 
PPTX
Apache spark - History and market overview
Martin Zapletal
 
PPTX
Microservice-based Architecture on the Salesforce App Cloud
pbattisson
 
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Curator intro
Jordan Zimmerman
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Helena Edelson
 
Data in Motion: Streaming Static Data Efficiently 2
Martin Zapletal
 
Data in Motion: Streaming Static Data Efficiently
Martin Zapletal
 
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
Kafka Lambda architecture with mirroring
Anant Rustagi
 
Demystifying salesforce for developers
Heitor Souza
 
Extreme Salesforce Data Volumes Webinar
Salesforce Developers
 
How Apache Kafka is transforming Hadoop, Spark and Storm
Edureka!
 
Handling of Large Data by Salesforce
Thinqloud
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Data Con LA
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Salesforce REST API
Bohdan Dovhań
 
Understanding the Salesforce Architecture: How We Do the Magic We Do
Salesforce Developers
 
Salesforce API Series: Fast Parallel Data Loading with the Bulk API Webinar
Salesforce Developers
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Introduction to Apache NiFi - Seattle Scalability Meetup
Saptak Sen
 
Apache spark - History and market overview
Martin Zapletal
 
Microservice-based Architecture on the Salesforce App Cloud
pbattisson
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Gyula Fóra
 
Ad

Similar to Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015 (20)

PDF
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Martin Zapletal
 
PDF
Cake Solutions: Cassandra as event sourced journal for big data analytics
DataStax Academy
 
PDF
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
PDF
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
StreamNative
 
PPTX
Software architecture for data applications
Ding Li
 
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
PDF
Big Data processing with Apache Spark
Lucian Neghina
 
PDF
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
PDF
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
PDF
Hadoop Network Performance profile
pramodbiligiri
 
PDF
Data pipelines from zero to solid
Lars Albertsson
 
PDF
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
PDF
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
PDF
Handout3o
Shahbaz Sidhu
 
PDF
Transforming Mobile Push Notifications with Big Data
plumbee
 
PDF
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
PDF
Spark cep
Byungjin Kim
 
PDF
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Martin Zapletal
 
Cake Solutions: Cassandra as event sourced journal for big data analytics
DataStax Academy
 
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
StreamNative
 
Software architecture for data applications
Ding Li
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Big Data processing with Apache Spark
Lucian Neghina
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
Hadoop Network Performance profile
pramodbiligiri
 
Data pipelines from zero to solid
Lars Albertsson
 
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Handout3o
Shahbaz Sidhu
 
Transforming Mobile Push Notifications with Big Data
plumbee
 
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Byungjin Kim
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Ad

Recently uploaded (20)

PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Import Data Form Excel to Tally Services
Tally xperts
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Tally_Basic_Operations_Presentation.pptx
AditiBansal54083
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
Human Resources Information System (HRIS)
Amity University, Patna
 
MailsDaddy Outlook OST to PST converter.pptx
abhishekdutt366
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 

Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015

  • 1. Martin Zapletal @zapletal_martin Cake Solutions @cakesolutions
  • 2. ● Increasing importance of data analytics, data mining and machine learning ● Current state ○ Destructive updates ○ Analytics tools with poor scalability and integration ○ Manual processes ○ Slow iterations ○ Not suitable for large amounts and fast data
  • 3. ● Shared memory, disk, shared nothing, threads, mutexes, transactional memory, message passing, CSP, actors, futures, coroutines, evented, dataflow, ... We can think of two reasons for using distributed machine learning: because you have to (so much data), or because you want to (hoping it will be faster). Only the first reason is good. Elapsed times for 20 PageRank iterations [1, 2] Zygmunt Z
  • 4. ● Complementary ● Distributed data processing framework Apache Spark won Daytona Gray Sort 100TB Benchmark ● Distributed databases
  • 5. ● Whole lifecycle of data ● Data processing - Futures, Akka, Akka Cluster, Reactive Streams, Spark, … ● Data stores ● Integration and messaging ● Distributed computing primitives ● Cluster managers and task schedulers ● Deployment, configuration management and DevOps ● Data analytics and machine learning
  • 11. Output 0 with result 0.6615020337700888 in 12:15:53.564 Output 0 with result 0.6622847063345205 in 12:15:53.564 ● Pure scala ● Functional programming ● Synchronization and memory management
  • 12. ● Actor framework for truly concurrent and distributed systems ● Thread safe mutable state - consistency boundary ● Domain modelling ● Distributed state, work, communication patterns ● Simple programming model - send messages, create new actors, change behaviour
  • 13. class UserActor extends PersistentActor { override def persistenceId: String = UserPersistenceId(self.path.name).persistenceId private[this] val userAccountKey = GSetKey[Account]("userAccountKey") override def receiveCommand: Receive = notRegistered(DistributedData(context.system).replicator) def notRegistered(distributedData: ActorRef): Receive = { case cmd: AccountCommand => persist(AccountEvent(cmd.account)){ acc => distributedData ! Update(userAccountKey, GSet.empty[Account], WriteLocal)(_ + acc.account) context.become(registered(acc)) } } def registered(account: Account): Receive = { case eres @ EntireResistanceExerciseSession(id, session, sets, examples, deviations) => persist(eres)(data => sender() ! /-(id)) } override def receiveRecover: Receive = { ... } }
  • 14. class SensorDataProcessor[P, S] extends ActorPublisher[SensorData] with DataSink[P] with DataProcessingFlow[S] { implicit val materializer = ActorMaterializer() override def preStart() = { FlowGraph.closed(sink) { implicit builder: FlowGraph.Builder[Future[Unit]] => s => Source(ActorPublisher(self)) ~> flow ~> s }.run() super.preStart() } def source(buffer: Seq[SensorData]): Receive = { case data: SensorData if totalDemand > 0 && buffer.isEmpty => onNext(data) case data: SensorData => context.become(source(buffer :+ data)) case Request(_) if buffer.nonEmpty => onNext(buffer.head) context.become(source(buffer.tail)) } override def receive: Receive = source(Seq()) }
  • 17. ? ? ? ? + 1 ? + 1 ? + 1 ? + 1 ? + 1 ? + 1 ? + 2 ? + 2
  • 18. ● At-most-once. Messages may be lost. ● At-least-once. Messages may be duplicated but not lost. ● Exactly-once. Ack [6]
  • 19. Output 18853 with result 0.6445355972059068 in 17:33:12.248 Output 18854 with result 0.6392081778097862 in 17:33:12.248 Output 18855 with result 0.6476549338361918 in 17:33:12.248 [17:33:12.353] [ClusterSystem-akka.actor.default-dispatcher-21] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://[email protected]:2551] - Leader is removing unreachable node [akka. tcp://[email protected]:54495] [17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-22] [akka.tcp://[email protected]:2551/user/sharding/PerceptronCoordinator] Member removed [akka. tcp://[email protected]:54495] [17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-35] [17:33:12.415] [ClusterSystem-akka.actor.default-dispatcher-18] [akka: //ClusterSystem/user/sharding/Edge/e-2-1-3-1] null java.lang.NullPointerException
  • 20. ● Microsoft's data centers average failure rate is 5.2 devices per day and 40.8 links per day, with a median time to repair of approximately five minutes (and a maximum of one week). ● Google new cluster over one year. Five times rack issues 40-80 machines seeing 50 percent packet loss. Eight network maintenance events (four of which might cause ~30-minute random connectivity losses). Three router failures (resulting in the need to pull traffic immediately for an hour). ● CENIC 500 isolating network partitions with median 2.7 and 32 minutes; 95th percentile of 19.9 minutes and 3.7 days, respectively for software and hardware problems [7]
  • 21. ● MongoDB separated primary from its 2 secondaries. 2 hours later the old primary rejoined and rolled back everything on the new primary ● A network partition isolated the Redis primary from all secondaries. Every API call caused the billing system to recharge customer credit cards automatically, resulting in 1.1 percent of customers being overbilled over a period of 40 minutes. ● The partition caused inconsistency in the MySQL database. Because foreign key relationships were not consistent, Github showed private repositories to the wrong users' dashboards and incorrectly routed some newly created repositories. ● For several seconds, Elasticsearch is happy to believe two nodes in the same cluster are both primaries, will accept writes on both of those nodes, and later discard the writes to one side. ● RabbitMQ lost ~35% of acknowledged writes under those conditions. ● Redis threw away 56% of the writes it told us succeeded. ● In Riak, last-write-wins resulted in dropping 30-70% of writes, even with the strongest consistency settings ● MongoDB “strictly consistent” reads see stale versions of documents, but they can also return garbage data from writes that never should have occurred. [8]
  • 22. ● Publisher and subscriber ● Lazy topology definition Source[Circle].map(_.toSquare).filter(_.color == blue) Publisher Subscriber toSquare color == blue backpressure
  • 23. weights ~> zip.in0 zip.out ~> transform ~> broadcast broadcast ~> zipWithIndex ~> sink zip.in1 <~ concat <~ input concat <~ broadcast Network zip transform * zipWithIndex Layer input n + 1 input 1 broadcast index weights
  • 24. [9]
  • 28. ● In memory dataflow distributed data processing framework, streaming and batch ● Distributes computation using a higher level API ● Load balancing ● Moves computation to data ● Fault tolerant
  • 29. ● Resilient Distributed Datasets ● Fault tolerance ● Caching ● Serialization ● Transformations ○ Lazy, form the DAG ○ map, filter, flatMap, union, group, reduce, sort, join, repartition, cartesian, glom, ... ● Actions ○ Execute DAG, retrieve result ○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ... ● Accumulators ● Broadcast Variables ● Integration ● Streaming ● Machine Learning ● Graph Processing
  • 31. textFile mapmap reduceByKey collect sc.textFile("counts") .map(line => line.split("t")) .map(word => (word(0), word(1).toInt)) .reduceByKey(_ + _) .collect() [10]
  • 32. ● Catalyst ● Multiple phases ● DataFrame [11]
  • 34. val events = sc.eventTable().cache().toDF() val pipeline = new Pipeline().setStages(Array( new UserFilter(), new ZScoreNormalizer(), new IntensityFeatureExtractor(), new LinearRegression() )) getEligibleUsers(events, sessionEndedBefore) .map { user => val model = pipeline.fit( events, ParamMap(ParamPair(userIdParam, user))) val testData = // Prepare test data. val predictions = model.transform(testData) submitResult(userId, predictions, config) }
  • 36. Choose the best combination of tools for given use case. Understand the internals of selected tools. The environment often fully asynchronous and distributed. 1) 2) 3)
  • 38. ● Jobs at www.cakesolutions.net/careers ● Code at https://github.com/muvr ● Twitter @zapletal_martin
  • 39. [1] http://www.csie.ntu.edu.tw/~cjlin/talks/twdatasci_cjlin.pdf [2] http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/ [3] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/ [4] http://malteschwarzkopf.de/research/assets/google-stack.pdf [5] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf [6] http://en.wikipedia.org/wiki/Two_Generals%27_Problem [7] https://queue.acm.org/detail.cfm?id=2655736 [8] https://aphyr.com/ [9] http://www.smartjava.org/content/visualizing-back-pressure-and-reactive-streams-akka-streams-statsd-grafana-and-influxdb [10] http://www.slideshare.net/LisaHua/spark-overview-37479609 [11] https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/