SlideShare a Scribd company logo
Real-Time Distributed and Reactive Systems
with Apache Kafka and Apache Accumulo
Joe Stein
• Developer, Architect & Technologist
• Founder & Principal Consultant => Big Data Open Source Security LLC - http://stealth.ly
Big Data Open Source Security LLC provides professional services and product solutions for the collection, storage,
transfer, real-time analytics, batch processing and reporting for complex data streams, data sets and distributed
systems. BDOSS is all about the "glue" and helping companies to not only figure out what Big Data Infrastructure
Components to use but also how to change their existing (or build new) systems to work with them.
• CEO => Elodina, Inc.
Expanding BDOSS from just consulting, Elodina is an ISV & SaaS provider of stream solutions & open source software.
Elodina helps make data streams actionable.
• Apache Kafka Committer & PMC member
• Blog & Podcast - http://allthingshadoop.com
• Twitter @allthingshadoop
Overview
● Real-time distributed reactive systems
● Quick Intro to Apache Kafka
● Quick Intro to Apache Mesos
● Kafka on Mesos
● Accumulo & HDFS on Mesos
● Real-time distributed reactive systems
● Bringing it all together with Accumulo
Real-Time Distributed and Reactive Systems
A distributed system for asynchronous stream processing with
non-blocking back pressure where complex event processing
systems can influence the response without coupling the
business logic of processing. The response can be calculated
by parallel operations with concurrent orthogonal processing
engines computing their influence towards the final result.
Real-Time Distributed and Reactive Systems
http://kafka.apache.org
Apache Kafka
• Apache Kafka
o http://kafka.apache.org
• Apache Kafka Source Code
o https://github.com/apache/kafka
• Documentation
o http://kafka.apache.org/documentation.html
• Wiki
o https://cwiki.apache.org/confluence/display/KAFKA/Index
Producers, Consumers, Brokers
• Producers - ** push **
o Batching
o Compression
o Sync (Ack), Async (auto batch)
o Replication
o Sequential writes, guaranteed ordering within each partition
• Consumers - ** pull **
o No state held by broker
o Consumers control reading from the stream
• Zero Copy for producers and consumers to and from the broker
http://kafka.apache.org/documentation.html#maximizingefficiency
• Message stay on disk when consumed, deletes on TTL or compaction
https://kafka.apache.org/documentation.html#compaction
Kafka decouples data-pipelines
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo [Leveraging Accumulo]
Client Libraries
Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients
• Python - Pure Python implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
• C - High performance C library with full protocol support
• C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset.
• Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer
implementations included, GZIP and Snappy compression supported.
• Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy
compression supported. Ruby 1.9.3 and up (CI runs MRI 2.
• Clojure - Clojure DSL for the Kafka API
• JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation
• stdin & stdout
Wire Protocol Developers Guide
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
Really Quick Start (Scala)
1) Install Vagrant http://www.vagrantup.com/
2) Install Virtual Box https://www.virtualbox.org/
3) git clone https://github.com/stealthly/scala-kafka
4) cd scala-kafka
5) vagrant up
Zookeeper will be running on 192.168.86.5
BrokerOne will be running on 192.168.86.10
All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm)
6) ./gradlew test
Really Quick Start (Go)
1) Install Vagrant http://www.vagrantup.com/
2) Install Virtual Box https://www.virtualbox.org/
3) git clone https://github.com/stealthly/go-kafka
4) cd go-kafka
5) vagrant up
6) vagrant ssh brokerOne
7) cd /vagrant
8) sudo ./test.sh
Apache Mesos
http://mesos.apache.org
Origins
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf
Google Borg - https://research.google.com/pubs/pub43438.html
Google Omega: flexible, scalable schedulers for large compute clusters
http://eurosys2013.tudos.org/wp-
content/uploads/2013/paper/Schwarzkopf.pdf
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo [Leveraging Accumulo]
Static Partition == Idle Resources
Operating System === Datacenter
Mesos => data center “kernel”
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo [Leveraging Accumulo]
Apache Mesos
● Scalability to 10,000s of nodes
● Fault-tolerant replicated master and slaves using ZooKeeper
● Support for Docker containers
● Native isolation between tasks with Linux Containers
● Multi-resource scheduling (memory, CPU, disk, and ports)
● Java, Python and C++ APIs for developing new parallel applications
● Web UI for viewing cluster state
Sample Frameworks
C++ - https://github.com/apache/mesos/tree/master/src/examples
Java - https://github.com/apache/mesos/tree/master/src/examples/java
Python - https://github.com/apache/mesos/tree/master/src/examples/python
Scala - https://github.com/mesosphere/scala-sbt-mesos-framework.g8
Go - https://github.com/mesosphere/mesos-go
Kafka on Mesos
● The Mesos Kafka framework https://github.com/mesos/kafka
○ Smart broker.id assignment.
○ Preservation of broker placement.
○ Ability to-do configuration changes.
○ Rolling restarts.
○ Auto-scaling the cluster up and down.
Accumulo on Mesos
No framework yet, but you can use Marathon, no problem!
Marathon https://github.com/mesosphere/marathon is a cluster-
wide init and control system for services in cgroups or docker
based on Apache Mesos
HDFS on Mesos https://github.com/mesosphere/hdfs (more on
this in a bit)
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo [Leveraging Accumulo]
Real-Time Distributed and Reactive Systems
Real-Time Distributed and Reactive Systems
Where does Accumulo fit in?
● Iterators
○ Accumulo iterators are a real time processing framework with
“reduce like” functionality
● Multi HDFS Volume Support
○ Spin up HDFS clusters when they are needed
● Streaming Large Blobs
○ Post files in producers, process and respond to scans
● More!
Real-Time Distributed and Reactive Systems
Questions?
/*******************************************
Joe Stein
CEO, Elodina, Inc
http://www.stealth.ly
Twitter: @allthingshadoop
********************************************/

More Related Content

PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
PDF
Spark day 2017 - Spark on Kubernetes
Yousun Jeong
 
PPTX
Streaming Data from Scylla to Kafka
ScyllaDB
 
PDF
Streaming ETL - from RDBMS to Dashboard with KSQL
Bjoern Rost
 
PDF
HBaseCon 2013: Apache HBase Operations at Pinterest
Cloudera, Inc.
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PDF
Streaming Processing with a Distributed Commit Log
Joe Stein
 
PDF
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Joe Stein
 
Spark day 2017 - Spark on Kubernetes
Yousun Jeong
 
Streaming Data from Scylla to Kafka
ScyllaDB
 
Streaming ETL - from RDBMS to Dashboard with KSQL
Bjoern Rost
 
HBaseCon 2013: Apache HBase Operations at Pinterest
Cloudera, Inc.
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Streaming Processing with a Distributed Commit Log
Joe Stein
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 

What's hot (20)

PDF
Get started with Developing Frameworks in Go on Apache Mesos
Joe Stein
 
KEY
Near-realtime analytics with Kafka and HBase
dave_revell
 
PPTX
Apache Kafka
Joe Stein
 
PDF
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
PDF
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
HostedbyConfluent
 
PPTX
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
PDF
Cassandra Introduction & Features
Phil Peace
 
PDF
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
confluent
 
PPTX
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon
 
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
PPTX
Visualizing Kafka Security
DataWorks Summit
 
PPTX
Kick your database_to_the_curb_reston_08_27_19
confluent
 
PPTX
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
Michael Stack
 
PDF
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
DataStax Academy
 
PDF
Building Out Your Kafka Developer CDC Ecosystem
confluent
 
PDF
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
confluent
 
PDF
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
PDF
Containerizing Distributed Pipes
inside-BigData.com
 
PDF
Make 2016 your year of SMACK talk
DataStax Academy
 
PPTX
Kafka
shrenikp
 
Get started with Developing Frameworks in Go on Apache Mesos
Joe Stein
 
Near-realtime analytics with Kafka and HBase
dave_revell
 
Apache Kafka
Joe Stein
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
HostedbyConfluent
 
Real-time streaming and data pipelines with Apache Kafka
Joe Stein
 
Cassandra Introduction & Features
Phil Peace
 
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
confluent
 
HBaseCon 2015: Blackbird Collections - In-situ Stream Processing in HBase
HBaseCon
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
Visualizing Kafka Security
DataWorks Summit
 
Kick your database_to_the_curb_reston_08_27_19
confluent
 
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
Michael Stack
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
DataStax Academy
 
Building Out Your Kafka Developer CDC Ecosystem
confluent
 
Show Me Kafka Tools That Will Increase My Productivity! (Stephane Maarek, Dat...
confluent
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
confluent
 
Containerizing Distributed Pipes
inside-BigData.com
 
Make 2016 your year of SMACK talk
DataStax Academy
 
Kafka
shrenikp
 
Ad

Viewers also liked (20)

PPTX
Modern Distributed Messaging and RPC
Max Alexejev
 
PPTX
Accumulo meetup 20130109
Sqrrl
 
PDF
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit
 
PDF
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit
 
PDF
Accumulo design
scsorensen
 
PDF
Apache Accumulo and the Data Lake
Aaron Cordova
 
PDF
Accumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit
 
PDF
Large Scale Accumulo Clusters
Aaron Cordova
 
PDF
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit
 
PDF
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit
 
PPTX
Accumulo: A Quick Introduction
James Salter
 
PDF
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
Accumulo Summit
 
PDF
Sqrrl real time_big_data_20130411
Sqrrl
 
PDF
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit
 
PPTX
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Yahoo Developer Network
 
PDF
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit
 
PPT
Apache Accumulo Overview
Bill Havanki
 
PPTX
Apache Kafka, HDFS, Accumulo and more on Mesos
Joe Stein
 
PDF
ZeroMQ Is The Answer
Ian Barber
 
PDF
Introduction to Apache Accumulo
Aaron Cordova
 
Modern Distributed Messaging and RPC
Max Alexejev
 
Accumulo meetup 20130109
Sqrrl
 
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo Summit
 
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit
 
Accumulo design
scsorensen
 
Apache Accumulo and the Data Lake
Aaron Cordova
 
Accumulo Summit 2016: Accumulo in the Enterprise
Accumulo Summit
 
Large Scale Accumulo Clusters
Aaron Cordova
 
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit
 
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit
 
Accumulo: A Quick Introduction
James Salter
 
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
Accumulo Summit
 
Sqrrl real time_big_data_20130411
Sqrrl
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit
 
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Yahoo Developer Network
 
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Accumulo Summit
 
Apache Accumulo Overview
Bill Havanki
 
Apache Kafka, HDFS, Accumulo and more on Mesos
Joe Stein
 
ZeroMQ Is The Answer
Ian Barber
 
Introduction to Apache Accumulo
Aaron Cordova
 
Ad

Similar to Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo [Leveraging Accumulo] (20)

PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
PPTX
Introduction Apache Kafka
Joe Stein
 
PDF
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
PDF
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
C4Media
 
PPTX
Current and Future of Apache Kafka
Joe Stein
 
PDF
An Introduction to Apache Kafka
Amir Sedighi
 
PPTX
Apache kafka
Viswanath J
 
PDF
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PDF
Tutorial Kafka-Storm
Universidad de Santiago de Chile
 
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
PPTX
Mario Cartia - SMACK is the new LAMP! - Codemotion Milan 2017
Codemotion
 
PPTX
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
PDF
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
PDF
Sparkstreaming
Marilyn Waldman
 
PDF
Data Pipelines with Apache Kafka
Ben Stopford
 
PPTX
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
PPTX
Building an Event Bus at Scale
jimriecken
 
PDF
SMACK Stack 1.1
Joe Stein
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Introduction Apache Kafka
Joe Stein
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
C4Media
 
Current and Future of Apache Kafka
Joe Stein
 
An Introduction to Apache Kafka
Amir Sedighi
 
Apache kafka
Viswanath J
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Tutorial Kafka-Storm
Universidad de Santiago de Chile
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
Mario Cartia - SMACK is the new LAMP! - Codemotion Milan 2017
Codemotion
 
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
Developing Realtime Data Pipelines With Apache Kafka
Joe Stein
 
Developing Real-Time Data Pipelines with Apache Kafka
Joe Stein
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Sparkstreaming
Marilyn Waldman
 
Data Pipelines with Apache Kafka
Ben Stopford
 
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Joe Stein
 
Building an Event Bus at Scale
jimriecken
 
SMACK Stack 1.1
Joe Stein
 

Recently uploaded (20)

PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
The Future of Artificial Intelligence (AI)
Mukul
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 

Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo [Leveraging Accumulo]

  • 1. Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
  • 2. Joe Stein • Developer, Architect & Technologist • Founder & Principal Consultant => Big Data Open Source Security LLC - http://stealth.ly Big Data Open Source Security LLC provides professional services and product solutions for the collection, storage, transfer, real-time analytics, batch processing and reporting for complex data streams, data sets and distributed systems. BDOSS is all about the "glue" and helping companies to not only figure out what Big Data Infrastructure Components to use but also how to change their existing (or build new) systems to work with them. • CEO => Elodina, Inc. Expanding BDOSS from just consulting, Elodina is an ISV & SaaS provider of stream solutions & open source software. Elodina helps make data streams actionable. • Apache Kafka Committer & PMC member • Blog & Podcast - http://allthingshadoop.com • Twitter @allthingshadoop
  • 3. Overview ● Real-time distributed reactive systems ● Quick Intro to Apache Kafka ● Quick Intro to Apache Mesos ● Kafka on Mesos ● Accumulo & HDFS on Mesos ● Real-time distributed reactive systems ● Bringing it all together with Accumulo
  • 4. Real-Time Distributed and Reactive Systems A distributed system for asynchronous stream processing with non-blocking back pressure where complex event processing systems can influence the response without coupling the business logic of processing. The response can be calculated by parallel operations with concurrent orthogonal processing engines computing their influence towards the final result.
  • 5. Real-Time Distributed and Reactive Systems
  • 7. Apache Kafka • Apache Kafka o http://kafka.apache.org • Apache Kafka Source Code o https://github.com/apache/kafka • Documentation o http://kafka.apache.org/documentation.html • Wiki o https://cwiki.apache.org/confluence/display/KAFKA/Index
  • 8. Producers, Consumers, Brokers • Producers - ** push ** o Batching o Compression o Sync (Ack), Async (auto batch) o Replication o Sequential writes, guaranteed ordering within each partition • Consumers - ** pull ** o No state held by broker o Consumers control reading from the stream • Zero Copy for producers and consumers to and from the broker http://kafka.apache.org/documentation.html#maximizingefficiency • Message stay on disk when consumed, deletes on TTL or compaction https://kafka.apache.org/documentation.html#compaction
  • 11. Client Libraries Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients • Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. • C - High performance C library with full protocol support • C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. • Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. • Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. • Clojure - Clojure DSL for the Kafka API • JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation • stdin & stdout Wire Protocol Developers Guide https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
  • 12. Really Quick Start (Scala) 1) Install Vagrant http://www.vagrantup.com/ 2) Install Virtual Box https://www.virtualbox.org/ 3) git clone https://github.com/stealthly/scala-kafka 4) cd scala-kafka 5) vagrant up Zookeeper will be running on 192.168.86.5 BrokerOne will be running on 192.168.86.10 All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm) 6) ./gradlew test
  • 13. Really Quick Start (Go) 1) Install Vagrant http://www.vagrantup.com/ 2) Install Virtual Box https://www.virtualbox.org/ 3) git clone https://github.com/stealthly/go-kafka 4) cd go-kafka 5) vagrant up 6) vagrant ssh brokerOne 7) cd /vagrant 8) sudo ./test.sh
  • 15. Origins Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf Google Borg - https://research.google.com/pubs/pub43438.html Google Omega: flexible, scalable schedulers for large compute clusters http://eurosys2013.tudos.org/wp- content/uploads/2013/paper/Schwarzkopf.pdf
  • 17. Static Partition == Idle Resources
  • 18. Operating System === Datacenter
  • 19. Mesos => data center “kernel”
  • 21. Apache Mesos ● Scalability to 10,000s of nodes ● Fault-tolerant replicated master and slaves using ZooKeeper ● Support for Docker containers ● Native isolation between tasks with Linux Containers ● Multi-resource scheduling (memory, CPU, disk, and ports) ● Java, Python and C++ APIs for developing new parallel applications ● Web UI for viewing cluster state
  • 22. Sample Frameworks C++ - https://github.com/apache/mesos/tree/master/src/examples Java - https://github.com/apache/mesos/tree/master/src/examples/java Python - https://github.com/apache/mesos/tree/master/src/examples/python Scala - https://github.com/mesosphere/scala-sbt-mesos-framework.g8 Go - https://github.com/mesosphere/mesos-go
  • 23. Kafka on Mesos ● The Mesos Kafka framework https://github.com/mesos/kafka ○ Smart broker.id assignment. ○ Preservation of broker placement. ○ Ability to-do configuration changes. ○ Rolling restarts. ○ Auto-scaling the cluster up and down.
  • 24. Accumulo on Mesos No framework yet, but you can use Marathon, no problem! Marathon https://github.com/mesosphere/marathon is a cluster- wide init and control system for services in cgroups or docker based on Apache Mesos HDFS on Mesos https://github.com/mesosphere/hdfs (more on this in a bit)
  • 26. Real-Time Distributed and Reactive Systems
  • 27. Real-Time Distributed and Reactive Systems
  • 28. Where does Accumulo fit in? ● Iterators ○ Accumulo iterators are a real time processing framework with “reduce like” functionality ● Multi HDFS Volume Support ○ Spin up HDFS clusters when they are needed ● Streaming Large Blobs ○ Post files in producers, process and respond to scans ● More!
  • 29. Real-Time Distributed and Reactive Systems
  • 30. Questions? /******************************************* Joe Stein CEO, Elodina, Inc http://www.stealth.ly Twitter: @allthingshadoop ********************************************/