SlideShare a Scribd company logo
Distributed Coordination
with ZooKeeper and Curator
Tibor Sulyán
tibor_sulyan@epam.com
April 25, 2015
2CONFIDENTIAL
CAP Theorem by Eric Brewer
• Consistency
• Availability
• Partition Tolerance
Introduction
1
2
12
write
read
?1
2
2
3CONFIDENTIAL
Agenda
What is ZooKeeper?1
ZooKeeper features2
Coordination Recipes3
Using Curator with ZooKeeper4
Deploying ZooKeeper clusters5
4CONFIDENTIAL
„ZooKeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.”
What is ZooKeeper about?
P
P
P
ES
/root
/root/data
/root/state
/root/state/service-000000001
zk
zk
zk zk
client client client
zk
ZK API ZK API ZK API
5CONFIDENTIAL
• Filesystem-like hierarchical structure
– Elements are called zNodes
• zNode operations
– Basic CRUD
– Transactional execution of multiple operations
– Watches
– Versioned changes
• zNode metadata
– Data
– Children
– Metadata (Stat structure)
• zNode types
ZooKeeper Data Model
P persistent
E ephemeral
PS persistent sequential
ES ephemeral sequential
6CONFIDENTIAL
• Ephemeral zNodes
– Session-scoped
– Exists as long as the ephemeral owner's
session is active
– Not persisted
– No children
• Sequence (sequential) zNodes
– Upon creation, zNode name is suffixed
by an integer value
– The value is unique in the zNode path
• Watches
– Can be set on read operations
(getData(), getChildren(), exists())
– One-time trigger when a zNode changes
ZooKeeper Data Model
P
P
/
servers
E server_A
E server_B
P leader
ES server_A0000000001
ES server_B0000000002
7CONFIDENTIAL
// this class will act as default watcher
class ZooKeeperClient implements Watcher {
...
// connect to the ensemble. 'this' refers to a watcher (aka default watcher)
ZooKeeper zooKeeper = new ZooKeeper("localhost:2181,localhost:2182,localhost:2183",
30_000, this);
@Override
public void process(WatchedEvent event) {
// zNode changes & connection state changes
// can be invoked before the constructor returns!
}
}
ZooKeeper API – connect, default watcher
8CONFIDENTIAL
// snyhronous node creation
try {
Stat stat = zooKeeper.create("/test", "data".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE,
CreateMode.EPHEMERAL);
} catch (KeeperException e) {
switch (e.code()) {
case CONNECTIONLOSS:
// retry operation
break;
}
}
// asynchronous node creation
zooKeeper.create("/test", "data".getBytes(), OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL, new
StringCallback() {
@Override
public void processResult(int rc, String path, Object ctx, String name) {
switch (Code.get(rc)) {
// handle errors (retry on CONNECTIONLOSS)
}
}
}, null /* no context passed to callback*/);
ZooKeeper API – create operations &
recoverable errors
9CONFIDENTIAL
// snyhronous update: sets "newdata" for /test1
// error handling omitted
Stat stat = zooKeeper.setData("/test", "newdata".getBytes(), -1);
// sets "newerdata" only if data version is 5
zooKeeper.setData("/test", "newerdata".getBytes(), 5);
ZooKeeper API – versioned update operations
10CONFIDENTIAL
// check if zNode exists using the default watcher
// error handling omitted
Stat stat = zooKeeper.exists("/parent/child1", false);
// get data & set default watcher
Stat stat = new Stat();
byte[] data = zooKeeper.getData("/parent/child1", true, stat);
// Use a separate Watcher
stat = zooKeeper.exists("/parent/child2", new Watcher() {
@Override
public void process(WatchedEvent event) {
// react to node deletion
}
});
ZooKeeper API – read operations & setting
watches
11CONFIDENTIAL
• Atomic updates
• Sequential Consistency
• Single System Image
• Timeliness
• Reliability
• Availability
ZooKeeper Guarantees
zk 5
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
12CONFIDENTIAL
notifycommitvoteproposepropagate
Sequential Consistency
client
follower
leader
follower
follower
setData
sync return
callback called
watch triggered
time
13CONFIDENTIAL
propagate, propose commit, notify
Timeliness
client 2
follower
leader
follower
follower
client 1
setData (v2)
v2
v2
time
14CONFIDENTIAL
• ZooKeeper process failures are tolerated if
a quorum is present
• Simplest quorum: majority-based
• Avoids split-brain scenarios
Availability
zk 5
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
behaviour on follower failures
15CONFIDENTIAL
• ZooKeeper process failures are tolerated if
a quorum is present
• Simplest quorum: majority-based
• Avoids split-brain scenarios
Availability
zk 5
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
behaviour on leader failure
zk 1
zk 2
16CONFIDENTIAL
ZooKeeper Recipes
17CONFIDENTIAL
Distributed Coordination Recipes
Shared Data Group Membership
P
P
/
serviceInstances
E serverA
E serverB
Service Discovery
P
P
/
service
E serviceInfo
Lock
Mutex
Leader Election
P
P
/
service
ES service_0000000001
ES service_0000000002
18CONFIDENTIAL
Leader Election Recipe
P
P
/
service
ES service_0000000001
ES service_0000000002
zk 5
zk 1
zk 2 zk 3
service service service
zk 4
ES service_0000000003
service
watch
service_0000000001
watch
service_0000000001
watch
service_0000000001
watch
service_0000000001
service
watch
service_0000000002
n-1 watches are set on the same node
Improvment: watch the last sequence node
instead of the first one
19CONFIDENTIAL
Improved Leader Election Recipe
P
P
/
service
ES service_0000000001
ES service_0000000002
zk 5
zk 1
zk 2 zk 3
service service service
zk 4
ES service_0000000003
service
watch
service_0000000001
watch
service_0000000002
watch
service_0000000001
service
20CONFIDENTIAL
• Higher level Client API to
ZooKeeper
• Hides most of the complexity of
communicating with ZK ensemble
• Implemented recipes
Curator and ZooKeeper
zk
zk
zk zk
client client client
zk
Curator
ZK API
Curator
ZK API
Curator
ZK API
21CONFIDENTIAL
// create & start framework instance
CuratorFramework framework =
CuratorFrameworkFactory.newClient("localhost:2181,localhost:2182,localhost:2183",
new ExponentialBackoffRetry(1000, 20));
framework.start()
// foreground operation
Stat stat = framework.setData().forPath("/a/b/c/d", "testdata".getBytes());
// background operation
framework.setData().inBackground().forPath("/a/b/c/d/e", "testdata".getBytes());
Curator API
22CONFIDENTIAL
Protected EPHEMERAL_SEQUENTIAL nodes
Curator Features
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
P
P
/
cluster
framework.create().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).forPath("/cluster/service");
ES service_0000000002
ES service_0000000003
zk 5
connection loss – reconnect attempt beginsreconnect successful within session timeout – retrying path creation
23CONFIDENTIAL
Protected EPHEMERAL_SEQUENTIAL nodes
Curator Features
zk 1
zk 2 zk 3
client 1 client 2 client 3
zk 4
P
P
/
cluster
framework.create().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).withProtection().forPath("/cluster/service");
ES _c_16c39a25-87b4-4a54-bd05-1666a3e718de_service_0000000002
zk 5
connection loss – reconnect attempt beginsreconnect successful within session timeout – checkning zNode with same GUIDno extra zNode created
24CONFIDENTIAL
• Performance Considerations
• Using Observers to scale
• Using Hierarchical Quorums for multi-datacenter setup
• Surviving network partition with read-only mode
Zookeeper in the real world
25CONFIDENTIAL
• Replicated data is kept entirely in-memory by zookeper processes
• full GC can drop out a server from the ensemble
• Synchronous filesystem writes in commit phase
• can take seconds on an overloaded storage device
• use dedicated device for zookeeper transaction logs
• Maximum zNode size is 1M by default
• data + metadata should fit in
• configurable using a system property, but increasing it is not recommended
• Watches and performance
• Too many watches on a single node – herd effect
• Too many watches overall – increases memory footprint
Performance considerations
26CONFIDENTIAL
notifycommitvoteproposepropagate
Using Observers to scale
client
follower
leader
follower
follower
setData
sync return
callback called
watch triggered
observer
observers:
• no proposals
• no votes
• can’t be leaders
time
27CONFIDENTIAL
Hierarchical Quorums
zk5
zk4
zk6
zk8
zk7
zk9
zk2
zk1
zk3
Majority quorums:
• any 4 zk failures are tolerated
A datacenter goes down
• remaining ensemble becomes
much less resilient
Hierarchical quorums:
• Disjoint groups are formed
• Quorum requires majority of votes
from the majority of groups
• 5 failures can be tolerated
• Better for clusters spanning
multiple datacenters
group 1 group 2
group 3
28CONFIDENTIAL
Read-only mode
zk5
zk4
zk6
zk8
zk7
zk9
zk2
zk1
zk3
Network partitions,
a datacenter gets detached
Partitioned zookeepers can operate
in read-only mode
• not connected to the ensemble
• no writes allowed
• read requests are still served
By default read-only mode is disabled
zk2
zk1
zk3
29CONFIDENTIAL
• ACLs
• Quota support
• Authentication support
• Transaction logging
• Connection state handling
• Weighted hierarchical quorums
• Configuration
• Dynamic reconfiguration
• ...
• More info:
• ZooKeeper documentation
http://zookeeper.apache.org/doc/trunk/index.html
• Curator resources
http://curator.apache.org
• ZAB protocol in detail
http://web.stanford.edu/class/cs347/reading/zab.pdf
http://diyhpl.us/~bryan/papers2/distributed/distributed-systems/zab.totally-ordered-broadcast-
protocol.2008.pdf
• ZooKeeper book
http://shop.oreilly.com/product/0636920028901.do
Topics not covered
30CONFIDENTIAL
THANK YOU!

More Related Content

What's hot (20)

PDF
Akka Cluster in Java - JCConf 2015
Jiayun Zhou
 
PDF
Distributed system coordination by zookeeper and introduction to kazoo python...
Jimmy Lai
 
PDF
Deterministic simulation testing
FoundationDB
 
PDF
Apache Commons Pool and DBCP - Version 2 Update
Phil Steitz
 
PDF
Programming with ZooKeeper - A basic tutorial
Jeff Smith
 
PDF
Javascript TDD with Jasmine, Karma, and Gulp
All Things Open
 
PPTX
Behind modern concurrency primitives
Bartosz Sypytkowski
 
PPTX
DIY Java Profiler
aragozin
 
PDF
Apache Hive Hook
Minwoo Kim
 
PDF
Everything as a code
Aleksandr Tarasov
 
PPTX
Java profiling Do It Yourself
aragozin
 
PPTX
Internal Hive
Recruit Technologies
 
PDF
Non Blocking I/O for Everyone with RxJava
Frank Lyaruu
 
PDF
Csw2016 gawlik bypassing_differentdefenseschemes
CanSecWest
 
PDF
First glance at Akka 2.0
Vasil Remeniuk
 
PDF
Advanced akka features
Grzegorz Duda
 
PDF
Counter Wars (JEEConf 2016)
Alexey Fyodorov
 
PPTX
Jersey framework
knight1128
 
PPTX
Thinking Beyond ORM in JPA
Patrycja Wegrzynowicz
 
PDF
Zookeeper
Geng-Dian Huang
 
Akka Cluster in Java - JCConf 2015
Jiayun Zhou
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Jimmy Lai
 
Deterministic simulation testing
FoundationDB
 
Apache Commons Pool and DBCP - Version 2 Update
Phil Steitz
 
Programming with ZooKeeper - A basic tutorial
Jeff Smith
 
Javascript TDD with Jasmine, Karma, and Gulp
All Things Open
 
Behind modern concurrency primitives
Bartosz Sypytkowski
 
DIY Java Profiler
aragozin
 
Apache Hive Hook
Minwoo Kim
 
Everything as a code
Aleksandr Tarasov
 
Java profiling Do It Yourself
aragozin
 
Internal Hive
Recruit Technologies
 
Non Blocking I/O for Everyone with RxJava
Frank Lyaruu
 
Csw2016 gawlik bypassing_differentdefenseschemes
CanSecWest
 
First glance at Akka 2.0
Vasil Remeniuk
 
Advanced akka features
Grzegorz Duda
 
Counter Wars (JEEConf 2016)
Alexey Fyodorov
 
Jersey framework
knight1128
 
Thinking Beyond ORM in JPA
Patrycja Wegrzynowicz
 
Zookeeper
Geng-Dian Huang
 

Similar to Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zookeeper (20)

PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
PDF
Regain Control Thanks To Prometheus
Etienne Coutaud
 
PPTX
Benchmarking Solr Performance at Scale
thelabdude
 
PPTX
Oracle Basics and Architecture
Sidney Chen
 
PDF
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
PPTX
So we're running Apache ZooKeeper. Now What? By Camille Fournier
Hakka Labs
 
PDF
UVM TUTORIAL;
Azad Mishra
 
PDF
Introduction to ZooKeeper - TriHUG May 22, 2012
mumrah
 
PDF
Oracle Database In-Memory Option in Action
Tanel Poder
 
PDF
In Memory Database In Action by Tanel Poder and Kerry Osborne
Enkitec
 
PPTX
Securing Hadoop @eBay
DataWorks Summit
 
PPTX
Oracle real application clusters system tests with demo
Ajith Narayanan
 
PDF
MySQL Performance Schema in Action
Sveta Smirnova
 
PDF
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
PDF
6 tips for improving ruby performance
Engine Yard
 
PPTX
Hadoop Query Performance Smackdown
DataWorks Summit
 
PDF
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Nagios
 
PDF
Declarative benchmarking of cassandra and it's data models
Monal Daxini
 
PPTX
Severalnines Training: MySQL Cluster - Part X
Severalnines
 
KEY
Curator intro
Jordan Zimmerman
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
Regain Control Thanks To Prometheus
Etienne Coutaud
 
Benchmarking Solr Performance at Scale
thelabdude
 
Oracle Basics and Architecture
Sidney Chen
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
So we're running Apache ZooKeeper. Now What? By Camille Fournier
Hakka Labs
 
UVM TUTORIAL;
Azad Mishra
 
Introduction to ZooKeeper - TriHUG May 22, 2012
mumrah
 
Oracle Database In-Memory Option in Action
Tanel Poder
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
Enkitec
 
Securing Hadoop @eBay
DataWorks Summit
 
Oracle real application clusters system tests with demo
Ajith Narayanan
 
MySQL Performance Schema in Action
Sveta Smirnova
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
6 tips for improving ruby performance
Engine Yard
 
Hadoop Query Performance Smackdown
DataWorks Summit
 
Lee Myers - What To Do When Nagios Notification Don't Meet Your Needs.
Nagios
 
Declarative benchmarking of cassandra and it's data models
Monal Daxini
 
Severalnines Training: MySQL Cluster - Part X
Severalnines
 
Curator intro
Jordan Zimmerman
 
Ad

More from EPAM_Systems_Bulgaria (6)

PDF
Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...
EPAM_Systems_Bulgaria
 
PDF
Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8
EPAM_Systems_Bulgaria
 
PDF
Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...
EPAM_Systems_Bulgaria
 
PDF
Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...
EPAM_Systems_Bulgaria
 
PDF
Tech Talk_25.04.15_Session 2_Martin Toshev_KDB database
EPAM_Systems_Bulgaria
 
PDF
Tech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJ
EPAM_Systems_Bulgaria
 
Tech Talks_04.07.15_Session 4_Vladimir Iliev_Inter-thread Messaging With Disr...
EPAM_Systems_Bulgaria
 
Tech Talks_04.07.15_Session 3_Martin Toshev_Concurrency Utilities In Java 8
EPAM_Systems_Bulgaria
 
Tech Talks_04.07.15_Session 2_Danail Branekov_Avoiding And Diagnosing Deadloc...
EPAM_Systems_Bulgaria
 
Tech Talks_04.07.15_Session 1_Jeni Markishka & Martin Hristov_Concurrent Prog...
EPAM_Systems_Bulgaria
 
Tech Talk_25.04.15_Session 2_Martin Toshev_KDB database
EPAM_Systems_Bulgaria
 
Tech Talks_25.04.15_Session 1_Balazs Kollar FIX_QFJ
EPAM_Systems_Bulgaria
 
Ad

Recently uploaded (20)

PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Biography of Daniel Podor.pdf
Daniel Podor
 

Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zookeeper

  • 1. Distributed Coordination with ZooKeeper and Curator Tibor Sulyán [email protected] April 25, 2015
  • 2. 2CONFIDENTIAL CAP Theorem by Eric Brewer • Consistency • Availability • Partition Tolerance Introduction 1 2 12 write read ?1 2 2
  • 3. 3CONFIDENTIAL Agenda What is ZooKeeper?1 ZooKeeper features2 Coordination Recipes3 Using Curator with ZooKeeper4 Deploying ZooKeeper clusters5
  • 4. 4CONFIDENTIAL „ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” What is ZooKeeper about? P P P ES /root /root/data /root/state /root/state/service-000000001 zk zk zk zk client client client zk ZK API ZK API ZK API
  • 5. 5CONFIDENTIAL • Filesystem-like hierarchical structure – Elements are called zNodes • zNode operations – Basic CRUD – Transactional execution of multiple operations – Watches – Versioned changes • zNode metadata – Data – Children – Metadata (Stat structure) • zNode types ZooKeeper Data Model P persistent E ephemeral PS persistent sequential ES ephemeral sequential
  • 6. 6CONFIDENTIAL • Ephemeral zNodes – Session-scoped – Exists as long as the ephemeral owner's session is active – Not persisted – No children • Sequence (sequential) zNodes – Upon creation, zNode name is suffixed by an integer value – The value is unique in the zNode path • Watches – Can be set on read operations (getData(), getChildren(), exists()) – One-time trigger when a zNode changes ZooKeeper Data Model P P / servers E server_A E server_B P leader ES server_A0000000001 ES server_B0000000002
  • 7. 7CONFIDENTIAL // this class will act as default watcher class ZooKeeperClient implements Watcher { ... // connect to the ensemble. 'this' refers to a watcher (aka default watcher) ZooKeeper zooKeeper = new ZooKeeper("localhost:2181,localhost:2182,localhost:2183", 30_000, this); @Override public void process(WatchedEvent event) { // zNode changes & connection state changes // can be invoked before the constructor returns! } } ZooKeeper API – connect, default watcher
  • 8. 8CONFIDENTIAL // snyhronous node creation try { Stat stat = zooKeeper.create("/test", "data".getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); } catch (KeeperException e) { switch (e.code()) { case CONNECTIONLOSS: // retry operation break; } } // asynchronous node creation zooKeeper.create("/test", "data".getBytes(), OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL, new StringCallback() { @Override public void processResult(int rc, String path, Object ctx, String name) { switch (Code.get(rc)) { // handle errors (retry on CONNECTIONLOSS) } } }, null /* no context passed to callback*/); ZooKeeper API – create operations & recoverable errors
  • 9. 9CONFIDENTIAL // snyhronous update: sets "newdata" for /test1 // error handling omitted Stat stat = zooKeeper.setData("/test", "newdata".getBytes(), -1); // sets "newerdata" only if data version is 5 zooKeeper.setData("/test", "newerdata".getBytes(), 5); ZooKeeper API – versioned update operations
  • 10. 10CONFIDENTIAL // check if zNode exists using the default watcher // error handling omitted Stat stat = zooKeeper.exists("/parent/child1", false); // get data & set default watcher Stat stat = new Stat(); byte[] data = zooKeeper.getData("/parent/child1", true, stat); // Use a separate Watcher stat = zooKeeper.exists("/parent/child2", new Watcher() { @Override public void process(WatchedEvent event) { // react to node deletion } }); ZooKeeper API – read operations & setting watches
  • 11. 11CONFIDENTIAL • Atomic updates • Sequential Consistency • Single System Image • Timeliness • Reliability • Availability ZooKeeper Guarantees zk 5 zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4
  • 13. 13CONFIDENTIAL propagate, propose commit, notify Timeliness client 2 follower leader follower follower client 1 setData (v2) v2 v2 time
  • 14. 14CONFIDENTIAL • ZooKeeper process failures are tolerated if a quorum is present • Simplest quorum: majority-based • Avoids split-brain scenarios Availability zk 5 zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4 behaviour on follower failures
  • 15. 15CONFIDENTIAL • ZooKeeper process failures are tolerated if a quorum is present • Simplest quorum: majority-based • Avoids split-brain scenarios Availability zk 5 zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4 behaviour on leader failure zk 1 zk 2
  • 17. 17CONFIDENTIAL Distributed Coordination Recipes Shared Data Group Membership P P / serviceInstances E serverA E serverB Service Discovery P P / service E serviceInfo Lock Mutex Leader Election P P / service ES service_0000000001 ES service_0000000002
  • 18. 18CONFIDENTIAL Leader Election Recipe P P / service ES service_0000000001 ES service_0000000002 zk 5 zk 1 zk 2 zk 3 service service service zk 4 ES service_0000000003 service watch service_0000000001 watch service_0000000001 watch service_0000000001 watch service_0000000001 service watch service_0000000002 n-1 watches are set on the same node Improvment: watch the last sequence node instead of the first one
  • 19. 19CONFIDENTIAL Improved Leader Election Recipe P P / service ES service_0000000001 ES service_0000000002 zk 5 zk 1 zk 2 zk 3 service service service zk 4 ES service_0000000003 service watch service_0000000001 watch service_0000000002 watch service_0000000001 service
  • 20. 20CONFIDENTIAL • Higher level Client API to ZooKeeper • Hides most of the complexity of communicating with ZK ensemble • Implemented recipes Curator and ZooKeeper zk zk zk zk client client client zk Curator ZK API Curator ZK API Curator ZK API
  • 21. 21CONFIDENTIAL // create & start framework instance CuratorFramework framework = CuratorFrameworkFactory.newClient("localhost:2181,localhost:2182,localhost:2183", new ExponentialBackoffRetry(1000, 20)); framework.start() // foreground operation Stat stat = framework.setData().forPath("/a/b/c/d", "testdata".getBytes()); // background operation framework.setData().inBackground().forPath("/a/b/c/d/e", "testdata".getBytes()); Curator API
  • 22. 22CONFIDENTIAL Protected EPHEMERAL_SEQUENTIAL nodes Curator Features zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4 P P / cluster framework.create().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).forPath("/cluster/service"); ES service_0000000002 ES service_0000000003 zk 5 connection loss – reconnect attempt beginsreconnect successful within session timeout – retrying path creation
  • 23. 23CONFIDENTIAL Protected EPHEMERAL_SEQUENTIAL nodes Curator Features zk 1 zk 2 zk 3 client 1 client 2 client 3 zk 4 P P / cluster framework.create().withMode(CreateMode.EPHEMERAL_SEQUENTIAL).withProtection().forPath("/cluster/service"); ES _c_16c39a25-87b4-4a54-bd05-1666a3e718de_service_0000000002 zk 5 connection loss – reconnect attempt beginsreconnect successful within session timeout – checkning zNode with same GUIDno extra zNode created
  • 24. 24CONFIDENTIAL • Performance Considerations • Using Observers to scale • Using Hierarchical Quorums for multi-datacenter setup • Surviving network partition with read-only mode Zookeeper in the real world
  • 25. 25CONFIDENTIAL • Replicated data is kept entirely in-memory by zookeper processes • full GC can drop out a server from the ensemble • Synchronous filesystem writes in commit phase • can take seconds on an overloaded storage device • use dedicated device for zookeeper transaction logs • Maximum zNode size is 1M by default • data + metadata should fit in • configurable using a system property, but increasing it is not recommended • Watches and performance • Too many watches on a single node – herd effect • Too many watches overall – increases memory footprint Performance considerations
  • 26. 26CONFIDENTIAL notifycommitvoteproposepropagate Using Observers to scale client follower leader follower follower setData sync return callback called watch triggered observer observers: • no proposals • no votes • can’t be leaders time
  • 27. 27CONFIDENTIAL Hierarchical Quorums zk5 zk4 zk6 zk8 zk7 zk9 zk2 zk1 zk3 Majority quorums: • any 4 zk failures are tolerated A datacenter goes down • remaining ensemble becomes much less resilient Hierarchical quorums: • Disjoint groups are formed • Quorum requires majority of votes from the majority of groups • 5 failures can be tolerated • Better for clusters spanning multiple datacenters group 1 group 2 group 3
  • 28. 28CONFIDENTIAL Read-only mode zk5 zk4 zk6 zk8 zk7 zk9 zk2 zk1 zk3 Network partitions, a datacenter gets detached Partitioned zookeepers can operate in read-only mode • not connected to the ensemble • no writes allowed • read requests are still served By default read-only mode is disabled zk2 zk1 zk3
  • 29. 29CONFIDENTIAL • ACLs • Quota support • Authentication support • Transaction logging • Connection state handling • Weighted hierarchical quorums • Configuration • Dynamic reconfiguration • ... • More info: • ZooKeeper documentation http://zookeeper.apache.org/doc/trunk/index.html • Curator resources http://curator.apache.org • ZAB protocol in detail http://web.stanford.edu/class/cs347/reading/zab.pdf http://diyhpl.us/~bryan/papers2/distributed/distributed-systems/zab.totally-ordered-broadcast- protocol.2008.pdf • ZooKeeper book http://shop.oreilly.com/product/0636920028901.do Topics not covered