ZooKeeper
  Scott Leberknight
Your mission..
   (should you choose to accept it)

           Build a distributed lock service


           Only one process may own the lock


           Must preserve ordering of requests


           Ensure proper lock release



                  . .this message will self destruct in 5 seconds
Mission
Training
“Distributed Coordination
         Service”
Distributed, hierarchical filesystem

High availability, fault tolerant

Performant (i.e. it’s fast)

Facilitates loose coupling
Fallacies of Distributed
       Computing..
  # 1 The network is reliable.




       http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
Partial Failure

Did my message get through?


Did the operation complete?


      Or, did it fail???
Follow the Leader
                           One elected leader
      Many followers




Followers may lag leader
                             Eventual consistency
What problems can it solve?
  Group membership


  Distributed data structures
  (locks, queues, barriers, etc.)




  Reliable configuration service


  Distributed workflow
Training exercise:
 Group Membership
Get connected..
public ZooKeeper connect(String hosts, int timeout)
      throws IOException, InterruptedException {

    final CountDownLatch signal = new CountDownLatch(1);
    ZooKeeper zk = new ZooKeeper(hosts, timeout, new Watcher() {
      @Override
      public void process(WatchedEvent event) {
        if (event.getState() ==
            Watcher.Event.KeeperState.SyncConnected) {
          signal.countDown();
        }
      }
    });
    signal.await();
    return zk;
}
                                  must wait for connected event!
ZooKeeper Sessions           Tick time

                       Session timeout

                     Automatic failover
znode

Persistent or ephemeral

Optional sequential numbering

Can hold data (like a file)

Persistent ones can have children (like a directory)
public void create(String groupName)
    throws KeeperException, InterruptedException {

    String path = "/" + groupName;
    String createdPath = zk.create(path,
      null /*data*/,
      ZooDefs.Ids.OPEN_ACL_UNSAFE,
      CreateMode.PERSISTENT);
    System.out.println("Created " + createdPath);
}
public void join(String groupName, String memberName)
    throws KeeperException, InterruptedException {

    String path = "/" + groupName + "/" + memberName;
    String createdPath = zk.create(path,
      null /*data*/,
      ZooDefs.Ids.OPEN_ACL_UNSAFE,
      CreateMode.EPHEMERAL);
    System.out.println("Created " + createdPath);
}
public void list(String groupName)
    throws KeeperException, InterruptedException {

    String path = "/" + groupName;
    try {
      List<String> children = zk.getChildren(path, false);
      for (String child : children) {
        System.out.println(child);
      }
    } catch (KeeperException.NoNodeException e) {
      System.out.printf("Group %s does not existn", groupName);
    }
}
public void delete(String groupName)
    throws KeeperException, InterruptedException {

    String path = "/" + groupName;
    try {
      List<String> children = zk.getChildren(path, false);
      for (String child : children) {
        zk.delete(path + "/" + child, -1);
      }
      zk.delete(path, -1); // parent
    } catch (KeeperException.NoNodeException e) {
      System.out.printf("%s does not existn", groupName);
    }
}


            -1   deletes unconditionally , or specify version
initial group members




node-4 died
Architecture
region




      Data Model
customer        account




address        transaction
Hierarchical filesystem


Comprised of znodes
                                     znode data
                                       (< 1 MB)
Watchers


Atomic znode access (reads/writes)


Security
(via authentication, ACLs)
znode - types


Persistent          Persistent Sequential


Ephemeral           Ephemeral Sequential


        die when session expires
znode - sequential



  {
         sequence numbers
znode - operations
 Operation     Type
  create       write
  delete       write
  exists       read
getChildren    read
  getData      read
  setData      write
  getACL       read
  setACL       write
   sync        read
APIs
 synchronous

 asynchronous
Synchronous
  public void list(final String groupName)
      throws KeeperException, InterruptedException {

      String path = "/" + groupName;
      try {
        List<String> children = zk.getChildren(path, false);
        // process children...
      }
      catch (KeeperException.NoNodeException e) {
        // handle non-existent path...
      }
  }
aSync
 public void list(String groupName)
     throws KeeperException, InterruptedException {

     String path = "/" + groupName;
     zk.getChildren(path, false,
       new AsyncCallback.ChildrenCallback() {
         @Override
         public void processResult(int rc, String path,
             Object ctx, List<String> children) {
           // process results when get called back later...
         }
       }, null /* optional context object */);
 }
znode - Watchers

Set (one-time) watches on read operations



    Write operations trigger watches
          on affected znodes



 Re-register watches on events (optional)
public interface Watcher {

    void process(WatchedEvent event);

    // other details elided...
}
public class ChildZNodeWatcher implements Watcher {
    private Semaphore semaphore = new Semaphore(1);

    public void watchChildren()
          throws InterruptedException, KeeperException {
        semaphore.acquire();
        while (true) {
            List<String> children = zk.getChildren(lockPath, this);
            display(children);
            semaphore.acquire();
        }
    }
    @Override public void process(WatchedEvent event) {
        if (event.getType() == Event.EventType.NodeChildrenChanged) {
          semaphore.release();
        }
    }
    // other details elided...
}
Gotcha!
          When using watchers...


          ...updates can be missed!
                (not seen by client)
Data Consistency
“A foolish consistency is the hobgoblin of little
minds, adored by little statesmen and philosophers
and divines. With consistency a great soul has simply
                                     nothing to do.”




                             - Ralph Waldo Emerson
                                      (Self-Reliance)
Data Consistency in ZK
Sequential updates                Durability (of writes)


Atomicity (all or nothin’)        Bounded lag time
                                    (eventual consistency)



Consistent client view
 (across all ZooKeeper servers)
Ensemble
broadcast                            broadcast




                                write
           Follower                                 Leader                           Follower




read                  read                  write            read                read                    read



  client              client               client            client             client          client
Leader election
                               “majority rules”
Atomic broadcast


Clients have session on one server


Writes routed through leader


Reads from server memory
Sessions
Client-requested timeout period



Automatic keep-alive (heartbeats)



 Automatic/transparent failover
Writes
All go through leader


       Broadcast to followers


                Global ordering
                                        every update
                                          has unique
                        zxid (ZooKeeper transaction id)
Reads
“Follow-the-leader”
                              Eventual consistency

         Can lag leader


                 In-memory (fast!)
Training Complete

      PA SSED
Fin al Mis sion:

Distributed Lock
Objectives

Mutual exclusion between processes



       Decouple lock users
Building Blocks
   parent lock znode

   ephemeral, sequential child znodes



                   lock-node


  child-1              child-2    ...   child-N
sample-lock
sample-lock


process-1


Lock
sample-lock


process-1   process-2


Lock
sample-lock


process-1   process-2     process-3


Lock
sample-lock


process-2     process-3


 Lock
sample-lock


              process-3


              Lock
sample-lock
pseudocode     (remember that stuff?)


  1. create child ephemeral, sequential znode

  2. list children of lock node and set a watch

  3. if znode from step 1 has lowest number,
     then lock is acquired. done.


  4. else wait for watch event from step 2,
     and then go back to step 2 on receipt
This is OK, but the re are problems. .




                          (perhaps a main B bus undervolt?)
Problem #1 - Connection Loss

     If partial failure on znode creation...



...then how do we know if znode was created?
Solution #1 (Connection Loss)

    Embed session id in child znode names


            lock-<sessionId>-


Failed-over client checks for child w/ sessionId
Problem #2 - The Herd Effect
Many clients watching lock node.



Notification sent to all watchers on lock release...



     ...possible large spike in network traffic
Solution #2 (The Herd Effect)
       Set watch only on child znode
       with preceding sequence number


 e.g. client holding lock-9 watches only lock-8


     Note, lock-9 could watch lock-7
               (e.g. lock-8 client died)
Implementation
Do it yourself?



...or be lazy and use existing code?
    org.apache.zookeeper.recipes.lock.WriteLock
    org.apache.zookeeper.recipes.lock.LockListener
Implementation, part deux

 WriteLock calls back when lock acquired (async)

 Maybe you want a synchronous client model...

 Let’s do some decoration...
public class BlockingWriteLock {
  private CountDownLatch signal = new CountDownLatch(1);
    // other instance variable definitions elided...


    public BlockingWriteLock(String name, ZooKeeper zookeeper,
        String path, List<ACL> acls) {
        this.name = name;
        this.path = path;
        this.writeLock = new WriteLock(zookeeper, path,
                                       acls, new SyncLockListener());
    }

    public void lock() throws InterruptedException, KeeperException {
      writeLock.lock();
        signal.await();
    }

    public void unlock() { writeLock.unlock(); }

    class SyncLockListener implements LockListener { /* next slide */ }
}
                                                                          55
class SyncLockListener implements LockListener {

    @Override public void lockAcquired() {
      signal.countDown();
    }

    @Override public void lockReleased() {
    }

}




                                                   56
BlockingWriteLock lock =
  new BlockingWriteLock(myName, zooKeeper, path,
                         ZooDefs.Ids.OPEN_ACL_UNSAFE);
try {
  lock.lock();
  // do something while we have the lock
}
catch (Exception ex) {
  // handle appropriately...
}
finally {
  lock.unlock();
}
                            Easy to forget!

                                                         57
(we can do a little better, right?)
public class DistributedOperationExecutor {
  private final ZooKeeper _zk;
  public DistributedOperationExecutor(ZooKeeper zk) { _zk = zk; }

    public Object withLock(String name, String lockPath,
                            List<ACL> acls,
                            DistributedOperation op)
        throws InterruptedException, KeeperException {
      BlockingWriteLock writeLock =
        new BlockingWriteLock(name, _zk, lockPath, acl);
      try {
        writeLock.lock();
        return op.execute();
      } finally {
        writeLock.unlock();
      }
    }
}
executor.withLock(myName, path, ZooDefs.Ids.OPEN_ACL_UNSAFE,
   new DistributedOperation() {
     @Override public Object execute() {
       // do something while we have the lock
       return whateverTheResultIs;
     }
   }
);
Run it!
Apache ZooKeeper
Mission:

      OMP LIS HED
ACC
Review..
let’s crowdsource
   (for free beer, of course)
Like a filesystem, except distributed & replicated


Build distributed coordination, data structures, etc.


High-availability, reliability


Automatic session failover, keep-alive


Writes via leader, in-memory reads (fast)
Refs
zookeeper.apache.org/




                        shop.oreilly.com/product/0636920021773.do
                        (3rd edition pub date is May 29, 2012)
photo attributions
  Follow the Leader - http://www.flickr.com/photos/davidspinks/4211977680/


  Antique Clock - http://www.flickr.com/photos/cncphotos/3828163139/


  Skull & Crossbones - http://www.flickr.com/photos/halfanacre/2841434212/


  Ralph Waldo Emerson - http://www.americanpoems.com/poets/emerson


  1921 Jazzing Orchestra - http://en.wikipedia.org/wiki/File:Jazzing_orchestra_1921.png


   Apollo 13 Patch - http://science.ksc.nasa.gov/history/apollo/apollo-13/apollo-13.html


  Herd of Sheep - http://www.flickr.com/photos/freefoto/639294974/


  Running on Beach - http://www.flickr.com/photos/kcdale99/3148108922/


  Crowd - http://www.flickr.com/photos/laubarnes/5449810523/
                                                                             (others from iStockPhoto...)
(my info)


scott.leberknight@nearinfinity.com
www.nearinfinity.com/blogs/
twitter: sleberknight

More Related Content

PPTX
RedisConf17- Using Redis at scale @ Twitter
PPTX
Kafka replication apachecon_2013
PPTX
Kafka Tutorial: Advanced Producers
PDF
An Introduction to Apache Kafka
PPTX
Apache Kafka 0.8 basic training - Verisign
PDF
Thousands of Threads and Blocking I/O
PPTX
Introduction to Storm
PDF
Producer Performance Tuning for Apache Kafka
RedisConf17- Using Redis at scale @ Twitter
Kafka replication apachecon_2013
Kafka Tutorial: Advanced Producers
An Introduction to Apache Kafka
Apache Kafka 0.8 basic training - Verisign
Thousands of Threads and Blocking I/O
Introduction to Storm
Producer Performance Tuning for Apache Kafka

What's hot (20)

PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PPTX
Distributed Applications with Apache Zookeeper
PPTX
Introduction to Apache ZooKeeper
PDF
Apache Zookeeper
PPT
Zookeeper Introduce
PDF
Apache kafka performance(latency)_benchmark_v0.3
PDF
Intro to HBase
PPTX
Apache Flink and what it is used for
PPTX
Apache Spark overview
PPT
Hadoop Security Architecture
PDF
ClickHouse Keeper
PDF
MongoDB WiredTiger Internals: Journey To Transactions
PDF
Cassandra Introduction & Features
PDF
Introduction to MongoDB
PPTX
Practical learnings from running thousands of Flink jobs
PDF
Using ClickHouse for Experimentation
PDF
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Kafka 101
PDF
RocksDB Performance and Reliability Practices
PDF
Storing 16 Bytes at Scale
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Distributed Applications with Apache Zookeeper
Introduction to Apache ZooKeeper
Apache Zookeeper
Zookeeper Introduce
Apache kafka performance(latency)_benchmark_v0.3
Intro to HBase
Apache Flink and what it is used for
Apache Spark overview
Hadoop Security Architecture
ClickHouse Keeper
MongoDB WiredTiger Internals: Journey To Transactions
Cassandra Introduction & Features
Introduction to MongoDB
Practical learnings from running thousands of Flink jobs
Using ClickHouse for Experimentation
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Kafka 101
RocksDB Performance and Reliability Practices
Storing 16 Bytes at Scale
Ad

Similar to Apache ZooKeeper (20)

PPTX
Zookeeper Architecture
PPTX
Apache zookeeper seminar_trinh_viet_dung_03_2016
PDF
Introduction to ZooKeeper - TriHUG May 22, 2012
PPTX
Winter is coming? Not if ZooKeeper is there!
PPTX
Zookeeper big sonata
PPTX
Leo's Notes about Apache Kafka
PPTX
ZooKeeper (and other things)
PDF
zookeeperProgrammers
PPTX
Zookeeper
PDF
Zookeeper In Simple Words
PDF
Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...
PPTX
Introduction to apache zoo keeper
KEY
Curator intro
PPTX
Zookeeper
PDF
Coordination in distributed systems
PDF
ZooKeeper Recipes and Solutions
PDF
ZooKeeper Recipes and Solutions
PDF
ZooKeeper Recipes and Solutions
PDF
Distributed system coordination by zookeeper and introduction to kazoo python...
PPTX
Zookeeper Tutorial for beginners
Zookeeper Architecture
Apache zookeeper seminar_trinh_viet_dung_03_2016
Introduction to ZooKeeper - TriHUG May 22, 2012
Winter is coming? Not if ZooKeeper is there!
Zookeeper big sonata
Leo's Notes about Apache Kafka
ZooKeeper (and other things)
zookeeperProgrammers
Zookeeper
Zookeeper In Simple Words
Tech Talks_25.04.15_Session 3_Tibor Sulyan_Distributed coordination with zook...
Introduction to apache zoo keeper
Curator intro
Zookeeper
Coordination in distributed systems
ZooKeeper Recipes and Solutions
ZooKeeper Recipes and Solutions
ZooKeeper Recipes and Solutions
Distributed system coordination by zookeeper and introduction to kazoo python...
Zookeeper Tutorial for beginners
Ad

More from Scott Leberknight (20)

PDF
JShell & ki
PDF
JUnit Pioneer
PDF
JDKs 10 to 14 (and beyond)
PDF
Unit Testing
PDF
PDF
PDF
AWS Lambda
PDF
Dropwizard
PDF
RESTful Web Services with Jersey
PDF
jps & jvmtop
PDF
Cloudera Impala, updated for v1.0
PDF
Java 8 Lambda Expressions
PDF
Google Guava
PDF
Cloudera Impala
PDF
HBase Lightning Talk
PDF
wtf is in Java/JDK/wtf7?
PDF
CoffeeScript
JShell & ki
JUnit Pioneer
JDKs 10 to 14 (and beyond)
Unit Testing
AWS Lambda
Dropwizard
RESTful Web Services with Jersey
jps & jvmtop
Cloudera Impala, updated for v1.0
Java 8 Lambda Expressions
Google Guava
Cloudera Impala
HBase Lightning Talk
wtf is in Java/JDK/wtf7?
CoffeeScript

Recently uploaded (20)

PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Modernising the Digital Integration Hub
PPTX
The various Industrial Revolutions .pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Unlock new opportunities with location data.pdf
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Hybrid model detection and classification of lung cancer
PDF
Five Habits of High-Impact Board Members
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
1 - Historical Antecedents, Social Consideration.pdf
Enhancing emotion recognition model for a student engagement use case through...
Univ-Connecticut-ChatGPT-Presentaion.pdf
A novel scalable deep ensemble learning framework for big data classification...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Modernising the Digital Integration Hub
The various Industrial Revolutions .pptx
Tartificialntelligence_presentation.pptx
Hindi spoken digit analysis for native and non-native speakers
Chapter 5: Probability Theory and Statistics
observCloud-Native Containerability and monitoring.pptx
WOOl fibre morphology and structure.pdf for textiles
Unlock new opportunities with location data.pdf
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
DP Operators-handbook-extract for the Mautical Institute
Hybrid model detection and classification of lung cancer
Five Habits of High-Impact Board Members
Web Crawler for Trend Tracking Gen Z Insights.pptx

Apache ZooKeeper

  • 1. ZooKeeper Scott Leberknight
  • 2. Your mission.. (should you choose to accept it) Build a distributed lock service Only one process may own the lock Must preserve ordering of requests Ensure proper lock release . .this message will self destruct in 5 seconds
  • 4. “Distributed Coordination Service” Distributed, hierarchical filesystem High availability, fault tolerant Performant (i.e. it’s fast) Facilitates loose coupling
  • 5. Fallacies of Distributed Computing.. # 1 The network is reliable. http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
  • 6. Partial Failure Did my message get through? Did the operation complete? Or, did it fail???
  • 7. Follow the Leader One elected leader Many followers Followers may lag leader Eventual consistency
  • 8. What problems can it solve? Group membership Distributed data structures (locks, queues, barriers, etc.) Reliable configuration service Distributed workflow
  • 11. public ZooKeeper connect(String hosts, int timeout) throws IOException, InterruptedException { final CountDownLatch signal = new CountDownLatch(1); ZooKeeper zk = new ZooKeeper(hosts, timeout, new Watcher() { @Override public void process(WatchedEvent event) { if (event.getState() == Watcher.Event.KeeperState.SyncConnected) { signal.countDown(); } } }); signal.await(); return zk; } must wait for connected event!
  • 12. ZooKeeper Sessions Tick time Session timeout Automatic failover
  • 13. znode Persistent or ephemeral Optional sequential numbering Can hold data (like a file) Persistent ones can have children (like a directory)
  • 14. public void create(String groupName) throws KeeperException, InterruptedException { String path = "/" + groupName; String createdPath = zk.create(path, null /*data*/, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); System.out.println("Created " + createdPath); }
  • 15. public void join(String groupName, String memberName) throws KeeperException, InterruptedException { String path = "/" + groupName + "/" + memberName; String createdPath = zk.create(path, null /*data*/, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); System.out.println("Created " + createdPath); }
  • 16. public void list(String groupName) throws KeeperException, InterruptedException { String path = "/" + groupName; try { List<String> children = zk.getChildren(path, false); for (String child : children) { System.out.println(child); } } catch (KeeperException.NoNodeException e) { System.out.printf("Group %s does not existn", groupName); } }
  • 17. public void delete(String groupName) throws KeeperException, InterruptedException { String path = "/" + groupName; try { List<String> children = zk.getChildren(path, false); for (String child : children) { zk.delete(path + "/" + child, -1); } zk.delete(path, -1); // parent } catch (KeeperException.NoNodeException e) { System.out.printf("%s does not existn", groupName); } } -1 deletes unconditionally , or specify version
  • 20. region Data Model customer account address transaction
  • 21. Hierarchical filesystem Comprised of znodes znode data (< 1 MB) Watchers Atomic znode access (reads/writes) Security (via authentication, ACLs)
  • 22. znode - types Persistent Persistent Sequential Ephemeral Ephemeral Sequential die when session expires
  • 23. znode - sequential { sequence numbers
  • 24. znode - operations Operation Type create write delete write exists read getChildren read getData read setData write getACL read setACL write sync read
  • 26. Synchronous public void list(final String groupName) throws KeeperException, InterruptedException { String path = "/" + groupName; try { List<String> children = zk.getChildren(path, false); // process children... } catch (KeeperException.NoNodeException e) { // handle non-existent path... } }
  • 27. aSync public void list(String groupName) throws KeeperException, InterruptedException { String path = "/" + groupName; zk.getChildren(path, false, new AsyncCallback.ChildrenCallback() { @Override public void processResult(int rc, String path, Object ctx, List<String> children) { // process results when get called back later... } }, null /* optional context object */); }
  • 28. znode - Watchers Set (one-time) watches on read operations Write operations trigger watches on affected znodes Re-register watches on events (optional)
  • 29. public interface Watcher { void process(WatchedEvent event); // other details elided... }
  • 30. public class ChildZNodeWatcher implements Watcher { private Semaphore semaphore = new Semaphore(1); public void watchChildren() throws InterruptedException, KeeperException { semaphore.acquire(); while (true) { List<String> children = zk.getChildren(lockPath, this); display(children); semaphore.acquire(); } } @Override public void process(WatchedEvent event) { if (event.getType() == Event.EventType.NodeChildrenChanged) { semaphore.release(); } } // other details elided... }
  • 31. Gotcha! When using watchers... ...updates can be missed! (not seen by client)
  • 33. “A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. With consistency a great soul has simply nothing to do.” - Ralph Waldo Emerson (Self-Reliance)
  • 34. Data Consistency in ZK Sequential updates Durability (of writes) Atomicity (all or nothin’) Bounded lag time (eventual consistency) Consistent client view (across all ZooKeeper servers)
  • 36. broadcast broadcast write Follower Leader Follower read read write read read read client client client client client client
  • 37. Leader election “majority rules” Atomic broadcast Clients have session on one server Writes routed through leader Reads from server memory
  • 38. Sessions Client-requested timeout period Automatic keep-alive (heartbeats) Automatic/transparent failover
  • 39. Writes All go through leader Broadcast to followers Global ordering every update has unique zxid (ZooKeeper transaction id)
  • 40. Reads “Follow-the-leader” Eventual consistency Can lag leader In-memory (fast!)
  • 41. Training Complete PA SSED
  • 42. Fin al Mis sion: Distributed Lock
  • 43. Objectives Mutual exclusion between processes Decouple lock users
  • 44. Building Blocks parent lock znode ephemeral, sequential child znodes lock-node child-1 child-2 ... child-N
  • 47. sample-lock process-1 process-2 Lock
  • 48. sample-lock process-1 process-2 process-3 Lock
  • 49. sample-lock process-2 process-3 Lock
  • 50. sample-lock process-3 Lock
  • 52. pseudocode (remember that stuff?) 1. create child ephemeral, sequential znode 2. list children of lock node and set a watch 3. if znode from step 1 has lowest number, then lock is acquired. done. 4. else wait for watch event from step 2, and then go back to step 2 on receipt
  • 53. This is OK, but the re are problems. . (perhaps a main B bus undervolt?)
  • 54. Problem #1 - Connection Loss If partial failure on znode creation... ...then how do we know if znode was created?
  • 55. Solution #1 (Connection Loss) Embed session id in child znode names lock-<sessionId>- Failed-over client checks for child w/ sessionId
  • 56. Problem #2 - The Herd Effect
  • 57. Many clients watching lock node. Notification sent to all watchers on lock release... ...possible large spike in network traffic
  • 58. Solution #2 (The Herd Effect) Set watch only on child znode with preceding sequence number e.g. client holding lock-9 watches only lock-8 Note, lock-9 could watch lock-7 (e.g. lock-8 client died)
  • 59. Implementation Do it yourself? ...or be lazy and use existing code? org.apache.zookeeper.recipes.lock.WriteLock org.apache.zookeeper.recipes.lock.LockListener
  • 60. Implementation, part deux WriteLock calls back when lock acquired (async) Maybe you want a synchronous client model... Let’s do some decoration...
  • 61. public class BlockingWriteLock { private CountDownLatch signal = new CountDownLatch(1); // other instance variable definitions elided... public BlockingWriteLock(String name, ZooKeeper zookeeper, String path, List<ACL> acls) { this.name = name; this.path = path; this.writeLock = new WriteLock(zookeeper, path, acls, new SyncLockListener()); } public void lock() throws InterruptedException, KeeperException { writeLock.lock(); signal.await(); } public void unlock() { writeLock.unlock(); } class SyncLockListener implements LockListener { /* next slide */ } } 55
  • 62. class SyncLockListener implements LockListener { @Override public void lockAcquired() { signal.countDown(); } @Override public void lockReleased() { } } 56
  • 63. BlockingWriteLock lock = new BlockingWriteLock(myName, zooKeeper, path, ZooDefs.Ids.OPEN_ACL_UNSAFE); try { lock.lock(); // do something while we have the lock } catch (Exception ex) { // handle appropriately... } finally { lock.unlock(); } Easy to forget! 57
  • 64. (we can do a little better, right?)
  • 65. public class DistributedOperationExecutor { private final ZooKeeper _zk; public DistributedOperationExecutor(ZooKeeper zk) { _zk = zk; } public Object withLock(String name, String lockPath, List<ACL> acls, DistributedOperation op) throws InterruptedException, KeeperException { BlockingWriteLock writeLock = new BlockingWriteLock(name, _zk, lockPath, acl); try { writeLock.lock(); return op.execute(); } finally { writeLock.unlock(); } } }
  • 66. executor.withLock(myName, path, ZooDefs.Ids.OPEN_ACL_UNSAFE, new DistributedOperation() { @Override public Object execute() { // do something while we have the lock return whateverTheResultIs; } } );
  • 69. Mission: OMP LIS HED ACC
  • 71. let’s crowdsource (for free beer, of course)
  • 72. Like a filesystem, except distributed & replicated Build distributed coordination, data structures, etc. High-availability, reliability Automatic session failover, keep-alive Writes via leader, in-memory reads (fast)
  • 73. Refs
  • 74. zookeeper.apache.org/ shop.oreilly.com/product/0636920021773.do (3rd edition pub date is May 29, 2012)
  • 75. photo attributions Follow the Leader - http://www.flickr.com/photos/davidspinks/4211977680/ Antique Clock - http://www.flickr.com/photos/cncphotos/3828163139/ Skull & Crossbones - http://www.flickr.com/photos/halfanacre/2841434212/ Ralph Waldo Emerson - http://www.americanpoems.com/poets/emerson 1921 Jazzing Orchestra - http://en.wikipedia.org/wiki/File:Jazzing_orchestra_1921.png Apollo 13 Patch - http://science.ksc.nasa.gov/history/apollo/apollo-13/apollo-13.html Herd of Sheep - http://www.flickr.com/photos/freefoto/639294974/ Running on Beach - http://www.flickr.com/photos/kcdale99/3148108922/ Crowd - http://www.flickr.com/photos/laubarnes/5449810523/ (others from iStockPhoto...)