SlideShare a Scribd company logo
UNIT - 2
Distribution Models
• As data volumes increase, it becomes more difficult
and expensive to scale up—buy a bigger server to
run the database on.
• A more appealing option is to scale out—run the
database on a cluster of servers.
• Aggregate orientation fits well with scaling out
because the aggregate is a natural unit to use for
distribution.
Advantages of Distribution model:
• Give ability to handle larger quantities of data
• Give ability to process a greater read or write
traffic
• Offer more availability in the face of network
slowdowns or breakages
Disadvantages of Distribution model:
• Above important benefits come at a cost
• Running over a cluster introduces complexity
Single Server
• The first and the simplest distribution option is
no distribution at all.
• Run the database on a
single machine that handles all
the reads and writes to the data store.
• It eliminates all the complexities that the other
options introduce
• It’s easy for operations people to manage and easy
for application developers to reason about.
• Although a lot of NoSQL databases are designed
around the idea of running on a cluster, it can make
sense to use NoSQL with a single-server
When to use Single Server distribution model:
• Graph databases are the obvious category
here—these work best in a single-server
configuration.
• If your data usage is mostly about
processing aggregates, then a single-server
document or key-value store may well be
worthwhile because it’s easier on application
developers.
Sharding
• Often, a busy data store is busy because different
people are accessing different parts of the dataset. In
these circumstances we can support horizontal
scalability by putting different parts of the data onto
different servers—a technique that’s called sharding.
Fig: Sharding puts different data on separate nodes,
each of which does its own reads and writes.
• In the ideal case, we have different users all talking to
different server nodes. Each user only has to talk to one
server, so gets rapid responses from that server. The load
is balanced out nicely between servers—for example, if
we have ten servers, each one only has to handle 10%
of the load.
• In order to get close to ideal case we have to ensure that
data that’s accessed together is clumped together on
the same node and that these clumps are arranged on
the nodes to provide the best data access.
• Data should be clump up such that one user mostly
gets her data from a single server. This is where
aggregate orientation comes in really handy. Aggregates
designed to combine data that’s commonly accessed
together—so aggregates leap out as an obvious unit of
distribution.
• While arranging the data on the nodes, there are
several factors that can help to improve performance.
• If most accesses of certain aggregates are based on a
physical location, place the data close to where it’s being
accessed.
• Example: If you have orders for someone who lives in
Boston, you can place that data in your eastern US data
center.
• Another factor is trying to keep the load even. Try to
arrange aggregates so they are evenly distributed across
the nodes which all get equal amounts of the load. This
may vary over time.
• Example: if some data tends to be accessed on certain
• In some cases, it’s useful to put aggregates together if you
think they may be read in sequence.
• Historically most people have done sharding as part of
application logic. You might put all customers with
surnames starting from A to D on one shard and E to G on
another. This complicates the programming model, as
application code needs to ensure that queries are
distributed across the various shards.
• Furthermore, rebalancing the sharding means changing
the application code and migrating the data. Many
NoSQL databases offer auto-sharding, where the
database takes on the responsibility of allocating data
to shards and ensuring that data access goes to the right
shard. This can make it much easier to use sharding in an
application.
• Sharding is particularly valuable for performance because
it can improve both read and write performance.
• Using replication, particularly with caching, can greatly
improve read performance but does little for
applications that have a lot of writes. Sharding provides
a way to horizontally scale writes.
• Sharding does little to improve resilience when used
alone. Although the data is on different nodes, a node
failure makes that shard’s data unavailable just as
surely as it does for a single-server solution.
• The resilience benefit it does provide is that only the users
of the data on that shard will suffer; however, it’s not
good to have a database with part of its data missing.
• With a single server it’s easier to pay the effort and
cost to keep that server up and running; clusters
usually try to use less reliable machines, and you’re
more likely to get a node failure. So in practice,
• Despite the fact that sharding is made much
easier with aggregates, it’s still not a step to be
taken lightly.
• Some databases are intended from the beginning to
use sharding, in which case it’s wise to run them
on a cluster from the very beginning of
development, and certainly in production.
• Other databases use sharding as a deliberate step up
from a single-server configuration, in which case it’s
best to start single-server and only use sharding
once your load projections clearly indicate that you
are running out of headroom.
• In any case the step from a single node to sharding is
going to be tricky. The lesson here is to use
sharding well before you need to—when you have
enough headroom to carry out the sharding.
Master-Slave Replication
• With master-slave distribution, you replicate data across multiple
nodes.
• One node is designated as the master, or primary. This master is
the authoritative source for the data and is usually responsible
for processing any updates to that data.
• The other nodes are slaves, or secondaries. A replication process
synchronizes the slaves with the master
Fig: Data is replicated from master to slaves.
Advantages:
• Scaling: Master-slave replication is most helpful
for scaling when you have a read-intensive
dataset. You can scale horizontally to handle
more read requests by adding more slave nodes
and ensuring that all read requests are routed to
the slaves.
• You are still, however, limited by the ability of the
master to process updates and its ability to pass
those updates on. Consequently it isn’t such a
good scheme for datasets with heavy write
traffic, although offloading the read traffic will
help a bit with handling the write load.
• Read resilience: if the master fail, the slaves
can still handle read requests. Again, this is
useful if most of your data access is reads. The
failure of the master does eliminate the ability to
handle writes until either the master is restored or
a new master is appointed. However, having
slaves as replicates of the master does speed
up recovery after a failure of the master since
a slave can be appointed a new master very
quickly.
• All read and write traffic can go to the master
while the slave acts as a hot backup. In this
case it’s easiest to think of the system as a
single-server store with a hot backup. You get
the convenience of the single-server
configuration but with greater resilience—
which is particularly handy if you want to be able
to handle server failures gracefully.
• Masters can be appointed
manually or
automatically.
• Manual appointing typically means that when you
configure your cluster, you configure one node as the
master.
• With automatic appointment, you create a cluster of
nodes and they elect one of themselves to be the
master.
• Apart from simpler configuration, automatic
appointment means that the cluster can
automatically appoint a new master when a
• Replication comes with some attractive benefits, but
it also comes with an unavoidable dark side—
inconsistency.
• You have the danger that different clients, reading
different slaves, will see different values because
the changes haven’t all propagated to the slaves.
• In the worst case, that can mean that a client cannot
read a write it just made.
• Even if you use master-slave replication just for hot
backup this can be a concern, because if the master
fails, any updates not passed on to the backup are
lost.
Peer-to-Peer Replication
• Master-slave replication helps with read scalability but
doesn’t help with scalability of writes. It provides resilience
against failure of a slave, but not of a master.
• Essentially, the master is still a bottleneck and a single
point of failure. Peer-to-peer replication attacks these
problems by not having a master. All the replicas have
equal weight, they can all accept writes, and the loss of any
of them doesn’t prevent access to the data store.
Fig: Peer-to-peer replication has all nodes applying reads
and writes to all the data.
Advantages:
• You can ride over node failures without losing
access to data.
• You can easily add
nodes to improve your
performance.
Disadvantages:
• Inconsistency: When you can write to two
different places, you run the risk that two people
will attempt to update the same record at the same
time—a write-write conflict. Inconsistencies on
read lead to problems but at least they are
relatively temporary. Inconsistent writes are
forever.
How to handle inconsistency?
• At one end, we can ensure that whenever we
write data, the replicas coordinate to ensure
we avoid a conflict. We don’t need all the
replicas to agree on the write, just a
majority, so we can still survive losing a
minority of the replica nodes.
• At the other extreme, we
can decide to
manage with an inconsistent write.
Combining Sharding and Replication
• Replication and sharding are strategies that
can be combined.
• If we use both master-slave replication and
sharding, this means that we have multiple
masters, but each data item only has a
single master.
• Depending on your configuration, you may
choose a node to be a master for some data
and slaves for others, or you may dedicate
nodes for master or slave duties.
Fig: Using master-slave replication together with
sharding
• Using peer-to-peer replication and sharding
is a common strategy for column-family
databases.
• In a scenario like this you might have tens or
hundreds of nodes in a cluster with data
sharded over them.
• A good starting point for peer-to-peer
replication is to have a replication factor of 3,
so each shard is present on three nodes.
When a node fail, then the shards on that node
will be built on the other nodes
Fig: Using peer-to-peer replication together with
sharding
Key Points
There are two styles of distributing data:
• Sharding distributes different data across multiple servers, so each
server acts as the single source for a subset of data.
• Replication copies data across multiple servers, so each bit of data
can be found in multiple places. A system may use either or both
techniques.
Replication comes in two forms:
• Master-slave replication makes one node the authoritative copy that
handles writes while slaves synchronize with the master and may handle
reads.
• Peer-to-peer replication allows writes to any node; the nodes coordinate
to synchronize their copies of the data. Master-slave replication
reduces the chance of update conflicts but peer-to-peer replication
avoids loading all writes onto a single point of failure.
Consistency
• One of the biggest changes from a
centralized relational database to a cluster-
oriented NoSQL database is in how you
think about consistency.
• Relational databases try to exhibit strong
consistency by avoiding all the various
inconsistencies.
• In NoSQL as soon as you start building
something you have to think about what sort
of consistency you need for your system.
Update Consistency
Consider example of updating a telephone number.
• Coincidentally, Martin and Pramod are looking at the
company website and notice that the phone number
is out of date.
• They both have update access, so they both go in at
the same time to update the number.
• Assume they update it slightly differently, because
each uses a slightly different format.
• This issue is called a write-write conflict: two
people updating the same data item at the same
• When the writes reach the server, the server
will serialize them—decide to apply one, then
the other. Let’s assume it uses alphabetical
order and picks Martin’s update first, then
Pramod’s.
• Without any concurrency control, Martin’s
update would be applied and immediately
overwritten by Pramod’s.
• In this case Martin’s is a lost update. Here
the lost update is not a big problem, but often
it is.
Approaches for maintaining consistency:
In the face of concurrency, approaches for maintaining
consistency are often described as pessimistic or
optimistic.
• A pessimistic approach works by preventing conflicts
from occurring.
• An optimistic approach lets conflicts occur, but
detects them and takes action to sort them out.
• For update conflicts, the most common
pessimistic approach is to have write locks,
so that in order to change a value you need to
acquire a lock, and the system ensures that
only one client can get a lock at a time.
• So Martin and Pramod would both attempt to
acquire the write lock, but only Martin (the
first one)would succeed.
• Pramod would then see the result of Martin’s
write before deciding whether to make his own
update.
• A common optimistic approach is a conditional
update where any client that does an update tests the
value just before updating it to see if it’s changed
since his last read.
• In this case, Martin’s update would succeed but
Pramod’s would fail.
• The error would let Pramod know that he should
look at the value again and decide whether to
attempt a further update.
• Both the pessimistic and optimistic approaches
that we’ve just described rely on a consistent
serialization of the updates. With a single server,
this is obvious—it has to choose one, then the
other.
• But if there’s more than one server, such as with
peer-to-peer replication, then two nodes might
apply the updates in a different order, resulting
in a different value for the telephone number on
each peer.
• Often, when people talk about concurrency in
distributed systems, they talk about sequential
consistency—ensuring that all nodes apply
operations in the same order.
There is another optimistic way to handle a write-write
conflict—
• Save both updates and record that they are in conflict
• You have to merge the two updates somehow. Maybe you
show both values to the user and ask them to sort it out—
this is what happens if you update the same contact on your
phone and your computer.
• Alternatively, the computer may be able to perform the
merge itself; if it was a phone formatting issue, it may be able
to realize that and apply the new number with the standard
format.
• Any automated merge of write-write conflicts is highly
domain-specific and needs to be programmed for each
particular case.
• Often, when people first encounter these issues,
their reaction is to prefer pessimistic
concurrency because they are determined to
avoid conflicts.
• While in some cases this is the right answer, there
is always a tradeoff.
• Concurrent programming involves a
fundamental compromise between safety
(avoiding errors such as update conflicts) and
liveness (responding quickly to clients).
• Pessimistic approaches often severely degrade
the responsiveness of a system to the degree that
it becomes unfit for its purpose.
• Pessimistic concurrency often leads to deadlocks,
which are hard to prevent and debug.
• Replication makes it much more likely to run
into write-write conflicts. If different nodes
have different copies of some data which
can be independently updated, then you’ll
get conflicts unless you take specific
measures to avoid them.
• Using a single node as the target for all writes
for some data makes it much easier to
maintain update consistency. Of the
distribution models we discussed earlier, all
but peer-to-peer replication do this.
Read Consistency
• Having a data store that maintains update consistency is one
thing, but it doesn’t guarantee that readers of that data
store will always get consistent responses to their requests.
• Let’s imagine we have an order with line items and a
shipping charge. The shipping charge is calculated based on
the line items in the order.
• If we add a line item, we thus also need to recalculate and
update the shipping charge.
• In a relational database, the shipping charge and line items
will be in separate tables.
• The danger of inconsistency is that Martin adds a line item to
his order, Pramod then reads the line items and shipping
charge, and then Martin updates the shipping charge. This
is an inconsistent read or read-write conflict: In Figure
Pramod has done a read in the middle of Martin’s write.
Fig: A read-write conflict in logical
consistency
• We refer to this type of consistency as logical
consistency: ensuring that different data items
make sense together.
• To avoid a logically inconsistent read-write
conflict, relational databases support the notion
of transactions. Providing Martin wraps his two
writes in a transaction, the system guarantees that
Pramod will either read both data items before the
update or both after the update.
• A common claim we hear is that NoSQL
databases don’t support transactions and thus
can’t be consistent, such claim is mostly wrong.
Clarification of why such claim is wrong:
• Any statement about lack of transactions usually
only applies to some NoSQL databases, in particular
the aggregate-oriented ones. In contrast, graph
databases tend to support ACID transactions just
the same as relational databases.
• Aggregate-oriented databases do support atomic
updates, but only within a single aggregate. This
means that you will have logical consistency within
an aggregate but not between aggregates. So in the
example, you could avoid running into that
inconsistency if the order, the delivery charge, and
the line items are all part of a single order
aggregate.
• Of course not all data can be put in the same aggregate,
so any update that affects multiple aggregates leaves
open a time when clients could perform an
inconsistent read. The length of time an inconsistency
is present is called the inconsistency window.
• A NoSQL system may have a quite short
inconsistency window: Amazon’s documentation says
that the inconsistency window for its SimpleDB
service is usually less than a second.
• Once you introduce replication, however, you get a
whole new kind of inconsistency. Let’s imagine
there’s one last hotel room for a desirable event.
• The hotel reservation system runs on many nodes.
Martin and Cindy are a couple considering this
room, but they are discussing this on the phone
because Martin is in London and Cindy is in
Boston.
• Meanwhile Pramod, who is in Mumbai, goes and
books that last room. That updates the replicated
room availability, but the update gets to Boston
quicker than it gets to London.
When Martin and Cindy fire up their browsers to see if the
room is available, Cindy sees it booked and Martin sees
it free. This is another inconsistent read—this form of
consistency we call replication consistency: ensuring that
the same data item has the same value when read from
different replicas (see Figure).
Figure: An example of replication inconsistency
• Eventually, of course, the updates will propagate fully, and
Martin will see the room is fully booked. Therefore this
situation is generally referred to as eventually consistent,
meaning that at any time nodes may have replication
inconsistencies but, if there are no further updates,
eventually all nodes will be updated to the same value. Data
that is out of date is generally referred to as stale.
• Although replication consistency is independent from logical
consistency, replication make worse a logical inconsistency
by lengthening its inconsistency window. Two different
updates on the master may be performed in rapid
succession, leaving an inconsistency window of milliseconds.
But delays in networking could mean that the same
inconsistency window lasts for much longer on a slave.
• You can usually specify the level of consistency you want
with individual requests. This allows you to use weak
consistency most of the time when it isn’t an issue, but request
strong consistency when it is.
• Consider the example of posting comments on a
blog entry. Few people are going to worry about
inconsistency windows of even a few minutes
while people are typing in their latest thoughts.
• Often, systems handle the load of such sites by
running on a cluster and load-balancing
incoming requests to different nodes.
• Therein lies a danger: You may post a message
using one node, then refresh your browser, but
the refresh goes to a different node which
hasn’t received your post yet—and it looks like
your post was lost.
• In situations like this, you can tolerate reasonably long
inconsistency windows, but you need read your-writes
consistency which means that, once you’ve made an
update, you’re guaranteed to continue seeing that
update.
• One way to get this in an otherwise eventually consistent
system is to provide session consistency: Within a
user’s session there is read-your-writes consistency.
This does mean that the user may lose that consistency
should their session end for some reason or should the
user access the same system simultaneously from
different computers, but these cases are relatively rare.
Techniques to provide session consistency
• A common way, and often the easiest way, is to have
a sticky session: a session that’s tied to one node
(this is also called session affinity). A sticky session
allows you to ensure that as long as you keep
read-your-writes consistency on a node, you’ll get
it for sessions too. The downside is that sticky
sessions reduce the ability of the load balancer to
do its job.
• Use version stamps and ensure every interaction
with the data store includes the latest version
stamp seen by a session. The server node must then
ensure that it has the updates that include that version
stamp before responding to a request.
Relaxing Consistency
• Consistency is a Good Thing—but, sadly, sometimes we
have to sacrifice it. It is always possible to design a
system to avoid inconsistencies, but often impossible to
do so without making unbearable sacrifices in other
characteristics of the system.
• As a result, we often have to compromise consistency
for something else. While some architects see this as a
disaster, we see it as part of the unavoidable
compromises involved in system design.
• Furthermore, different domains have different
tolerances for inconsistency, and we need to take this
tolerance into account as we make our decisions.
• Compromising consistency is a familiar concept even
in single-server relational database systems. Here, our
principal tool to enforce consistency is the transaction,
and transactions can provide strong consistency
guarantees.
• However, transaction systems usually come with the
ability to relax isolation levels, allowing queries to
read data that hasn’t been committed yet, and in
practice we see most applications relax consistency
down from the highest isolation level (serialized) in
order to get effective performance.
• We most commonly see people using the read-
committed transaction level, which eliminates some
read-write conflicts but allows others.
• Many systems go without transactions entirely
because the performance impact of transactions is too
high.
• On a small scale, we saw the popularity of MySQL
during the days when it didn’t support
transactions. Many websites liked the high speed
of MySQL and were prepared to live without
transactions.
• At the other end of the scale, some very large
websites, such as eBay, have to go without
transactions in order to perform acceptably—this is
particularly true when you need to introduce
sharding.
The CAP
Theorem
• In the NoSQL world refer CAP theorem as a reason
why you may need to relax consistency.
• The basic statement of the CAP theorem is that, given the
three properties of Consistency, Availability, and
Partition tolerance, you can only get two. Obviously
this depends very much on how you define these three
properties.
• Consistency means that data is the same across cluster, so
you can read or write from/to any node and get the same
data.
• Availability has a particular meaning in the context of
CAP—it means that if you can talk to a node in the
cluster, it can read and write data.
• Partition tolerance means that the cluster can survive
communication breakages in the cluster that separate the
cluster into multiple partitions unable to communicate
with each other.
Figure: With two breaks in the communication lines,
the network partitions into two groups.
• A single-server system is the obvious example of a
CA system—a system that has Consistency and
Availability but not Partition tolerance.
• A single machine can’t partition, so it does not have
to worry about partition tolerance. There’s only one
node—so if it’s up, it’s available. Being up and
keeping consistency is reasonable.
• It is theoretically possible to have a CA cluster.
However, this would mean that if a partition ever
occurs in the cluster, all the nodes in the cluster
would go down so that no client can talk to a node.
• By the usual definition of “available,” this would mean
a lack of availability, but this is where CAP’s special
usage of “availability” gets confusing. CAP defines
“availability” to mean “every request received by a
non failing node in the system must result in a
response”. So a failed, unresponsive node doesn’t
• This does imply that you can build a CA cluster, but
you have to ensure it will only partition rarely.
• So clusters have to be tolerant of network
partitions. And here is the real point of the CAP
theorem.
• Although the CAP theorem is often stated as “you can
only get two out of three,” in practice what it’s saying
is that in a system that may suffer partitions, as
distributed system do, you have to compromise
consistency versus availability.
• Often, you can compromise a little consistency to
get some availability. The resulting system would be
neither perfectly consistent nor perfectly
available—but would have a combination that is
reasonable for your particular needs.
• Example : Martin and Pramod are both trying to book the last
hotel room on a system that uses peer-to-peer distribution with
two nodes (London for Martin and Mumbai for Pramod).
• If we want to ensure consistency, then when Martin tries to book
his room on the London node, that node must communicate with
the Mumbai node before confirming the booking. Essentially, both
nodes must agree on the serialization of their requests. This
gives us consistency—but if the network link break, then neither
system can book any hotel room, sacrificing availability.
• One way to improve availability is to designate one node as the
master for a particular hotel and ensure all bookings are
processed by that master. If that master be Mumbai, then
Mumbai can still process hotel bookings for that hotel and
Pramod will get the last room.
• If we use master-slave replication, London users can see the
inconsistent room information but cannot make a booking and thus
cause an update inconsistency.
• We still can’t book a room on the London node for the hotel
whose master is in Mumbai if the connection goes down.
• In CAP terminology, this is a failure of availability in that
Martin can talk to the London node but the London node
cannot update the data.
• To gain more availability, we might allow both systems to
keep accepting hotel reservations even if the network link
breaks down. The danger here is that Martin and Pramod
book the last hotel room.
• However, depending on how this hotel operates, that may be
fine. Often, travel companies tolerate a certain amount of
overbooking in order to cope with no-shows.
• Conversely, some hotels always keep a few rooms clear even
when they are fully booked, in order to be able to swap a
guest out of a room with problems or to accommodate a high-
status late booking.
• Some might even cancel the booking with an apology once
they detected the conflict—reasoning that the cost of that is
• The classic example of allowing inconsistent writes is
the shopping cart, as discussed in Amazon’s Dynamo.
• In this case you are always allowed to write to your
shopping cart, even if network failures mean you end up
with multiple shopping carts. The checkout process can
merge the two shopping carts by putting the union of
the items from the carts into a single cart and returning
that.
• Almost always that’s the correct answer—but if not, the
user gets the opportunity to look at the cart before
completing the order.
• The lesson here is that although most software developers
treat update consistency as The Way Things Must Be,
there are cases where you can deal gracefully with
inconsistent answers to requests.
• If you can find a way to handle inconsistent updates, this
gives you more options to increase availability and
performance. For a shopping cart, it means that shoppers
can always shop, and do so quickly.
• A similar logic applies to read consistency. If you
are trading financial instruments over a
computerized exchange, you may not be able to
tolerate any data that isn’t right up to date.
However, if you are posting a news item to a media
website, you may be able to tolerate old pages for
minutes.
• Different data items may have different tolerances for
staleness, and thus may need different settings in
your replication configuration.
• Promoters of NoSQL often say that instead of
following the ACID properties of relational
transactions, NoSQL systems follow the BASE
properties (Basically Available, Soft state, Eventual
consistency).
• It’s usually better to think not about the tradeoff
between consistency and availability but rather
between consistency and latency(response time).
• We can improve consistency by getting more
nodes involved in the interaction, but each node
we add increases the response time of that
interaction.
• We can then think of availability as the limit of
latency that we’re prepared to tolerate; once latency
gets too high, we give up and treat the data as
unavailable—which neatly fits its definition in the
context of CAP.
Relaxing Durability
• Most people would laugh at relaxing
durability—after all, what is the point of a data
store if it can lose updates?
• There are cases where you may want to trade off
some durability for higher performance.
• If a database can run mostly in memory, apply
updates to its in-memory representation, and
periodically flush changes to disk, then it may
be able to provide considerably higher
responsiveness to requests. The cost is that, if
the server crash, any updates since the last
flush will be lost.
• One example of where this tradeoff may be
meaningful is storing user-session state.
• A big website may have many users and keep
temporary information about what each user is
doing in some kind of session state. There’s a lot of
activity on this state, creating lots of demand, which
affects the responsiveness of the website.
• The vital point is that losing the session data isn’t
too much of a tragedy—it will create some
annoyance, but may be less than a slower website
would cause. This makes it a good candidate for
nondurable writes.
• Often, you can specify the durability needs on a
call-by-call basis, so that more important updates
can force a flush to disk.
• Another class of durability tradeoffs comes up
with replicated data. A failure of replication
durability occurs when a node processes an
update but fails before that update is replicated to
the other nodes.
• A simple case of this may happen if you have a
master-slave distribution model where the
slaves appoint a new master automatically if
the existing master fail. If a master does fail, any
writes not passed onto the replicas will
effectively become lost.
• If the master come back online, those updates
will conflict with updates that have happened
since. We think of this as a durability problem
because you think your update has succeeded
since the master acknowledged it, but a master
node failure caused it to be lost.
• You can improve replication durability by
ensuring that the master waits for some replicas
to acknowledge the update before the master
acknowledges it to the client.
• Obviously, however, that will slow down
updates and make the cluster unavailable if
slaves fail—so, again, we have a tradeoff,
depending upon how vital durability is.
• As with basic durability, it’s useful for individual
calls to indicate what level of durability they
need.
Quorums
• When you’re trading
off
durability, it’s
not
an proposal.
consistency
or
all or
nothi
ng
• The more nodes you involve in a request, the
higher is the chance of avoiding an
inconsistency.
• This naturally leads to the question: How
many nodes need to be involved to get
strong consistency?
Write quorum
• Imagine some data replicated over three nodes. You don’t
need all nodes to acknowledge a write to ensure strong
consistency; all you need is two of them—a majority.
• If you have conflicting writes, only one can get a majority.
This is referred to as a write quorum.
• It is expressed in a slightly pretentious inequality of W >
N/2
• It means the number of nodes participating in the write
(W) must be more than the half the number of nodes
involved in replication (N).
• The number of replicas is often called the replication
factor.
Read quorum
• Similarly to the write quorum, there is the notion of
read quorum: How many nodes you need to contact
to be sure you have the most up-to-date change.
• The read quorum is a bit more
complicated because it depends on how many
nodes need to confirm a write.
• Let’s consider a replication factor of 3.
• If all writes need two nodes to confirm (W = 2) then we
need to contact at least two nodes to be sure we’ll get the
latest data.
• If, however, writes are only confirmed by a single node
(W = 1) we need to talk to all three nodes to be sure we
have the latest updates.
• In this case, since we don’t have a write quorum, we
may have an update conflict, but by contacting enough
readers we can be sure to detect it. Thus we can get
strongly consistent reads even if we don’t have strong
consistency on our writes.
• This relationship between the number of nodes you need
to contact for a read (R), those confirming a write (W),
and the replication factor (N) can be captured in an
inequality: You can have a strongly consistent read if R
+ W > N.
• These inequalities are written with a peer-to-
peer distribution model in mind. If you have a
master slave distribution, you only have to
write to the master to avoid write-write
conflicts, and similarly only read from the
master to avoid read-write conflicts.
• With this notation, it is common to
confuse the number of nodes in the cluster with
the replication factor, but these are often
different.
• I may have100 nodes in my cluster, but
only have a replication factor of 3, with most of
• Indeed most authorities suggest that a replication
factor of 3 is enough to have good resilience. This
allows a single node to fail while still maintaining
quora for reads and writes. If you have automatic
rebalancing, it won’t take too long for the cluster to
create a third replica, so the chances of losing a
second replica before a replacement comes up are
slight.
• The number of nodes participating in an operation
can vary with the operation.
• When writing, we might require quorum for some
types of updates but not others, depending on how
much we value consistency and availability.
• Similarly, a read that needs speed but can tolerate
staleness should contact less nodes.
• Often you may need to take both into account. If you
need fast, strongly consistent reads, you could
require writes to be acknowledged by all the nodes,
thus allowing reads to contact only one (N = 3, W=
3, R = 1).
• That would mean that your writes are slow, since they
have to contact all three nodes, and you would not be
able to tolerate losing a node. But in some
circumstances that may be the tradeoff to make.
• The point to all of this is that you have a range of
options to work with and can choose which
combination of problems and advantages to prefer.
Key Points
• Write-write conflicts occur when two clients try to
write the same data at the same time. Read-write
conflicts occur when one client reads inconsistent
data in the middle of another client’s write.
• Pessimistic approaches lock data records to prevent
conflicts. Optimistic approaches detect conflicts
and fix them.
• Distributed systems see read-write conflicts due to
some nodes having received updates while other
nodes have not. Eventual consistency means that at
some point the system will become consistent once
all the writes have propagated to all the nodes.
• Clients usually want read-your-writes consistency, which
means a client can write and then immediately read the new
value. This can be difficult if the read and the write
happen on different nodes.
• To get good consistency, you need to involve many nodes
in data operations, but this increases latency. So you often
have to trade off consistency versus latency.
• The CAP theorem states that if you get a network partition,
you have to trade off availability of data versus consistency.
• Durability can also be traded off against latency,
particularly if you want to survive failures with replicated
data.
• You do not need to contact all replicants to preserve
strong consistency with replication; you just need a large
enough quorum.
Version
Stamps
• Many opponents of NoSQL databases focus on
the lack of support for transactions.
Transactions are a useful tool that helps
programmers support consistency.
• One reason why many NoSQL proponents
worry less about a lack of transactions is that
aggregate-oriented NoSQL databases do
support atomic updates within an aggregate—
and aggregates are designed so that their data
forms a natural unit of update.
• That said, it’s true that transactional needs are
something to take into account when you
decide what database to use.
• As part of this, it’s important to remember that
transactions have limitations.
• Even within a transactional system we still
have to deal with updates that require
human intervention and usually cannot be
run within transactions because they would
involve holding a transaction open for too
long.
• We can cope with these using version
stamps—which turn out to be handy in other
situations as well, particularly as we move
away from the single-server distribution
model.
Business and System Transactions
• The need to support update consistency without
transactions is actually a common feature of systems
even when they are built on top of transactional
databases. When users think about transactions,
they usually mean business transactions.
• A business transaction may be something like
browsing a product catalog, choosing a bottle of
Cold drink at a good price, filling in credit card
information, and confirming the order.
• Yet all of this usually won’t occur within the system
transaction provided by the database because this
would mean locking the database elements while
the user is trying to find their credit card and gets
called off to lunch by their colleagues.
• Usually applications only begin a system
transaction at the end of the interaction with the
user, so that the locks are only held for a short
period of time.
• The problem, however, is that calculations and
decisions may have been made based on data
that’s changed.
• The price list may have updated the price of the
Cold drink bottle, or someone may have updated
the customer’s address, changing the shipping
charges.
• The broad techniques for handling this are offline
concurrency, useful in NoSQL situations too. A
particularly useful approach is the Optimistic Offline
Lock, a form of conditional update where a client
operation rereads any information that the
business transaction relies on and checks that it
hasn’t changed since it was originally read and
displayed to the user.
• A good way of doing this is to ensure that records in
the database contain some form of version stamp: a
field that changes every time the underlying data in
the record changes.
• When you read the data you keep a note of the
version stamp, so that when you write data you can
check to see if the version has changed.
• You may have come across this technique with updating
resources with HTTP. One way of doing this is to use
etags. Whenever you get a resource, the server responds
with an etag in the header.
• This etag is an opaque string that indicates the version
of the resource. If you then update that resource, you can
use a conditional update by supplying the etag that you
got from your last GET.
• If the resource has changed on the server, the etags
won’t match and the server will refuse the update,
returning a 412 (Precondition Failed) response.
• Some databases provide a similar mechanism of
conditional update that allows you to ensure updates
won’t be based on stale data.
• You can do this check yourself, although you then have
to ensure no other thread can run against the resource
between your read and your update. Sometimes this is
called a compare-and-set (CAS) operation, whose name
comes from the CAS operations done in processors.
• The difference is that a processor CAS compares a
value before setting it, while a database conditional
update compares a version stamp of the value.
• There are various ways you can construct your version
stamps. You can use a counter, always incrementing it
when you update the resource. Counters are useful since
they make it easy to tell if one version is more recent than
another. On the other hand, they require the server to
generate the counter value, and also need a single
master to ensure the counters aren’t duplicated.
• Another approach is to create a GUID, a large random
number that’s guaranteed to be unique. These use some
combination of dates, hardware information, and
whatever other sources of randomness they can pick up.
The nice thing about GUIDs is that they can be generated
by anyone and you’ll never get a duplicate; a
disadvantage is that they are large and can’t be
compared directly for recentness.
• A third approach is to make a hash of the contents
of the resource. With a big enough hash key size, a
content hash can be globally unique like a GUID and
can also be generated by anyone.
• The advantage is that they are deterministic—any
node will generate the same content hash for same
resource data.
• However, like GUIDs they can’t
be directly
compared for recentness, and they can be lengthy.
• A fourth approach is to use the timestamp of the last
update. Like counters, they are reasonably short
and can be directly compared for recentness, yet
have the advantage of not needing a single master.
• Multiple machines can generate timestamps—but
to work properly, their clocks have to be kept in
sync.
• One node with a bad clock can cause all sorts of data
corruptions. There’s also a danger that if the
timestamp is too granular you can get duplicates—
it’s no good using timestamps of a millisecond
precision if you get many updates per millisecond.
• You can blend the advantages of these
different version stamp schemes by using
more than one of them to create a composite
stamp. For example, CouchDB uses a
combination of counter and content hash.
• Version stamps are also useful for providing
session consistency.
Version Stamps on Multiple
Nodes
• The basic version stamp works well when you have
a single authoritative source for data, such as a
single server or master-slave replication. In that
case the version stamp is controlled by the master.
Any slaves follow the master’s stamps.
• But this system has to be enhanced in a peer-to-
peer distribution model because there’s no longer a
single place to set the version stamps.
• If you’re asking two nodes for some data, you
run into the chance that they may give you
different answers. If this happens, your reaction
may vary depending on the cause of that
difference.
• It may be that an update has only reached one
node but not the other, in which case you can
accept the latest(assuming you can tell which one
that is).
• Alternatively, you may have run into an
inconsistent update, in which case you need to
decide how to deal with that. In this situation, a
simple GUID or etag won’t suffice, since these
don’t tell you enough about the relationships.
• The simplest form of version stamp is a
counter. Each time a node updates the data, it
increments the counter and puts the value of
the counter into the version stamp.
• If you have blue and green slave replicas of a
single master, and the blue node answers with
a version stamp of 4 and the green node with
6, you know that the green’s answer is more
recent.
• In multiple-master cases, we need something
fancier.
• One approach, used by distributed version control
systems, is to ensure that all nodes contain a
history of version stamps. That way you can see if
the blue node’s answer is an ancestor of the green’s
answer.
• This would either require the clients to hold onto
version stamp histories, or the server nodes to
keep version stamp histories and include them
when asked for data.
• Although version control systems keep these kinds
of histories, they aren’t found in NoSQL databases.
• A simple but problematic approach is to use
timestamps. The main problem here is that
it’s usually difficult to ensure that all the
nodes have a consistent notion of time,
particularly if updates can happen rapidly.
• Should a node’s clock get out of sync, it can
cause all sorts of trouble. In addition, you
can’t detect write-write conflicts with
timestamps, so it would only work well for the
single master case—and then a counter is
usually better.
• The most common approach used by peer-to-peer
NoSQL systems is a special form of version stamp
which we call a vector stamp. In essence, a vector
stamp is a set of counters, one for each node.
• A vector stamp for three nodes (blue, green, black)
would look something like [blue: 43,green: 54,
black: 12]. Each time a node has an internal
update, it updates its own counter, so an update in
the green node would change the vector to [blue: 43,
green: 55, black: 12].
• Whenever two nodes
communicate, they
• By using this scheme you can tell if one version
stamp is newer than another because the newer
stamp will have all its counters greater than or
equal to those in the older stamp.
• So [blue: 1,green: 2, black: 5] is newer than
[blue:1, green: 1, black 5] since one of its
counters is greater.
• If both stamps have a counter greater than the
other, e.g. [blue: 1, green: 2, black: 5]and [blue:
2, green: 1, black: 5], then you have a write-
write conflict.
• There may be missing values in the vector, in which
case we use treat the missing value as 0. So[blue:
6, black: 2] would be treated as [blue: 6, green: 0,
black: 2]. This allows you to easily add new nodes
without invalidating the existing vector stamps.
• Vector stamps are a valuable tool that spots
inconsistencies, but doesn’t resolve them. Any
conflict resolution will depend on the domain you
are working in. This is part of the
consistency/latency tradeoff.
• You either have to live with the fact that network
partitions may make your system unavailable, or
you have to detect and deal with inconsistencies.
Key Points
• Version stamps help you detect concurrency
conflicts. When you read data, then update it, you
can check the version stamp to ensure nobody
updated the data between your read and write.
• Version stamps can be implemented using
counters, GUIDs, content hashes, timestamps,
or a combination of these.
• With distributed systems, a vector of version
stamps allows you to detect when different nodes
have conflicting updates.

More Related Content

Similar to NOSQL DATABASES UNIT-3 FOR ENGINEERING STUDENTS (20)

PPTX
TASK AND DATA PARALLELISM in Computer Science pptx
dianachakauya
 
PPTX
Scalable Web Architecture and Distributed Systems
hyun soomyung
 
PPTX
Nosql-Module 1 PPT.pptx
Radhika R
 
PDF
System design fundamentals CAP.pdf
UsmanAhmed269749
 
PPTX
Chaptor 2- Big Data Processing in big data technologies
GulbakshiDharmale
 
PDF
Multithreaded Programming Part- I.pdf
Harika Pudugosula
 
PDF
Scalability Considerations
Navid Malek
 
PPTX
MongoDB
fsbrooke
 
PDF
Talon systems - Distributed multi master replication strategy
Saptarshi Chatterjee
 
PPTX
NoSQL Evolution
Abdul Manaf
 
PPTX
lecture-13.pptx
laiba29012
 
PPTX
Relational and non relational database 7
abdulrahmanhelan
 
PPTX
U1-NOSQL.pptx DIFFERENT TYPES OF NOSQL DATABASES
KusumaS36
 
PPTX
Relational databases store data in tables
HELLOWorld889594
 
PPTX
History and Introduction to NoSQL over Traditional Rdbms
vinayh902
 
PPTX
Database replication
Arslan111
 
KEY
Writing Scalable Software in Java
Ruben Badaró
 
PPT
Intro to MySQL Master Slave Replication
satejsahu
 
PPTX
NoSQL and Couchbase
Sangharsh agarwal
 
PPTX
Introduction to NoSQL Databases and Types of NOSQL Databases.pptx
Abcd463572
 
TASK AND DATA PARALLELISM in Computer Science pptx
dianachakauya
 
Scalable Web Architecture and Distributed Systems
hyun soomyung
 
Nosql-Module 1 PPT.pptx
Radhika R
 
System design fundamentals CAP.pdf
UsmanAhmed269749
 
Chaptor 2- Big Data Processing in big data technologies
GulbakshiDharmale
 
Multithreaded Programming Part- I.pdf
Harika Pudugosula
 
Scalability Considerations
Navid Malek
 
MongoDB
fsbrooke
 
Talon systems - Distributed multi master replication strategy
Saptarshi Chatterjee
 
NoSQL Evolution
Abdul Manaf
 
lecture-13.pptx
laiba29012
 
Relational and non relational database 7
abdulrahmanhelan
 
U1-NOSQL.pptx DIFFERENT TYPES OF NOSQL DATABASES
KusumaS36
 
Relational databases store data in tables
HELLOWorld889594
 
History and Introduction to NoSQL over Traditional Rdbms
vinayh902
 
Database replication
Arslan111
 
Writing Scalable Software in Java
Ruben Badaró
 
Intro to MySQL Master Slave Replication
satejsahu
 
NoSQL and Couchbase
Sangharsh agarwal
 
Introduction to NoSQL Databases and Types of NOSQL Databases.pptx
Abcd463572
 

Recently uploaded (20)

PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PDF
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
PPTX
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PPTX
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
PPTX
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
PDF
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Viol_Alessandro_Presentazione_prelaurea.pdf
dsecqyvhbowrzxshhf
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
265587293-NFPA 101 Life safety code-PPT-1.pptx
chandermwason
 
Hashing Introduction , hash functions and techniques
sailajam21
 
Element 11. ELECTRICITY safety and hazards
merrandomohandas
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
Element 7. CHEMICAL AND BIOLOGICAL AGENT.pptx
merrandomohandas
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Lecture 1 Shell and Tube Heat exchanger-1.pptx
mailforillegalwork
 
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Ad

NOSQL DATABASES UNIT-3 FOR ENGINEERING STUDENTS

  • 2. Distribution Models • As data volumes increase, it becomes more difficult and expensive to scale up—buy a bigger server to run the database on. • A more appealing option is to scale out—run the database on a cluster of servers. • Aggregate orientation fits well with scaling out because the aggregate is a natural unit to use for distribution.
  • 3. Advantages of Distribution model: • Give ability to handle larger quantities of data • Give ability to process a greater read or write traffic • Offer more availability in the face of network slowdowns or breakages Disadvantages of Distribution model: • Above important benefits come at a cost • Running over a cluster introduces complexity
  • 4. Single Server • The first and the simplest distribution option is no distribution at all. • Run the database on a single machine that handles all the reads and writes to the data store. • It eliminates all the complexities that the other options introduce • It’s easy for operations people to manage and easy for application developers to reason about. • Although a lot of NoSQL databases are designed around the idea of running on a cluster, it can make sense to use NoSQL with a single-server
  • 5. When to use Single Server distribution model: • Graph databases are the obvious category here—these work best in a single-server configuration. • If your data usage is mostly about processing aggregates, then a single-server document or key-value store may well be worthwhile because it’s easier on application developers.
  • 6. Sharding • Often, a busy data store is busy because different people are accessing different parts of the dataset. In these circumstances we can support horizontal scalability by putting different parts of the data onto different servers—a technique that’s called sharding. Fig: Sharding puts different data on separate nodes, each of which does its own reads and writes.
  • 7. • In the ideal case, we have different users all talking to different server nodes. Each user only has to talk to one server, so gets rapid responses from that server. The load is balanced out nicely between servers—for example, if we have ten servers, each one only has to handle 10% of the load. • In order to get close to ideal case we have to ensure that data that’s accessed together is clumped together on the same node and that these clumps are arranged on the nodes to provide the best data access. • Data should be clump up such that one user mostly gets her data from a single server. This is where aggregate orientation comes in really handy. Aggregates designed to combine data that’s commonly accessed together—so aggregates leap out as an obvious unit of distribution.
  • 8. • While arranging the data on the nodes, there are several factors that can help to improve performance. • If most accesses of certain aggregates are based on a physical location, place the data close to where it’s being accessed. • Example: If you have orders for someone who lives in Boston, you can place that data in your eastern US data center. • Another factor is trying to keep the load even. Try to arrange aggregates so they are evenly distributed across the nodes which all get equal amounts of the load. This may vary over time. • Example: if some data tends to be accessed on certain
  • 9. • In some cases, it’s useful to put aggregates together if you think they may be read in sequence. • Historically most people have done sharding as part of application logic. You might put all customers with surnames starting from A to D on one shard and E to G on another. This complicates the programming model, as application code needs to ensure that queries are distributed across the various shards. • Furthermore, rebalancing the sharding means changing the application code and migrating the data. Many NoSQL databases offer auto-sharding, where the database takes on the responsibility of allocating data to shards and ensuring that data access goes to the right shard. This can make it much easier to use sharding in an application.
  • 10. • Sharding is particularly valuable for performance because it can improve both read and write performance. • Using replication, particularly with caching, can greatly improve read performance but does little for applications that have a lot of writes. Sharding provides a way to horizontally scale writes. • Sharding does little to improve resilience when used alone. Although the data is on different nodes, a node failure makes that shard’s data unavailable just as surely as it does for a single-server solution. • The resilience benefit it does provide is that only the users of the data on that shard will suffer; however, it’s not good to have a database with part of its data missing. • With a single server it’s easier to pay the effort and cost to keep that server up and running; clusters usually try to use less reliable machines, and you’re more likely to get a node failure. So in practice,
  • 11. • Despite the fact that sharding is made much easier with aggregates, it’s still not a step to be taken lightly. • Some databases are intended from the beginning to use sharding, in which case it’s wise to run them on a cluster from the very beginning of development, and certainly in production. • Other databases use sharding as a deliberate step up from a single-server configuration, in which case it’s best to start single-server and only use sharding once your load projections clearly indicate that you are running out of headroom. • In any case the step from a single node to sharding is going to be tricky. The lesson here is to use sharding well before you need to—when you have enough headroom to carry out the sharding.
  • 12. Master-Slave Replication • With master-slave distribution, you replicate data across multiple nodes. • One node is designated as the master, or primary. This master is the authoritative source for the data and is usually responsible for processing any updates to that data. • The other nodes are slaves, or secondaries. A replication process synchronizes the slaves with the master Fig: Data is replicated from master to slaves.
  • 13. Advantages: • Scaling: Master-slave replication is most helpful for scaling when you have a read-intensive dataset. You can scale horizontally to handle more read requests by adding more slave nodes and ensuring that all read requests are routed to the slaves. • You are still, however, limited by the ability of the master to process updates and its ability to pass those updates on. Consequently it isn’t such a good scheme for datasets with heavy write traffic, although offloading the read traffic will help a bit with handling the write load.
  • 14. • Read resilience: if the master fail, the slaves can still handle read requests. Again, this is useful if most of your data access is reads. The failure of the master does eliminate the ability to handle writes until either the master is restored or a new master is appointed. However, having slaves as replicates of the master does speed up recovery after a failure of the master since a slave can be appointed a new master very quickly.
  • 15. • All read and write traffic can go to the master while the slave acts as a hot backup. In this case it’s easiest to think of the system as a single-server store with a hot backup. You get the convenience of the single-server configuration but with greater resilience— which is particularly handy if you want to be able to handle server failures gracefully.
  • 16. • Masters can be appointed manually or automatically. • Manual appointing typically means that when you configure your cluster, you configure one node as the master. • With automatic appointment, you create a cluster of nodes and they elect one of themselves to be the master. • Apart from simpler configuration, automatic appointment means that the cluster can automatically appoint a new master when a
  • 17. • Replication comes with some attractive benefits, but it also comes with an unavoidable dark side— inconsistency. • You have the danger that different clients, reading different slaves, will see different values because the changes haven’t all propagated to the slaves. • In the worst case, that can mean that a client cannot read a write it just made. • Even if you use master-slave replication just for hot backup this can be a concern, because if the master fails, any updates not passed on to the backup are lost.
  • 18. Peer-to-Peer Replication • Master-slave replication helps with read scalability but doesn’t help with scalability of writes. It provides resilience against failure of a slave, but not of a master. • Essentially, the master is still a bottleneck and a single point of failure. Peer-to-peer replication attacks these problems by not having a master. All the replicas have equal weight, they can all accept writes, and the loss of any of them doesn’t prevent access to the data store. Fig: Peer-to-peer replication has all nodes applying reads and writes to all the data.
  • 19. Advantages: • You can ride over node failures without losing access to data. • You can easily add nodes to improve your performance. Disadvantages: • Inconsistency: When you can write to two different places, you run the risk that two people will attempt to update the same record at the same time—a write-write conflict. Inconsistencies on read lead to problems but at least they are relatively temporary. Inconsistent writes are forever.
  • 20. How to handle inconsistency? • At one end, we can ensure that whenever we write data, the replicas coordinate to ensure we avoid a conflict. We don’t need all the replicas to agree on the write, just a majority, so we can still survive losing a minority of the replica nodes. • At the other extreme, we can decide to manage with an inconsistent write.
  • 21. Combining Sharding and Replication • Replication and sharding are strategies that can be combined. • If we use both master-slave replication and sharding, this means that we have multiple masters, but each data item only has a single master. • Depending on your configuration, you may choose a node to be a master for some data and slaves for others, or you may dedicate nodes for master or slave duties.
  • 22. Fig: Using master-slave replication together with sharding
  • 23. • Using peer-to-peer replication and sharding is a common strategy for column-family databases. • In a scenario like this you might have tens or hundreds of nodes in a cluster with data sharded over them. • A good starting point for peer-to-peer replication is to have a replication factor of 3, so each shard is present on three nodes. When a node fail, then the shards on that node will be built on the other nodes
  • 24. Fig: Using peer-to-peer replication together with sharding
  • 25. Key Points There are two styles of distributing data: • Sharding distributes different data across multiple servers, so each server acts as the single source for a subset of data. • Replication copies data across multiple servers, so each bit of data can be found in multiple places. A system may use either or both techniques. Replication comes in two forms: • Master-slave replication makes one node the authoritative copy that handles writes while slaves synchronize with the master and may handle reads. • Peer-to-peer replication allows writes to any node; the nodes coordinate to synchronize their copies of the data. Master-slave replication reduces the chance of update conflicts but peer-to-peer replication avoids loading all writes onto a single point of failure.
  • 26. Consistency • One of the biggest changes from a centralized relational database to a cluster- oriented NoSQL database is in how you think about consistency. • Relational databases try to exhibit strong consistency by avoiding all the various inconsistencies. • In NoSQL as soon as you start building something you have to think about what sort of consistency you need for your system.
  • 27. Update Consistency Consider example of updating a telephone number. • Coincidentally, Martin and Pramod are looking at the company website and notice that the phone number is out of date. • They both have update access, so they both go in at the same time to update the number. • Assume they update it slightly differently, because each uses a slightly different format. • This issue is called a write-write conflict: two people updating the same data item at the same
  • 28. • When the writes reach the server, the server will serialize them—decide to apply one, then the other. Let’s assume it uses alphabetical order and picks Martin’s update first, then Pramod’s. • Without any concurrency control, Martin’s update would be applied and immediately overwritten by Pramod’s. • In this case Martin’s is a lost update. Here the lost update is not a big problem, but often it is.
  • 29. Approaches for maintaining consistency: In the face of concurrency, approaches for maintaining consistency are often described as pessimistic or optimistic. • A pessimistic approach works by preventing conflicts from occurring. • An optimistic approach lets conflicts occur, but detects them and takes action to sort them out.
  • 30. • For update conflicts, the most common pessimistic approach is to have write locks, so that in order to change a value you need to acquire a lock, and the system ensures that only one client can get a lock at a time. • So Martin and Pramod would both attempt to acquire the write lock, but only Martin (the first one)would succeed. • Pramod would then see the result of Martin’s write before deciding whether to make his own update.
  • 31. • A common optimistic approach is a conditional update where any client that does an update tests the value just before updating it to see if it’s changed since his last read. • In this case, Martin’s update would succeed but Pramod’s would fail. • The error would let Pramod know that he should look at the value again and decide whether to attempt a further update.
  • 32. • Both the pessimistic and optimistic approaches that we’ve just described rely on a consistent serialization of the updates. With a single server, this is obvious—it has to choose one, then the other. • But if there’s more than one server, such as with peer-to-peer replication, then two nodes might apply the updates in a different order, resulting in a different value for the telephone number on each peer. • Often, when people talk about concurrency in distributed systems, they talk about sequential consistency—ensuring that all nodes apply operations in the same order.
  • 33. There is another optimistic way to handle a write-write conflict— • Save both updates and record that they are in conflict • You have to merge the two updates somehow. Maybe you show both values to the user and ask them to sort it out— this is what happens if you update the same contact on your phone and your computer. • Alternatively, the computer may be able to perform the merge itself; if it was a phone formatting issue, it may be able to realize that and apply the new number with the standard format. • Any automated merge of write-write conflicts is highly domain-specific and needs to be programmed for each particular case.
  • 34. • Often, when people first encounter these issues, their reaction is to prefer pessimistic concurrency because they are determined to avoid conflicts. • While in some cases this is the right answer, there is always a tradeoff. • Concurrent programming involves a fundamental compromise between safety (avoiding errors such as update conflicts) and liveness (responding quickly to clients). • Pessimistic approaches often severely degrade the responsiveness of a system to the degree that it becomes unfit for its purpose. • Pessimistic concurrency often leads to deadlocks, which are hard to prevent and debug.
  • 35. • Replication makes it much more likely to run into write-write conflicts. If different nodes have different copies of some data which can be independently updated, then you’ll get conflicts unless you take specific measures to avoid them. • Using a single node as the target for all writes for some data makes it much easier to maintain update consistency. Of the distribution models we discussed earlier, all but peer-to-peer replication do this.
  • 36. Read Consistency • Having a data store that maintains update consistency is one thing, but it doesn’t guarantee that readers of that data store will always get consistent responses to their requests. • Let’s imagine we have an order with line items and a shipping charge. The shipping charge is calculated based on the line items in the order. • If we add a line item, we thus also need to recalculate and update the shipping charge. • In a relational database, the shipping charge and line items will be in separate tables. • The danger of inconsistency is that Martin adds a line item to his order, Pramod then reads the line items and shipping charge, and then Martin updates the shipping charge. This is an inconsistent read or read-write conflict: In Figure Pramod has done a read in the middle of Martin’s write.
  • 37. Fig: A read-write conflict in logical consistency
  • 38. • We refer to this type of consistency as logical consistency: ensuring that different data items make sense together. • To avoid a logically inconsistent read-write conflict, relational databases support the notion of transactions. Providing Martin wraps his two writes in a transaction, the system guarantees that Pramod will either read both data items before the update or both after the update. • A common claim we hear is that NoSQL databases don’t support transactions and thus can’t be consistent, such claim is mostly wrong.
  • 39. Clarification of why such claim is wrong: • Any statement about lack of transactions usually only applies to some NoSQL databases, in particular the aggregate-oriented ones. In contrast, graph databases tend to support ACID transactions just the same as relational databases. • Aggregate-oriented databases do support atomic updates, but only within a single aggregate. This means that you will have logical consistency within an aggregate but not between aggregates. So in the example, you could avoid running into that inconsistency if the order, the delivery charge, and the line items are all part of a single order aggregate.
  • 40. • Of course not all data can be put in the same aggregate, so any update that affects multiple aggregates leaves open a time when clients could perform an inconsistent read. The length of time an inconsistency is present is called the inconsistency window. • A NoSQL system may have a quite short inconsistency window: Amazon’s documentation says that the inconsistency window for its SimpleDB service is usually less than a second.
  • 41. • Once you introduce replication, however, you get a whole new kind of inconsistency. Let’s imagine there’s one last hotel room for a desirable event. • The hotel reservation system runs on many nodes. Martin and Cindy are a couple considering this room, but they are discussing this on the phone because Martin is in London and Cindy is in Boston. • Meanwhile Pramod, who is in Mumbai, goes and books that last room. That updates the replicated room availability, but the update gets to Boston quicker than it gets to London.
  • 42. When Martin and Cindy fire up their browsers to see if the room is available, Cindy sees it booked and Martin sees it free. This is another inconsistent read—this form of consistency we call replication consistency: ensuring that the same data item has the same value when read from different replicas (see Figure). Figure: An example of replication inconsistency
  • 43. • Eventually, of course, the updates will propagate fully, and Martin will see the room is fully booked. Therefore this situation is generally referred to as eventually consistent, meaning that at any time nodes may have replication inconsistencies but, if there are no further updates, eventually all nodes will be updated to the same value. Data that is out of date is generally referred to as stale. • Although replication consistency is independent from logical consistency, replication make worse a logical inconsistency by lengthening its inconsistency window. Two different updates on the master may be performed in rapid succession, leaving an inconsistency window of milliseconds. But delays in networking could mean that the same inconsistency window lasts for much longer on a slave. • You can usually specify the level of consistency you want with individual requests. This allows you to use weak consistency most of the time when it isn’t an issue, but request strong consistency when it is.
  • 44. • Consider the example of posting comments on a blog entry. Few people are going to worry about inconsistency windows of even a few minutes while people are typing in their latest thoughts. • Often, systems handle the load of such sites by running on a cluster and load-balancing incoming requests to different nodes. • Therein lies a danger: You may post a message using one node, then refresh your browser, but the refresh goes to a different node which hasn’t received your post yet—and it looks like your post was lost.
  • 45. • In situations like this, you can tolerate reasonably long inconsistency windows, but you need read your-writes consistency which means that, once you’ve made an update, you’re guaranteed to continue seeing that update. • One way to get this in an otherwise eventually consistent system is to provide session consistency: Within a user’s session there is read-your-writes consistency. This does mean that the user may lose that consistency should their session end for some reason or should the user access the same system simultaneously from different computers, but these cases are relatively rare.
  • 46. Techniques to provide session consistency • A common way, and often the easiest way, is to have a sticky session: a session that’s tied to one node (this is also called session affinity). A sticky session allows you to ensure that as long as you keep read-your-writes consistency on a node, you’ll get it for sessions too. The downside is that sticky sessions reduce the ability of the load balancer to do its job. • Use version stamps and ensure every interaction with the data store includes the latest version stamp seen by a session. The server node must then ensure that it has the updates that include that version stamp before responding to a request.
  • 47. Relaxing Consistency • Consistency is a Good Thing—but, sadly, sometimes we have to sacrifice it. It is always possible to design a system to avoid inconsistencies, but often impossible to do so without making unbearable sacrifices in other characteristics of the system. • As a result, we often have to compromise consistency for something else. While some architects see this as a disaster, we see it as part of the unavoidable compromises involved in system design. • Furthermore, different domains have different tolerances for inconsistency, and we need to take this tolerance into account as we make our decisions.
  • 48. • Compromising consistency is a familiar concept even in single-server relational database systems. Here, our principal tool to enforce consistency is the transaction, and transactions can provide strong consistency guarantees. • However, transaction systems usually come with the ability to relax isolation levels, allowing queries to read data that hasn’t been committed yet, and in practice we see most applications relax consistency down from the highest isolation level (serialized) in order to get effective performance. • We most commonly see people using the read- committed transaction level, which eliminates some read-write conflicts but allows others.
  • 49. • Many systems go without transactions entirely because the performance impact of transactions is too high. • On a small scale, we saw the popularity of MySQL during the days when it didn’t support transactions. Many websites liked the high speed of MySQL and were prepared to live without transactions. • At the other end of the scale, some very large websites, such as eBay, have to go without transactions in order to perform acceptably—this is particularly true when you need to introduce sharding.
  • 50. The CAP Theorem • In the NoSQL world refer CAP theorem as a reason why you may need to relax consistency. • The basic statement of the CAP theorem is that, given the three properties of Consistency, Availability, and Partition tolerance, you can only get two. Obviously this depends very much on how you define these three properties. • Consistency means that data is the same across cluster, so you can read or write from/to any node and get the same data. • Availability has a particular meaning in the context of CAP—it means that if you can talk to a node in the cluster, it can read and write data. • Partition tolerance means that the cluster can survive communication breakages in the cluster that separate the cluster into multiple partitions unable to communicate with each other.
  • 51. Figure: With two breaks in the communication lines, the network partitions into two groups.
  • 52. • A single-server system is the obvious example of a CA system—a system that has Consistency and Availability but not Partition tolerance. • A single machine can’t partition, so it does not have to worry about partition tolerance. There’s only one node—so if it’s up, it’s available. Being up and keeping consistency is reasonable. • It is theoretically possible to have a CA cluster. However, this would mean that if a partition ever occurs in the cluster, all the nodes in the cluster would go down so that no client can talk to a node. • By the usual definition of “available,” this would mean a lack of availability, but this is where CAP’s special usage of “availability” gets confusing. CAP defines “availability” to mean “every request received by a non failing node in the system must result in a response”. So a failed, unresponsive node doesn’t
  • 53. • This does imply that you can build a CA cluster, but you have to ensure it will only partition rarely. • So clusters have to be tolerant of network partitions. And here is the real point of the CAP theorem. • Although the CAP theorem is often stated as “you can only get two out of three,” in practice what it’s saying is that in a system that may suffer partitions, as distributed system do, you have to compromise consistency versus availability. • Often, you can compromise a little consistency to get some availability. The resulting system would be neither perfectly consistent nor perfectly available—but would have a combination that is reasonable for your particular needs.
  • 54. • Example : Martin and Pramod are both trying to book the last hotel room on a system that uses peer-to-peer distribution with two nodes (London for Martin and Mumbai for Pramod). • If we want to ensure consistency, then when Martin tries to book his room on the London node, that node must communicate with the Mumbai node before confirming the booking. Essentially, both nodes must agree on the serialization of their requests. This gives us consistency—but if the network link break, then neither system can book any hotel room, sacrificing availability. • One way to improve availability is to designate one node as the master for a particular hotel and ensure all bookings are processed by that master. If that master be Mumbai, then Mumbai can still process hotel bookings for that hotel and Pramod will get the last room. • If we use master-slave replication, London users can see the inconsistent room information but cannot make a booking and thus cause an update inconsistency.
  • 55. • We still can’t book a room on the London node for the hotel whose master is in Mumbai if the connection goes down. • In CAP terminology, this is a failure of availability in that Martin can talk to the London node but the London node cannot update the data. • To gain more availability, we might allow both systems to keep accepting hotel reservations even if the network link breaks down. The danger here is that Martin and Pramod book the last hotel room. • However, depending on how this hotel operates, that may be fine. Often, travel companies tolerate a certain amount of overbooking in order to cope with no-shows. • Conversely, some hotels always keep a few rooms clear even when they are fully booked, in order to be able to swap a guest out of a room with problems or to accommodate a high- status late booking. • Some might even cancel the booking with an apology once they detected the conflict—reasoning that the cost of that is
  • 56. • The classic example of allowing inconsistent writes is the shopping cart, as discussed in Amazon’s Dynamo. • In this case you are always allowed to write to your shopping cart, even if network failures mean you end up with multiple shopping carts. The checkout process can merge the two shopping carts by putting the union of the items from the carts into a single cart and returning that. • Almost always that’s the correct answer—but if not, the user gets the opportunity to look at the cart before completing the order. • The lesson here is that although most software developers treat update consistency as The Way Things Must Be, there are cases where you can deal gracefully with inconsistent answers to requests. • If you can find a way to handle inconsistent updates, this gives you more options to increase availability and performance. For a shopping cart, it means that shoppers can always shop, and do so quickly.
  • 57. • A similar logic applies to read consistency. If you are trading financial instruments over a computerized exchange, you may not be able to tolerate any data that isn’t right up to date. However, if you are posting a news item to a media website, you may be able to tolerate old pages for minutes. • Different data items may have different tolerances for staleness, and thus may need different settings in your replication configuration. • Promoters of NoSQL often say that instead of following the ACID properties of relational transactions, NoSQL systems follow the BASE properties (Basically Available, Soft state, Eventual consistency).
  • 58. • It’s usually better to think not about the tradeoff between consistency and availability but rather between consistency and latency(response time). • We can improve consistency by getting more nodes involved in the interaction, but each node we add increases the response time of that interaction. • We can then think of availability as the limit of latency that we’re prepared to tolerate; once latency gets too high, we give up and treat the data as unavailable—which neatly fits its definition in the context of CAP.
  • 59. Relaxing Durability • Most people would laugh at relaxing durability—after all, what is the point of a data store if it can lose updates? • There are cases where you may want to trade off some durability for higher performance. • If a database can run mostly in memory, apply updates to its in-memory representation, and periodically flush changes to disk, then it may be able to provide considerably higher responsiveness to requests. The cost is that, if the server crash, any updates since the last flush will be lost.
  • 60. • One example of where this tradeoff may be meaningful is storing user-session state. • A big website may have many users and keep temporary information about what each user is doing in some kind of session state. There’s a lot of activity on this state, creating lots of demand, which affects the responsiveness of the website. • The vital point is that losing the session data isn’t too much of a tragedy—it will create some annoyance, but may be less than a slower website would cause. This makes it a good candidate for nondurable writes. • Often, you can specify the durability needs on a call-by-call basis, so that more important updates can force a flush to disk.
  • 61. • Another class of durability tradeoffs comes up with replicated data. A failure of replication durability occurs when a node processes an update but fails before that update is replicated to the other nodes. • A simple case of this may happen if you have a master-slave distribution model where the slaves appoint a new master automatically if the existing master fail. If a master does fail, any writes not passed onto the replicas will effectively become lost. • If the master come back online, those updates will conflict with updates that have happened since. We think of this as a durability problem because you think your update has succeeded since the master acknowledged it, but a master node failure caused it to be lost.
  • 62. • You can improve replication durability by ensuring that the master waits for some replicas to acknowledge the update before the master acknowledges it to the client. • Obviously, however, that will slow down updates and make the cluster unavailable if slaves fail—so, again, we have a tradeoff, depending upon how vital durability is. • As with basic durability, it’s useful for individual calls to indicate what level of durability they need.
  • 63. Quorums • When you’re trading off durability, it’s not an proposal. consistency or all or nothi ng • The more nodes you involve in a request, the higher is the chance of avoiding an inconsistency. • This naturally leads to the question: How many nodes need to be involved to get strong consistency?
  • 64. Write quorum • Imagine some data replicated over three nodes. You don’t need all nodes to acknowledge a write to ensure strong consistency; all you need is two of them—a majority. • If you have conflicting writes, only one can get a majority. This is referred to as a write quorum. • It is expressed in a slightly pretentious inequality of W > N/2 • It means the number of nodes participating in the write (W) must be more than the half the number of nodes involved in replication (N). • The number of replicas is often called the replication factor.
  • 65. Read quorum • Similarly to the write quorum, there is the notion of read quorum: How many nodes you need to contact to be sure you have the most up-to-date change. • The read quorum is a bit more complicated because it depends on how many nodes need to confirm a write.
  • 66. • Let’s consider a replication factor of 3. • If all writes need two nodes to confirm (W = 2) then we need to contact at least two nodes to be sure we’ll get the latest data. • If, however, writes are only confirmed by a single node (W = 1) we need to talk to all three nodes to be sure we have the latest updates. • In this case, since we don’t have a write quorum, we may have an update conflict, but by contacting enough readers we can be sure to detect it. Thus we can get strongly consistent reads even if we don’t have strong consistency on our writes. • This relationship between the number of nodes you need to contact for a read (R), those confirming a write (W), and the replication factor (N) can be captured in an inequality: You can have a strongly consistent read if R + W > N.
  • 67. • These inequalities are written with a peer-to- peer distribution model in mind. If you have a master slave distribution, you only have to write to the master to avoid write-write conflicts, and similarly only read from the master to avoid read-write conflicts. • With this notation, it is common to confuse the number of nodes in the cluster with the replication factor, but these are often different. • I may have100 nodes in my cluster, but only have a replication factor of 3, with most of
  • 68. • Indeed most authorities suggest that a replication factor of 3 is enough to have good resilience. This allows a single node to fail while still maintaining quora for reads and writes. If you have automatic rebalancing, it won’t take too long for the cluster to create a third replica, so the chances of losing a second replica before a replacement comes up are slight. • The number of nodes participating in an operation can vary with the operation. • When writing, we might require quorum for some types of updates but not others, depending on how much we value consistency and availability. • Similarly, a read that needs speed but can tolerate staleness should contact less nodes.
  • 69. • Often you may need to take both into account. If you need fast, strongly consistent reads, you could require writes to be acknowledged by all the nodes, thus allowing reads to contact only one (N = 3, W= 3, R = 1). • That would mean that your writes are slow, since they have to contact all three nodes, and you would not be able to tolerate losing a node. But in some circumstances that may be the tradeoff to make. • The point to all of this is that you have a range of options to work with and can choose which combination of problems and advantages to prefer.
  • 70. Key Points • Write-write conflicts occur when two clients try to write the same data at the same time. Read-write conflicts occur when one client reads inconsistent data in the middle of another client’s write. • Pessimistic approaches lock data records to prevent conflicts. Optimistic approaches detect conflicts and fix them. • Distributed systems see read-write conflicts due to some nodes having received updates while other nodes have not. Eventual consistency means that at some point the system will become consistent once all the writes have propagated to all the nodes.
  • 71. • Clients usually want read-your-writes consistency, which means a client can write and then immediately read the new value. This can be difficult if the read and the write happen on different nodes. • To get good consistency, you need to involve many nodes in data operations, but this increases latency. So you often have to trade off consistency versus latency. • The CAP theorem states that if you get a network partition, you have to trade off availability of data versus consistency. • Durability can also be traded off against latency, particularly if you want to survive failures with replicated data. • You do not need to contact all replicants to preserve strong consistency with replication; you just need a large enough quorum.
  • 72. Version Stamps • Many opponents of NoSQL databases focus on the lack of support for transactions. Transactions are a useful tool that helps programmers support consistency. • One reason why many NoSQL proponents worry less about a lack of transactions is that aggregate-oriented NoSQL databases do support atomic updates within an aggregate— and aggregates are designed so that their data forms a natural unit of update. • That said, it’s true that transactional needs are something to take into account when you decide what database to use.
  • 73. • As part of this, it’s important to remember that transactions have limitations. • Even within a transactional system we still have to deal with updates that require human intervention and usually cannot be run within transactions because they would involve holding a transaction open for too long. • We can cope with these using version stamps—which turn out to be handy in other situations as well, particularly as we move away from the single-server distribution model.
  • 74. Business and System Transactions • The need to support update consistency without transactions is actually a common feature of systems even when they are built on top of transactional databases. When users think about transactions, they usually mean business transactions. • A business transaction may be something like browsing a product catalog, choosing a bottle of Cold drink at a good price, filling in credit card information, and confirming the order. • Yet all of this usually won’t occur within the system transaction provided by the database because this would mean locking the database elements while the user is trying to find their credit card and gets called off to lunch by their colleagues.
  • 75. • Usually applications only begin a system transaction at the end of the interaction with the user, so that the locks are only held for a short period of time. • The problem, however, is that calculations and decisions may have been made based on data that’s changed. • The price list may have updated the price of the Cold drink bottle, or someone may have updated the customer’s address, changing the shipping charges.
  • 76. • The broad techniques for handling this are offline concurrency, useful in NoSQL situations too. A particularly useful approach is the Optimistic Offline Lock, a form of conditional update where a client operation rereads any information that the business transaction relies on and checks that it hasn’t changed since it was originally read and displayed to the user. • A good way of doing this is to ensure that records in the database contain some form of version stamp: a field that changes every time the underlying data in the record changes. • When you read the data you keep a note of the version stamp, so that when you write data you can check to see if the version has changed.
  • 77. • You may have come across this technique with updating resources with HTTP. One way of doing this is to use etags. Whenever you get a resource, the server responds with an etag in the header. • This etag is an opaque string that indicates the version of the resource. If you then update that resource, you can use a conditional update by supplying the etag that you got from your last GET. • If the resource has changed on the server, the etags won’t match and the server will refuse the update, returning a 412 (Precondition Failed) response.
  • 78. • Some databases provide a similar mechanism of conditional update that allows you to ensure updates won’t be based on stale data. • You can do this check yourself, although you then have to ensure no other thread can run against the resource between your read and your update. Sometimes this is called a compare-and-set (CAS) operation, whose name comes from the CAS operations done in processors. • The difference is that a processor CAS compares a value before setting it, while a database conditional update compares a version stamp of the value.
  • 79. • There are various ways you can construct your version stamps. You can use a counter, always incrementing it when you update the resource. Counters are useful since they make it easy to tell if one version is more recent than another. On the other hand, they require the server to generate the counter value, and also need a single master to ensure the counters aren’t duplicated. • Another approach is to create a GUID, a large random number that’s guaranteed to be unique. These use some combination of dates, hardware information, and whatever other sources of randomness they can pick up. The nice thing about GUIDs is that they can be generated by anyone and you’ll never get a duplicate; a disadvantage is that they are large and can’t be compared directly for recentness.
  • 80. • A third approach is to make a hash of the contents of the resource. With a big enough hash key size, a content hash can be globally unique like a GUID and can also be generated by anyone. • The advantage is that they are deterministic—any node will generate the same content hash for same resource data. • However, like GUIDs they can’t be directly compared for recentness, and they can be lengthy.
  • 81. • A fourth approach is to use the timestamp of the last update. Like counters, they are reasonably short and can be directly compared for recentness, yet have the advantage of not needing a single master. • Multiple machines can generate timestamps—but to work properly, their clocks have to be kept in sync. • One node with a bad clock can cause all sorts of data corruptions. There’s also a danger that if the timestamp is too granular you can get duplicates— it’s no good using timestamps of a millisecond precision if you get many updates per millisecond.
  • 82. • You can blend the advantages of these different version stamp schemes by using more than one of them to create a composite stamp. For example, CouchDB uses a combination of counter and content hash. • Version stamps are also useful for providing session consistency.
  • 83. Version Stamps on Multiple Nodes • The basic version stamp works well when you have a single authoritative source for data, such as a single server or master-slave replication. In that case the version stamp is controlled by the master. Any slaves follow the master’s stamps. • But this system has to be enhanced in a peer-to- peer distribution model because there’s no longer a single place to set the version stamps.
  • 84. • If you’re asking two nodes for some data, you run into the chance that they may give you different answers. If this happens, your reaction may vary depending on the cause of that difference. • It may be that an update has only reached one node but not the other, in which case you can accept the latest(assuming you can tell which one that is). • Alternatively, you may have run into an inconsistent update, in which case you need to decide how to deal with that. In this situation, a simple GUID or etag won’t suffice, since these don’t tell you enough about the relationships.
  • 85. • The simplest form of version stamp is a counter. Each time a node updates the data, it increments the counter and puts the value of the counter into the version stamp. • If you have blue and green slave replicas of a single master, and the blue node answers with a version stamp of 4 and the green node with 6, you know that the green’s answer is more recent.
  • 86. • In multiple-master cases, we need something fancier. • One approach, used by distributed version control systems, is to ensure that all nodes contain a history of version stamps. That way you can see if the blue node’s answer is an ancestor of the green’s answer. • This would either require the clients to hold onto version stamp histories, or the server nodes to keep version stamp histories and include them when asked for data. • Although version control systems keep these kinds of histories, they aren’t found in NoSQL databases.
  • 87. • A simple but problematic approach is to use timestamps. The main problem here is that it’s usually difficult to ensure that all the nodes have a consistent notion of time, particularly if updates can happen rapidly. • Should a node’s clock get out of sync, it can cause all sorts of trouble. In addition, you can’t detect write-write conflicts with timestamps, so it would only work well for the single master case—and then a counter is usually better.
  • 88. • The most common approach used by peer-to-peer NoSQL systems is a special form of version stamp which we call a vector stamp. In essence, a vector stamp is a set of counters, one for each node. • A vector stamp for three nodes (blue, green, black) would look something like [blue: 43,green: 54, black: 12]. Each time a node has an internal update, it updates its own counter, so an update in the green node would change the vector to [blue: 43, green: 55, black: 12]. • Whenever two nodes communicate, they
  • 89. • By using this scheme you can tell if one version stamp is newer than another because the newer stamp will have all its counters greater than or equal to those in the older stamp. • So [blue: 1,green: 2, black: 5] is newer than [blue:1, green: 1, black 5] since one of its counters is greater. • If both stamps have a counter greater than the other, e.g. [blue: 1, green: 2, black: 5]and [blue: 2, green: 1, black: 5], then you have a write- write conflict.
  • 90. • There may be missing values in the vector, in which case we use treat the missing value as 0. So[blue: 6, black: 2] would be treated as [blue: 6, green: 0, black: 2]. This allows you to easily add new nodes without invalidating the existing vector stamps. • Vector stamps are a valuable tool that spots inconsistencies, but doesn’t resolve them. Any conflict resolution will depend on the domain you are working in. This is part of the consistency/latency tradeoff. • You either have to live with the fact that network partitions may make your system unavailable, or you have to detect and deal with inconsistencies.
  • 91. Key Points • Version stamps help you detect concurrency conflicts. When you read data, then update it, you can check the version stamp to ensure nobody updated the data between your read and write. • Version stamps can be implemented using counters, GUIDs, content hashes, timestamps, or a combination of these. • With distributed systems, a vector of version stamps allows you to detect when different nodes have conflicting updates.