NOSQL DATABASES UNIT-3 FOR ENGINEERING STUDENTS

Distribution Models
• As data volumes increase, it becomes more difficult
and expensive to scale up—buy a bigger server to
run the database on.
• A more appealing option is to scale out—run the
database on a cluster of servers.
• Aggregate orientation fits well with scaling out
because the aggregate is a natural unit to use for
distribution.

Advantages of Distribution model:
• Give ability to handle larger quantities of data
• Give ability to process a greater read or write
traffic
• Offer more availability in the face of network
slowdowns or breakages
Disadvantages of Distribution model:
• Above important benefits come at a cost
• Running over a cluster introduces complexity

Single Server
• The first and the simplest distribution option is
no distribution at all.
• Run the database on a
single machine that handles all
the reads and writes to the data store.
• It eliminates all the complexities that the other
options introduce
• It’s easy for operations people to manage and easy
for application developers to reason about.
• Although a lot of NoSQL databases are designed
around the idea of running on a cluster, it can make
sense to use NoSQL with a single-server

When to use Single Server distribution model:
• Graph databases are the obvious category
here—these work best in a single-server
configuration.
• If your data usage is mostly about
processing aggregates, then a single-server
document or key-value store may well be
worthwhile because it’s easier on application
developers.

Sharding
• Often, a busy data store is busy because different
people are accessing different parts of the dataset. In
these circumstances we can support horizontal
scalability by putting different parts of the data onto
different servers—a technique that’s called sharding.
Fig: Sharding puts different data on separate nodes,
each of which does its own reads and writes.

• In the ideal case, we have different users all talking to
different server nodes. Each user only has to talk to one
server, so gets rapid responses from that server. The load
is balanced out nicely between servers—for example, if
we have ten servers, each one only has to handle 10%
of the load.
• In order to get close to ideal case we have to ensure that
data that’s accessed together is clumped together on
the same node and that these clumps are arranged on
the nodes to provide the best data access.
• Data should be clump up such that one user mostly
gets her data from a single server. This is where
aggregate orientation comes in really handy. Aggregates
designed to combine data that’s commonly accessed
together—so aggregates leap out as an obvious unit of
distribution.

• While arranging the data on the nodes, there are
several factors that can help to improve performance.
• If most accesses of certain aggregates are based on a
physical location, place the data close to where it’s being
accessed.
• Example: If you have orders for someone who lives in
Boston, you can place that data in your eastern US data
center.
• Another factor is trying to keep the load even. Try to
arrange aggregates so they are evenly distributed across
the nodes which all get equal amounts of the load. This
may vary over time.
• Example: if some data tends to be accessed on certain

• In some cases, it’s useful to put aggregates together if you
think they may be read in sequence.
• Historically most people have done sharding as part of
application logic. You might put all customers with
surnames starting from A to D on one shard and E to G on
another. This complicates the programming model, as
application code needs to ensure that queries are
distributed across the various shards.
• Furthermore, rebalancing the sharding means changing
the application code and migrating the data. Many
NoSQL databases offer auto-sharding, where the
database takes on the responsibility of allocating data
to shards and ensuring that data access goes to the right
shard. This can make it much easier to use sharding in an
application.

• Sharding is particularly valuable for performance because
it can improve both read and write performance.
• Using replication, particularly with caching, can greatly
improve read performance but does little for
applications that have a lot of writes. Sharding provides
a way to horizontally scale writes.
• Sharding does little to improve resilience when used
alone. Although the data is on different nodes, a node
failure makes that shard’s data unavailable just as
surely as it does for a single-server solution.
• The resilience benefit it does provide is that only the users
of the data on that shard will suffer; however, it’s not
good to have a database with part of its data missing.
• With a single server it’s easier to pay the effort and
cost to keep that server up and running; clusters
usually try to use less reliable machines, and you’re
more likely to get a node failure. So in practice,

• Despite the fact that sharding is made much
easier with aggregates, it’s still not a step to be
taken lightly.
• Some databases are intended from the beginning to
use sharding, in which case it’s wise to run them
on a cluster from the very beginning of
development, and certainly in production.
• Other databases use sharding as a deliberate step up
from a single-server configuration, in which case it’s
best to start single-server and only use sharding
once your load projections clearly indicate that you
are running out of headroom.
• In any case the step from a single node to sharding is
going to be tricky. The lesson here is to use
sharding well before you need to—when you have
enough headroom to carry out the sharding.

Master-Slave Replication
• With master-slave distribution, you replicate data across multiple
nodes.
• One node is designated as the master, or primary. This master is
the authoritative source for the data and is usually responsible
for processing any updates to that data.
• The other nodes are slaves, or secondaries. A replication process
synchronizes the slaves with the master
Fig: Data is replicated from master to slaves.

Advantages:
• Scaling: Master-slave replication is most helpful
for scaling when you have a read-intensive
dataset. You can scale horizontally to handle
more read requests by adding more slave nodes
and ensuring that all read requests are routed to
the slaves.
• You are still, however, limited by the ability of the
master to process updates and its ability to pass
those updates on. Consequently it isn’t such a
good scheme for datasets with heavy write
traffic, although offloading the read traffic will
help a bit with handling the write load.

• Read resilience: if the master fail, the slaves
can still handle read requests. Again, this is
useful if most of your data access is reads. The
failure of the master does eliminate the ability to
handle writes until either the master is restored or
a new master is appointed. However, having
slaves as replicates of the master does speed
up recovery after a failure of the master since
a slave can be appointed a new master very
quickly.

• All read and write traffic can go to the master
while the slave acts as a hot backup. In this
case it’s easiest to think of the system as a
single-server store with a hot backup. You get
the convenience of the single-server
configuration but with greater resilience—
which is particularly handy if you want to be able
to handle server failures gracefully.

• Masters can be appointed
manually or
automatically.
• Manual appointing typically means that when you
configure your cluster, you configure one node as the
master.
• With automatic appointment, you create a cluster of
nodes and they elect one of themselves to be the
master.
• Apart from simpler configuration, automatic
appointment means that the cluster can
automatically appoint a new master when a

• Replication comes with some attractive benefits, but
it also comes with an unavoidable dark side—
inconsistency.
• You have the danger that different clients, reading
different slaves, will see different values because
the changes haven’t all propagated to the slaves.
• In the worst case, that can mean that a client cannot
read a write it just made.
• Even if you use master-slave replication just for hot
backup this can be a concern, because if the master
fails, any updates not passed on to the backup are
lost.

Peer-to-Peer Replication
• Master-slave replication helps with read scalability but
doesn’t help with scalability of writes. It provides resilience
against failure of a slave, but not of a master.
• Essentially, the master is still a bottleneck and a single
point of failure. Peer-to-peer replication attacks these
problems by not having a master. All the replicas have
equal weight, they can all accept writes, and the loss of any
of them doesn’t prevent access to the data store.
Fig: Peer-to-peer replication has all nodes applying reads
and writes to all the data.

Advantages:
• You can ride over node failures without losing
access to data.
• You can easily add
nodes to improve your
performance.
Disadvantages:
• Inconsistency: When you can write to two
different places, you run the risk that two people
will attempt to update the same record at the same
time—a write-write conflict. Inconsistencies on
read lead to problems but at least they are
relatively temporary. Inconsistent writes are
forever.

How to handle inconsistency?
• At one end, we can ensure that whenever we
write data, the replicas coordinate to ensure
we avoid a conflict. We don’t need all the
replicas to agree on the write, just a
majority, so we can still survive losing a
minority of the replica nodes.
• At the other extreme, we
can decide to
manage with an inconsistent write.

Combining Sharding and Replication
• Replication and sharding are strategies that
can be combined.
• If we use both master-slave replication and
sharding, this means that we have multiple
masters, but each data item only has a
single master.
• Depending on your configuration, you may
choose a node to be a master for some data
and slaves for others, or you may dedicate
nodes for master or slave duties.

Fig: Using master-slave replication together with
sharding

• Using peer-to-peer replication and sharding
is a common strategy for column-family
databases.
• In a scenario like this you might have tens or
hundreds of nodes in a cluster with data
sharded over them.
• A good starting point for peer-to-peer
replication is to have a replication factor of 3,
so each shard is present on three nodes.
When a node fail, then the shards on that node
will be built on the other nodes

Fig: Using peer-to-peer replication together with
sharding

Key Points
There are two styles of distributing data:
• Sharding distributes different data across multiple servers, so each
server acts as the single source for a subset of data.
• Replication copies data across multiple servers, so each bit of data
can be found in multiple places. A system may use either or both
techniques.
Replication comes in two forms:
• Master-slave replication makes one node the authoritative copy that
handles writes while slaves synchronize with the master and may handle
reads.
• Peer-to-peer replication allows writes to any node; the nodes coordinate
to synchronize their copies of the data. Master-slave replication
reduces the chance of update conflicts but peer-to-peer replication
avoids loading all writes onto a single point of failure.

Consistency
• One of the biggest changes from a
centralized relational database to a cluster-
oriented NoSQL database is in how you
think about consistency.
• Relational databases try to exhibit strong
consistency by avoiding all the various
inconsistencies.
• In NoSQL as soon as you start building
something you have to think about what sort
of consistency you need for your system.

Update Consistency
Consider example of updating a telephone number.
• Coincidentally, Martin and Pramod are looking at the
company website and notice that the phone number
is out of date.
• They both have update access, so they both go in at
the same time to update the number.
• Assume they update it slightly differently, because
each uses a slightly different format.
• This issue is called a write-write conflict: two
people updating the same data item at the same

• When the writes reach the server, the server
will serialize them—decide to apply one, then
the other. Let’s assume it uses alphabetical
order and picks Martin’s update first, then
Pramod’s.
• Without any concurrency control, Martin’s
update would be applied and immediately
overwritten by Pramod’s.
• In this case Martin’s is a lost update. Here
the lost update is not a big problem, but often
it is.

Approaches for maintaining consistency:
In the face of concurrency, approaches for maintaining
consistency are often described as pessimistic or
optimistic.
• A pessimistic approach works by preventing conflicts
from occurring.
• An optimistic approach lets conflicts occur, but
detects them and takes action to sort them out.

• For update conflicts, the most common
pessimistic approach is to have write locks,
so that in order to change a value you need to
acquire a lock, and the system ensures that
only one client can get a lock at a time.
• So Martin and Pramod would both attempt to
acquire the write lock, but only Martin (the
first one)would succeed.
• Pramod would then see the result of Martin’s
write before deciding whether to make his own
update.

• A common optimistic approach is a conditional
update where any client that does an update tests the
value just before updating it to see if it’s changed
since his last read.
• In this case, Martin’s update would succeed but
Pramod’s would fail.
• The error would let Pramod know that he should
look at the value again and decide whether to
attempt a further update.

• Both the pessimistic and optimistic approaches
that we’ve just described rely on a consistent
serialization of the updates. With a single server,
this is obvious—it has to choose one, then the
other.
• But if there’s more than one server, such as with
peer-to-peer replication, then two nodes might
apply the updates in a different order, resulting
in a different value for the telephone number on
each peer.
• Often, when people talk about concurrency in
distributed systems, they talk about sequential
consistency—ensuring that all nodes apply
operations in the same order.

There is another optimistic way to handle a write-write
conflict—
• Save both updates and record that they are in conflict
• You have to merge the two updates somehow. Maybe you
show both values to the user and ask them to sort it out—
this is what happens if you update the same contact on your
phone and your computer.
• Alternatively, the computer may be able to perform the
merge itself; if it was a phone formatting issue, it may be able
to realize that and apply the new number with the standard
format.
• Any automated merge of write-write conflicts is highly
domain-specific and needs to be programmed for each
particular case.

• Often, when people first encounter these issues,
their reaction is to prefer pessimistic
concurrency because they are determined to
avoid conflicts.
• While in some cases this is the right answer, there
is always a tradeoff.
• Concurrent programming involves a
fundamental compromise between safety
(avoiding errors such as update conflicts) and
liveness (responding quickly to clients).
• Pessimistic approaches often severely degrade
the responsiveness of a system to the degree that
it becomes unfit for its purpose.
• Pessimistic concurrency often leads to deadlocks,
which are hard to prevent and debug.

• Replication makes it much more likely to run
into write-write conflicts. If different nodes
have different copies of some data which
can be independently updated, then you’ll
get conflicts unless you take specific
measures to avoid them.
• Using a single node as the target for all writes
for some data makes it much easier to
maintain update consistency. Of the
distribution models we discussed earlier, all
but peer-to-peer replication do this.

Read Consistency
• Having a data store that maintains update consistency is one
thing, but it doesn’t guarantee that readers of that data
store will always get consistent responses to their requests.
• Let’s imagine we have an order with line items and a
shipping charge. The shipping charge is calculated based on
the line items in the order.
• If we add a line item, we thus also need to recalculate and
update the shipping charge.
• In a relational database, the shipping charge and line items
will be in separate tables.
• The danger of inconsistency is that Martin adds a line item to
his order, Pramod then reads the line items and shipping
charge, and then Martin updates the shipping charge. This
is an inconsistent read or read-write conflict: In Figure
Pramod has done a read in the middle of Martin’s write.

Fig: A read-write conflict in logical
consistency

• We refer to this type of consistency as logical
consistency: ensuring that different data items
make sense together.
• To avoid a logically inconsistent read-write
conflict, relational databases support the notion
of transactions. Providing Martin wraps his two
writes in a transaction, the system guarantees that
Pramod will either read both data items before the
update or both after the update.
• A common claim we hear is that NoSQL
databases don’t support transactions and thus
can’t be consistent, such claim is mostly wrong.

Clarification of why such claim is wrong:
• Any statement about lack of transactions usually
only applies to some NoSQL databases, in particular
the aggregate-oriented ones. In contrast, graph
databases tend to support ACID transactions just
the same as relational databases.
• Aggregate-oriented databases do support atomic
updates, but only within a single aggregate. This
means that you will have logical consistency within
an aggregate but not between aggregates. So in the
example, you could avoid running into that
inconsistency if the order, the delivery charge, and
the line items are all part of a single order
aggregate.

• Of course not all data can be put in the same aggregate,
so any update that affects multiple aggregates leaves
open a time when clients could perform an
inconsistent read. The length of time an inconsistency
is present is called the inconsistency window.
• A NoSQL system may have a quite short
inconsistency window: Amazon’s documentation says
that the inconsistency window for its SimpleDB
service is usually less than a second.

• Once you introduce replication, however, you get a
whole new kind of inconsistency. Let’s imagine
there’s one last hotel room for a desirable event.
• The hotel reservation system runs on many nodes.
Martin and Cindy are a couple considering this
room, but they are discussing this on the phone
because Martin is in London and Cindy is in
Boston.
• Meanwhile Pramod, who is in Mumbai, goes and
books that last room. That updates the replicated
room availability, but the update gets to Boston
quicker than it gets to London.

When Martin and Cindy fire up their browsers to see if the
room is available, Cindy sees it booked and Martin sees
it free. This is another inconsistent read—this form of
consistency we call replication consistency: ensuring that
the same data item has the same value when read from
different replicas (see Figure).
Figure: An example of replication inconsistency

• Eventually, of course, the updates will propagate fully, and
Martin will see the room is fully booked. Therefore this
situation is generally referred to as eventually consistent,
meaning that at any time nodes may have replication
inconsistencies but, if there are no further updates,
eventually all nodes will be updated to the same value. Data
that is out of date is generally referred to as stale.
• Although replication consistency is independent from logical
consistency, replication make worse a logical inconsistency
by lengthening its inconsistency window. Two different
updates on the master may be performed in rapid
succession, leaving an inconsistency window of milliseconds.
But delays in networking could mean that the same
inconsistency window lasts for much longer on a slave.
• You can usually specify the level of consistency you want
with individual requests. This allows you to use weak
consistency most of the time when it isn’t an issue, but request
strong consistency when it is.

• Consider the example of posting comments on a
blog entry. Few people are going to worry about
inconsistency windows of even a few minutes
while people are typing in their latest thoughts.
• Often, systems handle the load of such sites by
running on a cluster and load-balancing
incoming requests to different nodes.
• Therein lies a danger: You may post a message
using one node, then refresh your browser, but
the refresh goes to a different node which
hasn’t received your post yet—and it looks like
your post was lost.

• In situations like this, you can tolerate reasonably long
inconsistency windows, but you need read your-writes
consistency which means that, once you’ve made an
update, you’re guaranteed to continue seeing that
update.
• One way to get this in an otherwise eventually consistent
system is to provide session consistency: Within a
user’s session there is read-your-writes consistency.
This does mean that the user may lose that consistency
should their session end for some reason or should the
user access the same system simultaneously from
different computers, but these cases are relatively rare.

Techniques to provide session consistency
• A common way, and often the easiest way, is to have
a sticky session: a session that’s tied to one node
(this is also called session affinity). A sticky session
allows you to ensure that as long as you keep
read-your-writes consistency on a node, you’ll get
it for sessions too. The downside is that sticky
sessions reduce the ability of the load balancer to
do its job.
• Use version stamps and ensure every interaction
with the data store includes the latest version
stamp seen by a session. The server node must then
ensure that it has the updates that include that version
stamp before responding to a request.

Relaxing Consistency
• Consistency is a Good Thing—but, sadly, sometimes we
have to sacrifice it. It is always possible to design a
system to avoid inconsistencies, but often impossible to
do so without making unbearable sacrifices in other
characteristics of the system.
• As a result, we often have to compromise consistency
for something else. While some architects see this as a
disaster, we see it as part of the unavoidable
compromises involved in system design.
• Furthermore, different domains have different
tolerances for inconsistency, and we need to take this
tolerance into account as we make our decisions.

• Compromising consistency is a familiar concept even
in single-server relational database systems. Here, our
principal tool to enforce consistency is the transaction,
and transactions can provide strong consistency
guarantees.
• However, transaction systems usually come with the
ability to relax isolation levels, allowing queries to
read data that hasn’t been committed yet, and in
practice we see most applications relax consistency
down from the highest isolation level (serialized) in
order to get effective performance.
• We most commonly see people using the read-
committed transaction level, which eliminates some
read-write conflicts but allows others.

• Many systems go without transactions entirely
because the performance impact of transactions is too
high.
• On a small scale, we saw the popularity of MySQL
during the days when it didn’t support
transactions. Many websites liked the high speed
of MySQL and were prepared to live without
transactions.
• At the other end of the scale, some very large
websites, such as eBay, have to go without
transactions in order to perform acceptably—this is
particularly true when you need to introduce
sharding.

The CAP
Theorem
• In the NoSQL world refer CAP theorem as a reason
why you may need to relax consistency.
• The basic statement of the CAP theorem is that, given the
three properties of Consistency, Availability, and
Partition tolerance, you can only get two. Obviously
this depends very much on how you define these three
properties.
• Consistency means that data is the same across cluster, so
you can read or write from/to any node and get the same
data.
• Availability has a particular meaning in the context of
CAP—it means that if you can talk to a node in the
cluster, it can read and write data.
• Partition tolerance means that the cluster can survive
communication breakages in the cluster that separate the
cluster into multiple partitions unable to communicate
with each other.

Figure: With two breaks in the communication lines,
the network partitions into two groups.

• A single-server system is the obvious example of a
CA system—a system that has Consistency and
Availability but not Partition tolerance.
• A single machine can’t partition, so it does not have
to worry about partition tolerance. There’s only one
node—so if it’s up, it’s available. Being up and
keeping consistency is reasonable.
• It is theoretically possible to have a CA cluster.
However, this would mean that if a partition ever
occurs in the cluster, all the nodes in the cluster
would go down so that no client can talk to a node.
• By the usual definition of “available,” this would mean
a lack of availability, but this is where CAP’s special
usage of “availability” gets confusing. CAP defines
“availability” to mean “every request received by a
non failing node in the system must result in a
response”. So a failed, unresponsive node doesn’t

• This does imply that you can build a CA cluster, but
you have to ensure it will only partition rarely.
• So clusters have to be tolerant of network
partitions. And here is the real point of the CAP
theorem.
• Although the CAP theorem is often stated as “you can
only get two out of three,” in practice what it’s saying
is that in a system that may suffer partitions, as
distributed system do, you have to compromise
consistency versus availability.
• Often, you can compromise a little consistency to
get some availability. The resulting system would be
neither perfectly consistent nor perfectly
available—but would have a combination that is
reasonable for your particular needs.

• Example : Martin and Pramod are both trying to book the last
hotel room on a system that uses peer-to-peer distribution with
two nodes (London for Martin and Mumbai for Pramod).
• If we want to ensure consistency, then when Martin tries to book
his room on the London node, that node must communicate with
the Mumbai node before confirming the booking. Essentially, both
nodes must agree on the serialization of their requests. This
gives us consistency—but if the network link break, then neither
system can book any hotel room, sacrificing availability.
• One way to improve availability is to designate one node as the
master for a particular hotel and ensure all bookings are
processed by that master. If that master be Mumbai, then
Mumbai can still process hotel bookings for that hotel and
Pramod will get the last room.
• If we use master-slave replication, London users can see the
inconsistent room information but cannot make a booking and thus
cause an update inconsistency.

• We still can’t book a room on the London node for the hotel
whose master is in Mumbai if the connection goes down.
• In CAP terminology, this is a failure of availability in that
Martin can talk to the London node but the London node
cannot update the data.
• To gain more availability, we might allow both systems to
keep accepting hotel reservations even if the network link
breaks down. The danger here is that Martin and Pramod
book the last hotel room.
• However, depending on how this hotel operates, that may be
fine. Often, travel companies tolerate a certain amount of
overbooking in order to cope with no-shows.
• Conversely, some hotels always keep a few rooms clear even
when they are fully booked, in order to be able to swap a
guest out of a room with problems or to accommodate a high-
status late booking.
• Some might even cancel the booking with an apology once
they detected the conflict—reasoning that the cost of that is

• The classic example of allowing inconsistent writes is
the shopping cart, as discussed in Amazon’s Dynamo.
• In this case you are always allowed to write to your
shopping cart, even if network failures mean you end up
with multiple shopping carts. The checkout process can
merge the two shopping carts by putting the union of
the items from the carts into a single cart and returning
that.
• Almost always that’s the correct answer—but if not, the
user gets the opportunity to look at the cart before
completing the order.
• The lesson here is that although most software developers
treat update consistency as The Way Things Must Be,
there are cases where you can deal gracefully with
inconsistent answers to requests.
• If you can find a way to handle inconsistent updates, this
gives you more options to increase availability and
performance. For a shopping cart, it means that shoppers
can always shop, and do so quickly.

• A similar logic applies to read consistency. If you
are trading financial instruments over a
computerized exchange, you may not be able to
tolerate any data that isn’t right up to date.
However, if you are posting a news item to a media
website, you may be able to tolerate old pages for
minutes.
• Different data items may have different tolerances for
staleness, and thus may need different settings in
your replication configuration.
• Promoters of NoSQL often say that instead of
following the ACID properties of relational
transactions, NoSQL systems follow the BASE
properties (Basically Available, Soft state, Eventual
consistency).

• It’s usually better to think not about the tradeoff
between consistency and availability but rather
between consistency and latency(response time).
• We can improve consistency by getting more
nodes involved in the interaction, but each node
we add increases the response time of that
interaction.
• We can then think of availability as the limit of
latency that we’re prepared to tolerate; once latency
gets too high, we give up and treat the data as
unavailable—which neatly fits its definition in the
context of CAP.

Relaxing Durability
• Most people would laugh at relaxing
durability—after all, what is the point of a data
store if it can lose updates?
• There are cases where you may want to trade off
some durability for higher performance.
• If a database can run mostly in memory, apply
updates to its in-memory representation, and
periodically flush changes to disk, then it may
be able to provide considerably higher
responsiveness to requests. The cost is that, if
the server crash, any updates since the last
flush will be lost.

• One example of where this tradeoff may be
meaningful is storing user-session state.
• A big website may have many users and keep
temporary information about what each user is
doing in some kind of session state. There’s a lot of
activity on this state, creating lots of demand, which
affects the responsiveness of the website.
• The vital point is that losing the session data isn’t
too much of a tragedy—it will create some
annoyance, but may be less than a slower website
would cause. This makes it a good candidate for
nondurable writes.
• Often, you can specify the durability needs on a
call-by-call basis, so that more important updates
can force a flush to disk.

• Another class of durability tradeoffs comes up
with replicated data. A failure of replication
durability occurs when a node processes an
update but fails before that update is replicated to
the other nodes.
• A simple case of this may happen if you have a
master-slave distribution model where the
slaves appoint a new master automatically if
the existing master fail. If a master does fail, any
writes not passed onto the replicas will
effectively become lost.
• If the master come back online, those updates
will conflict with updates that have happened
since. We think of this as a durability problem
because you think your update has succeeded
since the master acknowledged it, but a master
node failure caused it to be lost.

• You can improve replication durability by
ensuring that the master waits for some replicas
to acknowledge the update before the master
acknowledges it to the client.
• Obviously, however, that will slow down
updates and make the cluster unavailable if
slaves fail—so, again, we have a tradeoff,
depending upon how vital durability is.
• As with basic durability, it’s useful for individual
calls to indicate what level of durability they
need.

Quorums
• When you’re trading
off
durability, it’s
not
an proposal.
consistency
or
all or
nothi
ng
• The more nodes you involve in a request, the
higher is the chance of avoiding an
inconsistency.
• This naturally leads to the question: How
many nodes need to be involved to get
strong consistency?

Write quorum
• Imagine some data replicated over three nodes. You don’t
need all nodes to acknowledge a write to ensure strong
consistency; all you need is two of them—a majority.
• If you have conflicting writes, only one can get a majority.
This is referred to as a write quorum.
• It is expressed in a slightly pretentious inequality of W >
N/2
• It means the number of nodes participating in the write
(W) must be more than the half the number of nodes
involved in replication (N).
• The number of replicas is often called the replication
factor.

Read quorum
• Similarly to the write quorum, there is the notion of
read quorum: How many nodes you need to contact
to be sure you have the most up-to-date change.
• The read quorum is a bit more
complicated because it depends on how many
nodes need to confirm a write.

• Let’s consider a replication factor of 3.
• If all writes need two nodes to confirm (W = 2) then we
need to contact at least two nodes to be sure we’ll get the
latest data.
• If, however, writes are only confirmed by a single node
(W = 1) we need to talk to all three nodes to be sure we
have the latest updates.
• In this case, since we don’t have a write quorum, we
may have an update conflict, but by contacting enough
readers we can be sure to detect it. Thus we can get
strongly consistent reads even if we don’t have strong
consistency on our writes.
• This relationship between the number of nodes you need
to contact for a read (R), those confirming a write (W),
and the replication factor (N) can be captured in an
inequality: You can have a strongly consistent read if R
+ W > N.

• These inequalities are written with a peer-to-
peer distribution model in mind. If you have a
master slave distribution, you only have to
write to the master to avoid write-write
conflicts, and similarly only read from the
master to avoid read-write conflicts.
• With this notation, it is common to
confuse the number of nodes in the cluster with
the replication factor, but these are often
different.
• I may have100 nodes in my cluster, but
only have a replication factor of 3, with most of

• Indeed most authorities suggest that a replication
factor of 3 is enough to have good resilience. This
allows a single node to fail while still maintaining
quora for reads and writes. If you have automatic
rebalancing, it won’t take too long for the cluster to
create a third replica, so the chances of losing a
second replica before a replacement comes up are
slight.
• The number of nodes participating in an operation
can vary with the operation.
• When writing, we might require quorum for some
types of updates but not others, depending on how
much we value consistency and availability.
• Similarly, a read that needs speed but can tolerate
staleness should contact less nodes.

• Often you may need to take both into account. If you
need fast, strongly consistent reads, you could
require writes to be acknowledged by all the nodes,
thus allowing reads to contact only one (N = 3, W=
3, R = 1).
• That would mean that your writes are slow, since they
have to contact all three nodes, and you would not be
able to tolerate losing a node. But in some
circumstances that may be the tradeoff to make.
• The point to all of this is that you have a range of
options to work with and can choose which
combination of problems and advantages to prefer.

Key Points
• Write-write conflicts occur when two clients try to
write the same data at the same time. Read-write
conflicts occur when one client reads inconsistent
data in the middle of another client’s write.
• Pessimistic approaches lock data records to prevent
conflicts. Optimistic approaches detect conflicts
and fix them.
• Distributed systems see read-write conflicts due to
some nodes having received updates while other
nodes have not. Eventual consistency means that at
some point the system will become consistent once
all the writes have propagated to all the nodes.

• Clients usually want read-your-writes consistency, which
means a client can write and then immediately read the new
value. This can be difficult if the read and the write
happen on different nodes.
• To get good consistency, you need to involve many nodes
in data operations, but this increases latency. So you often
have to trade off consistency versus latency.
• The CAP theorem states that if you get a network partition,
you have to trade off availability of data versus consistency.
• Durability can also be traded off against latency,
particularly if you want to survive failures with replicated
data.
• You do not need to contact all replicants to preserve
strong consistency with replication; you just need a large
enough quorum.

Version
Stamps
• Many opponents of NoSQL databases focus on
the lack of support for transactions.
Transactions are a useful tool that helps
programmers support consistency.
• One reason why many NoSQL proponents
worry less about a lack of transactions is that
aggregate-oriented NoSQL databases do
support atomic updates within an aggregate—
and aggregates are designed so that their data
forms a natural unit of update.
• That said, it’s true that transactional needs are
something to take into account when you
decide what database to use.

• As part of this, it’s important to remember that
transactions have limitations.
• Even within a transactional system we still
have to deal with updates that require
human intervention and usually cannot be
run within transactions because they would
involve holding a transaction open for too
long.
• We can cope with these using version
stamps—which turn out to be handy in other
situations as well, particularly as we move
away from the single-server distribution
model.

Business and System Transactions
• The need to support update consistency without
transactions is actually a common feature of systems
even when they are built on top of transactional
databases. When users think about transactions,
they usually mean business transactions.
• A business transaction may be something like
browsing a product catalog, choosing a bottle of
Cold drink at a good price, filling in credit card
information, and confirming the order.
• Yet all of this usually won’t occur within the system
transaction provided by the database because this
would mean locking the database elements while
the user is trying to find their credit card and gets
called off to lunch by their colleagues.

• Usually applications only begin a system
transaction at the end of the interaction with the
user, so that the locks are only held for a short
period of time.
• The problem, however, is that calculations and
decisions may have been made based on data
that’s changed.
• The price list may have updated the price of the
Cold drink bottle, or someone may have updated
the customer’s address, changing the shipping
charges.

• The broad techniques for handling this are offline
concurrency, useful in NoSQL situations too. A
particularly useful approach is the Optimistic Offline
Lock, a form of conditional update where a client
operation rereads any information that the
business transaction relies on and checks that it
hasn’t changed since it was originally read and
displayed to the user.
• A good way of doing this is to ensure that records in
the database contain some form of version stamp: a
field that changes every time the underlying data in
the record changes.
• When you read the data you keep a note of the
version stamp, so that when you write data you can
check to see if the version has changed.

• You may have come across this technique with updating
resources with HTTP. One way of doing this is to use
etags. Whenever you get a resource, the server responds
with an etag in the header.
• This etag is an opaque string that indicates the version
of the resource. If you then update that resource, you can
use a conditional update by supplying the etag that you
got from your last GET.
• If the resource has changed on the server, the etags
won’t match and the server will refuse the update,
returning a 412 (Precondition Failed) response.

• Some databases provide a similar mechanism of
conditional update that allows you to ensure updates
won’t be based on stale data.
• You can do this check yourself, although you then have
to ensure no other thread can run against the resource
between your read and your update. Sometimes this is
called a compare-and-set (CAS) operation, whose name
comes from the CAS operations done in processors.
• The difference is that a processor CAS compares a
value before setting it, while a database conditional
update compares a version stamp of the value.

• There are various ways you can construct your version
stamps. You can use a counter, always incrementing it
when you update the resource. Counters are useful since
they make it easy to tell if one version is more recent than
another. On the other hand, they require the server to
generate the counter value, and also need a single
master to ensure the counters aren’t duplicated.
• Another approach is to create a GUID, a large random
number that’s guaranteed to be unique. These use some
combination of dates, hardware information, and
whatever other sources of randomness they can pick up.
The nice thing about GUIDs is that they can be generated
by anyone and you’ll never get a duplicate; a
disadvantage is that they are large and can’t be
compared directly for recentness.

• A third approach is to make a hash of the contents
of the resource. With a big enough hash key size, a
content hash can be globally unique like a GUID and
can also be generated by anyone.
• The advantage is that they are deterministic—any
node will generate the same content hash for same
resource data.
• However, like GUIDs they can’t
be directly
compared for recentness, and they can be lengthy.

• A fourth approach is to use the timestamp of the last
update. Like counters, they are reasonably short
and can be directly compared for recentness, yet
have the advantage of not needing a single master.
• Multiple machines can generate timestamps—but
to work properly, their clocks have to be kept in
sync.
• One node with a bad clock can cause all sorts of data
corruptions. There’s also a danger that if the
timestamp is too granular you can get duplicates—
it’s no good using timestamps of a millisecond
precision if you get many updates per millisecond.

• You can blend the advantages of these
different version stamp schemes by using
more than one of them to create a composite
stamp. For example, CouchDB uses a
combination of counter and content hash.
• Version stamps are also useful for providing
session consistency.

Version Stamps on Multiple
Nodes
• The basic version stamp works well when you have
a single authoritative source for data, such as a
single server or master-slave replication. In that
case the version stamp is controlled by the master.
Any slaves follow the master’s stamps.
• But this system has to be enhanced in a peer-to-
peer distribution model because there’s no longer a
single place to set the version stamps.

• If you’re asking two nodes for some data, you
run into the chance that they may give you
different answers. If this happens, your reaction
may vary depending on the cause of that
difference.
• It may be that an update has only reached one
node but not the other, in which case you can
accept the latest(assuming you can tell which one
that is).
• Alternatively, you may have run into an
inconsistent update, in which case you need to
decide how to deal with that. In this situation, a
simple GUID or etag won’t suffice, since these
don’t tell you enough about the relationships.

• The simplest form of version stamp is a
counter. Each time a node updates the data, it
increments the counter and puts the value of
the counter into the version stamp.
• If you have blue and green slave replicas of a
single master, and the blue node answers with
a version stamp of 4 and the green node with
6, you know that the green’s answer is more
recent.

• In multiple-master cases, we need something
fancier.
• One approach, used by distributed version control
systems, is to ensure that all nodes contain a
history of version stamps. That way you can see if
the blue node’s answer is an ancestor of the green’s
answer.
• This would either require the clients to hold onto
version stamp histories, or the server nodes to
keep version stamp histories and include them
when asked for data.
• Although version control systems keep these kinds
of histories, they aren’t found in NoSQL databases.

• A simple but problematic approach is to use
timestamps. The main problem here is that
it’s usually difficult to ensure that all the
nodes have a consistent notion of time,
particularly if updates can happen rapidly.
• Should a node’s clock get out of sync, it can
cause all sorts of trouble. In addition, you
can’t detect write-write conflicts with
timestamps, so it would only work well for the
single master case—and then a counter is
usually better.

• The most common approach used by peer-to-peer
NoSQL systems is a special form of version stamp
which we call a vector stamp. In essence, a vector
stamp is a set of counters, one for each node.
• A vector stamp for three nodes (blue, green, black)
would look something like [blue: 43,green: 54,
black: 12]. Each time a node has an internal
update, it updates its own counter, so an update in
the green node would change the vector to [blue: 43,
green: 55, black: 12].
• Whenever two nodes
communicate, they

• By using this scheme you can tell if one version
stamp is newer than another because the newer
stamp will have all its counters greater than or
equal to those in the older stamp.
• So [blue: 1,green: 2, black: 5] is newer than
[blue:1, green: 1, black 5] since one of its
counters is greater.
• If both stamps have a counter greater than the
other, e.g. [blue: 1, green: 2, black: 5]and [blue:
2, green: 1, black: 5], then you have a write-
write conflict.

• There may be missing values in the vector, in which
case we use treat the missing value as 0. So[blue:
6, black: 2] would be treated as [blue: 6, green: 0,
black: 2]. This allows you to easily add new nodes
without invalidating the existing vector stamps.
• Vector stamps are a valuable tool that spots
inconsistencies, but doesn’t resolve them. Any
conflict resolution will depend on the domain you
are working in. This is part of the
consistency/latency tradeoff.
• You either have to live with the fact that network
partitions may make your system unavailable, or
you have to detect and deal with inconsistencies.

Key Points
• Version stamps help you detect concurrency
conflicts. When you read data, then update it, you
can check the version stamp to ensure nobody
updated the data between your read and write.
• Version stamps can be implemented using
counters, GUIDs, content hashes, timestamps,
or a combination of these.
• With distributed systems, a vector of version
stamps allows you to detect when different nodes
have conflicting updates.

NOSQL DATABASES UNIT-3 FOR ENGINEERING STUDENTS

More Related Content

Similar to NOSQL DATABASES UNIT-3 FOR ENGINEERING STUDENTS (20)

Recently uploaded (20)

NOSQL DATABASES UNIT-3 FOR ENGINEERING STUDENTS