6. RDBMS over File-based system -ACID Concept
ACID is an acronym that represents a set of properties that guarantee the
accuracy and integrity of data in databases.
7. ACID Concept - Atomicity
Atomicity ensures that a transaction is treated as a single, indivisible
unit of work.
Either all the changes made by the transaction are committed to
the database, or none of them are.
If any part of the transaction fails, the entire transaction is rolled
back, and the database remains unchanged.
Example:
Consider a bank transfer where money is being withdrawn from one
account and deposited into another. Atomicity ensures that either
both the withdrawal and the deposit occur, or neither happens.
8. ACID Concept - Consistency
Consistency guarantees that a transaction brings the database from
one valid state to another.
The database must satisfy a set of integrity constraints before and after
the transaction.
If a transaction violates the database's consistency rules, it is rolled
back.
Example:
In a database where each user has a defined account balance,
consistency ensures that a transaction doesn't leave the database in a
state where the total balance is not preserved.
9. ACID Concept - Isolation
Ensures that the execution of one transaction appears isolated from the execution of
other transactions, even when multiple transactions are executing concurrently. The goal
of isolation is to prevent interference between transactions and to maintain the
consistency of the database.
Isolation is typically implemented through mechanisms such as locks, isolation levels, and
transaction boundaries. Different isolation levels define the degree to which transactions
are isolated from each other, and they determine how changes made by one
transaction become visible to other concurrently executing transactions.
Read Committed : In Git, once a developer commits changes to their local branch,
those changes are not visible to other developers until they push the changes to the
shared repository.
Read Uncommitted (Not applicable in Git): In the context of databases, Read
Uncommitted allows a transaction to read uncommitted changes by other transactions.
Serializable: When merging or rebasing, Git ensures that changes from one branch are
applied in a way that avoids conflicts and maintains a consistent history.
10. ACID Concept - Durability
Durability guarantees that once a transaction is committed, its changes to the
database persist even in the face of subsequent failures.
The changes are permanently stored in the database and are not lost, even if the
system crashes.
The data related to the completed transaction will persist even in the case of network
or power outages. If a transaction fails, it will not impact the already changed data.
Example:
After a user submits a form updating their profile information, durability ensures that the
changes are permanently stored in the database and will not be lost due to a system
failure.
21. Introduction to Big Data
Data has become an ever-expanding array of information. It is collected
as:
• User information,
• Geographic location data,
• Sensor-generated data,
• Social media feed, and in many other forms.
This massive set of unstructured data which is commonly known as big
data has now become the backbone of analysis for many mission-
critical applications.
LINK
22. Introduction to Big Data
Data continues to grow in volume, variety and velocity at an unprecedented fast
speed, and companies are searching for new ways to capture, store and analyse it.
Big Data refers to vast and complex datasets that cannot be effectively managed,
processed, or analyzed using traditional data processing tools and methods. These
datasets typically exhibit three main characteristics, often referred to as the 3Vs:
• Volume: Big Data involves massive amounts of data, often ranging from terabytes
to petabytes or more. This data can come from various sources, including social
media, sensors, devices, and transaction records.
• Velocity: Data is generated at an unprecedented speed. For example, social
media platforms generate millions of posts, comments, and interactions every
minute. This real-time data influx requires rapid processing and analysis.
• Variety: Big Data is heterogeneous and can include structured data (e.g.,
databases), semi-structured data (e.g., JSON or XML), and unstructured data (e.g.,
text, images, videos). Handling this diverse data is a significant challenge.
24. Challenges and Considerations
Big data brings out many challenges that call for new ways to handle the
scale of that data and perform new types of transformations and
analytics
1.Data Storage & Processing Challenges:
Traditional ways of data storage, processing and management, as well as
traditional relational databases face challenges in handling the volume and
variety of data generated in today's world. The structure and scaling limitations
make it difficult to manage Big Data effectively using traditional database
systems. But luckily, a wave of new technologies is also coming along with the
big data.
Two main kinds of important technologies
1. Distributed systems and Parallel Processing, and
2. NoSQL database systems
25. Distributed Systems
• LINK
• Fragmentation of Data (Partitioning/or
Sharding of data).
• Replication of Data
• Yet , relational databases can’t run
efficiently on distributed systems??!
26. Why Relational databases can’t run efficiently on
Distributed Systems ?!
1. ACID Properties and Consistency
RDB adhere to ACID properties, thus achieving strong consistency /
integrity in a distributed system can be challenging due to the potential
for network partitions and the need for synchronous communication
among nodes.
• Example:
Imagine a distributed relational database where each node needs to
agree on every transaction to maintain consistency. This requirement
may lead to high-latency communication.
27. 2. Vertical Scaling vs. Horizontal Scaling
• Traditional relational databases are designed for vertical scaling, where you add
more resources (CPU, RAM) to a single server to handle increased load.
• However, this approach has limitations in terms of cost and scalability.
• In a cluster, the emphasis is on horizontal scaling by adding more nodes, which
may not align with the architecture of traditional relational databases.
Example:
A relational database optimized for a single, powerful server may struggle when
distributed across multiple nodes, each with its own subset of data.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
28. Vertical vs. Horizontal Scaling
• Vertical scaling
• refers to increasing the processing power of a
single server.
• Both relational and non-relational databases
can scale up, but eventually, there will be a
limit in terms of maximum processing power.
• Additionally, there are increased costs with
scaling up to high-performing hardware, as
costs do not scale linearly.
• Horizontal scaling
• known as scale-out
• refers to bringing on additional nodes to share
the load.
• This is difficult with relational databases due to
the difficulty in spreading out related data across
nodes.
• With non-relational databases, since collections
are self-contained and not coupled relationally.
This allows them to be distributed across nodes
more simply, as queries do not have to “join”
them together across nodes.
29. 3. Complex Joins and Transactions:
• Relational databases often involve complex join operations and
transactions that require coordination across multiple tables.
• Distributing such operations across nodes in a cluster can introduce
significant overhead due to the need for inter-node communication.
Example:
Consider a query that involves joining several large tables distributed
across different nodes. Coordinating this join operation can be less
efficient than when the tables are co-located on a single server.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
30. 4. Data Distribution and Sharding:
Distributing data across nodes in a cluster (sharding) is a common
technique to achieve parallelism. However, relational databases may
face challenges when deciding how to shard data effectively without
introducing performance bottlenecks.
Example:
Sharding a large table based on a specific column might lead to uneven
data distribution if that column has high cardinality, resulting in certain
nodes being overloaded.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
31. 5. Schema Changes and Schema Rigidity:
Traditional relational databases often have a rigid schema that requires
careful planning before introducing changes.
In a distributed environment, the need for schema changes across
multiple nodes can be complex and time-consuming.
Example:
Adding a new column to a table in a distributed relational database
might require coordination and schema updates on all nodes, potentially
leading to downtime or operational challenges.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
32. While some relational databases have introduced features to support
distributed architectures (e.g., MySQL Cluster, PostgreSQL):
• These adaptations often come with trade-offs (CAP Theory)
• May not offer the same level of scalability and efficiency as native
NoSQL solutions designed explicitly for distributed computing.
In the context of Big Data and clusters, NoSQL databases like Apache
Cassandra or key-value stores like Amazon DynamoDB are often
preferred for their inherent scalability and ability to handle distributed
data efficiently.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
34. NOSQL
- Early 2000s, the amount of data that applications needed to store and
query increased. This data came in all shapes and sizes. The demands of
these applications could not be served by SQL technology and each of the
early companies developed new databases to meet their needs.
- What encourage NoSQL:
1. Decrease in storage cost.
2. massive use of Mobile Applications.
3. Data Distribution
4. In addition , NoSQL databases allow developers to store huge amounts of
unstructured data, giving them a lot of flexibility the need to make changes
throughout their software stack (Agile Methodology) , without pre-defined
schema.
35. What is NOSQL?
- Not Only SQL
- Non-tabular databases
- Non-relational Data Management System, that does not require a
fixed schema.
- NoSQL is used for Big data and real-time web apps. For example,
companies like Twitter, Facebook and Google collect terabytes of user
data every single day.
- The major purpose of using a NoSQL database is for data
distribution
38. NOSQL Benefits
1. The pace of development with NoSQL databases can be much faster than with
a SQL database: NoSQL databases are used in nearly every industry. Use cases
range from the highly critical (e.g., storing financial data and healthcare
records) to (e.g., storing IoT readings).
2. Fast-paced Agile development : The structure of many different forms of data
is more easily handled and evolved with a NoSQL database.
3. Store Huge volumes of data, NOSQL created to handle Big data. The amount of
data in many applications cannot be served affordably by a SQL database.
4. The scale of traffic and need for zero downtime cannot be handled by SQL. Fast
queries & scale of traffic queries. In SQL need to join data from multiple tables,
the joins can become expensive. However, data in NoSQL databases is typically
stored in a way that is optimized for queries.
5. New application paradigms can be more easily supported. Easy for developers:
MongoDB map the data structures to those of popular programming
languages, this mapping can allow developers to write less code, leading to
faster development time and fewer bugs
39. Types of NOSQL database format & top engines
1- Document databases
• Stores data in JSON,or XML documents (not Word documents or
Google docs, of course).
• In a document database, documents can be nested. (MongoDB)
40. 2- Key-Value DB
• Every data element in the database is stored as a key value pair.
• The key or attribute name (such as state) and the value (such as Alaska). (Radis
engine)
41. 3- Wide-column databases
• Wide-column databases :A relational database stores data in tables, where data is queried by
row, and where all rows have the same columns.
• A NoSQL wide column store permits rows to have differing columns, resulting in a more flexible
data model that provides the ability to evolve and adapt over time.
• In a wide column store, each column is stored separately, enabling data to be partitioned more
easily across distributed database systems .This means that when you want to run analytics on a
small number of columns, you can read those columns directly without consuming memory with
the unwanted data.
• What makes this model so flexible is that the structure of the column data can vary from row to
row. Used by Ebay , Most popular engine is Cassendra
42. 4- Graph databases
• Stores data in nodes and edges.
• Nodes typically store information about people, places, and things, while edges store information
about the relationships between the nodes.
A graph database is optimized to capture and search the connections between data elements,
overcoming the overhead associated with JOINing multiple tables in SQL.(Neo4j engine) , FB &
Twitter
43. What ties NoSQL databases together?
• One commonality is that the majority of them have their roots in the open
source community and have been used and leveraged in an open source
manner. This has been fundamental for spring-boarding their growth in the
industry.
• We often see companies who also provide a commercial version of the
database, and services and support of the technology, at the same time
providing sponsorship of the open source counterpart.
Examples of this include:
• IBM Cloudant for CouchDB,
• Datastax for Apache Cassandra, and
• Mongo has their own open source version of the Mongo database too.
48. Mongo DB
• MongoDB is the poster child for the NoSQL database movement. If asked to
name a NoSQL database, most people will say MongoDB, and many people start
with MongoDB when looking at NoSQL technology.
• It is an open source (free to use), primarily written in C++.
• MongoDB uses JSON-like documents with schema.
• MongoDB uses its own BSON (short for Binary JSON) storage format to reduce
the amount of space and processing required to store JSON documents. This
binary representation provides efficient serialization that is useful for storage and
network transmission.
• One of the benefits of document databases is that each document truly offers a
flexible schema, where no two documents need to be the same or contain the
same information.
• One of the main strengths of MongoDB is the range of official programming
language ,it Supports (C, C++, C#, Java, Node.js, Perl, PHP, Python, Ruby, Scala).
49. MongoDB in the cloud
Several cloud providers on Amazon and Azure offer hosted MongoDB
database services. MongoDB supports the following public cloud platforms:
1. Amazon EC2
2. dotCloud
3. Google Compute Engine
4. Joyent Cloud
5. Rackspace Cloud
6. Red Hat OpenShift
7. VMWare Cloud Foundry
8. Windows Azure