SlideShare a Scribd company logo
NoSQL for AI
HAGAR IBRAHIEM
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
RDBMS over File-based system -ACID Concept
ACID is an acronym that represents a set of properties that guarantee the
accuracy and integrity of data in databases.
ACID Concept - Atomicity
 Atomicity ensures that a transaction is treated as a single, indivisible
unit of work.
 Either all the changes made by the transaction are committed to
the database, or none of them are.
 If any part of the transaction fails, the entire transaction is rolled
back, and the database remains unchanged.
Example:
Consider a bank transfer where money is being withdrawn from one
account and deposited into another. Atomicity ensures that either
both the withdrawal and the deposit occur, or neither happens.
ACID Concept - Consistency
 Consistency guarantees that a transaction brings the database from
one valid state to another.
 The database must satisfy a set of integrity constraints before and after
the transaction.
 If a transaction violates the database's consistency rules, it is rolled
back.
Example:
In a database where each user has a defined account balance,
consistency ensures that a transaction doesn't leave the database in a
state where the total balance is not preserved.
ACID Concept - Isolation
 Ensures that the execution of one transaction appears isolated from the execution of
other transactions, even when multiple transactions are executing concurrently. The goal
of isolation is to prevent interference between transactions and to maintain the
consistency of the database.
 Isolation is typically implemented through mechanisms such as locks, isolation levels, and
transaction boundaries. Different isolation levels define the degree to which transactions
are isolated from each other, and they determine how changes made by one
transaction become visible to other concurrently executing transactions.
 Read Committed : In Git, once a developer commits changes to their local branch,
those changes are not visible to other developers until they push the changes to the
shared repository.
 Read Uncommitted (Not applicable in Git): In the context of databases, Read
Uncommitted allows a transaction to read uncommitted changes by other transactions.
 Serializable: When merging or rebasing, Git ensures that changes from one branch are
applied in a way that avoids conflicts and maintains a consistent history.
ACID Concept - Durability
 Durability guarantees that once a transaction is committed, its changes to the
database persist even in the face of subsequent failures.
 The changes are permanently stored in the database and are not lost, even if the
system crashes.
 The data related to the completed transaction will persist even in the case of network
or power outages. If a transaction fails, it will not impact the already changed data.
Example:
After a user submits a form updating their profile information, durability ensures that the
changes are permanently stored in the database and will not be lost due to a system
failure.
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
Introduction to Big Data
Data has become an ever-expanding array of information. It is collected
as:
• User information,
• Geographic location data,
• Sensor-generated data,
• Social media feed, and in many other forms.
This massive set of unstructured data which is commonly known as big
data has now become the backbone of analysis for many mission-
critical applications.
LINK
Introduction to Big Data
Data continues to grow in volume, variety and velocity at an unprecedented fast
speed, and companies are searching for new ways to capture, store and analyse it.
Big Data refers to vast and complex datasets that cannot be effectively managed,
processed, or analyzed using traditional data processing tools and methods. These
datasets typically exhibit three main characteristics, often referred to as the 3Vs:
• Volume: Big Data involves massive amounts of data, often ranging from terabytes
to petabytes or more. This data can come from various sources, including social
media, sensors, devices, and transaction records.
• Velocity: Data is generated at an unprecedented speed. For example, social
media platforms generate millions of posts, comments, and interactions every
minute. This real-time data influx requires rapid processing and analysis.
• Variety: Big Data is heterogeneous and can include structured data (e.g.,
databases), semi-structured data (e.g., JSON or XML), and unstructured data (e.g.,
text, images, videos). Handling this diverse data is a significant challenge.
NOSQL -lecture  1 mongo database expalnation.pdf
Challenges and Considerations
Big data brings out many challenges that call for new ways to handle the
scale of that data and perform new types of transformations and
analytics
1.Data Storage & Processing Challenges:
Traditional ways of data storage, processing and management, as well as
traditional relational databases face challenges in handling the volume and
variety of data generated in today's world. The structure and scaling limitations
make it difficult to manage Big Data effectively using traditional database
systems. But luckily, a wave of new technologies is also coming along with the
big data.
Two main kinds of important technologies
1. Distributed systems and Parallel Processing, and
2. NoSQL database systems
Distributed Systems
• LINK
• Fragmentation of Data (Partitioning/or
Sharding of data).
• Replication of Data
• Yet , relational databases can’t run
efficiently on distributed systems??!
Why Relational databases can’t run efficiently on
Distributed Systems ?!
1. ACID Properties and Consistency
RDB adhere to ACID properties, thus achieving strong consistency /
integrity in a distributed system can be challenging due to the potential
for network partitions and the need for synchronous communication
among nodes.
• Example:
Imagine a distributed relational database where each node needs to
agree on every transaction to maintain consistency. This requirement
may lead to high-latency communication.
2. Vertical Scaling vs. Horizontal Scaling
• Traditional relational databases are designed for vertical scaling, where you add
more resources (CPU, RAM) to a single server to handle increased load.
• However, this approach has limitations in terms of cost and scalability.
• In a cluster, the emphasis is on horizontal scaling by adding more nodes, which
may not align with the architecture of traditional relational databases.
Example:
A relational database optimized for a single, powerful server may struggle when
distributed across multiple nodes, each with its own subset of data.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
Vertical vs. Horizontal Scaling
• Vertical scaling
• refers to increasing the processing power of a
single server.
• Both relational and non-relational databases
can scale up, but eventually, there will be a
limit in terms of maximum processing power.
• Additionally, there are increased costs with
scaling up to high-performing hardware, as
costs do not scale linearly.
• Horizontal scaling
• known as scale-out
• refers to bringing on additional nodes to share
the load.
• This is difficult with relational databases due to
the difficulty in spreading out related data across
nodes.
• With non-relational databases, since collections
are self-contained and not coupled relationally.
This allows them to be distributed across nodes
more simply, as queries do not have to “join”
them together across nodes.
3. Complex Joins and Transactions:
• Relational databases often involve complex join operations and
transactions that require coordination across multiple tables.
• Distributing such operations across nodes in a cluster can introduce
significant overhead due to the need for inter-node communication.
Example:
Consider a query that involves joining several large tables distributed
across different nodes. Coordinating this join operation can be less
efficient than when the tables are co-located on a single server.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
4. Data Distribution and Sharding:
Distributing data across nodes in a cluster (sharding) is a common
technique to achieve parallelism. However, relational databases may
face challenges when deciding how to shard data effectively without
introducing performance bottlenecks.
Example:
Sharding a large table based on a specific column might lead to uneven
data distribution if that column has high cardinality, resulting in certain
nodes being overloaded.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
5. Schema Changes and Schema Rigidity:
Traditional relational databases often have a rigid schema that requires
careful planning before introducing changes.
In a distributed environment, the need for schema changes across
multiple nodes can be complex and time-consuming.
Example:
Adding a new column to a table in a distributed relational database
might require coordination and schema updates on all nodes, potentially
leading to downtime or operational challenges.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
While some relational databases have introduced features to support
distributed architectures (e.g., MySQL Cluster, PostgreSQL):
• These adaptations often come with trade-offs (CAP Theory)
• May not offer the same level of scalability and efficiency as native
NoSQL solutions designed explicitly for distributed computing.
In the context of Big Data and clusters, NoSQL databases like Apache
Cassandra or key-value stores like Amazon DynamoDB are often
preferred for their inherent scalability and ability to handle distributed
data efficiently.
Why Relational databases can’t run efficiently on
Distributed Systems ?!
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL
- Early 2000s, the amount of data that applications needed to store and
query increased. This data came in all shapes and sizes. The demands of
these applications could not be served by SQL technology and each of the
early companies developed new databases to meet their needs.
- What encourage NoSQL:
1. Decrease in storage cost.
2. massive use of Mobile Applications.
3. Data Distribution
4. In addition , NoSQL databases allow developers to store huge amounts of
unstructured data, giving them a lot of flexibility the need to make changes
throughout their software stack (Agile Methodology) , without pre-defined
schema.
What is NOSQL?
- Not Only SQL
- Non-tabular databases
- Non-relational Data Management System, that does not require a
fixed schema.
- NoSQL is used for Big data and real-time web apps. For example,
companies like Twitter, Facebook and Google collect terabytes of user
data every single day.
- The major purpose of using a NoSQL database is for data
distribution
NOSQL -lecture  1 mongo database expalnation.pdf
History of NOSQL - Timeline
NOSQL Benefits
1. The pace of development with NoSQL databases can be much faster than with
a SQL database: NoSQL databases are used in nearly every industry. Use cases
range from the highly critical (e.g., storing financial data and healthcare
records) to (e.g., storing IoT readings).
2. Fast-paced Agile development : The structure of many different forms of data
is more easily handled and evolved with a NoSQL database.
3. Store Huge volumes of data, NOSQL created to handle Big data. The amount of
data in many applications cannot be served affordably by a SQL database.
4. The scale of traffic and need for zero downtime cannot be handled by SQL. Fast
queries & scale of traffic queries. In SQL need to join data from multiple tables,
the joins can become expensive. However, data in NoSQL databases is typically
stored in a way that is optimized for queries.
5. New application paradigms can be more easily supported. Easy for developers:
MongoDB map the data structures to those of popular programming
languages, this mapping can allow developers to write less code, leading to
faster development time and fewer bugs
Types of NOSQL database format & top engines
1- Document databases
• Stores data in JSON,or XML documents (not Word documents or
Google docs, of course).
• In a document database, documents can be nested. (MongoDB)
2- Key-Value DB
• Every data element in the database is stored as a key value pair.
• The key or attribute name (such as state) and the value (such as Alaska). (Radis
engine)
3- Wide-column databases
• Wide-column databases :A relational database stores data in tables, where data is queried by
row, and where all rows have the same columns.
• A NoSQL wide column store permits rows to have differing columns, resulting in a more flexible
data model that provides the ability to evolve and adapt over time.
• In a wide column store, each column is stored separately, enabling data to be partitioned more
easily across distributed database systems .This means that when you want to run analytics on a
small number of columns, you can read those columns directly without consuming memory with
the unwanted data.
• What makes this model so flexible is that the structure of the column data can vary from row to
row. Used by Ebay , Most popular engine is Cassendra
4- Graph databases
• Stores data in nodes and edges.
• Nodes typically store information about people, places, and things, while edges store information
about the relationships between the nodes.
A graph database is optimized to capture and search the connections between data elements,
overcoming the overhead associated with JOINing multiple tables in SQL.(Neo4j engine) , FB &
Twitter
What ties NoSQL databases together?
• One commonality is that the majority of them have their roots in the open
source community and have been used and leveraged in an open source
manner. This has been fundamental for spring-boarding their growth in the
industry.
• We often see companies who also provide a commercial version of the
database, and services and support of the technology, at the same time
providing sponsorship of the open source counterpart.
Examples of this include:
• IBM Cloudant for CouchDB,
• Datastax for Apache Cassandra, and
• Mongo has their own open source version of the Mongo database too.
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
1 - Mongo DB
NOSQL -lecture  1 mongo database expalnation.pdf
Mongo DB
• MongoDB is the poster child for the NoSQL database movement. If asked to
name a NoSQL database, most people will say MongoDB, and many people start
with MongoDB when looking at NoSQL technology.
• It is an open source (free to use), primarily written in C++.
• MongoDB uses JSON-like documents with schema.
• MongoDB uses its own BSON (short for Binary JSON) storage format to reduce
the amount of space and processing required to store JSON documents. This
binary representation provides efficient serialization that is useful for storage and
network transmission.
• One of the benefits of document databases is that each document truly offers a
flexible schema, where no two documents need to be the same or contain the
same information.
• One of the main strengths of MongoDB is the range of official programming
language ,it Supports (C, C++, C#, Java, Node.js, Perl, PHP, Python, Ruby, Scala).
MongoDB in the cloud
Several cloud providers on Amazon and Azure offer hosted MongoDB
database services. MongoDB supports the following public cloud platforms:
1. Amazon EC2
2. dotCloud
3. Google Compute Engine
4. Joyent Cloud
5. Rackspace Cloud
6. Red Hat OpenShift
7. VMWare Cloud Foundry
8. Windows Azure
Document-based NOSQL Database Examples
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
NOSQL -lecture  1 mongo database expalnation.pdf
MQL vs SQL

More Related Content

Similar to NOSQL -lecture 1 mongo database expalnation.pdf (20)

PDF
Big Data using NoSQL Technologies
Amit Singh
 
PPTX
NoSQL.pptx
RithikRaj25
 
PPTX
The Rise of NoSQL and Polyglot Persistence
Abdelmonaim Remani
 
PPT
NoSQL Fundamentals PowerPoint Presentation
AnweshMishra21
 
PPTX
cours database pour etudiant NoSQL (1).pptx
ssuser1fde9c
 
PPTX
Relational databases vs Non-relational databases
James Serra
 
PDF
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
Sharmila Chidaravalli
 
PPTX
Introduction to asdfghjkln b vfgh n v
23mz02
 
PPTX
Hbase hivepig
Radha Krishna
 
ODP
Front Range PHP NoSQL Databases
Jon Meredith
 
PDF
Beyond Relational Databases
Gregory Boissinot
 
PPTX
Hbase hive pig
Xuhong Zhang
 
PPTX
Relational databases store data in tables
HELLOWorld889594
 
PPTX
NOSQL DATAbASES INTRDUCTION powerpoint presentaion
Abcd463572
 
PPTX
Introduction to Big Data
Vipin Batra
 
PPTX
Big Data (NJ SQL Server User Group)
Don Demcsak
 
PPTX
History and Introduction to NoSQL over Traditional Rdbms
vinayh902
 
PPT
NoSQL Seminer
Partha Das
 
PPTX
Nosql- Introduction for Beginners
Rahul Dhawani
 
PPTX
ch02models.pptx
dreamboy6060
 
Big Data using NoSQL Technologies
Amit Singh
 
NoSQL.pptx
RithikRaj25
 
The Rise of NoSQL and Polyglot Persistence
Abdelmonaim Remani
 
NoSQL Fundamentals PowerPoint Presentation
AnweshMishra21
 
cours database pour etudiant NoSQL (1).pptx
ssuser1fde9c
 
Relational databases vs Non-relational databases
James Serra
 
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
Sharmila Chidaravalli
 
Introduction to asdfghjkln b vfgh n v
23mz02
 
Hbase hivepig
Radha Krishna
 
Front Range PHP NoSQL Databases
Jon Meredith
 
Beyond Relational Databases
Gregory Boissinot
 
Hbase hive pig
Xuhong Zhang
 
Relational databases store data in tables
HELLOWorld889594
 
NOSQL DATAbASES INTRDUCTION powerpoint presentaion
Abcd463572
 
Introduction to Big Data
Vipin Batra
 
Big Data (NJ SQL Server User Group)
Don Demcsak
 
History and Introduction to NoSQL over Traditional Rdbms
vinayh902
 
NoSQL Seminer
Partha Das
 
Nosql- Introduction for Beginners
Rahul Dhawani
 
ch02models.pptx
dreamboy6060
 

Recently uploaded (20)

PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
introduction to computer hardware and sofeware
chauhanshraddha2007
 
PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
The Future of Artificial Intelligence (AI)
Mukul
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
introduction to computer hardware and sofeware
chauhanshraddha2007
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Ad

NOSQL -lecture 1 mongo database expalnation.pdf

  • 6. RDBMS over File-based system -ACID Concept ACID is an acronym that represents a set of properties that guarantee the accuracy and integrity of data in databases.
  • 7. ACID Concept - Atomicity  Atomicity ensures that a transaction is treated as a single, indivisible unit of work.  Either all the changes made by the transaction are committed to the database, or none of them are.  If any part of the transaction fails, the entire transaction is rolled back, and the database remains unchanged. Example: Consider a bank transfer where money is being withdrawn from one account and deposited into another. Atomicity ensures that either both the withdrawal and the deposit occur, or neither happens.
  • 8. ACID Concept - Consistency  Consistency guarantees that a transaction brings the database from one valid state to another.  The database must satisfy a set of integrity constraints before and after the transaction.  If a transaction violates the database's consistency rules, it is rolled back. Example: In a database where each user has a defined account balance, consistency ensures that a transaction doesn't leave the database in a state where the total balance is not preserved.
  • 9. ACID Concept - Isolation  Ensures that the execution of one transaction appears isolated from the execution of other transactions, even when multiple transactions are executing concurrently. The goal of isolation is to prevent interference between transactions and to maintain the consistency of the database.  Isolation is typically implemented through mechanisms such as locks, isolation levels, and transaction boundaries. Different isolation levels define the degree to which transactions are isolated from each other, and they determine how changes made by one transaction become visible to other concurrently executing transactions.  Read Committed : In Git, once a developer commits changes to their local branch, those changes are not visible to other developers until they push the changes to the shared repository.  Read Uncommitted (Not applicable in Git): In the context of databases, Read Uncommitted allows a transaction to read uncommitted changes by other transactions.  Serializable: When merging or rebasing, Git ensures that changes from one branch are applied in a way that avoids conflicts and maintains a consistent history.
  • 10. ACID Concept - Durability  Durability guarantees that once a transaction is committed, its changes to the database persist even in the face of subsequent failures.  The changes are permanently stored in the database and are not lost, even if the system crashes.  The data related to the completed transaction will persist even in the case of network or power outages. If a transaction fails, it will not impact the already changed data. Example: After a user submits a form updating their profile information, durability ensures that the changes are permanently stored in the database and will not be lost due to a system failure.
  • 21. Introduction to Big Data Data has become an ever-expanding array of information. It is collected as: • User information, • Geographic location data, • Sensor-generated data, • Social media feed, and in many other forms. This massive set of unstructured data which is commonly known as big data has now become the backbone of analysis for many mission- critical applications. LINK
  • 22. Introduction to Big Data Data continues to grow in volume, variety and velocity at an unprecedented fast speed, and companies are searching for new ways to capture, store and analyse it. Big Data refers to vast and complex datasets that cannot be effectively managed, processed, or analyzed using traditional data processing tools and methods. These datasets typically exhibit three main characteristics, often referred to as the 3Vs: • Volume: Big Data involves massive amounts of data, often ranging from terabytes to petabytes or more. This data can come from various sources, including social media, sensors, devices, and transaction records. • Velocity: Data is generated at an unprecedented speed. For example, social media platforms generate millions of posts, comments, and interactions every minute. This real-time data influx requires rapid processing and analysis. • Variety: Big Data is heterogeneous and can include structured data (e.g., databases), semi-structured data (e.g., JSON or XML), and unstructured data (e.g., text, images, videos). Handling this diverse data is a significant challenge.
  • 24. Challenges and Considerations Big data brings out many challenges that call for new ways to handle the scale of that data and perform new types of transformations and analytics 1.Data Storage & Processing Challenges: Traditional ways of data storage, processing and management, as well as traditional relational databases face challenges in handling the volume and variety of data generated in today's world. The structure and scaling limitations make it difficult to manage Big Data effectively using traditional database systems. But luckily, a wave of new technologies is also coming along with the big data. Two main kinds of important technologies 1. Distributed systems and Parallel Processing, and 2. NoSQL database systems
  • 25. Distributed Systems • LINK • Fragmentation of Data (Partitioning/or Sharding of data). • Replication of Data • Yet , relational databases can’t run efficiently on distributed systems??!
  • 26. Why Relational databases can’t run efficiently on Distributed Systems ?! 1. ACID Properties and Consistency RDB adhere to ACID properties, thus achieving strong consistency / integrity in a distributed system can be challenging due to the potential for network partitions and the need for synchronous communication among nodes. • Example: Imagine a distributed relational database where each node needs to agree on every transaction to maintain consistency. This requirement may lead to high-latency communication.
  • 27. 2. Vertical Scaling vs. Horizontal Scaling • Traditional relational databases are designed for vertical scaling, where you add more resources (CPU, RAM) to a single server to handle increased load. • However, this approach has limitations in terms of cost and scalability. • In a cluster, the emphasis is on horizontal scaling by adding more nodes, which may not align with the architecture of traditional relational databases. Example: A relational database optimized for a single, powerful server may struggle when distributed across multiple nodes, each with its own subset of data. Why Relational databases can’t run efficiently on Distributed Systems ?!
  • 28. Vertical vs. Horizontal Scaling • Vertical scaling • refers to increasing the processing power of a single server. • Both relational and non-relational databases can scale up, but eventually, there will be a limit in terms of maximum processing power. • Additionally, there are increased costs with scaling up to high-performing hardware, as costs do not scale linearly. • Horizontal scaling • known as scale-out • refers to bringing on additional nodes to share the load. • This is difficult with relational databases due to the difficulty in spreading out related data across nodes. • With non-relational databases, since collections are self-contained and not coupled relationally. This allows them to be distributed across nodes more simply, as queries do not have to “join” them together across nodes.
  • 29. 3. Complex Joins and Transactions: • Relational databases often involve complex join operations and transactions that require coordination across multiple tables. • Distributing such operations across nodes in a cluster can introduce significant overhead due to the need for inter-node communication. Example: Consider a query that involves joining several large tables distributed across different nodes. Coordinating this join operation can be less efficient than when the tables are co-located on a single server. Why Relational databases can’t run efficiently on Distributed Systems ?!
  • 30. 4. Data Distribution and Sharding: Distributing data across nodes in a cluster (sharding) is a common technique to achieve parallelism. However, relational databases may face challenges when deciding how to shard data effectively without introducing performance bottlenecks. Example: Sharding a large table based on a specific column might lead to uneven data distribution if that column has high cardinality, resulting in certain nodes being overloaded. Why Relational databases can’t run efficiently on Distributed Systems ?!
  • 31. 5. Schema Changes and Schema Rigidity: Traditional relational databases often have a rigid schema that requires careful planning before introducing changes. In a distributed environment, the need for schema changes across multiple nodes can be complex and time-consuming. Example: Adding a new column to a table in a distributed relational database might require coordination and schema updates on all nodes, potentially leading to downtime or operational challenges. Why Relational databases can’t run efficiently on Distributed Systems ?!
  • 32. While some relational databases have introduced features to support distributed architectures (e.g., MySQL Cluster, PostgreSQL): • These adaptations often come with trade-offs (CAP Theory) • May not offer the same level of scalability and efficiency as native NoSQL solutions designed explicitly for distributed computing. In the context of Big Data and clusters, NoSQL databases like Apache Cassandra or key-value stores like Amazon DynamoDB are often preferred for their inherent scalability and ability to handle distributed data efficiently. Why Relational databases can’t run efficiently on Distributed Systems ?!
  • 34. NOSQL - Early 2000s, the amount of data that applications needed to store and query increased. This data came in all shapes and sizes. The demands of these applications could not be served by SQL technology and each of the early companies developed new databases to meet their needs. - What encourage NoSQL: 1. Decrease in storage cost. 2. massive use of Mobile Applications. 3. Data Distribution 4. In addition , NoSQL databases allow developers to store huge amounts of unstructured data, giving them a lot of flexibility the need to make changes throughout their software stack (Agile Methodology) , without pre-defined schema.
  • 35. What is NOSQL? - Not Only SQL - Non-tabular databases - Non-relational Data Management System, that does not require a fixed schema. - NoSQL is used for Big data and real-time web apps. For example, companies like Twitter, Facebook and Google collect terabytes of user data every single day. - The major purpose of using a NoSQL database is for data distribution
  • 37. History of NOSQL - Timeline
  • 38. NOSQL Benefits 1. The pace of development with NoSQL databases can be much faster than with a SQL database: NoSQL databases are used in nearly every industry. Use cases range from the highly critical (e.g., storing financial data and healthcare records) to (e.g., storing IoT readings). 2. Fast-paced Agile development : The structure of many different forms of data is more easily handled and evolved with a NoSQL database. 3. Store Huge volumes of data, NOSQL created to handle Big data. The amount of data in many applications cannot be served affordably by a SQL database. 4. The scale of traffic and need for zero downtime cannot be handled by SQL. Fast queries & scale of traffic queries. In SQL need to join data from multiple tables, the joins can become expensive. However, data in NoSQL databases is typically stored in a way that is optimized for queries. 5. New application paradigms can be more easily supported. Easy for developers: MongoDB map the data structures to those of popular programming languages, this mapping can allow developers to write less code, leading to faster development time and fewer bugs
  • 39. Types of NOSQL database format & top engines 1- Document databases • Stores data in JSON,or XML documents (not Word documents or Google docs, of course). • In a document database, documents can be nested. (MongoDB)
  • 40. 2- Key-Value DB • Every data element in the database is stored as a key value pair. • The key or attribute name (such as state) and the value (such as Alaska). (Radis engine)
  • 41. 3- Wide-column databases • Wide-column databases :A relational database stores data in tables, where data is queried by row, and where all rows have the same columns. • A NoSQL wide column store permits rows to have differing columns, resulting in a more flexible data model that provides the ability to evolve and adapt over time. • In a wide column store, each column is stored separately, enabling data to be partitioned more easily across distributed database systems .This means that when you want to run analytics on a small number of columns, you can read those columns directly without consuming memory with the unwanted data. • What makes this model so flexible is that the structure of the column data can vary from row to row. Used by Ebay , Most popular engine is Cassendra
  • 42. 4- Graph databases • Stores data in nodes and edges. • Nodes typically store information about people, places, and things, while edges store information about the relationships between the nodes. A graph database is optimized to capture and search the connections between data elements, overcoming the overhead associated with JOINing multiple tables in SQL.(Neo4j engine) , FB & Twitter
  • 43. What ties NoSQL databases together? • One commonality is that the majority of them have their roots in the open source community and have been used and leveraged in an open source manner. This has been fundamental for spring-boarding their growth in the industry. • We often see companies who also provide a commercial version of the database, and services and support of the technology, at the same time providing sponsorship of the open source counterpart. Examples of this include: • IBM Cloudant for CouchDB, • Datastax for Apache Cassandra, and • Mongo has their own open source version of the Mongo database too.
  • 46. 1 - Mongo DB
  • 48. Mongo DB • MongoDB is the poster child for the NoSQL database movement. If asked to name a NoSQL database, most people will say MongoDB, and many people start with MongoDB when looking at NoSQL technology. • It is an open source (free to use), primarily written in C++. • MongoDB uses JSON-like documents with schema. • MongoDB uses its own BSON (short for Binary JSON) storage format to reduce the amount of space and processing required to store JSON documents. This binary representation provides efficient serialization that is useful for storage and network transmission. • One of the benefits of document databases is that each document truly offers a flexible schema, where no two documents need to be the same or contain the same information. • One of the main strengths of MongoDB is the range of official programming language ,it Supports (C, C++, C#, Java, Node.js, Perl, PHP, Python, Ruby, Scala).
  • 49. MongoDB in the cloud Several cloud providers on Amazon and Azure offer hosted MongoDB database services. MongoDB supports the following public cloud platforms: 1. Amazon EC2 2. dotCloud 3. Google Compute Engine 4. Joyent Cloud 5. Rackspace Cloud 6. Red Hat OpenShift 7. VMWare Cloud Foundry 8. Windows Azure