SlideShare a Scribd company logo
Referent
Einrichtung Titel des Vortrages 1
WP-Benchmarking Top NoSQL
Databases
Apache Cassandra, Apache HBase and MongoDB
Presented By
Athiq Ahamed
Supriya
Referent
Einrichtung Titel des Vortrages 2
Introduction
 Enormous amount of data-BigData
 Scalabilty issue in RDBMS
 Rise of NoSQL databases
 Amazon Dynamo
 Big table
 CAP Theorem
 BASE system
Referent
Einrichtung Titel des Vortrages 3
CAP Theorem
 Consistency
 Availability
 Partition tolerance
CAP theorem states that only two of the properties can be
achieved at a time.
Referent
Einrichtung Titel des Vortrages 4
RDBMS NoSQL
Supports powerful query
language
Supports very simple query
language
It has a fixed schema No fixed schema
Follows ACID (Atomicity,
Consistency, Isolation and
Durability)
It is only eventually consistent
Supports transactions Does not support transactions
RDBMS vs NoSQL
Content:tutorialspoint.com
Referent
Einrichtung Titel des Vortrages 5
 Basically available: System guarantees availability, in
terms of the CAP theorem
 Soft state: State of the system may change over time,
because of eventual consistency model
 Eventual consistency: System will become consistent over
time
BASE
Content:www.edureka.in
Referent
Einrichtung Titel des Vortrages 6
 Fast Performance is the key.
 POC processes include right benchmarks:
 Configurations
 Parameters
 Workloads
Making the right choice!
Selection of NoSQL
Referent
Einrichtung Titel des Vortrages 7
 Yahoo Cloud Serving Benchmark (YCSB)
 Top 3 NoSQL databases-Apache Cassandra, Apache
Hbase and MongoDB.
 Amazon Web Services EC2 instances for hosting the tests
 Test performed 3 times on 3 different days
Benchmark configuration
Referent
Einrichtung Titel des Vortrages 8
 The tests ran on large size instances (15GB RAM and 4
CPU cores)
 Instances used customized Ubuntu with Oracle Java 1.6
installed as a base.
 A customized script written to drive the benchmark
processes
Benchmark configuration
Referent
Einrichtung Titel des Vortrages 9
 Each NoSQL system performs differently, not alike.
 Components and Internal working.
 Apache Cassandra: Columnar database model
 Apache HBase: Columnar database model
 MongoDB: Document storage database model
Understanding NoSQL Databases
Referent
Einrichtung Titel des Vortrages 10
Apache Cassandra
 Cassandra is scalable, fault-tolerant, and consistent. All
nodes are equal.
 Its distribution design is based on Amazon’s Dynamo and
its data model on Google’s Bigtable.
 Key components: Node, Cluster, Commit log, Mem-table,
SSTable and Bloom filter
Content:http://www.tutorialspoint.com/cassandra/cassandra_architecture.htm
Referent
Einrichtung Titel des Vortrages 11
 Ring structure, peer to peer architecture
 All nodes are equal
 This improves general database availablity
 Scaling up and scaling down is easier
 Cassandra has key-value, column oriented database
Apache Cassandra
Referent
Einrichtung Titel des Vortrages 12
Apache Cassandra
Content:http://demoiselle.sourceforge.net/component/demoiselle-
cassandra/1.0.0/images/datamodel1.png
Referent
Einrichtung Titel des Vortrages 13
 Cassandra has an internal keyspace called system, stores
metadata about the cluster.
 Metadata:
 The node‘s token
 The cluster name
 Keyspace n schema definitions (dynamic loading)
 Whether or not the node is bootstrapped
Apache Cassandra
Content:https://www.edureka.co/blog/category/apache-cassandra/
Referent
Einrichtung Titel des Vortrages 14
 Commit log: Crash recovery mechanism. Every write
operation is written to commit log
 Mem-Table: A memory resident data structure.
 SSTable: It is a disk file to which the data is flushed from
the mem-table
Apache Cassandra
Referent
Einrichtung Titel des Vortrages 15
 Bloom filters are used as a performance booster
 Bloom filter are very fast, quick algorithms for testing a
member in the set.
 Bloom filters serves as a special kind of cache – quick
lookups/search as they reside in memory
Apache Cassandra
Referent
Einrichtung Titel des Vortrages 16
 Gossip protocol: Communiction between nodes, co-
ordination and failure check
 Anti-Entropy protocol: Replica sync mechanism enusing
data on different nodes are updated (Merkle trees)
 Snitches ensures host proximity
Apache Cassandra
Referent
Einrichtung Titel des Vortrages 17
Apache Cassandra- Read/Write operation
Referent
Einrichtung Titel des Vortrages 18
 Sparse, distributed, sorted map and multidimensional and
consistent.
 Hbase is a Key/value store
 Consists Row key, Column family, columns and timestamp.
Apache HBase
Referent
Einrichtung Titel des Vortrages 19
Apache HBase
Content:http://zhangjunhd.github.io/assets/2013-02-25-apache-hbase/rowkey-
Referent
Einrichtung Titel des Vortrages 20
 Region: Contiguous rows form a region
 Region server(RS): Serves one or more regions.
 Master server: Daemon responsible for managing Hbase
cluster
 HDFS: Distributed, open source file system containing
HBase‘s data
 Zookeeper: Distributed, open source co-ordinated service
for co-ordination of master and region servers.
Apache HBase Components
Content: https://www.mapr.com/blog/in-depth-look-hbase-architecture
Referent
Einrichtung Titel des Vortrages 21
Apache Hbase Architecture
Referent
Einrichtung Titel des Vortrages 22
 Client obtains meta table RS from Zookeeper
 Client gets RS which holds the corresponding rowkey
 Client receives the row from the respective Region server
 Client caches this information along with the location of
meta table server.
First Read/Write to HBase
Referent
Einrichtung Titel des Vortrages 23
 WAL: Write Ahead Log is a file on the distributed file
system. It is used to store new data
 Block Cache: It is the read cache. It stores frequently
read data in memory
 Mem Store: Write cache that stores new data which is not
written to disk yet.
 Hfiles stores the rows as sorted key values on disk
HBase RS Components
Referent
Einrichtung Titel des Vortrages 24
 Client writes the data to the WAL file stored on disk
 WAL is used to recover not yet persisted data in case a
server crashes.
 Once data is written to WAL, it is placed in Mem Store
Hbase Write steps (1)
Referent
Einrichtung Titel des Vortrages 25
 All write/read are to/from the primary node.
 HDFS replicates WAL and Hfile blocks. Replication
happens automatically.
 When data is written in HDFS, one copy is written locally
and then it is replicated to a secondary node and later to
tertiary node.
HDFS Write steps (2)
Referent
Einrichtung Titel des Vortrages 26
 Cassandra usecase: Availability and Partition tolerant
requirements.
Consistency is tunable by setting it high in the option
 Hbase usecase: Consistency and Scalability. However, at
less number of nodes/threads, availability is achieved high
Cassandra and Hbase
Referent
Einrichtung Titel des Vortrages 27
 Document-oriented database
 High performance and automatic scaling
 High consistency and partition tolerant
 Replication and failover for high availability
 Low latency
 Flexible indexing
MongoDB
Referent
Einrichtung Titel des Vortrages 28
 Document is the basic unit for MongoDB(row)
 Collection is similar to a table
 A single instance has multiple independent databases
 Every document has a special key, “_id”
 Powerful JavaScript shell for administration
 Configdb contains metadata of clusters
MongoDB Concepts
Referent
Einrichtung Titel des Vortrages 29
MongoDB Simple Architecture
Referent
Einrichtung Titel des Vortrages 30
 A mongo receives queries from applications
 Uses metadata from config server for the data
 Mangos directs write operations to a particular shard
 Mongos uses the cluster metadata from the config
database
Read/Write MongoDB
Referent
Einrichtung Titel des Vortrages 31
 Scalability
 Availability
 Partition Tolerant
 Consistency
MOST IMPORTANT PERFORMANCE
Yahoo Cloud Serving Benchmark (YCSB)
Recap Importance of Benchmark and Factors
Referent
Einrichtung Titel des Vortrages 32
Results: Load Process
Referent
Einrichtung Titel des Vortrages 33
Results: Read/Write Mix Workload
Referent
Einrichtung Titel des Vortrages 34
Results: Read/Scan Mix Workload
Referent
Einrichtung Titel des Vortrages 35
Results: Read Latency across all workloads
Referent
Einrichtung Titel des Vortrages 36
Results: Insert Latency across all workloads
Referent
Einrichtung Titel des Vortrages 37
Lets MIGRATE from traditional data base !!!!
Live Demo
Referent
Einrichtung Titel des Vortrages 38
 Identify data model for the application
 Corresponding data sets have to be known
 Whether the application requires replication
 Identify the performance requirements
 Prototype the application
 Test the performance of the prototype
Discussion
Referent
Einrichtung Titel des Vortrages 39
Conclusion
 NoSQL replaced tradition relational databases
 Performance is the key feature
 Importance of benchmarks
 Top three NoSQL data base’s performance tested
 Cassandra outperforms all the other NoSQL data bases
 Decide based on application
Referent
Einrichtung Titel des Vortrages 40

More Related Content

What's hot (20)

PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
PPTX
Scaling with MongoDB
Rick Copeland
 
PPTX
Cassandra training
András Fehér
 
PPTX
NoSQL databases - An introduction
Pooyan Mehrparvar
 
PPTX
Cassandra an overview
PritamKathar
 
PPTX
Voldemort
fasiha ikram
 
PDF
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
DataStax
 
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
PDF
Migrating to postgresql
botsplash.com
 
PPTX
Cassandra
Upaang Saxena
 
PDF
Cassandra TK 2014 - Large Nodes
aaronmorton
 
PPT
Apache Cassandra training. Overview and Basics
Oleg Magazov
 
PDF
Voldemort on Solid State Drives
Vinoth Chandar
 
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
PDF
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
PDF
Real-time Cassandra
Acunu
 
PDF
Cassandra: Open Source Bigtable + Dynamo
jbellis
 
ODP
Introduction to Apache Cassandra
Knoldus Inc.
 
PDF
Run Cloud Native MySQL NDB Cluster in Kubernetes
Bernd Ocklin
 
PPT
Cassandra architecture
T Jake Luciani
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
Scaling with MongoDB
Rick Copeland
 
Cassandra training
András Fehér
 
NoSQL databases - An introduction
Pooyan Mehrparvar
 
Cassandra an overview
PritamKathar
 
Voldemort
fasiha ikram
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
DataStax
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
Migrating to postgresql
botsplash.com
 
Cassandra
Upaang Saxena
 
Cassandra TK 2014 - Large Nodes
aaronmorton
 
Apache Cassandra training. Overview and Basics
Oleg Magazov
 
Voldemort on Solid State Drives
Vinoth Chandar
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
Real-time Cassandra
Acunu
 
Cassandra: Open Source Bigtable + Dynamo
jbellis
 
Introduction to Apache Cassandra
Knoldus Inc.
 
Run Cloud Native MySQL NDB Cluster in Kubernetes
Bernd Ocklin
 
Cassandra architecture
T Jake Luciani
 

Viewers also liked (6)

PPTX
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Edureka!
 
PDF
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
PDF
Optimizing Hive Queries
Owen O'Malley
 
PDF
Hive tuning
Michael Zhang
 
PPTX
Spark + HBase
DataWorks Summit/Hadoop Summit
 
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Edureka!
 
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
Optimizing Hive Queries
Owen O'Malley
 
Hive tuning
Michael Zhang
 
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Ad

Similar to Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB (20)

PPT
Wmware NoSQL
Murat Çakal
 
PPTX
NoSQL Databases
Amit Kumar Gupta
 
PPTX
Yes, Sql!
Uri Cohen
 
PDF
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
Alexey Zinoviev
 
PPTX
Datastore PPT.pptx
Jatin Chuglani
 
PDF
Performance analysis of MongoDB and HBase
SindhujanDhayalan
 
PPTX
No sql databases
swathika rajan
 
PDF
Nosql data models
Viet-Trung TRAN
 
PPTX
Mongo db
Akshay Mathur
 
PDF
Performance Analysis of HBASE and MONGODB
Kaushik Rajan
 
PPTX
No SQL introduction
surabhi_dwivedi
 
PPTX
Xebia Knowledge Exchange (may 2010) - NoSQL : Using the right tool for the ri...
Michaël Figuière
 
PPTX
NoSql Data Management
sameerfaizan
 
PPTX
Comparing sql and nosql dbs
Vasilios Kuznos
 
PPTX
noSQL choices
lugiamaster4
 
PPTX
Nosql databases
Fayez Shayeb
 
PPT
The No SQL Principles and Basic Application Of Casandra Model
Rishikese MR
 
PDF
Data Storage and Management project Report
Tushar Dalvi
 
PPTX
Selecting best NoSQL
Mohammed Fazuluddin
 
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
Joe Alex
 
Wmware NoSQL
Murat Çakal
 
NoSQL Databases
Amit Kumar Gupta
 
Yes, Sql!
Uri Cohen
 
JPoint'15 Mom, I so wish Hibernate for my NoSQL database...
Alexey Zinoviev
 
Datastore PPT.pptx
Jatin Chuglani
 
Performance analysis of MongoDB and HBase
SindhujanDhayalan
 
No sql databases
swathika rajan
 
Nosql data models
Viet-Trung TRAN
 
Mongo db
Akshay Mathur
 
Performance Analysis of HBASE and MONGODB
Kaushik Rajan
 
No SQL introduction
surabhi_dwivedi
 
Xebia Knowledge Exchange (may 2010) - NoSQL : Using the right tool for the ri...
Michaël Figuière
 
NoSql Data Management
sameerfaizan
 
Comparing sql and nosql dbs
Vasilios Kuznos
 
noSQL choices
lugiamaster4
 
Nosql databases
Fayez Shayeb
 
The No SQL Principles and Basic Application Of Casandra Model
Rishikese MR
 
Data Storage and Management project Report
Tushar Dalvi
 
Selecting best NoSQL
Mohammed Fazuluddin
 
NoSQL A brief look at Apache Cassandra Distributed Database
Joe Alex
 
Ad

Recently uploaded (20)

PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PPTX
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Comparative Study of ML Techniques for RealTime Credit Card Fraud Detection S...
Debolina Ghosh
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
Research Methodology Overview Introduction
ayeshagul29594
 
BinarySearchTree in datastructures in detail
kichokuttu
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
美国史蒂文斯理工学院毕业证书{SIT学费发票SIT录取通知书}哪里购买
Taqyea
 

Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB

  • 1. Referent Einrichtung Titel des Vortrages 1 WP-Benchmarking Top NoSQL Databases Apache Cassandra, Apache HBase and MongoDB Presented By Athiq Ahamed Supriya
  • 2. Referent Einrichtung Titel des Vortrages 2 Introduction  Enormous amount of data-BigData  Scalabilty issue in RDBMS  Rise of NoSQL databases  Amazon Dynamo  Big table  CAP Theorem  BASE system
  • 3. Referent Einrichtung Titel des Vortrages 3 CAP Theorem  Consistency  Availability  Partition tolerance CAP theorem states that only two of the properties can be achieved at a time.
  • 4. Referent Einrichtung Titel des Vortrages 4 RDBMS NoSQL Supports powerful query language Supports very simple query language It has a fixed schema No fixed schema Follows ACID (Atomicity, Consistency, Isolation and Durability) It is only eventually consistent Supports transactions Does not support transactions RDBMS vs NoSQL Content:tutorialspoint.com
  • 5. Referent Einrichtung Titel des Vortrages 5  Basically available: System guarantees availability, in terms of the CAP theorem  Soft state: State of the system may change over time, because of eventual consistency model  Eventual consistency: System will become consistent over time BASE Content:www.edureka.in
  • 6. Referent Einrichtung Titel des Vortrages 6  Fast Performance is the key.  POC processes include right benchmarks:  Configurations  Parameters  Workloads Making the right choice! Selection of NoSQL
  • 7. Referent Einrichtung Titel des Vortrages 7  Yahoo Cloud Serving Benchmark (YCSB)  Top 3 NoSQL databases-Apache Cassandra, Apache Hbase and MongoDB.  Amazon Web Services EC2 instances for hosting the tests  Test performed 3 times on 3 different days Benchmark configuration
  • 8. Referent Einrichtung Titel des Vortrages 8  The tests ran on large size instances (15GB RAM and 4 CPU cores)  Instances used customized Ubuntu with Oracle Java 1.6 installed as a base.  A customized script written to drive the benchmark processes Benchmark configuration
  • 9. Referent Einrichtung Titel des Vortrages 9  Each NoSQL system performs differently, not alike.  Components and Internal working.  Apache Cassandra: Columnar database model  Apache HBase: Columnar database model  MongoDB: Document storage database model Understanding NoSQL Databases
  • 10. Referent Einrichtung Titel des Vortrages 10 Apache Cassandra  Cassandra is scalable, fault-tolerant, and consistent. All nodes are equal.  Its distribution design is based on Amazon’s Dynamo and its data model on Google’s Bigtable.  Key components: Node, Cluster, Commit log, Mem-table, SSTable and Bloom filter Content:http://www.tutorialspoint.com/cassandra/cassandra_architecture.htm
  • 11. Referent Einrichtung Titel des Vortrages 11  Ring structure, peer to peer architecture  All nodes are equal  This improves general database availablity  Scaling up and scaling down is easier  Cassandra has key-value, column oriented database Apache Cassandra
  • 12. Referent Einrichtung Titel des Vortrages 12 Apache Cassandra Content:http://demoiselle.sourceforge.net/component/demoiselle- cassandra/1.0.0/images/datamodel1.png
  • 13. Referent Einrichtung Titel des Vortrages 13  Cassandra has an internal keyspace called system, stores metadata about the cluster.  Metadata:  The node‘s token  The cluster name  Keyspace n schema definitions (dynamic loading)  Whether or not the node is bootstrapped Apache Cassandra Content:https://www.edureka.co/blog/category/apache-cassandra/
  • 14. Referent Einrichtung Titel des Vortrages 14  Commit log: Crash recovery mechanism. Every write operation is written to commit log  Mem-Table: A memory resident data structure.  SSTable: It is a disk file to which the data is flushed from the mem-table Apache Cassandra
  • 15. Referent Einrichtung Titel des Vortrages 15  Bloom filters are used as a performance booster  Bloom filter are very fast, quick algorithms for testing a member in the set.  Bloom filters serves as a special kind of cache – quick lookups/search as they reside in memory Apache Cassandra
  • 16. Referent Einrichtung Titel des Vortrages 16  Gossip protocol: Communiction between nodes, co- ordination and failure check  Anti-Entropy protocol: Replica sync mechanism enusing data on different nodes are updated (Merkle trees)  Snitches ensures host proximity Apache Cassandra
  • 17. Referent Einrichtung Titel des Vortrages 17 Apache Cassandra- Read/Write operation
  • 18. Referent Einrichtung Titel des Vortrages 18  Sparse, distributed, sorted map and multidimensional and consistent.  Hbase is a Key/value store  Consists Row key, Column family, columns and timestamp. Apache HBase
  • 19. Referent Einrichtung Titel des Vortrages 19 Apache HBase Content:http://zhangjunhd.github.io/assets/2013-02-25-apache-hbase/rowkey-
  • 20. Referent Einrichtung Titel des Vortrages 20  Region: Contiguous rows form a region  Region server(RS): Serves one or more regions.  Master server: Daemon responsible for managing Hbase cluster  HDFS: Distributed, open source file system containing HBase‘s data  Zookeeper: Distributed, open source co-ordinated service for co-ordination of master and region servers. Apache HBase Components Content: https://www.mapr.com/blog/in-depth-look-hbase-architecture
  • 21. Referent Einrichtung Titel des Vortrages 21 Apache Hbase Architecture
  • 22. Referent Einrichtung Titel des Vortrages 22  Client obtains meta table RS from Zookeeper  Client gets RS which holds the corresponding rowkey  Client receives the row from the respective Region server  Client caches this information along with the location of meta table server. First Read/Write to HBase
  • 23. Referent Einrichtung Titel des Vortrages 23  WAL: Write Ahead Log is a file on the distributed file system. It is used to store new data  Block Cache: It is the read cache. It stores frequently read data in memory  Mem Store: Write cache that stores new data which is not written to disk yet.  Hfiles stores the rows as sorted key values on disk HBase RS Components
  • 24. Referent Einrichtung Titel des Vortrages 24  Client writes the data to the WAL file stored on disk  WAL is used to recover not yet persisted data in case a server crashes.  Once data is written to WAL, it is placed in Mem Store Hbase Write steps (1)
  • 25. Referent Einrichtung Titel des Vortrages 25  All write/read are to/from the primary node.  HDFS replicates WAL and Hfile blocks. Replication happens automatically.  When data is written in HDFS, one copy is written locally and then it is replicated to a secondary node and later to tertiary node. HDFS Write steps (2)
  • 26. Referent Einrichtung Titel des Vortrages 26  Cassandra usecase: Availability and Partition tolerant requirements. Consistency is tunable by setting it high in the option  Hbase usecase: Consistency and Scalability. However, at less number of nodes/threads, availability is achieved high Cassandra and Hbase
  • 27. Referent Einrichtung Titel des Vortrages 27  Document-oriented database  High performance and automatic scaling  High consistency and partition tolerant  Replication and failover for high availability  Low latency  Flexible indexing MongoDB
  • 28. Referent Einrichtung Titel des Vortrages 28  Document is the basic unit for MongoDB(row)  Collection is similar to a table  A single instance has multiple independent databases  Every document has a special key, “_id”  Powerful JavaScript shell for administration  Configdb contains metadata of clusters MongoDB Concepts
  • 29. Referent Einrichtung Titel des Vortrages 29 MongoDB Simple Architecture
  • 30. Referent Einrichtung Titel des Vortrages 30  A mongo receives queries from applications  Uses metadata from config server for the data  Mangos directs write operations to a particular shard  Mongos uses the cluster metadata from the config database Read/Write MongoDB
  • 31. Referent Einrichtung Titel des Vortrages 31  Scalability  Availability  Partition Tolerant  Consistency MOST IMPORTANT PERFORMANCE Yahoo Cloud Serving Benchmark (YCSB) Recap Importance of Benchmark and Factors
  • 32. Referent Einrichtung Titel des Vortrages 32 Results: Load Process
  • 33. Referent Einrichtung Titel des Vortrages 33 Results: Read/Write Mix Workload
  • 34. Referent Einrichtung Titel des Vortrages 34 Results: Read/Scan Mix Workload
  • 35. Referent Einrichtung Titel des Vortrages 35 Results: Read Latency across all workloads
  • 36. Referent Einrichtung Titel des Vortrages 36 Results: Insert Latency across all workloads
  • 37. Referent Einrichtung Titel des Vortrages 37 Lets MIGRATE from traditional data base !!!! Live Demo
  • 38. Referent Einrichtung Titel des Vortrages 38  Identify data model for the application  Corresponding data sets have to be known  Whether the application requires replication  Identify the performance requirements  Prototype the application  Test the performance of the prototype Discussion
  • 39. Referent Einrichtung Titel des Vortrages 39 Conclusion  NoSQL replaced tradition relational databases  Performance is the key feature  Importance of benchmarks  Top three NoSQL data base’s performance tested  Cassandra outperforms all the other NoSQL data bases  Decide based on application

Editor's Notes

  • #9: Managing the start up Configuration and Termination of EC2 instances Running the test on clients
  • #10: Apache Cassandra: Columnar database model (Combination of Amazon Dynamo+Bigtable) Apache HBase: Columnar database model (Big table inspired Hadoop system)
  • #12: Rows are split and it has row key for range of rows (primary key is hashed, md5 hash), column family (column name) with value and time stamp. In habse, data is split columnwise, it has row key for range of rows, column family and column qualifier and time stamp. Ordered distribution and no hash distribution. Frequently accessed column are grouped together under commom family.
  • #14: System keyspace stores metadata for the local node. System keyspace cannot be modeified or edited by us . The node‘s token is decided by the partitioner.
  • #16: Memory reads are faster than disk reads..so when we see results of test, cassandra outperforms and bloom filters could be one of the reason, because of fast memory access and reads.
  • #17: Cassandra nodes exchange merkle trees for conversation with neighbours. Merkle tree is a hash representing the data in a column family. Trees are compared and if there is any difference, it launches a repair for the ranges that dont agree. Read-repair happens in the background internally.There is something called as snitch which routes the client to the nearest node.(there is no separate configdb like mongodb to route or zookeeper in hbase..which may take aditional time to respond). Snitch gives host proximity.
  • #27: Give example of facebook