SlideShare a Scribd company logo
Novateur Publication’s
International Journal of Innovation in Engineering, Research and Technology [IJIERT]
ICITDCEME’15 Conference Proceedings
ISSN No - 2394-3696
1 | P a g e
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS
USING HADOOP STREAMING
Dahatonde Varsha Sukhdev,
Department of Computer Engineering, G.H.Raisoni College of Engineering, chas, Ahmednagar India
vaishudahatonde@gmail.com
Ashish Kumar,
Department of Computer Engineering, G.H.Raisoni College of Engineering, chas, Ahmednagar India
ashish.kumar@raisoni.net
ABSTRACT
An unstructured data poses challenges to storing data. Experts estimate that 80 to 90 percent of the data in any
organization is unstructured. And the amount of unstructured data in enterprises is growing significantly— often
many times faster than structured databases are growing. As structured data is existing in table format i,e having
proper scheme but unstructured data is schema less database So it’s directly signifying the importance of NoSQL
storage Model and Map Reduce platform. For processing unstructured data, where in existing it is given to
Cassandra dataset. Here in present system along with Cassandra dataset, Mongo DB is to be implemented. As
Mongo DB provide flexible data model and large amount of options for querying unstructured data. Whereas
Cassandra model their data in such a way as to minimize the total number of queries through more careful planning
and renormalizations. It offers basic secondary indexes but for the best performance it’s recommended to model our
data as to use them infrequently. So to process
KEYWORDS: Unstructured data, schema less database, secondary indexes, denormalization.
INTRODUCTION
Structured data is generally in the form of relational database i.e relational data and can be accessed through
predesigned fields. In contrast unstructured data doesn’t fit into any pre-defined data models. Bigdata is used to
analyze the structured as well as unstructured data. As unstructured data grows more rapidly, as user content of
database is text. For about 40 years, files were likewise most often comprised of just text. Now users want rich
content, not just plain text. To handle huge amount of unstructured data by using different programs under varied
conditions becomes difficult. The main problem while handling the NOSQL database is about the storage and search
of the data requires high computational resources. NoSQL database are Non-relational, Schema-less data model,
having low latency, highly scalable and gives high performance. NoSQL database is coded in district programming
languages and available as open source software. Objective of this paper is to handle the unstructured data using
widely used NoSQL database system, Cassandra and MongoDB [1]. The existing work uses Map Reduce pipeline
that is adopted by Hadoop streaming and MARISSA. For evaluation of data the pipeline have three stages: Data
preparation, Data Transformation and Data Processing [1]. This paper is organized as follow. Section 2 provides an
introduction for NoSQL database, Cassandra and Mongo DB system. We discuss related work in section 3 and we
present, at section 4 the proposed architecture of the system.
NOSQL DATABASE
“A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modeled in
means other than the tabular relations used in relational databases”[2]. A major difference from relational databases
is the lack of explicit data scheme. NoSQL databases infer scheme from stored data, if it requires it at all, depending
on which model was used. The main benefit of using different data models is that they are very good at what they
do. At the same time, don’t force them to do something they aren’t designed for. This means that it is of the upmost
importance to understand and correctly use the data model when choosing NoSQL solutions.Generally, data models
in NoSQL are grouped into four categories. However, particular NoSQL solutions may incorporate several models
at once.
KEY-VALUE (K-V) STORES
K-V store is the simplest data model. The key is a unique identifier for a value, which can be any data application
needs stored. This model is also the fastest way to get data by known key, but without the flexibility of more
advanced querying. It may be used for data sharing between application instances like distributed cache or to store
user session data.
Novateur Publication’s
International Journal of Innovation in Engineering, Research and Technology [IJIERT]
ICITDCEME’15 Conference Proceedings
ISSN No - 2394-3696
2 | P a g e
DOCUMENT STORES
Document store is a data model for storing semi-structured document object data and metadata. The JSON format is
normally used to represent such objects. Documents can be queried by their properties in a similar manner to
relational databases but aren’t required to adhere to the strict structure of a database table. Additionally, only parts of
the object may be requested or updated.
Generally speaking, document stores are used for aggregate objects that have no shared complex data between them
and to quickly search or filter by some object properties.
COLUMN-ORIENTED STORES
A more advanced K-V store data model is a column family. These are used for organizing data based on individual
columns where actual data is used as a key to refer to whole data collections. It is similar to a relational database
index; however a column family may be an arbitrary collection of columns. There are more complex aggregation
structures like super columns and super column families to allow access to the data by several keys.
This particular approach is used for very large scalable databases to greatly reduce time for searching data. It is
rarely used outside of enterprise level applications.
GRAPH DATABASES
As the name implies, this data model allows objects to link and be linked by several other objects thus constructing a
graph structure. Links usually have additional properties to describe the relation between objects. Graph databases
map more directly to object oriented programming models and are faster for highly associative data sets and graph
queries. Furthermore they typically support ACID transaction properties in the same way as most RDBMS.
CASSANDRA
Cassandra’s architecture is made of nodes, clusters, data centers and a partitioner. A node is a physical instance of
Cassandra. Cassandra does not use a master-slave architecture; rather, Cassandra uses peer-to-peer architecture,
which all nodes are equal. A cluster is a group of nodes or even a single node. A group of clusters is a data center.
A partitioned is a hash function for computing the token of each row key.
When one row is inserted, a token is calculated, based on its unique row key. This token determines in what node
that particular row will be stored. Each node of a cluster is responsible for a range of data based on a token. When
the row is inserted and its token is calculated, this row is stored on a node responsible for this token. The advantage
here is that multiple rows can be written in parallel into the database, as each node is responsible for its own write
requests. However this may be seen as a drawback regarding data extraction, becoming a bottleneck.
The MurMur3Partitioner [17] is a partitioner that uses tokens to assign equal portions of data to each node. This
technique was selected because it provides fast hashing, and its hash function helps to evenly distribute data to all
the nodes of a cluster.
LITERATURE SURVEY
E. Dede have proposed two different approaches, one working with the distributed Cassandra cluster[1] directly to
perform MapReduce operations and the other exporting the dataset from the database servers to the file system for
further processing. They also gives an approaches in solving the challenge of integrating NoSQL data stores with
Map Reduce for non-Java application scenarios, along with advantages and disadvantages of each approach. Also
compare Hadoop Streaming alongside their own streaming framework, MARISSA, to show performance
implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file-
system based data stores. Elif Dede have proposed Cassandra’s Random Partitioned distributes data evenly,
improving Hadoop’s performance by a factor of 3 [3]. Also Increasing the replication-factor on Cassandra does not
affect Hadoop turn around time; leveraging range scans reduces read repair calls on replicas, immunizing Hadoop
from replication related performance degradation. CPU intensive loads perform better using Hadoop-native, but the
difference using Cassandra is minimal.Z. Fadika [4] have proposed evaluate Hadoop specifically for data-intensive
scientific operations -- filter, merge and reorder-- to understand its various design considerations and performance
trade-offs. In this paper, we evaluate Hadoop for these data operations in the context of High Performance
Computing (HPC) environments to understand the impact of the file system, network and programming modes on
performance. Many research works [5-8] present results involving the performance of a Cassandra database system
for massive data volumes. In this paper, we have decided to evaluate the performance of Cassandra NoSQL database
system specifically for genomic data.
Novateur Publication’s
International Journal of Innovation in Engineering, Research and Technology [IJIERT]
ICITDCEME’15 Conference Proceedings
ISSN No - 2394-3696
3 | P a g e
PROPOSED SYSTEM
This proposed system consists of following components:
1. Data Preparation: Data Preparation, Figure a, is the step of downloading the data from Cassandra servers to
the corresponding file systems – HDFS for Hadoop Streaming and the shared file system for MARISSA. For
both of these frameworks this step is initiated in parallel. Cassandra allows exporting the records of a dataset
in JSON formatted files [9]. Using this feature, each node downloads the data from the local Cassandra
server to the file system. In our experimental setup, each node that is running a Cassandra server is also a
worker node for the Map Reduce framework in use.
2. Data Transformation (MR1): Cassandra allows users to export datasets as JSON formatted files. As our
assumption is that the Map Reduce applications to be run are legacy applications which are either impossible
or impractical to be modified and the input data needs to be converted into a format that is expected by these
target executables. For this reason, our software pipeline includes a Map Reduce stage, Figure 1b, where
JSON data can be transformed into other formats. In this phase each input record is processed to be
converted to another format and stored in intermediary output files. This step does not involve any data or
processing dependencies between nodes and therefore is a great fit for the Map Reduce.
3. Data Processing (MR2): This is the final step of the Map Reduce Streaming pipeline. We run the non-java
executables, over the output of MR1 .
To show the full operation, we assume the time taken for Data Preparation and data Transformation under each
Mapreduce framework and repeat our comparisons[1].
Figure: Block diagram of Proposed System
CONCLUSION
NoSQL databases or new tests using Cassandra with different hardware configurations seeking improvements in
performance. Comparing the performance of Cassandra to the Mongo DB database will definitely help in the
processing of unstructured data. Further it is possible to outline new approaches in studies of processing the
unstructured data
REFERENCES
1. E. Dede, B. Sendir, P. Kuzlu, J. Weachock, M. Govindaraju, “A Processing Pipeline for Cassandra Datasets
Based on Hadoop Streaming “DOI 10.1109/BigData.Congress.2014.32, 2014 IEEE International Congress on
Big Data.
2. NoSQL wiki. https://en.wikipedia.org/wiki/NoSQL
3. Elif Dede, Bedri Sendir, Pinar Kuzlu, Jessica Hartog, Madhusudhan Govindaraju : “An Evaluation of Cassandra
for Hadoop”, Grid and Cloud Computing Research Laboratory SUNY Binghamton, New York, USA , 2013 IEEE
Sixth International Conference on Cloud Computing
4. Z. Fadika, M. Govindaraju, R. Canon, and L. Ramakrishnan. Evaluating Hadoop for Data-Intensive Scientific
Operations. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on, pages 67–74. IEEE,
2012
5. Z. Ye and S. Li, “A request skew aware heterogeneous distributed storage system based on Cassandra,”
in Proceedings of the International Conference on Computer and Management (CAMAN '11), pp. 1–5, May
2011.
6. G. Wang and J. Tang, “The NoSQL principles and basic application of cassandra model,” in Proceedings of the
International Conference on Computer Science and Service System (CSSS '12), pp. 1332–1335, August 2012.
Novateur Publication’s
International Journal of Innovation in Engineering, Research and Technology [IJIERT]
ICITDCEME’15 Conference Proceedings
ISSN No - 2394-3696
4 | P a g e
7. B. G. Tudorica and C. Bucur, “A comparison between several NoSQL databases with comments and notes,”
in Proceedings of the 10th RoEduNet International Conference on Networking in Education and Research
(RoEduNet '11), pp. 1–5, June 2011.
8. Y. Li and S. Manoharan, “A performance comparison of SQL and NoSQL databases,” in Proceedings of the 14th
IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing (PACRIM '13), pp. 15–19,
August 2013.
9. Cassandra wiki, operations. http://wiki.apache.org/cassandra/Operations.
10. M. Klems, D. Bermbach, and R. Weinert, “A runtime quality measurement framework for cloud database
service systems,” in Proceedings of the 8th International Conference on the Quality of Information and
Communications Technology (QUATIC '12), pp. 38–46, September 2012.

More Related Content

What's hot (16)

PDF
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
IJEACS
 
PPTX
Big Data Unit 4 - Hadoop
RojaT4
 
PDF
Trends in Computer Science and Information Technology
peertechzpublication
 
PDF
DSM - Comparison of Hbase and Cassandra
Shrikant Samarth
 
PDF
Processing cassandra datasets with hadoop streaming based approaches
LeMeniz Infotech
 
PPTX
Introduction of big data unit 1
RojaT4
 
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
PDF
MongoDB NoSQL database a deep dive -MyWhitePaper
Rajesh Kumar
 
PDF
Hdfs Dhruba
Jeff Hammerbacher
 
PPTX
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
nehabsairam
 
DOCX
Analysis and evaluation of riak kv cluster environment using basho bench
StevenChike
 
PDF
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
PPTX
Nosql
Roxana Tadayon
 
PDF
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
ijscai
 
PPTX
Nosql
ROXTAD71
 
Implementation of Multi-node Clusters in Column Oriented Database using HDFS
IJEACS
 
Big Data Unit 4 - Hadoop
RojaT4
 
Trends in Computer Science and Information Technology
peertechzpublication
 
DSM - Comparison of Hbase and Cassandra
Shrikant Samarth
 
Processing cassandra datasets with hadoop streaming based approaches
LeMeniz Infotech
 
Introduction of big data unit 1
RojaT4
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
MongoDB NoSQL database a deep dive -MyWhitePaper
Rajesh Kumar
 
Hdfs Dhruba
Jeff Hammerbacher
 
Chapter 4 terminolgy of keyvalue databses from nosql for mere mortals
nehabsairam
 
Analysis and evaluation of riak kv cluster environment using basho bench
StevenChike
 
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
A STUDY ON GRAPH STORAGE DATABASE OF NOSQL
ijscai
 
Nosql
ROXTAD71
 

Similar to EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING (20)

PDF
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
IJCERT JOURNAL
 
PDF
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
Sharmila Chidaravalli
 
PDF
Vskills Apache Cassandra sample material
Vskills
 
PPTX
NoSQL Intro with cassandra
Brian Enochson
 
PPTX
Selecting best NoSQL
Mohammed Fazuluddin
 
PDF
the rising no sql technology
INFOGAIN PUBLICATION
 
PDF
Uint-5 Big data Frameworks.pdf
Sitamarhi Institute of Technology
 
PDF
A Study of Performance NoSQL Databases
AM Publications
 
PPTX
Introduction to NoSql
Omid Vahdaty
 
PDF
Comparative study of relational and non relations database performances using...
IAEME Publication
 
PPTX
Choosing your NoSQL storage
Imteyaz Khan
 
PDF
BDA UNIT5.pdf
SCE
 
PPTX
Introduction to Data Science NoSQL.pptx
tarakesh7199
 
PPT
The No SQL Principles and Basic Application Of Casandra Model
Rishikese MR
 
PPTX
2018 05 08_biological_databases_no_sql
Prof. Wim Van Criekinge
 
PDF
Hybrid Database System for Big Data Storage and Management
IJCSEA Journal
 
PDF
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
IJCSEA Journal
 
PDF
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
IJCSEA Journal
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
IJCERT JOURNAL
 
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
Sharmila Chidaravalli
 
Vskills Apache Cassandra sample material
Vskills
 
NoSQL Intro with cassandra
Brian Enochson
 
Selecting best NoSQL
Mohammed Fazuluddin
 
the rising no sql technology
INFOGAIN PUBLICATION
 
Uint-5 Big data Frameworks.pdf
Sitamarhi Institute of Technology
 
A Study of Performance NoSQL Databases
AM Publications
 
Introduction to NoSql
Omid Vahdaty
 
Comparative study of relational and non relations database performances using...
IAEME Publication
 
Choosing your NoSQL storage
Imteyaz Khan
 
BDA UNIT5.pdf
SCE
 
Introduction to Data Science NoSQL.pptx
tarakesh7199
 
The No SQL Principles and Basic Application Of Casandra Model
Rishikese MR
 
2018 05 08_biological_databases_no_sql
Prof. Wim Van Criekinge
 
Hybrid Database System for Big Data Storage and Management
IJCSEA Journal
 
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
IJCSEA Journal
 
HYBRID DATABASE SYSTEM FOR BIG DATA STORAGE AND MANAGEMENT
IJCSEA Journal
 
Ad

More from ijiert bestjournal (20)

PDF
CRACKS IN STEEL CASTING FOR VOLUTE CASING OF A PUMP
ijiert bestjournal
 
PDF
A COMPARATIVE STUDY OF DESIGN OF SIMPLE SPUR GEAR TRAIN AND HELICAL GEAR TRAI...
ijiert bestjournal
 
PDF
COMPARATIVE ANALYSIS OF CONVENTIONAL LEAF SPRING AND COMPOSITE LEAF
ijiert bestjournal
 
PDF
POWER GENERATION BY DIFFUSER AUGMENTED WIND TURBINE
ijiert bestjournal
 
PDF
FINITE ELEMENT ANALYSIS OF CONNECTING ROD OF MG-ALLOY
ijiert bestjournal
 
PDF
REVIEW ON CRITICAL SPEED IMPROVEMENT IN SINGLE CYLINDER ENGINE VALVE TRAIN
ijiert bestjournal
 
PDF
ENERGY CONVERSION PHENOMENON IN IMPLEMENTATION OF WATER LIFTING BY USING PEND...
ijiert bestjournal
 
PDF
SCUDERI SPLIT CYCLE ENGINE: REVOLUTIONARY TECHNOLOGY & EVOLUTIONARY DESIGN RE...
ijiert bestjournal
 
PDF
EXPERIMENTAL EVALUATION OF TEMPERATURE DISTRIBUTION IN JOURNAL BEARING OPERAT...
ijiert bestjournal
 
PDF
STUDY OF SOLAR THERMAL CAVITY RECEIVER FOR PARABOLIC CONCENTRATING COLLECTOR
ijiert bestjournal
 
PDF
DESIGN, OPTIMIZATION AND FINITE ELEMENT ANALYSIS OF CRANKSHAFT
ijiert bestjournal
 
PDF
ELECTRO CHEMICAL MACHINING AND ELECTRICAL DISCHARGE MACHINING PROCESSES MICRO...
ijiert bestjournal
 
PDF
HEAT TRANSFER ENHANCEMENT BY USING NANOFLUID JET IMPINGEMENT
ijiert bestjournal
 
PDF
MODIFICATION AND OPTIMIZATION IN STEEL SANDWICH PANELS USING ANSYS WORKBENCH
ijiert bestjournal
 
PDF
IMPACT ANALYSIS OF ALUMINUM HONEYCOMB SANDWICH PANEL BUMPER BEAM: A REVIEW
ijiert bestjournal
 
PDF
DESIGN OF WELDING FIXTURES AND POSITIONERS
ijiert bestjournal
 
PDF
ADVANCED TRANSIENT THERMAL AND STRUCTURAL ANALYSIS OF DISC BRAKE BY USING ANS...
ijiert bestjournal
 
PDF
REVIEW ON MECHANICAL PROPERTIES OF NON-ASBESTOS COMPOSITE MATERIAL USED IN BR...
ijiert bestjournal
 
PDF
PERFORMANCE EVALUATION OF TRIBOLOGICAL PROPERTIES OF COTTON SEED OIL FOR MULT...
ijiert bestjournal
 
PDF
MAGNETIC ABRASIVE FINISHING
ijiert bestjournal
 
CRACKS IN STEEL CASTING FOR VOLUTE CASING OF A PUMP
ijiert bestjournal
 
A COMPARATIVE STUDY OF DESIGN OF SIMPLE SPUR GEAR TRAIN AND HELICAL GEAR TRAI...
ijiert bestjournal
 
COMPARATIVE ANALYSIS OF CONVENTIONAL LEAF SPRING AND COMPOSITE LEAF
ijiert bestjournal
 
POWER GENERATION BY DIFFUSER AUGMENTED WIND TURBINE
ijiert bestjournal
 
FINITE ELEMENT ANALYSIS OF CONNECTING ROD OF MG-ALLOY
ijiert bestjournal
 
REVIEW ON CRITICAL SPEED IMPROVEMENT IN SINGLE CYLINDER ENGINE VALVE TRAIN
ijiert bestjournal
 
ENERGY CONVERSION PHENOMENON IN IMPLEMENTATION OF WATER LIFTING BY USING PEND...
ijiert bestjournal
 
SCUDERI SPLIT CYCLE ENGINE: REVOLUTIONARY TECHNOLOGY & EVOLUTIONARY DESIGN RE...
ijiert bestjournal
 
EXPERIMENTAL EVALUATION OF TEMPERATURE DISTRIBUTION IN JOURNAL BEARING OPERAT...
ijiert bestjournal
 
STUDY OF SOLAR THERMAL CAVITY RECEIVER FOR PARABOLIC CONCENTRATING COLLECTOR
ijiert bestjournal
 
DESIGN, OPTIMIZATION AND FINITE ELEMENT ANALYSIS OF CRANKSHAFT
ijiert bestjournal
 
ELECTRO CHEMICAL MACHINING AND ELECTRICAL DISCHARGE MACHINING PROCESSES MICRO...
ijiert bestjournal
 
HEAT TRANSFER ENHANCEMENT BY USING NANOFLUID JET IMPINGEMENT
ijiert bestjournal
 
MODIFICATION AND OPTIMIZATION IN STEEL SANDWICH PANELS USING ANSYS WORKBENCH
ijiert bestjournal
 
IMPACT ANALYSIS OF ALUMINUM HONEYCOMB SANDWICH PANEL BUMPER BEAM: A REVIEW
ijiert bestjournal
 
DESIGN OF WELDING FIXTURES AND POSITIONERS
ijiert bestjournal
 
ADVANCED TRANSIENT THERMAL AND STRUCTURAL ANALYSIS OF DISC BRAKE BY USING ANS...
ijiert bestjournal
 
REVIEW ON MECHANICAL PROPERTIES OF NON-ASBESTOS COMPOSITE MATERIAL USED IN BR...
ijiert bestjournal
 
PERFORMANCE EVALUATION OF TRIBOLOGICAL PROPERTIES OF COTTON SEED OIL FOR MULT...
ijiert bestjournal
 
MAGNETIC ABRASIVE FINISHING
ijiert bestjournal
 
Ad

Recently uploaded (20)

PPTX
Knowledge Representation : Semantic Networks
Amity University, Patna
 
PPTX
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PDF
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
PDF
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PDF
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
PPTX
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PDF
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
PDF
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
PDF
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
PDF
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPTX
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPT
Testing and final inspection of a solar PV system
MuhammadSanni2
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
Knowledge Representation : Semantic Networks
Amity University, Patna
 
Water Resources Engineering (CVE 728)--Slide 3.pptx
mohammedado3
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
MODULE-5 notes [BCG402-CG&V] PART-B.pdf
Alvas Institute of Engineering and technology, Moodabidri
 
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Ashutosh Satapathy
 
aAn_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
3rd International Conference on Machine Learning and IoT (MLIoT 2025)
ClaraZara1
 
REINFORCEMENT LEARNING IN DECISION MAKING SEMINAR REPORT
anushaashraf20
 
Electrical Machines and Their Protection.pdf
Nabajyoti Banik
 
SERVERLESS PERSONAL TO-DO LIST APPLICATION
anushaashraf20
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Testing and final inspection of a solar PV system
MuhammadSanni2
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 

EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING

  • 1. Novateur Publication’s International Journal of Innovation in Engineering, Research and Technology [IJIERT] ICITDCEME’15 Conference Proceedings ISSN No - 2394-3696 1 | P a g e EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING Dahatonde Varsha Sukhdev, Department of Computer Engineering, G.H.Raisoni College of Engineering, chas, Ahmednagar India [email protected] Ashish Kumar, Department of Computer Engineering, G.H.Raisoni College of Engineering, chas, Ahmednagar India [email protected] ABSTRACT An unstructured data poses challenges to storing data. Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly— often many times faster than structured databases are growing. As structured data is existing in table format i,e having proper scheme but unstructured data is schema less database So it’s directly signifying the importance of NoSQL storage Model and Map Reduce platform. For processing unstructured data, where in existing it is given to Cassandra dataset. Here in present system along with Cassandra dataset, Mongo DB is to be implemented. As Mongo DB provide flexible data model and large amount of options for querying unstructured data. Whereas Cassandra model their data in such a way as to minimize the total number of queries through more careful planning and renormalizations. It offers basic secondary indexes but for the best performance it’s recommended to model our data as to use them infrequently. So to process KEYWORDS: Unstructured data, schema less database, secondary indexes, denormalization. INTRODUCTION Structured data is generally in the form of relational database i.e relational data and can be accessed through predesigned fields. In contrast unstructured data doesn’t fit into any pre-defined data models. Bigdata is used to analyze the structured as well as unstructured data. As unstructured data grows more rapidly, as user content of database is text. For about 40 years, files were likewise most often comprised of just text. Now users want rich content, not just plain text. To handle huge amount of unstructured data by using different programs under varied conditions becomes difficult. The main problem while handling the NOSQL database is about the storage and search of the data requires high computational resources. NoSQL database are Non-relational, Schema-less data model, having low latency, highly scalable and gives high performance. NoSQL database is coded in district programming languages and available as open source software. Objective of this paper is to handle the unstructured data using widely used NoSQL database system, Cassandra and MongoDB [1]. The existing work uses Map Reduce pipeline that is adopted by Hadoop streaming and MARISSA. For evaluation of data the pipeline have three stages: Data preparation, Data Transformation and Data Processing [1]. This paper is organized as follow. Section 2 provides an introduction for NoSQL database, Cassandra and Mongo DB system. We discuss related work in section 3 and we present, at section 4 the proposed architecture of the system. NOSQL DATABASE “A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases”[2]. A major difference from relational databases is the lack of explicit data scheme. NoSQL databases infer scheme from stored data, if it requires it at all, depending on which model was used. The main benefit of using different data models is that they are very good at what they do. At the same time, don’t force them to do something they aren’t designed for. This means that it is of the upmost importance to understand and correctly use the data model when choosing NoSQL solutions.Generally, data models in NoSQL are grouped into four categories. However, particular NoSQL solutions may incorporate several models at once. KEY-VALUE (K-V) STORES K-V store is the simplest data model. The key is a unique identifier for a value, which can be any data application needs stored. This model is also the fastest way to get data by known key, but without the flexibility of more advanced querying. It may be used for data sharing between application instances like distributed cache or to store user session data.
  • 2. Novateur Publication’s International Journal of Innovation in Engineering, Research and Technology [IJIERT] ICITDCEME’15 Conference Proceedings ISSN No - 2394-3696 2 | P a g e DOCUMENT STORES Document store is a data model for storing semi-structured document object data and metadata. The JSON format is normally used to represent such objects. Documents can be queried by their properties in a similar manner to relational databases but aren’t required to adhere to the strict structure of a database table. Additionally, only parts of the object may be requested or updated. Generally speaking, document stores are used for aggregate objects that have no shared complex data between them and to quickly search or filter by some object properties. COLUMN-ORIENTED STORES A more advanced K-V store data model is a column family. These are used for organizing data based on individual columns where actual data is used as a key to refer to whole data collections. It is similar to a relational database index; however a column family may be an arbitrary collection of columns. There are more complex aggregation structures like super columns and super column families to allow access to the data by several keys. This particular approach is used for very large scalable databases to greatly reduce time for searching data. It is rarely used outside of enterprise level applications. GRAPH DATABASES As the name implies, this data model allows objects to link and be linked by several other objects thus constructing a graph structure. Links usually have additional properties to describe the relation between objects. Graph databases map more directly to object oriented programming models and are faster for highly associative data sets and graph queries. Furthermore they typically support ACID transaction properties in the same way as most RDBMS. CASSANDRA Cassandra’s architecture is made of nodes, clusters, data centers and a partitioner. A node is a physical instance of Cassandra. Cassandra does not use a master-slave architecture; rather, Cassandra uses peer-to-peer architecture, which all nodes are equal. A cluster is a group of nodes or even a single node. A group of clusters is a data center. A partitioned is a hash function for computing the token of each row key. When one row is inserted, a token is calculated, based on its unique row key. This token determines in what node that particular row will be stored. Each node of a cluster is responsible for a range of data based on a token. When the row is inserted and its token is calculated, this row is stored on a node responsible for this token. The advantage here is that multiple rows can be written in parallel into the database, as each node is responsible for its own write requests. However this may be seen as a drawback regarding data extraction, becoming a bottleneck. The MurMur3Partitioner [17] is a partitioner that uses tokens to assign equal portions of data to each node. This technique was selected because it provides fast hashing, and its hash function helps to evenly distribute data to all the nodes of a cluster. LITERATURE SURVEY E. Dede have proposed two different approaches, one working with the distributed Cassandra cluster[1] directly to perform MapReduce operations and the other exporting the dataset from the database servers to the file system for further processing. They also gives an approaches in solving the challenge of integrating NoSQL data stores with Map Reduce for non-Java application scenarios, along with advantages and disadvantages of each approach. Also compare Hadoop Streaming alongside their own streaming framework, MARISSA, to show performance implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file- system based data stores. Elif Dede have proposed Cassandra’s Random Partitioned distributes data evenly, improving Hadoop’s performance by a factor of 3 [3]. Also Increasing the replication-factor on Cassandra does not affect Hadoop turn around time; leveraging range scans reduces read repair calls on replicas, immunizing Hadoop from replication related performance degradation. CPU intensive loads perform better using Hadoop-native, but the difference using Cassandra is minimal.Z. Fadika [4] have proposed evaluate Hadoop specifically for data-intensive scientific operations -- filter, merge and reorder-- to understand its various design considerations and performance trade-offs. In this paper, we evaluate Hadoop for these data operations in the context of High Performance Computing (HPC) environments to understand the impact of the file system, network and programming modes on performance. Many research works [5-8] present results involving the performance of a Cassandra database system for massive data volumes. In this paper, we have decided to evaluate the performance of Cassandra NoSQL database system specifically for genomic data.
  • 3. Novateur Publication’s International Journal of Innovation in Engineering, Research and Technology [IJIERT] ICITDCEME’15 Conference Proceedings ISSN No - 2394-3696 3 | P a g e PROPOSED SYSTEM This proposed system consists of following components: 1. Data Preparation: Data Preparation, Figure a, is the step of downloading the data from Cassandra servers to the corresponding file systems – HDFS for Hadoop Streaming and the shared file system for MARISSA. For both of these frameworks this step is initiated in parallel. Cassandra allows exporting the records of a dataset in JSON formatted files [9]. Using this feature, each node downloads the data from the local Cassandra server to the file system. In our experimental setup, each node that is running a Cassandra server is also a worker node for the Map Reduce framework in use. 2. Data Transformation (MR1): Cassandra allows users to export datasets as JSON formatted files. As our assumption is that the Map Reduce applications to be run are legacy applications which are either impossible or impractical to be modified and the input data needs to be converted into a format that is expected by these target executables. For this reason, our software pipeline includes a Map Reduce stage, Figure 1b, where JSON data can be transformed into other formats. In this phase each input record is processed to be converted to another format and stored in intermediary output files. This step does not involve any data or processing dependencies between nodes and therefore is a great fit for the Map Reduce. 3. Data Processing (MR2): This is the final step of the Map Reduce Streaming pipeline. We run the non-java executables, over the output of MR1 . To show the full operation, we assume the time taken for Data Preparation and data Transformation under each Mapreduce framework and repeat our comparisons[1]. Figure: Block diagram of Proposed System CONCLUSION NoSQL databases or new tests using Cassandra with different hardware configurations seeking improvements in performance. Comparing the performance of Cassandra to the Mongo DB database will definitely help in the processing of unstructured data. Further it is possible to outline new approaches in studies of processing the unstructured data REFERENCES 1. E. Dede, B. Sendir, P. Kuzlu, J. Weachock, M. Govindaraju, “A Processing Pipeline for Cassandra Datasets Based on Hadoop Streaming “DOI 10.1109/BigData.Congress.2014.32, 2014 IEEE International Congress on Big Data. 2. NoSQL wiki. https://en.wikipedia.org/wiki/NoSQL 3. Elif Dede, Bedri Sendir, Pinar Kuzlu, Jessica Hartog, Madhusudhan Govindaraju : “An Evaluation of Cassandra for Hadoop”, Grid and Cloud Computing Research Laboratory SUNY Binghamton, New York, USA , 2013 IEEE Sixth International Conference on Cloud Computing 4. Z. Fadika, M. Govindaraju, R. Canon, and L. Ramakrishnan. Evaluating Hadoop for Data-Intensive Scientific Operations. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on, pages 67–74. IEEE, 2012 5. Z. Ye and S. Li, “A request skew aware heterogeneous distributed storage system based on Cassandra,” in Proceedings of the International Conference on Computer and Management (CAMAN '11), pp. 1–5, May 2011. 6. G. Wang and J. Tang, “The NoSQL principles and basic application of cassandra model,” in Proceedings of the International Conference on Computer Science and Service System (CSSS '12), pp. 1332–1335, August 2012.
  • 4. Novateur Publication’s International Journal of Innovation in Engineering, Research and Technology [IJIERT] ICITDCEME’15 Conference Proceedings ISSN No - 2394-3696 4 | P a g e 7. B. G. Tudorica and C. Bucur, “A comparison between several NoSQL databases with comments and notes,” in Proceedings of the 10th RoEduNet International Conference on Networking in Education and Research (RoEduNet '11), pp. 1–5, June 2011. 8. Y. Li and S. Manoharan, “A performance comparison of SQL and NoSQL databases,” in Proceedings of the 14th IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing (PACRIM '13), pp. 15–19, August 2013. 9. Cassandra wiki, operations. http://wiki.apache.org/cassandra/Operations. 10. M. Klems, D. Bermbach, and R. Weinert, “A runtime quality measurement framework for cloud database service systems,” in Proceedings of the 8th International Conference on the Quality of Information and Communications Technology (QUATIC '12), pp. 38–46, September 2012.