2018 05 08_biological_databases_no_sql

FBW
8-05-2018
Biological Databases
Wim Van Criekinge

Data Warehousing and Decision Support

RMySQL Package
No need to parse the data – the Fetch function puts the
queried data directly into an R data.frame format!

What does NoSQLmean?
●
NoSQL stands for:
–
–
–
No Relational
No RDBMS
NonRel
– Not OnlySQL
●
NoSQL is
– An umbrella term for all databases and data stores that don’t
follow the RDBMSprinciples
●
●
A class of products
A collection of several (related) concepts about data storage and
manipulation
– Often related to large data sets

Where does NoSQL comefrom?
●
●
Non-relational DBMSs are not new
But NoSQL represents a newincarnation
– Due tomassively scalable Internet applications
– Based on distributed and parallel computing
●
Development
– Starts with Google
●
First research paper published in 2003
–
–
Continues also thanks to Lucene's developers/Apache
(Hadoop) and Amazon (Dynamo)
Then a lot of products and interests came from Facebook,
Netfix, Yahoo, eBay, Hulu, IBM, and many more

NoSQL and big data
●
●
NoSQL comes from Internet, thus it is often related to the
“big data” concept
How much big are “big data”?
–
–
Over few terabytes (>1012 ≈240)
Enough to start spanning multiple storage units
●
Challenges
–
–
– Efciently storing and accessing large amounts of data is
difcult, even more considering fault tolerance and backups
Manipulating large data sets involves running immensely
parallel processes
Managing continuously evolving schema and metadata for
semi-structured and un-structured data is difcult

Why are RDBMSnot suitable for big data?
●
●
The context is Internet
RDBMSs assume that dataare
– Dense
– Largely uniform (structureddata)
●
Data coming from Internet are
–
–
Massive and sparse
Semi-structured or unstructured
●
With massive sparse data sets, the typical storage
mechanisms and access methods getstretched

NoSQL products' categories
●
NoSQLproducts canbe categorized in
– Key/value stores
– Sorted ordered column-orientedstores
– Document databases(JSON/XML)
– Graphdatabases

The Benefits of NoSQL
[https://www.mongodb.com/nosql-explained]When compared to relational databases, NoSQL databases
are more scalable and provide superior performance, and
their data model addresses several issues that the relational
model is not designed to address:
– Geographically distributed architecture instead of
expensive, monolithic architecture
– Large volumes of rapidly changing structured, semi-
structured, and unstructured data
– Agile sprints, quick schema iteration, and frequent code
pushes
– Object-oriented programming that is easy to use and
flexible12

NoSQL Database Types
[https://www.mongodb.com/nosql-explained]
• Key-value stores are the simplest NoSQL databases. Every single item
in the database is stored as an attribute name (or 'key'), together with its
value. Examples of key-value stores are Riak and Berkeley DB.
• Wide-column stores such as Cassandra and HBase are optimized for
queries over large datasets, and store columns of data together, instead
of rows.
• Document databases pair each key with a complex data structure
known as a document.
• Graph stores are used to store information about networks of data, such
as social connections. Graph stores include Neo4J and triple stores like
Fuseki.
13

Features
Simple primitive data
structure
No predefined schema
Limited query capabilities
Dictionary-like
functionality at large scale
key3
key2
key1 value1
value2
value2
Bioinformatics Use Case
Word vectors in text
mining
Caching
Limitations
Key lookup only, no
generalized query
Small number of
attributes per entity
Key/value stores

Key/value stores
●
●
Store datain a schema-less way
Store data asmaps
–
–
HashMaps or associativearrays
Provide a very efcient average running time algorithm for
accessing data
●
Notable for:
–
–
–
–
–
– Couchbase (Zynga, Vimeo, NAVTEQ,...)
Redis (Craiglist, Instagram, StackOverfow, fickr, ...)
Amazon Dynamo (Amazon, Elsevier, IMDb,...)
Apache Cassandra (Facebook, Digg, Reddit, Twitter,...)
Voldemort (LinkedIn, eBay,… )
Riak (Github, Comcast, Mochi, ...)

Sorted ordered column-orientedstores
●
Data are stored in a column-oriented way
–
–
–
Data efficiently stored
Avoids consuming space forstoring nulls
Unit of data is a set of key/value pairs
●
●
Identified by“row-key”
Ordered and sorted based on row-key
–
–
Columns are grouped incolumn-families
Data isn’t stored as a single table but is stored by column-
families
●
Notable for:
– Google's Bigtable (used in all Google's services)
– HBase (Facebook, StumbleUpon, Hulu, Yahoo!,...)

Features
Groups attributes into
column families
Column families store key-
value pairs
Implemented as sparse
multi-dimensional arrays
Denormalized
104-106 columns; 109 rows
 Bioinformatics Use Case
 Large studies
 Many experiments & data types
 Simulations
 Limitations
 Operationally
challenging
 Suitable for large
number of servers

Document databases
●
Documents
–
–
–
– Loosely structured sets of key/value pairs in documents, e.g.,
XML, JSON, BSON
Encapsulate and encode data in some standard formats or
encodings
Are addressed in the database via a unique key
Documents are treated as a whole, avoiding splitting a
document into its constituent name/value pairs
●
●
Allow documents retrieving by keys or contents
Notable for:
– MongoDB (used in FourSquare, Github, and more)
– CouchDB (used inApple, BBC, Canonical, Cern, and more)

Document databases,JSON
{
“ApacheLogRecord”: {
“ip”: “127.0.0.1”,
“ident” : “-”,
“http_user” : “frank”,
“time” : “10/Oct/2000:13:55:36 -0700”,
“request_line” : {
“http_method” : “GET”,
“url” : “/apache_pb.gif”,
“http_vers” : “HTTP/1.0”,
},
“http_response_code” : “200”,
“http_response_size” : “2326”,
“referrer” : “http://www.example.com/start.html”,
“user_agent” : “Mozilla/4.08 [en] (Win98; I ;Nav)”,
}
}

{
subject_id: "F8273",
age : "26",
sex : "M"
date_of_death : "12-Jan-1995”,
glycohemoglobin: 10%,
BMI : 22,
samples : [ {type:"Thoracic Aorta", AHA_score: 1},
{type:"Abdominal Aorta", AHA_score: 2},
{type:"LAD", AHA_Score:5} ],
sequence: {seq_file: "F8273_08152014.bam",
variant_file: "F8273_08152014.vcf”}
}

Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Text mining
Atherosclerosis
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}

Limitations
Less suited for tabular
data
Features
Highly normalized
Graph-based query
language (Gremlin)
SQL-inspired query
language (Cypher)
Support for path finding
and recursion
Epidemiology
simulations
Interaction networks

Property Graph Model
name: the Doctor
age: 907
species: Time Lord
first name: Rose
late name: Tyler
vehicle: tardis
model: Type 40

Modeling NoSQLstores
●
NoSQL data modeling often starts from the
application-specific queries as opposed to relational
modeling:
–
– Relational modeling is typically driven by the
structure of available data. The main design theme is
”What answers doI have?”
NoSQL data modeling is typically driven by
application-specific access patterns, i.e. thetypes of
queries to be supported. The main design theme is
”What questions doI have?”
●
Data duplication and denormalization are first-class
citizens

Querying NoSQLstores
●
Different NoSQLstores provide diferent
querying tools andfeatures
–
– From “simple” filtering ofdata basedon “columns”
names/values (MongoDB, HBase,Redis, …)
ToSQL-likelanguages (GoogleApp Engine,
HyperTable, Hive,...)

NoSQL, No ACID
●
●
RDBMSs are based on ACID (Atomicity,Consistency,
Isolation, and Durability) properties
NoSQL
– Does not give importance to ACID properties
– In some cases completely ignoresthem
●
In distributed parallel systems itis difcult/impossible
to ensure ACIDproperties
– Even with a centralcoordinator
– 2PL, 2PC and SS2PLcan help
●
Long-running transactions don't work because keeping
resources blocked for a long time is not practical

CAPTheorem
●
A congruent and logical way for assessing the problems
involved in assuring ACID-like guarantees in distributed
systems is provided by the CAP theorem
– At most two of the following three can be maximized at one
time
●
●
●
Consistency - Each client has the same view of the data
Availability - Each clientcan always read and write
Partition tolerance - System works wellacross distributed
physical networks
–
–
Conjectured by Eric Brewer in 2000
Proved by Seth Gilbert and Nancy Lynch in 2002

References
●
●
●
●
●
●
●
Tiwari, Shashank. Professional NoSQL. Wrox, 2011.
Warden, Pete. Big Data Glossary. O'Reilly Media, 2011.
Vogels, Werner (Amazon.com's CTO). All Things Distributed. Werner Vogels'
weblog on building scalable and robust distributed systems.
http://www.allthingsdistributed.com/
Katsov, Ilya. NoSQL Data Modeling Techniques.
http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
Bushik, Sergey. A vendor-independent comparison of NoSQL databases:
Cassandra, HBase, MongoDB, Riak. October 2012. Available online.
Gilbert, Seth and Lynch, Nancy. Brewer's conjecture and the feasibility of
consistent, available, partition-tolerant web services. ACM SIGACT News
33.2 (2002): 51-59.
Redmond, Eric, Wilson, Jim R. , and Carter, Jacquelyn. Seven databases in
seven weeks: a guide to modern databases and the NoSQL movement. The
Pragmatic Programmers, LLC,2012.

2018 05 08_biological_databases_no_sql

2018 05 08_biological_databases_no_sql

More Related Content

What's hot

Similar to 2018 05 08_biological_databases_no_sql

More from Prof. Wim Van Criekinge

Recently uploaded

2018 05 08_biological_databases_no_sql