FBW
8-05-2018
Biological Databases
Wim Van Criekinge
SPARQL 1/2
SPARQL 2/2
noSQL
Data Warehousing and Decision Support
RMySQL Package
No need to parse the data – the Fetch function puts the
queried data directly into an R data.frame format!
What does NoSQLmean?
●
NoSQL stands for:
–
–
–
No Relational
No RDBMS
NonRel
– Not OnlySQL
●
NoSQL is
– An umbrella term for all databases and data stores that don’t
follow the RDBMSprinciples
●
●
A class of products
A collection of several (related) concepts about data storage and
manipulation
– Often related to large data sets
Where does NoSQL comefrom?
●
●
Non-relational DBMSs are not new
But NoSQL represents a newincarnation
– Due tomassively scalable Internet applications
– Based on distributed and parallel computing
●
Development
– Starts with Google
●
First research paper published in 2003
–
–
Continues also thanks to Lucene's developers/Apache
(Hadoop) and Amazon (Dynamo)
Then a lot of products and interests came from Facebook,
Netfix, Yahoo, eBay, Hulu, IBM, and many more
NoSQL and big data
●
●
NoSQL comes from Internet, thus it is often related to the
“big data” concept
How much big are “big data”?
–
–
Over few terabytes (>1012 ≈240)
Enough to start spanning multiple storage units
●
Challenges
–
–
– Efciently storing and accessing large amounts of data is
difcult, even more considering fault tolerance and backups
Manipulating large data sets involves running immensely
parallel processes
Managing continuously evolving schema and metadata for
semi-structured and un-structured data is difcult
Why are RDBMSnot suitable for big data?
●
●
The context is Internet
RDBMSs assume that dataare
– Dense
– Largely uniform (structureddata)
●
Data coming from Internet are
–
–
Massive and sparse
Semi-structured or unstructured
●
With massive sparse data sets, the typical storage
mechanisms and access methods getstretched
NoSQL products' categories
●
NoSQLproducts canbe categorized in
– Key/value stores
– Sorted ordered column-orientedstores
– Document databases(JSON/XML)
– Graphdatabases
The Benefits of NoSQL
[https://www.mongodb.com/nosql-explained]When compared to relational databases, NoSQL databases
are more scalable and provide superior performance, and
their data model addresses several issues that the relational
model is not designed to address:
– Geographically distributed architecture instead of
expensive, monolithic architecture
– Large volumes of rapidly changing structured, semi-
structured, and unstructured data
– Agile sprints, quick schema iteration, and frequent code
pushes
– Object-oriented programming that is easy to use and
flexible12
NoSQL Database Types
[https://www.mongodb.com/nosql-explained]
• Key-value stores are the simplest NoSQL databases. Every single item
in the database is stored as an attribute name (or 'key'), together with its
value. Examples of key-value stores are Riak and Berkeley DB.
• Wide-column stores such as Cassandra and HBase are optimized for
queries over large datasets, and store columns of data together, instead
of rows.
• Document databases pair each key with a complex data structure
known as a document.
• Graph stores are used to store information about networks of data, such
as social connections. Graph stores include Neo4J and triple stores like
Fuseki.
13
Features
Simple primitive data
structure
No predefined schema
Limited query capabilities
Dictionary-like
functionality at large scale
key3
key2
key1 value1
value2
value2
Bioinformatics Use Case
Word vectors in text
mining
Caching
Limitations
Key lookup only, no
generalized query
Small number of
attributes per entity
Key/value stores
Key/value stores
●
●
Store datain a schema-less way
Store data asmaps
–
–
HashMaps or associativearrays
Provide a very efcient average running time algorithm for
accessing data
●
Notable for:
–
–
–
–
–
– Couchbase (Zynga, Vimeo, NAVTEQ,...)
Redis (Craiglist, Instagram, StackOverfow, fickr, ...)
Amazon Dynamo (Amazon, Elsevier, IMDb,...)
Apache Cassandra (Facebook, Digg, Reddit, Twitter,...)
Voldemort (LinkedIn, eBay,… )
Riak (Github, Comcast, Mochi, ...)
Sorted ordered column-orientedstores
●
Data are stored in a column-oriented way
–
–
–
Data efficiently stored
Avoids consuming space forstoring nulls
Unit of data is a set of key/value pairs
●
●
Identified by“row-key”
Ordered and sorted based on row-key
–
–
Columns are grouped incolumn-families
Data isn’t stored as a single table but is stored by column-
families
●
Notable for:
– Google's Bigtable (used in all Google's services)
– HBase (Facebook, StumbleUpon, Hulu, Yahoo!,...)
Column-oriented store example
Features
Groups attributes into
column families
Column families store key-
value pairs
Implemented as sparse
multi-dimensional arrays
Denormalized
104-106 columns; 109 rows
 Bioinformatics Use Case
 Large studies
 Many experiments & data types
 Simulations
 Limitations
 Operationally
challenging
 Suitable for large
number of servers
Document databases
●
Documents
–
–
–
– Loosely structured sets of key/value pairs in documents, e.g.,
XML, JSON, BSON
Encapsulate and encode data in some standard formats or
encodings
Are addressed in the database via a unique key
Documents are treated as a whole, avoiding splitting a
document into its constituent name/value pairs
●
●
Allow documents retrieving by keys or contents
Notable for:
– MongoDB (used in FourSquare, Github, and more)
– CouchDB (used inApple, BBC, Canonical, Cern, and more)
Document databases,JSON
{
“ApacheLogRecord”: {
“ip”: “127.0.0.1”,
“ident” : “-”,
“http_user” : “frank”,
“time” : “10/Oct/2000:13:55:36 -0700”,
“request_line” : {
“http_method” : “GET”,
“url” : “/apache_pb.gif”,
“http_vers” : “HTTP/1.0”,
},
“http_response_code” : “200”,
“http_response_size” : “2326”,
“referrer” : “http://www.example.com/start.html”,
“user_agent” : “Mozilla/4.08 [en] (Win98; I ;Nav)”,
}
}
{
subject_id: "F8273",
age : "26",
sex : "M"
date_of_death : "12-Jan-1995”,
glycohemoglobin: 10%,
BMI : 22,
samples : [ {type:"Thoracic Aorta", AHA_score: 1},
{type:"Abdominal Aorta", AHA_score: 2},
{type:"LAD", AHA_Score:5} ],
sequence: {seq_file: "F8273_08152014.bam",
variant_file: "F8273_08152014.vcf”}
}
Features
 JSON/XML structures
 Fields vary between docs
 No predefined schema
 Documents analogous to
rows
 Collections analogous to
tables
 Query capabilities
Bioinformatics Use Case
Text mining
Atherosclerosis
Limitations
No joins
No referential integrity
checks
Object-based query language
{
id : <value>,
<key> : <value>,
<key> : <embedded
document>,
<key> : <array>
}
Limitations
Less suited for tabular
data
Features
Highly normalized
Graph-based query
language (Gremlin)
SQL-inspired query
language (Cypher)
Support for path finding
and recursion
Bioinformatics Use Case
Epidemiology
simulations
Interaction networks
Property Graph Model
name: the Doctor
age: 907
species: Time Lord
first name: Rose
late name: Tyler
vehicle: tardis
model: Type 40
Modeling NoSQLstores
●
NoSQL data modeling often starts from the
application-specific queries as opposed to relational
modeling:
–
– Relational modeling is typically driven by the
structure of available data. The main design theme is
”What answers doI have?”
NoSQL data modeling is typically driven by
application-specific access patterns, i.e. thetypes of
queries to be supported. The main design theme is
”What questions doI have?”
●
Data duplication and denormalization are first-class
citizens
Querying NoSQLstores
●
Different NoSQLstores provide diferent
querying tools andfeatures
–
– From “simple” filtering ofdata basedon “columns”
names/values (MongoDB, HBase,Redis, …)
ToSQL-likelanguages (GoogleApp Engine,
HyperTable, Hive,...)
NoSQL, No ACID
●
●
RDBMSs are based on ACID (Atomicity,Consistency,
Isolation, and Durability) properties
NoSQL
– Does not give importance to ACID properties
– In some cases completely ignoresthem
●
In distributed parallel systems itis difcult/impossible
to ensure ACIDproperties
– Even with a centralcoordinator
– 2PL, 2PC and SS2PLcan help
●
Long-running transactions don't work because keeping
resources blocked for a long time is not practical
CAPTheorem
●
A congruent and logical way for assessing the problems
involved in assuring ACID-like guarantees in distributed
systems is provided by the CAP theorem
– At most two of the following three can be maximized at one
time
●
●
●
Consistency - Each client has the same view of the data
Availability - Each clientcan always read and write
Partition tolerance - System works wellacross distributed
physical networks
–
–
Conjectured by Eric Brewer in 2000
Proved by Seth Gilbert and Nancy Lynch in 2002
References
●
●
●
●
●
●
●
Tiwari, Shashank. Professional NoSQL. Wrox, 2011.
Warden, Pete. Big Data Glossary. O'Reilly Media, 2011.
Vogels, Werner (Amazon.com's CTO). All Things Distributed. Werner Vogels'
weblog on building scalable and robust distributed systems.
http://www.allthingsdistributed.com/
Katsov, Ilya. NoSQL Data Modeling Techniques.
http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
Bushik, Sergey. A vendor-independent comparison of NoSQL databases:
Cassandra, HBase, MongoDB, Riak. October 2012. Available online.
Gilbert, Seth and Lynch, Nancy. Brewer's conjecture and the feasibility of
consistent, available, partition-tolerant web services. ACM SIGACT News
33.2 (2002): 51-59.
Redmond, Eric, Wilson, Jim R. , and Carter, Jacquelyn. Seven databases in
seven weeks: a guide to modern databases and the NoSQL movement. The
Pragmatic Programmers, LLC,2012.
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql

2018 05 08_biological_databases_no_sql

  • 2.
  • 3.
  • 4.
    Data Warehousing andDecision Support
  • 6.
    RMySQL Package No needto parse the data – the Fetch function puts the queried data directly into an R data.frame format!
  • 7.
    What does NoSQLmean? ● NoSQLstands for: – – – No Relational No RDBMS NonRel – Not OnlySQL ● NoSQL is – An umbrella term for all databases and data stores that don’t follow the RDBMSprinciples ● ● A class of products A collection of several (related) concepts about data storage and manipulation – Often related to large data sets
  • 8.
    Where does NoSQLcomefrom? ● ● Non-relational DBMSs are not new But NoSQL represents a newincarnation – Due tomassively scalable Internet applications – Based on distributed and parallel computing ● Development – Starts with Google ● First research paper published in 2003 – – Continues also thanks to Lucene's developers/Apache (Hadoop) and Amazon (Dynamo) Then a lot of products and interests came from Facebook, Netfix, Yahoo, eBay, Hulu, IBM, and many more
  • 9.
    NoSQL and bigdata ● ● NoSQL comes from Internet, thus it is often related to the “big data” concept How much big are “big data”? – – Over few terabytes (>1012 ≈240) Enough to start spanning multiple storage units ● Challenges – – – Efciently storing and accessing large amounts of data is difcult, even more considering fault tolerance and backups Manipulating large data sets involves running immensely parallel processes Managing continuously evolving schema and metadata for semi-structured and un-structured data is difcult
  • 10.
    Why are RDBMSnotsuitable for big data? ● ● The context is Internet RDBMSs assume that dataare – Dense – Largely uniform (structureddata) ● Data coming from Internet are – – Massive and sparse Semi-structured or unstructured ● With massive sparse data sets, the typical storage mechanisms and access methods getstretched
  • 11.
    NoSQL products' categories ● NoSQLproductscanbe categorized in – Key/value stores – Sorted ordered column-orientedstores – Document databases(JSON/XML) – Graphdatabases
  • 12.
    The Benefits ofNoSQL [https://www.mongodb.com/nosql-explained]When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address: – Geographically distributed architecture instead of expensive, monolithic architecture – Large volumes of rapidly changing structured, semi- structured, and unstructured data – Agile sprints, quick schema iteration, and frequent code pushes – Object-oriented programming that is easy to use and flexible12
  • 13.
    NoSQL Database Types [https://www.mongodb.com/nosql-explained] •Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. • Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows. • Document databases pair each key with a complex data structure known as a document. • Graph stores are used to store information about networks of data, such as social connections. Graph stores include Neo4J and triple stores like Fuseki. 13
  • 14.
    Features Simple primitive data structure Nopredefined schema Limited query capabilities Dictionary-like functionality at large scale key3 key2 key1 value1 value2 value2 Bioinformatics Use Case Word vectors in text mining Caching Limitations Key lookup only, no generalized query Small number of attributes per entity Key/value stores
  • 15.
    Key/value stores ● ● Store dataina schema-less way Store data asmaps – – HashMaps or associativearrays Provide a very efcient average running time algorithm for accessing data ● Notable for: – – – – – – Couchbase (Zynga, Vimeo, NAVTEQ,...) Redis (Craiglist, Instagram, StackOverfow, fickr, ...) Amazon Dynamo (Amazon, Elsevier, IMDb,...) Apache Cassandra (Facebook, Digg, Reddit, Twitter,...) Voldemort (LinkedIn, eBay,… ) Riak (Github, Comcast, Mochi, ...)
  • 16.
    Sorted ordered column-orientedstores ● Dataare stored in a column-oriented way – – – Data efficiently stored Avoids consuming space forstoring nulls Unit of data is a set of key/value pairs ● ● Identified by“row-key” Ordered and sorted based on row-key – – Columns are grouped incolumn-families Data isn’t stored as a single table but is stored by column- families ● Notable for: – Google's Bigtable (used in all Google's services) – HBase (Facebook, StumbleUpon, Hulu, Yahoo!,...)
  • 17.
  • 18.
    Features Groups attributes into columnfamilies Column families store key- value pairs Implemented as sparse multi-dimensional arrays Denormalized 104-106 columns; 109 rows  Bioinformatics Use Case  Large studies  Many experiments & data types  Simulations  Limitations  Operationally challenging  Suitable for large number of servers
  • 19.
    Document databases ● Documents – – – – Looselystructured sets of key/value pairs in documents, e.g., XML, JSON, BSON Encapsulate and encode data in some standard formats or encodings Are addressed in the database via a unique key Documents are treated as a whole, avoiding splitting a document into its constituent name/value pairs ● ● Allow documents retrieving by keys or contents Notable for: – MongoDB (used in FourSquare, Github, and more) – CouchDB (used inApple, BBC, Canonical, Cern, and more)
  • 20.
    Document databases,JSON { “ApacheLogRecord”: { “ip”:“127.0.0.1”, “ident” : “-”, “http_user” : “frank”, “time” : “10/Oct/2000:13:55:36 -0700”, “request_line” : { “http_method” : “GET”, “url” : “/apache_pb.gif”, “http_vers” : “HTTP/1.0”, }, “http_response_code” : “200”, “http_response_size” : “2326”, “referrer” : “http://www.example.com/start.html”, “user_agent” : “Mozilla/4.08 [en] (Win98; I ;Nav)”, } }
  • 21.
    { subject_id: "F8273", age :"26", sex : "M" date_of_death : "12-Jan-1995”, glycohemoglobin: 10%, BMI : 22, samples : [ {type:"Thoracic Aorta", AHA_score: 1}, {type:"Abdominal Aorta", AHA_score: 2}, {type:"LAD", AHA_Score:5} ], sequence: {seq_file: "F8273_08152014.bam", variant_file: "F8273_08152014.vcf”} }
  • 22.
    Features  JSON/XML structures Fields vary between docs  No predefined schema  Documents analogous to rows  Collections analogous to tables  Query capabilities Bioinformatics Use Case Text mining Atherosclerosis Limitations No joins No referential integrity checks Object-based query language { id : <value>, <key> : <value>, <key> : <embedded document>, <key> : <array> }
  • 23.
    Limitations Less suited fortabular data Features Highly normalized Graph-based query language (Gremlin) SQL-inspired query language (Cypher) Support for path finding and recursion Bioinformatics Use Case Epidemiology simulations Interaction networks
  • 24.
    Property Graph Model name:the Doctor age: 907 species: Time Lord first name: Rose late name: Tyler vehicle: tardis model: Type 40
  • 26.
    Modeling NoSQLstores ● NoSQL datamodeling often starts from the application-specific queries as opposed to relational modeling: – – Relational modeling is typically driven by the structure of available data. The main design theme is ”What answers doI have?” NoSQL data modeling is typically driven by application-specific access patterns, i.e. thetypes of queries to be supported. The main design theme is ”What questions doI have?” ● Data duplication and denormalization are first-class citizens
  • 27.
    Querying NoSQLstores ● Different NoSQLstoresprovide diferent querying tools andfeatures – – From “simple” filtering ofdata basedon “columns” names/values (MongoDB, HBase,Redis, …) ToSQL-likelanguages (GoogleApp Engine, HyperTable, Hive,...)
  • 28.
    NoSQL, No ACID ● ● RDBMSsare based on ACID (Atomicity,Consistency, Isolation, and Durability) properties NoSQL – Does not give importance to ACID properties – In some cases completely ignoresthem ● In distributed parallel systems itis difcult/impossible to ensure ACIDproperties – Even with a centralcoordinator – 2PL, 2PC and SS2PLcan help ● Long-running transactions don't work because keeping resources blocked for a long time is not practical
  • 29.
    CAPTheorem ● A congruent andlogical way for assessing the problems involved in assuring ACID-like guarantees in distributed systems is provided by the CAP theorem – At most two of the following three can be maximized at one time ● ● ● Consistency - Each client has the same view of the data Availability - Each clientcan always read and write Partition tolerance - System works wellacross distributed physical networks – – Conjectured by Eric Brewer in 2000 Proved by Seth Gilbert and Nancy Lynch in 2002
  • 30.
    References ● ● ● ● ● ● ● Tiwari, Shashank. ProfessionalNoSQL. Wrox, 2011. Warden, Pete. Big Data Glossary. O'Reilly Media, 2011. Vogels, Werner (Amazon.com's CTO). All Things Distributed. Werner Vogels' weblog on building scalable and robust distributed systems. http://www.allthingsdistributed.com/ Katsov, Ilya. NoSQL Data Modeling Techniques. http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/ Bushik, Sergey. A vendor-independent comparison of NoSQL databases: Cassandra, HBase, MongoDB, Riak. October 2012. Available online. Gilbert, Seth and Lynch, Nancy. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33.2 (2002): 51-59. Redmond, Eric, Wilson, Jim R. , and Carter, Jacquelyn. Seven databases in seven weeks: a guide to modern databases and the NoSQL movement. The Pragmatic Programmers, LLC,2012.