NOSQL in Media
Sander Kieft
About me
 Manager Core Services at Sanoma
 Responsible for all common services, including the
Big Data platform
 Work:
– Centralized services
– Data platform
– Search
 Like:
– Work
– Water(sports)
– Whiskey
– Tinkering: Arduino, Raspberry PI, soldering stuff
24 April 20152
Sanoma, B2C Publishing and Learning company
2+100
2 Finnish newspapers
Over 100 magazines
24 April 2015 Presentation name3
5
TV channels in Finland
and The Netherlands
200+
Websites
100
Mobile applications on
various mobile platforms
Use Cases for NoSQL in Media
24 April 2015 Presentation name5
Not Only
SQL
Generic vs specialized solutions
24 April 2015 Presentation name6
 Data models
 Speed
 Scalability
 Partition tolerance
 Availability / Redundancy
 Cost per GB
Specialized focus
24 April 2015 Presentation name7
 CAP (or Brewster) Theorem says:
“it is impossible for a distributed computer system
to simultaneously provide all three of the following
guarantees:
– Consistency
– Availability
– Partition tolerance”
CAP Theorem
24 April 2015 Presentation name8
A
C P
CAP Theorem
24 April 2015 Presentation name9
A
C P
Availability
Each client can always
read and write
Partition Tolerance
The system works well
despite physical
network partitions
Consistency
All clients always have
the same view of the
data
RDBMS
MySQL
Postgres
MS SQL
Oracle
NOSQL
NOSQL
Eventual consistency
-- Werner Vogels, CTO Amazon
Various Data models
 key-value
 column
 document stores
 map/reduce
 graph
 search
 blob storage
Various data models
24 April 2015 Presentation name12
Key/value stores
Photo credits: John Chulick - https://www.flickr.com/photos/chulickphotos/8234894686/
Key/value stores
 Storing object on key
 Based on the Dynamo paper (Werner Vogels)
 Products:
– Riak
– Memcache/Membase
– Tokyo Cabinet
– Redis
– Voldemort
 Use cases:
– Counting
– Top lists
– Caches
– Pre-calculated optimizations
24 April 2015 Presentation name14
Bucket A B C
Key/Value buckets
24 April 2015 Presentation name15
User XXXX YYYY ZZZZ
Article 100 200 300
Article_<5 min. TIME> 50 100 150
Real time stats
24 April 2015 Presentation name16
Use Cases for NoSQL in Media
Use Cases for NoSQL in Media
Document Stores
Document stores
 Stores ”records” as documents
 Versioning
 Easy sharding (document self contained)
 Products:
– MongoDB
– CouchDB
– SimpleDB
 Use case:
– CMS
– Meta data
– Product catalog
24 April 2015 Presentation name20
From relational data model to document
24 April 2015 Presentation name21
Product
Properties
Application
Property
Property
MyJour
Item Based Framework
….
CMS
Architecture Content Platform
24 April 2015 Presentation name22
Content Platform Core
Search
Solr
Blob
Storage
(S3 & MT)
Article
storage
MongoDB
Analyse
CMS
CMS
Editorial
reuse-interface
ePub
Digital
Template
system
WoodWing
Content
Portal
Feeds
Noma
Viva
PDF Based Framework
….
HomeDeco
Sources Services Solutions Products
??
??
??
??
eLinea
Blendle
Google Currents
LINDA. nieuws
NU.nl search
Column stores
Column stores
 Lineage: Google's BigTable paper
 Records with many, many columns
 Distinguish between hot and cold data
 Versioning
 Records and columns can be sharded
 Products:
– Hbase
– Cassandra
– Hypertable
 Use cases:
– Analytics
– Messages
24 April 2015 Presentation name24
Big Data
Big Data
 Linage: Google GFS & Map/Reduce
 Distributed data storage and processing
 Advanced analytics capabilities on raw data
 Schema on read
Products:
 Hadoop
 MPP databases
 Use cases:
– Adhoc querying terabytes of data
– Data science
 Predictive analytics
 Model training
– Calculate recommendations
24 April 2015 Presentation name26
Big Data at Sanoma
 Main use case for reporting and analytics, moving to
data science
 A/B MVT testing evaluations
 Using Qlikview as a front-end
 Supply data to other environments (SAS,
Advertising, Behavioral Targeting)
 Agile process for adding sources, from raw to
intermediate to modeled datawarehouse
 Sanoma standard data platform, used in all Sanoma
countries
 > 250 Users: dashboard users
 40 daily users: analysts & developers
 43 source systems, with 125 different sources
 400 tables in hive
 Platform:
– Cloudera Hadoop
– 40-60 nodes
– > 400TB storage
– ~2000 jobs/day
 Typical data node / task tracker:
– 1-2 CPU 4-12 cores
– 2 system disks (RAID 1)
– 4 data disks (2TB, 3TB or 4TB)
– 24-32GB RAM
24 April 2015 Presentation name27
Sanoma Data lake
Traditional BI vs Big Data approach
28 24.4.2015 © Sanoma Media
Search
Photo credits: http://www.flickr.com/photos/emyanmei/8223998414/
Search
 Keyword search can be combined with
advanced forms of ranking the results
 Most of the fields go to an index
 Facets can be used for analytics
 Ranker can be replaced with custom logic
 Products:
– Solr
– ElasticSearch
– Marklogic
 Use cases:
– Content Search
– Analytics / Faceted
– Percolation
24 April 2015 Presentation name30
Search
24 April 2015 Presentation name31
Content
Q Σ Result ranking
Search too
24 April 2015 Presentation name32
Content
t
Σ Result ranking
User
Search too
24 April 2015 Presentation name33
Content
Page
Σ Result ranking
User
 Traditional queries: against index with existing data
 What if the data does not exist at time of query?
 Percolation allows registration of queries and then returning the query IDs, e.g. for notification when
new matches are available
 Use case:
– Search for a tweet, but after the initial results continuously
get newly tweeted items when they come in
Search - Percolation
24 April 2015 Presentation name34
Graph databases
Graph databases
 Lineage: Euler and graph theory.
 Data model: Nodes & edges, both which can
hold key-value pairs
 Products:
– AllegroGraph
– InfoGrid
– Neo4j
 Use cases:
– Social relationships
– Content Linking (Entity linking)
24 April 2015 Presentation name36
Jan Smit
3js
Nick en Simon
Volendam
Article
1
Article
2
Article
3
Blob storage
Blob storage
 Endless storage of binary data
 Storing larger objects then a single machine
 “Lower” price/GB compared to SAN storage
 Products
– Amazon S3
– CAStor
– (Hadoop)
 Use case:
– Media storage
– Archiving
24 April 2015 Presentation name38
Summary
 RDBMS systems are a good enough for many problems
 For specific problems NOSQL solutions provide a specific solution
 There’s a variety of NOSQL solutions with different characteristics
 NOSQL solutions will require a higher engineering effort
Summary
24 April 2015 Presentation name40
Dream NO SQL Architecture – Content Delivery
24 April 201541
CMS
Document storage
(MongoDB/
CouchDB)
Blob storage
(S3/
CAStor)
Search
(ElasticSearch/
Solr)
Website / Mobile
Application
Dream NO SQL Architecture - Analytics
24 April 201542
Event collection
Message Queue
(Kafka / Flume )
Event processing
(Storm)
Key-value
store
(Redis)
Real time
recommendations
/ targeting
Column
storage
(Cassandra/
Hbase)
Real time
Dashboarding
Big Data
(Hadoop)
Adhoc reporting &
Data science
CAP Theorem
24 April 2015 Presentation name43
A
C P
Availability
Each client can always
read and write
Partition Tolerance
The system works well
despite physical
network partitions
Consistency
All clients always have
the same view of the
data
MySQL Asterdata
Postgres Greenplum
MS SQL Vertica
Oracle
Dynamo Cassandra
Voldemort SimpleDB
Tokyo Cabinet CouchDB
KAI Riak
Big Table MongoDB Berkeley DB
Hypertable Terrastore MemcachDB
Hbase Scalaris Redis
Data models
Relational databases
Key-value
Column-oriented
Document-oriented
Use Cases for NoSQL in Media

More Related Content

PDF
How Enterprises are Using NoSQL for Mission-Critical Applications
PDF
Integrated Data Warehouse with Hadoop and Oracle Database
PDF
Delivering rapid-fire Analytics with Snowflake and Tableau
PPTX
Versa Shore Microsoft APS PDW webinar
PDF
Demystifying Data Warehouse as a Service (DWaaS)
PPTX
SQL Server on Linux - march 2017
PDF
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
PPTX
Microsoft Data Platform - What's included
How Enterprises are Using NoSQL for Mission-Critical Applications
Integrated Data Warehouse with Hadoop and Oracle Database
Delivering rapid-fire Analytics with Snowflake and Tableau
Versa Shore Microsoft APS PDW webinar
Demystifying Data Warehouse as a Service (DWaaS)
SQL Server on Linux - march 2017
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Microsoft Data Platform - What's included

What's hot (20)

PPTX
Building a Big Data Solution
PDF
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
PDF
Analyzing Semi-Structured Data At Volume In The Cloud
PDF
Hadoop Integration into Data Warehousing Architectures
PDF
Actionable Insights with AI - Snowflake for Data Science
PPTX
Hadoop and Enterprise Data Warehouse
PPTX
Zero to Snowflake Presentation
PDF
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
PPTX
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
PDF
Dataiku & Snowflake Meetup Berlin 2020
PPTX
CRM UG Belux March 2017 - Power BI and Dynamics 365
PDF
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
PDF
Data Mesh for Dinner
PPTX
Webinar: An Enterprise Architect’s View of MongoDB
PPTX
Power BI for Big Data and the New Look of Big Data Solutions
PPTX
Big Data Platforms: An Overview
PDF
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
PDF
AWS User Group: Building Cloud Analytics Solution with AWS
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PDF
A7 storytelling with_oracle_analytics_cloud
Building a Big Data Solution
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Analyzing Semi-Structured Data At Volume In The Cloud
Hadoop Integration into Data Warehousing Architectures
Actionable Insights with AI - Snowflake for Data Science
Hadoop and Enterprise Data Warehouse
Zero to Snowflake Presentation
How to Achieve Fast Data Performance in Big Data, Logical Data Warehouse, and...
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dataiku & Snowflake Meetup Berlin 2020
CRM UG Belux March 2017 - Power BI and Dynamics 365
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Data Mesh for Dinner
Webinar: An Enterprise Architect’s View of MongoDB
Power BI for Big Data and the New Look of Big Data Solutions
Big Data Platforms: An Overview
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
AWS User Group: Building Cloud Analytics Solution with AWS
Building the Data Lake with Azure Data Factory and Data Lake Analytics
A7 storytelling with_oracle_analytics_cloud
Ad

Similar to Use Cases for NoSQL in Media (20)

PPTX
Big Data Overview 2013-2014
PDF
BigData Behind-the-Scenes~20150827
PPTX
Big Data Overview
PDF
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
PPTX
Big data
PPTX
Big Data, NoSQL, NewSQL & The Future of Data Management
PPTX
MongoDB & Hadoop - Understanding Your Big Data
PPTX
Data analytics introduction
PPSX
Big Data Basic Concepts | Presented in 2014
PPTX
No SQL- The Future Of Data Storage
PDF
Bigdatappt 140225061440-phpapp01
PPTX
A Big Data Concept
PPTX
ch02models.pptx
PPTX
ch02models.pptx
PPTX
Big data
PDF
What's the Big Deal About Big Data?
PDF
Seminaire bigdata23102014
PDF
Where Does Big Data Meet Big Database - QCon 2012
PDF
Big data rmoug
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
Big Data Overview 2013-2014
BigData Behind-the-Scenes~20150827
Big Data Overview
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Big data
Big Data, NoSQL, NewSQL & The Future of Data Management
MongoDB & Hadoop - Understanding Your Big Data
Data analytics introduction
Big Data Basic Concepts | Presented in 2014
No SQL- The Future Of Data Storage
Bigdatappt 140225061440-phpapp01
A Big Data Concept
ch02models.pptx
ch02models.pptx
Big data
What's the Big Deal About Big Data?
Seminaire bigdata23102014
Where Does Big Data Meet Big Database - QCon 2012
Big data rmoug
NoSQL A brief look at Apache Cassandra Distributed Database
Ad

Recently uploaded (20)

PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPT
Geologic Time for studying geology for geologist
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Getting Started with Data Integration: FME Form 101
PDF
Five Habits of High-Impact Board Members
PDF
Hybrid model detection and classification of lung cancer
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
Tartificialntelligence_presentation.pptx
PDF
Unlock new opportunities with location data.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
STKI Israel Market Study 2025 version august
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
Getting started with AI Agents and Multi-Agent Systems
O2C Customer Invoices to Receipt V15A.pptx
Developing a website for English-speaking practice to English as a foreign la...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Geologic Time for studying geology for geologist
A review of recent deep learning applications in wood surface defect identifi...
Enhancing emotion recognition model for a student engagement use case through...
Getting Started with Data Integration: FME Form 101
Five Habits of High-Impact Board Members
Hybrid model detection and classification of lung cancer
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Module 1.ppt Iot fundamentals and Architecture
observCloud-Native Containerability and monitoring.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
Tartificialntelligence_presentation.pptx
Unlock new opportunities with location data.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
STKI Israel Market Study 2025 version august
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
A comparative study of natural language inference in Swahili using monolingua...

Use Cases for NoSQL in Media

  • 2. About me  Manager Core Services at Sanoma  Responsible for all common services, including the Big Data platform  Work: – Centralized services – Data platform – Search  Like: – Work – Water(sports) – Whiskey – Tinkering: Arduino, Raspberry PI, soldering stuff 24 April 20152
  • 3. Sanoma, B2C Publishing and Learning company 2+100 2 Finnish newspapers Over 100 magazines 24 April 2015 Presentation name3 5 TV channels in Finland and The Netherlands 200+ Websites 100 Mobile applications on various mobile platforms
  • 5. 24 April 2015 Presentation name5 Not Only SQL
  • 6. Generic vs specialized solutions 24 April 2015 Presentation name6
  • 7.  Data models  Speed  Scalability  Partition tolerance  Availability / Redundancy  Cost per GB Specialized focus 24 April 2015 Presentation name7
  • 8.  CAP (or Brewster) Theorem says: “it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: – Consistency – Availability – Partition tolerance” CAP Theorem 24 April 2015 Presentation name8 A C P
  • 9. CAP Theorem 24 April 2015 Presentation name9 A C P Availability Each client can always read and write Partition Tolerance The system works well despite physical network partitions Consistency All clients always have the same view of the data RDBMS MySQL Postgres MS SQL Oracle NOSQL NOSQL
  • 10. Eventual consistency -- Werner Vogels, CTO Amazon
  • 12.  key-value  column  document stores  map/reduce  graph  search  blob storage Various data models 24 April 2015 Presentation name12
  • 13. Key/value stores Photo credits: John Chulick - https://www.flickr.com/photos/chulickphotos/8234894686/
  • 14. Key/value stores  Storing object on key  Based on the Dynamo paper (Werner Vogels)  Products: – Riak – Memcache/Membase – Tokyo Cabinet – Redis – Voldemort  Use cases: – Counting – Top lists – Caches – Pre-calculated optimizations 24 April 2015 Presentation name14
  • 15. Bucket A B C Key/Value buckets 24 April 2015 Presentation name15 User XXXX YYYY ZZZZ Article 100 200 300 Article_<5 min. TIME> 50 100 150
  • 16. Real time stats 24 April 2015 Presentation name16
  • 20. Document stores  Stores ”records” as documents  Versioning  Easy sharding (document self contained)  Products: – MongoDB – CouchDB – SimpleDB  Use case: – CMS – Meta data – Product catalog 24 April 2015 Presentation name20
  • 21. From relational data model to document 24 April 2015 Presentation name21 Product Properties Application Property Property
  • 22. MyJour Item Based Framework …. CMS Architecture Content Platform 24 April 2015 Presentation name22 Content Platform Core Search Solr Blob Storage (S3 & MT) Article storage MongoDB Analyse CMS CMS Editorial reuse-interface ePub Digital Template system WoodWing Content Portal Feeds Noma Viva PDF Based Framework …. HomeDeco Sources Services Solutions Products ?? ?? ?? ?? eLinea Blendle Google Currents LINDA. nieuws NU.nl search
  • 24. Column stores  Lineage: Google's BigTable paper  Records with many, many columns  Distinguish between hot and cold data  Versioning  Records and columns can be sharded  Products: – Hbase – Cassandra – Hypertable  Use cases: – Analytics – Messages 24 April 2015 Presentation name24
  • 26. Big Data  Linage: Google GFS & Map/Reduce  Distributed data storage and processing  Advanced analytics capabilities on raw data  Schema on read Products:  Hadoop  MPP databases  Use cases: – Adhoc querying terabytes of data – Data science  Predictive analytics  Model training – Calculate recommendations 24 April 2015 Presentation name26
  • 27. Big Data at Sanoma  Main use case for reporting and analytics, moving to data science  A/B MVT testing evaluations  Using Qlikview as a front-end  Supply data to other environments (SAS, Advertising, Behavioral Targeting)  Agile process for adding sources, from raw to intermediate to modeled datawarehouse  Sanoma standard data platform, used in all Sanoma countries  > 250 Users: dashboard users  40 daily users: analysts & developers  43 source systems, with 125 different sources  400 tables in hive  Platform: – Cloudera Hadoop – 40-60 nodes – > 400TB storage – ~2000 jobs/day  Typical data node / task tracker: – 1-2 CPU 4-12 cores – 2 system disks (RAID 1) – 4 data disks (2TB, 3TB or 4TB) – 24-32GB RAM 24 April 2015 Presentation name27
  • 28. Sanoma Data lake Traditional BI vs Big Data approach 28 24.4.2015 © Sanoma Media
  • 30. Search  Keyword search can be combined with advanced forms of ranking the results  Most of the fields go to an index  Facets can be used for analytics  Ranker can be replaced with custom logic  Products: – Solr – ElasticSearch – Marklogic  Use cases: – Content Search – Analytics / Faceted – Percolation 24 April 2015 Presentation name30
  • 31. Search 24 April 2015 Presentation name31 Content Q Σ Result ranking
  • 32. Search too 24 April 2015 Presentation name32 Content t Σ Result ranking User
  • 33. Search too 24 April 2015 Presentation name33 Content Page Σ Result ranking User
  • 34.  Traditional queries: against index with existing data  What if the data does not exist at time of query?  Percolation allows registration of queries and then returning the query IDs, e.g. for notification when new matches are available  Use case: – Search for a tweet, but after the initial results continuously get newly tweeted items when they come in Search - Percolation 24 April 2015 Presentation name34
  • 36. Graph databases  Lineage: Euler and graph theory.  Data model: Nodes & edges, both which can hold key-value pairs  Products: – AllegroGraph – InfoGrid – Neo4j  Use cases: – Social relationships – Content Linking (Entity linking) 24 April 2015 Presentation name36 Jan Smit 3js Nick en Simon Volendam Article 1 Article 2 Article 3
  • 38. Blob storage  Endless storage of binary data  Storing larger objects then a single machine  “Lower” price/GB compared to SAN storage  Products – Amazon S3 – CAStor – (Hadoop)  Use case: – Media storage – Archiving 24 April 2015 Presentation name38
  • 40.  RDBMS systems are a good enough for many problems  For specific problems NOSQL solutions provide a specific solution  There’s a variety of NOSQL solutions with different characteristics  NOSQL solutions will require a higher engineering effort Summary 24 April 2015 Presentation name40
  • 41. Dream NO SQL Architecture – Content Delivery 24 April 201541 CMS Document storage (MongoDB/ CouchDB) Blob storage (S3/ CAStor) Search (ElasticSearch/ Solr) Website / Mobile Application
  • 42. Dream NO SQL Architecture - Analytics 24 April 201542 Event collection Message Queue (Kafka / Flume ) Event processing (Storm) Key-value store (Redis) Real time recommendations / targeting Column storage (Cassandra/ Hbase) Real time Dashboarding Big Data (Hadoop) Adhoc reporting & Data science
  • 43. CAP Theorem 24 April 2015 Presentation name43 A C P Availability Each client can always read and write Partition Tolerance The system works well despite physical network partitions Consistency All clients always have the same view of the data MySQL Asterdata Postgres Greenplum MS SQL Vertica Oracle Dynamo Cassandra Voldemort SimpleDB Tokyo Cabinet CouchDB KAI Riak Big Table MongoDB Berkeley DB Hypertable Terrastore MemcachDB Hbase Scalaris Redis Data models Relational databases Key-value Column-oriented Document-oriented