SlideShare a Scribd company logo
Schema Agnostic Indexing with
Azure DocumentDB
@dharmashukla, DocumentDB
Presented at VLDB 2015
Sudipta Sengupta, Justin Levandoski,
David Lomet
Microsoft Research
Dharma Shukla, Shireesh Thota, Karthik Raman,
Madhan Gajendran, Ankur Shah, Sergii Ziuzin,
Krishnan Sundaram, Miguel Gonzalez Guajardo, Anna
Wawrzyniak, Samer Boshra,
Renato Ferreira, Mohamed Nassar,
Michael Koltachev, Ji Huang
Microsoft Corporation
 Overview of DocumentDB
 Schema Agnostic Indexing
 Logical Index Organization
 Physical Index Organization
 Summary
Outline
 Fully managed, multi-tenant, geo-distributed document database service on
Azure
 Born out of the needs of internal Microsoft applications; GA since April 2015
 Built from the ground up with resource governance
 Provisioned throughput, performance isolation, OPEX efficiency
 Well defined consistency levels with predictable performance
 Database engine built for JSON & JavaScript
 Automatic indexing of JSON values and rich (SQL and JavaScript) query
 JavaScript language integrated transactions and query directly inside the database engine
What is DocumentDB?
Strong Bounded Staleness Session Eventual
Architecture
Database
Collection
Document
Account
User
Permission
JavaScript Object Literals
JSON serializable
values (aka JSON
Infoset)
{
"locations":
[
{ "country": "Germany", "city": "Berlin" },
{ "country": "France", "city": "Paris" }
],
"headquarter": "Belgium",
"exports":[{ "city": "Moscow" },{ "city": "Athens"}]
}
locations headquarter exports
0 1
country
Germany
city
Berlin
country
France
city
Paris
city
Moscow
city
Athens
Belgium 0 1
• Automatic indexing of document trees without
requiring schema or secondary indices
• SQL and JavaScript query processing on the trees
• Lazy materialization of JavaScript values from the
instances of trees
JSON document as tree
Schema-agnostic indexing
• Index is a union of all the document trees
Common
structure
• Structural information and instance values are normalized into a
unifying concept of JSON-Path
Terms Postings List/Values
$/location/0/ 1, 2
location/0/country/ 1, 2
location/0/city/ 1, 2
0/country/Germany 1, 2
1/country/France 2
… …
0/city/Moscow 2
0/dealers/0 2
0
Germany
location
0
location
country
0
country
Range (>, <, !=) &
ORDERBY queries
0
Germany
location
0
location
country
0
country
Wildcard queries Spatial queries
0
coordinates
Dynamic
Encoding of
Postings List
(E-WAH/differential)
Logical Index Organization
Query
{
"results":
[
{
"locations":
[
{"country":"Germany","city":"Berlin"},
{"country":"France","city":"Paris"}
]
}
]
}
{ "locations":
[ { "country": "Germany", "city": "Berlin" },
{ "country": "France", "city": "Paris" }
],
"headquarter": "Belgium",
"exports": [{ "city": "Moscow" }, { "city": "Athens" }]
}
{ "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 } ],
"headquarter": "Italy",
"exports": [ { "city": "Berlin","dealers": [{"name": "Hans"}] }, { "city": "Athens" }
]
}
locations headquarter exports
0 1
country
Germany
city
Berlin
country
France
city
Paris
city
Moscow
city
Athens
Belgium
locations headquarter
0
country
Germany
city
Bonn
revenue
200
Italy
0 1
exports
city
Berlin
city
Athens
0
1
dealers
0
Hans
name
0
locations
0 1
country
Germany
city
Berlin
country
France
city
Paris
SELECT C.locations
FROM company C
WHERE C.headquarter = "Belgium"
results
Query result
Input documents
function businessLogic() {
var country = "Belgium";
__.filter(function(x){return x.headquarter===country;});}
SQL JavaScript
doc_id =5
key: “age/22”
payload: +doc5
key: “age/21”
payload: -doc5
key: “city/seattle”
payload: +doc5
key: “zip/98103”
payload: +doc5
…
Path/Posting List updates
Index
Query Processor
Indexscan > “age/30”
< “age/32”
doc1, doc5, doc7
System model for writes and queries
B-Tree
Cache
Log Structured Store
Index Maintanance Requirements
• Support sustained volume of rapid writes
without any term locality
• Queries should honor various consistency
levels
• Index maintenance must operate within
frugal resource budget
• Low write, read and space amplification
Page P
Page
ID
Physical
Address
P
Mapping Table
Δ: Insert record 50
Δ: Delete record 48
Δ: Update record 35 Δ: Insert record 60
Consolidated Page P
Update record 35 Insert record 60
HighlyConcurrentPageUpdatesHighly concurrent index updates
Base page
Log-structured Store on SSD
.
.
.
.
.
Mapping
table
Writeorderinginlog
Base page
Base page
-record
-record
(Latch-free)
Flush Buffer
(8MB)
.
.
Base page
-record
-record
RAM
-record
WriteOptimizedStorageOrganizationWrite optimized storage organization
• Little to no term locality on index write path
• Unable to keep “hot set” of leaf pages
cached in memory
• Performing read to modify each leaf node
leads to very high I/O overhead
• Requires method to maintain efficient write
path for sustained term ingestion with
predictable performance
update term t1
delete term t58
insert term t109
update term t179
update term t568
delete term t732
Lack of term locality
Blindupdates&ValueMerge
Address
Mapping Table
Log Structured Store (LSS)
T  {doc1, doc2, doc3, doc5}
Term T  -doc2
P
Read I/O
Page Stub
Address
Mapping Table
Log Structured Store (LSS)
Term T  +doc5
P
T->+doc2 T->-doc2
Page Stub
{doc1, doc2, doc3} {+doc5} {-doc2}
Term lookup or full
page consolidate
Page P
T  {doc1, doc2, doc3}
Add doc5 to posting list for term T
Page P
T  {doc1, doc2, doc3}
Page P
T  {doc1, doc2, doc3}
…
Consolidated Page P
T  {doc1, doc3, doc5}
Blind update for term T
Blind updates and value merge
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2000 4000 6000 8000 10000
NumberofIOs
Index Size (MB)
Update Blind Update
Summary

More Related Content

What's hot (20)

PPTX
The CIOs Guide to NoSQL
DATAVERSITY
 
PPTX
Agility and Scalability with MongoDB
MongoDB
 
PPTX
Azure CosmosDB the new frontier of big data and nosql
Riccardo Cappello
 
PPTX
Mongo db
Akshay Mathur
 
PPTX
CouchDB
Jacob Diamond
 
PPTX
An Introduction To NoSQL & MongoDB
Lee Theobald
 
PDF
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Globus
 
PPSX
Microsoft Hekaton
Siraj Memon
 
PPTX
Mongo DB
Pradeep Shanmugam
 
PPTX
No SQL, No Problem: Use Azure DocumentDB
Ken Cenerelli
 
PDF
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
João Gabriel Lima
 
PPTX
Benefits of Using MongoDB Over RDBMSs
MongoDB
 
PDF
Apache CouchDB
Trinh Phuc Tho
 
PPTX
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
PDF
Performance comparison: Multi-Model vs. MongoDB and Neo4j
ArangoDB Database
 
PDF
Globus Portal Framework (APS Workshop)
Globus
 
PPTX
Azure DocumentDB for Healthcare Integration
BizTalk360
 
PPTX
When to Use MongoDB
MongoDB
 
PDF
Session #2, tech session: Build realtime search by Sylvain Utard from Algolia
SaaS Is Beautiful
 
The CIOs Guide to NoSQL
DATAVERSITY
 
Agility and Scalability with MongoDB
MongoDB
 
Azure CosmosDB the new frontier of big data and nosql
Riccardo Cappello
 
Mongo db
Akshay Mathur
 
CouchDB
Jacob Diamond
 
An Introduction To NoSQL & MongoDB
Lee Theobald
 
Automating Research Data Flows with Globus (CHPC 2019 - South Africa)
Globus
 
Microsoft Hekaton
Siraj Memon
 
No SQL, No Problem: Use Azure DocumentDB
Ken Cenerelli
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
João Gabriel Lima
 
Benefits of Using MongoDB Over RDBMSs
MongoDB
 
Apache CouchDB
Trinh Phuc Tho
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
Performance comparison: Multi-Model vs. MongoDB and Neo4j
ArangoDB Database
 
Globus Portal Framework (APS Workshop)
Globus
 
Azure DocumentDB for Healthcare Integration
BizTalk360
 
When to Use MongoDB
MongoDB
 
Session #2, tech session: Build realtime search by Sylvain Utard from Algolia
SaaS Is Beautiful
 

Viewers also liked (10)

PPTX
#PortraitDeCDO - Guénaëlle Gault - Kantar
OCTO Technology
 
PPTX
Real time machine learning
Vinoth Kannan
 
PDF
CAPとBASEとEventually Consistent
Yohei Yamamoto
 
PDF
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
Recruit Technologies
 
PPTX
#PortraitDeCDO - Thierry Picard - Pierre Fabre
OCTO Technology
 
PDF
Time Series Analysis with Spark by Sandy Ryza
Spark Summit
 
PPTX
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
OCTO Technology
 
PDF
Nosqlの基礎知識(2013年7月講義資料)
CLOUDIAN KK
 
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
PDF
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
#PortraitDeCDO - Guénaëlle Gault - Kantar
OCTO Technology
 
Real time machine learning
Vinoth Kannan
 
CAPとBASEとEventually Consistent
Yohei Yamamoto
 
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
Recruit Technologies
 
#PortraitDeCDO - Thierry Picard - Pierre Fabre
OCTO Technology
 
Time Series Analysis with Spark by Sandy Ryza
Spark Summit
 
#PortraitDeCDO - Laurent Assouad - Aéroport de Lyon
OCTO Technology
 
Nosqlの基礎知識(2013年7月講義資料)
CLOUDIAN KK
 
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
Ad

Similar to Schema Agnostic Indexing with Azure DocumentDB (20)

PDF
Simplifying & accelerating application development with MongoDB's intelligent...
Maxime Beugnet
 
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
PPTX
MongoDB 3.4 webinar
Andrew Morgan
 
PDF
MongoDB NoSQL database a deep dive -MyWhitePaper
Rajesh Kumar
 
PDF
Technological insights behind Clusterpoint database
Clusterpoint
 
PPTX
MongoDB is a document database. It stores data in a type of JSON format calle...
amintafernandos
 
PPTX
Data saturday malta - ADX Azure Data Explorer overview
Riccardo Zamana
 
PPTX
nodejs.pptx
shamsullah shamsi
 
PPTX
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Pieter Brinkman
 
PDF
Confluent & MongoDB APAC Lunch & Learn
confluent
 
PDF
NoSQL and Spatial Database Capabilities using PostgreSQL
EDB
 
PDF
MongoDB - General Purpose Database
Ashnikbiz
 
PDF
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2
 
PDF
Power Saturday 2019 B4 - From relational to Multimodel Azure Cosmos DB
PowerSaturdayParis
 
PPTX
MCT Virtual Summit 2021
Riccardo Zamana
 
PDF
Couchbase - Yet Another Introduction
Kelum Senanayake
 
PDF
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
PDF
MongoDB 4.0 새로운 기능 소개
Ha-Yang(White) Moon
 
PPTX
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
Simplifying & accelerating application development with MongoDB's intelligent...
Maxime Beugnet
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
MongoDB 3.4 webinar
Andrew Morgan
 
MongoDB NoSQL database a deep dive -MyWhitePaper
Rajesh Kumar
 
Technological insights behind Clusterpoint database
Clusterpoint
 
MongoDB is a document database. It stores data in a type of JSON format calle...
amintafernandos
 
Data saturday malta - ADX Azure Data Explorer overview
Riccardo Zamana
 
nodejs.pptx
shamsullah shamsi
 
Sitecore 7.5 xDB oh(No)SQL - Where is the data at?
Pieter Brinkman
 
Confluent & MongoDB APAC Lunch & Learn
confluent
 
NoSQL and Spatial Database Capabilities using PostgreSQL
EDB
 
MongoDB - General Purpose Database
Ashnikbiz
 
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...
WSO2
 
Power Saturday 2019 B4 - From relational to Multimodel Azure Cosmos DB
PowerSaturdayParis
 
MCT Virtual Summit 2021
Riccardo Zamana
 
Couchbase - Yet Another Introduction
Kelum Senanayake
 
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Michael Rys
 
MongoDB 4.0 새로운 기능 소개
Ha-Yang(White) Moon
 
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
Ad

Recently uploaded (20)

PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
deep dive data management sharepoint apps.ppt
novaprofk
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Climate Action.pptx action plan for climate
justfortalabat
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 

Schema Agnostic Indexing with Azure DocumentDB

  • 1. Schema Agnostic Indexing with Azure DocumentDB @dharmashukla, DocumentDB Presented at VLDB 2015 Sudipta Sengupta, Justin Levandoski, David Lomet Microsoft Research Dharma Shukla, Shireesh Thota, Karthik Raman, Madhan Gajendran, Ankur Shah, Sergii Ziuzin, Krishnan Sundaram, Miguel Gonzalez Guajardo, Anna Wawrzyniak, Samer Boshra, Renato Ferreira, Mohamed Nassar, Michael Koltachev, Ji Huang Microsoft Corporation
  • 2.  Overview of DocumentDB  Schema Agnostic Indexing  Logical Index Organization  Physical Index Organization  Summary Outline
  • 3.  Fully managed, multi-tenant, geo-distributed document database service on Azure  Born out of the needs of internal Microsoft applications; GA since April 2015  Built from the ground up with resource governance  Provisioned throughput, performance isolation, OPEX efficiency  Well defined consistency levels with predictable performance  Database engine built for JSON & JavaScript  Automatic indexing of JSON values and rich (SQL and JavaScript) query  JavaScript language integrated transactions and query directly inside the database engine What is DocumentDB? Strong Bounded Staleness Session Eventual
  • 5. JavaScript Object Literals JSON serializable values (aka JSON Infoset) { "locations": [ { "country": "Germany", "city": "Berlin" }, { "country": "France", "city": "Paris" } ], "headquarter": "Belgium", "exports":[{ "city": "Moscow" },{ "city": "Athens"}] } locations headquarter exports 0 1 country Germany city Berlin country France city Paris city Moscow city Athens Belgium 0 1 • Automatic indexing of document trees without requiring schema or secondary indices • SQL and JavaScript query processing on the trees • Lazy materialization of JavaScript values from the instances of trees JSON document as tree Schema-agnostic indexing
  • 6. • Index is a union of all the document trees Common structure • Structural information and instance values are normalized into a unifying concept of JSON-Path Terms Postings List/Values $/location/0/ 1, 2 location/0/country/ 1, 2 location/0/city/ 1, 2 0/country/Germany 1, 2 1/country/France 2 … … 0/city/Moscow 2 0/dealers/0 2 0 Germany location 0 location country 0 country Range (>, <, !=) & ORDERBY queries 0 Germany location 0 location country 0 country Wildcard queries Spatial queries 0 coordinates Dynamic Encoding of Postings List (E-WAH/differential) Logical Index Organization
  • 7. Query { "results": [ { "locations": [ {"country":"Germany","city":"Berlin"}, {"country":"France","city":"Paris"} ] } ] } { "locations": [ { "country": "Germany", "city": "Berlin" }, { "country": "France", "city": "Paris" } ], "headquarter": "Belgium", "exports": [{ "city": "Moscow" }, { "city": "Athens" }] } { "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 } ], "headquarter": "Italy", "exports": [ { "city": "Berlin","dealers": [{"name": "Hans"}] }, { "city": "Athens" } ] } locations headquarter exports 0 1 country Germany city Berlin country France city Paris city Moscow city Athens Belgium locations headquarter 0 country Germany city Bonn revenue 200 Italy 0 1 exports city Berlin city Athens 0 1 dealers 0 Hans name 0 locations 0 1 country Germany city Berlin country France city Paris SELECT C.locations FROM company C WHERE C.headquarter = "Belgium" results Query result Input documents function businessLogic() { var country = "Belgium"; __.filter(function(x){return x.headquarter===country;});} SQL JavaScript
  • 8. doc_id =5 key: “age/22” payload: +doc5 key: “age/21” payload: -doc5 key: “city/seattle” payload: +doc5 key: “zip/98103” payload: +doc5 … Path/Posting List updates Index Query Processor Indexscan > “age/30” < “age/32” doc1, doc5, doc7 System model for writes and queries
  • 9. B-Tree Cache Log Structured Store Index Maintanance Requirements • Support sustained volume of rapid writes without any term locality • Queries should honor various consistency levels • Index maintenance must operate within frugal resource budget • Low write, read and space amplification
  • 10. Page P Page ID Physical Address P Mapping Table Δ: Insert record 50 Δ: Delete record 48 Δ: Update record 35 Δ: Insert record 60 Consolidated Page P Update record 35 Insert record 60 HighlyConcurrentPageUpdatesHighly concurrent index updates
  • 11. Base page Log-structured Store on SSD . . . . . Mapping table Writeorderinginlog Base page Base page -record -record (Latch-free) Flush Buffer (8MB) . . Base page -record -record RAM -record WriteOptimizedStorageOrganizationWrite optimized storage organization
  • 12. • Little to no term locality on index write path • Unable to keep “hot set” of leaf pages cached in memory • Performing read to modify each leaf node leads to very high I/O overhead • Requires method to maintain efficient write path for sustained term ingestion with predictable performance update term t1 delete term t58 insert term t109 update term t179 update term t568 delete term t732 Lack of term locality
  • 13. Blindupdates&ValueMerge Address Mapping Table Log Structured Store (LSS) T  {doc1, doc2, doc3, doc5} Term T  -doc2 P Read I/O Page Stub Address Mapping Table Log Structured Store (LSS) Term T  +doc5 P T->+doc2 T->-doc2 Page Stub {doc1, doc2, doc3} {+doc5} {-doc2} Term lookup or full page consolidate Page P T  {doc1, doc2, doc3} Add doc5 to posting list for term T Page P T  {doc1, doc2, doc3} Page P T  {doc1, doc2, doc3} … Consolidated Page P T  {doc1, doc3, doc5} Blind update for term T Blind updates and value merge 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 2000 4000 6000 8000 10000 NumberofIOs Index Size (MB) Update Blind Update