Dremel: interactive analysis of web-scale datasets

Dremel: Interactive Analysis of Web-
Scale Datasets
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer,
Shiva Shivakumar, Matt Tolton, Theo Vassilakis
Google, Inc.
VLDB, 2010

Presentation by ensky (enskylin@gmail.com)

Outline
 Introduction
 Nested columnar storage
 Query processing
 Experiments
 Conclusion

2

Introduction
 Dremel is an query system
For analysis of read-only nested data

 Use case

Interactive Trends
Tools Detection

Web Spam Network
Dashboards Optimization
3

DocId: 10
Links

Features
Forward: 20
Name
Language
Code: 'en-us'
Country: 'us'
 Multi-level structure Url: 'http://A'
Name

 SQL-like query language
Url: 'http://B'

 Fast: Trillion-row tables in second level
 Big scale: thousands of CPUs and
petabytes of data
 Widely used by google: in production
since 2006
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
4

Contribution
 This paper presented two major
technique in Dremel:

◦ store / split into columns
◦ Assembly to record
 Query
◦ Language
◦ Execution
5

Outline
 Introduction
 Experiments
 Conclusion

6

Why nested?
 Flexible
◦ Data in web and scientific is often non-
relational
 Reduce cost
◦ Normalizing and recombining nested data
is often prohibited in TB, PB scale of data.

7

Why column?
DocId: 10
Links
column-oriented
Forward: 20
Name A
* *
...
Language
Code: 'en-us'
Country: 'us' B E
Url: 'http://A' *
C D
r1
Name

r1
r1 r1
r2 r2
r2 Read less, r2
cheaper
Record-oriented decompression

Challenge: preserve structure, reconstruct from a subset of fields
8

DocId: 10
Nested data model
Links
Forward: 20 message Document {
Forward: 40
Forward: 60 required int64 DocId; [1,1]
Name optional group Links {
Language
Code: 'en-us'
repeated int64 Backward; [0,*]
Country: 'us' repeated int64 Forward;
Language }
Code: 'en'
Url: 'http://A' repeated group Name {
Name repeated group Language {
Name required string Code;
Language optional string Country; [0,1]
Code: 'en-gb'
Country: 'gb'
}
optional string Url;
DocId: 20 }
Links }
Backward: 10
Backward: 30
Forward: 80
Name
http://code.google.com/apis/protocolbuffers
Url: 'http://C' 9

Column-striped representation
DocId Name.Url Links.Forward Links.Backward
value r d value r d value r d value r d
10 0 0 http://A 0 2 20 0 2 NULL 0 1
20 0 0 http://B 1 2 40 1 2 10 0 2

DocId: 10
NULL 1 1 60 1 2 30 1 2
Links http://C 0 2 80 0 2
Forward: 20
Forward: 40
Forward: 60
Name
Language Name.Language.Code Name.Language.Country
Code: 'en-us'
Country: 'us' value r d value r d
Language en-us 0 2 us 0 3
Code: 'en'
Url: 'http://A' en 2 2 NULL 2 2
Name
Url: 'http://B' NULL 1 1 NULL 1 1
Name
Language en-gb 1 2 gb 1 3
Code: 'en-gb' NULL 0 1 NULL 0 1
Country: 'gb' 10

Column-oriented problem
 When query sub-tree, there is no way for
you to know the whole structure(even the
position of yourself) A
* *
B ... E
 To solve this problem, *
this paper presents C D
repetition & definition
r1
r1
r1
r2 r2
Where am I?
r2
11

Repetition and definition levels
r
DocId: 10
Links 1
Forward: 20
Forward: 40
Name.Language.Code Forward: 60
value r d Name
First time, repeat=0 Language
en-us 0 2 Code: 'en-us'
Language repeat, level = 2 Country: 'us'
en 2 2 Language
Code: 'en'
NULL 1 1 Url: 'http://A'
en-gb 1 2 Name
NULL 0 1 Name
Language
Code: 'en-gb'
None repeat, level = 0 Country: 'gb'

DocId: 20
2 r
Links
Backward: 10
Backward: 30
r: At what repeated field in the field's path Forward: 80
the value has repeated Name
Url: 'http://C'12

Repetition and definition levels
r
DocId: 10
Links 1
Forward: 20
Forward: 40
Name.Language.Country Forward: 60
value r d Name
Language
us 0 3 Code: 'en-us'
Country: 'us'
NULL 2 2 Language
Code: 'en'
NULL 1 1 Url: 'http://A'
gb 1 3 Name
NULL 0 1 Name
Language
Code: 'en-gb'
Country: 'gb'

DocId: 20
2 r
Links
Backward: 10
Backward: 30
d: How many fields in paths that could be Forward: 80
Name
undefined (opt. or rep.) are actually present Url: 'http://C'13

Column-oriented is all right,
but what if I still need record-
oriented query?

(e.g., MapReduce)

14

Record assembly FSM
Transitions labeled with
repetition levels
DocId
0
0 1
1 Links.Backward Links.Forward
0

0,1,2
Name.Language.Code Name.Language.Country
2
Name.Ur 0,1
1
l
0

For record-oriented data processing (e.g., MapReduce)
15

Reading two fields

DocId: 10 s 1
Name
Language
DocId Country: 'us'
Language
0 Name
Name
1,2 Name.Language.Country Language
0 Country: 'gb'

DocId: 20 s2
Name

Structure of parent fields is preserved.
Useful for queries like /Name[3]/Language[1]/Country

16

Outline
 Introduction
 Experiments
 Conclusion

17

Query language
 Based on SQL
 Performs
◦ Projection
◦ Selection
◦ Nested subqueries
◦ inner and intra-record aggregation
◦ Top-k
◦ Joins
◦ User-defined functions

18

DocId: 10
Links
Forward: 20
Forward: 40

Example usage Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
SELECT DocId AS Id, Language
Code: 'en'
COUNT(Name.Language.Code) WITHIN Name AS Cnt, Url: 'http://A'
Name
Name.Url + ',' + Name.Language.Code AS Str Url: 'http://B'
Name
FROM t Language
WHERE REGEXP(Name.Url, '^http') AND DocId < 20; Code: 'en-gb'
Country: 'gb'

Output table Output schema
Id: 10 t1 message QueryResult {
Name required int64 Id;
Cnt: 2 repeated group Name {
Language optional uint64 Cnt;
Str: 'http://A,en-us' repeated group Language {
Str: 'http://A,en' optional string Str;
Name }
Cnt: 0 }
}
19

Query execution
client
• Parallelizes scheduling
root server and aggregation
• Fault tolerance
intermediate
... • query dispatcher can
servers
dispatch servers

leaf servers ...
(with local ...
storage)

storage layer (e.g., GFS) 20

Example: count()
SELECT A, COUNT(B) FROM T GROUP SELECT A, SUM(c)
0 BY A FROM (R11 UNION ALL R110)
T = {/gfs/1, /gfs/2, …, /gfs/100000} GROUP BY A

R11 R12
SELECT A, COUNT(B) AS c SELECT A, COUNT(B) AS c
1 FROM T11 GROUP BY A FROM T12 GROUP BY A
T11 = {/gfs/1, …, /gfs/10000} T12 = {/gfs/10001, …, /gfs/20000}

...
SELECT A, COUNT(B) AS c
3 FROM T31 GROUP BY A ...
T31 = {/gfs/1}

Data access ops 21

Outline
 Introduction
 Experiments
 Conclusion

22

Experiment Data
• 1 PB of real data
(uncompressed, non-replicated)
• 100K-800K tablets per table
• Experiments run during business hours

Table Number of Size (unrepl., Number Data Repl.
name records compressed) of fields center factor
T1 85 billion 87 TB 270 A 3×
T2 24 billion 13 TB 530 A 3×
T3 4 billion 70 TB 1200 A 3×
T4 1+ trillion 105 TB 50 B 3×
T5 1+ trillion 20 TB 30 B 2×
23

Column v.s. Record"cold" time on local disk,
time (sec) averaged over 30 runs
(e) parse as
from records C++ objects
10x speedup
using columnar objects
storage (d) read +
decompress
records
(c) parse as
from columns

columns
C++ objects
(b) assemble
2-4x overhead of records
using records (a) read +
decompress

number of fields

Table partition: 375 MB (compressed), 300K rows, 125 columns 24

MR and Dremel execution
Avg # of terms in txtField in 85 billion record table T1
execution time (sec) on 3000 nodes
Sawzall program ran on MR:

num_recs: table sum of int;
num_words: table sum of int;
emit num_recs <- 1;
emit num_words <-

count_words(input.txtField);

87 TB 0.5 TB 0.5 TB

Q1: SELECT SUM(count_words(txtField)) / COUNT(*)
FROM T1

MR overheads: launch jobs, schedule 0.5M tasks, assemble record
25

Impact of serving tree depth
execution time (sec)

(returns 100s of records) (returns 1M records)

Q2: SELECT country, SUM(item.amount) FROM T2
GROUP BY country

Q3: SELECT domain, SUM(item.amount) FROM T2
WHERE domain CONTAINS ’.net’
GROUP BY domain
40 billion nested items26

Scalability
execution time (sec)

number of
leaf servers

Q5 on a trillion-row table T4:
SELECT TOP(aid, 20), COUNT(*) FROM T4
27

Outline
 Introduction
 Experiments
 Conclusion

28

Observation
Monthly query workload
of one 3000-node
percentage of queries Dremel instance

execution
time (sec)

Most queries complete under 10 sec 29

Conclusion
 Dremel is an query system
For analysis of read-only nested data
 Main feature is fast(interactive response
time), column-oriented, SQL-like Query
language
 Introduced two major method:
◦ Nested columnar storage – in order to solve
partial query problem.
◦ Query processing – parallel processing &
distributing, decompressing queries

30

Comments
 Google is awesome!
 Pros
◦ Nested storage gives us more flexibility.
◦ repetition & definition is an novel idea
and it can solve the locality problem
easily.
◦ Distributed serving tree is awesome,
faster than MR.

31

Comments
 Cons
◦ Read-only may not fit every requirement
◦ Dremel is not a database, so you’ll need
to convert your real data into dremel
when analyzing
 Converting may cost lost of time and space
 Google doesn’t care this problem, they have
GFS and many servers.

32

Thank you for listening.

33

Dremel: interactive analysis of web-scale datasets

More Related Content

What's hot (20)

Similar to Dremel: interactive analysis of web-scale datasets (20)

More from Hung-yu Lin (11)

Recently uploaded (20)

Dremel: interactive analysis of web-scale datasets