User Defined Partitioning on PlazmaDB

T R E A S U R E D A T A
USER DEFINED PARTITIONING
A New Partitioning Strategy accelerating CDP Workload
Kai Sasaki
Software Engineer in Treasure Data

ABOUT ME
- Kai Sasaki (@Lewuathe)
- Software Engineer in Treasure Data since 2015
Working in Query Engine Team (Managing Hive, Presto in Treasure data)
- Contributor of Hadoop, Spark, Presto

TOPICS
PlazmaDB
PlazmaDB is a metadata storage for all log data in
Treasure Data. It supports import, export, INSERT
INTO, CREATE TABLE, DELETE etc on top of
PostgreSQL transaction mechanism.
Time Index Partitioning
Partitioning log data by the time log generated. The
time when the log is generated is specified as “time”
column in Treasure Data. It enables us to skip to
read unnecessary partitions.
User Defined Partitioning
(New!)
In addition to “time” column, we can use any
column as partitioning key. It provides us more
flexible partitioning strategy that fits CDP workload.

OVERVIEW OF QUERY ENGINE IN TD

PRESTO IN TREASURE DATA
• Multiple clusters with 50~60 worker cluster
• Presto 0.188
Stats
• 4.3+ million queries / month
• 400 trillion records / month
• 6+ PB / month
At the end of 2017

HIVE AND PRESTO ON PLAZMADB
Bulk Import
Fluentd
Mobile SDK
PlazmaDB
Presto
Hive
SQL, CDP
Amazon S3

PLAZMADB
PlazmaDB
Amazon S3
id data_set_id ﬁrst_index_key last_index_key record_count path
P1 3065124 187250 1412323028 1412385139 109 abcdefg-1234567-abcdefg-1234567
Multi Column Indexes
s3://plazma-partitions/…
1-hour partitioning

PLAZMADB
PlazmaDB
Amazon S3
Realtime Storage
Amazon S3
Archive StorageMapReduce
Keeps 1-hour partitioning periodically.
Time-Indexed Partitioning

PROBLEM
• Time index partitioning is efﬁcient only when “time” value is speciﬁed. 
Specifying other columns cause full-scan which can make  
performance worse.
• The number of records in a partition highly depends on the table type, user usage.
SELECT
COUNT(1)
FROM table
WHERE
user_id = 1;

• User can specify the partitioning strategy based on their usage using partitioning key column  
max time range.
1h 1h 1h 1h1h
time
v1
v2
v3
c1

1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE c1 = ‘v1’ AND time = …
max time range.

1h 1h 1h 1h1h
time
c1
v1
v2
v3
1h 1h 1h 1h1h
time
c1
v1
v2
v3
max time range.

CREATE TABLE via Presto or Hive
Insert data partitioned by set partitioning key
Set user deﬁned conﬁguration
The number of bucket, hash function, partitioning key
Read the data from UDP table
UDP table is now visible via Presto and HiveLOG

USER DEFINED CONFIGURATION
• We need to set columns to be used as partitioning key and the number of partitions.  
It should be custom conﬁguration by each user.
user_table_id columns bucket_count partiton_function
T1 141849 [["o_orderkey","long"]] 32 hash
T2 141850 [[“user_id","long"]] 32 hash
T3 141910 [[“item_id”,”long"]] 16 hash
T4 151242
[[“region_id”,”long"],
[“device_id”,”long”]]
256 hash

CREATE UDP TABLE VIA PRESTO
• Presto and Hive support CREATE TABLE/INSERT INTO on UDP table
CREATE TABLE udp_customer
WITH (
bucketed_on = array[‘customer_id’],
bucket_count = 128
)
AS SELECT * from normal_customer;

• Override ConnectorPageSink to write MPC1 ﬁle based on user deﬁned partitioning key.
PlazmaPageSink
PartitionedMPCWriter
TimeRangeMPCWriter
TimeRangeMPCWriter
TimeRangeMPCWriter
BufferedMPCWriter
BufferedMPCWriter
BufferedMPCWriter
.
.
.
b1
b2
b3
Page
1h
1h
1h

• Override ConnectorPageSink to write MPC1 ﬁle based on user deﬁned partitioning key.
PlazmaPageSink
PartitionedMPCWriter
TimeRangeMPCWriter
TimeRangeMPCWriter
TimeRangeMPCWriter
BufferedMPCWriter
BufferedMPCWriter
BufferedMPCWriter
.
.
.
Page

bucket_
number
P1 3065124 187250 1412323028 1412385139 109 abcdefg-1234567-abcdefg-1234567 1
• New bucket_number column is added to partition record in PlazmaDB.

READ DATA FROM UDP TABLE
ConnectorSplitManager#getSplits
returns data source splits to be read by Presto
cluster.
Decide target bucket from constraint
Constraint speciﬁes the range should be read from
the table. ConnectorSplitManager asks PlazmaDB to
get the partitions on the target bucket.
Override Presto Connector to data source
Presto provides a plugin mechanism to connect any
data source ﬂexibly. The connector provides the
information about metadata and location of real data
source, UDFs.
Receive constraint as TupleDomain
TupleDomain is created from query plan and
passed through TableLayout which is available
in ConnectorSplitManager

READ DATA FROM UDP TABLE
SplitManager
PlazmaDB
TableLayout
SQL
constraint
Map<ColumnHandle, Domain>
Distribute PageSource
… WHERE bucker_number in () …

PERFORMANCE COMPARISON
SQLs on TPC-H (scaled factor=1000)
elapsedtime(sec)
0 sec
75 sec
150 sec
225 sec
300 sec
count1_ﬁlter groupby hashjoin
87.279
36.569
1.04
266.71
69.374
19.478
NORMAL UDP

COLOCATED JOIN
time
left right
l1 r1 l2 r2 l3 r3
left right left right
time
Distributed Join
l1 r1
l1 r1 l2 r2 l3 r3
l2 r2 l3 r3
Colocated Join

PERFORMANCE COMPARISON
SQLs on TPC-H (scaled factor=1000)
elapsedtime
0 sec
20 sec
40 sec
60 sec
80 sec
between mod_predicate count_distinct
NORMAL UDP

1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE time = …
1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE time = …

FUTURE WORKS
• Maintaining efﬁcient partitioning structure
• Developing Stella job to rearranging partitioning schema ﬂexibly by using Presto resource.
• Various kinds of pipeline (streaming import etc) should support UDP table.
• Documentation

User Defined Partitioning on PlazmaDB

More Related Content

What's hot (20)

Similar to User Defined Partitioning on PlazmaDB (20)

More from Kai Sasaki (20)

Recently uploaded (20)

User Defined Partitioning on PlazmaDB