SlideShare a Scribd company logo
Masahiro Nakagawa
Senior Software Engineer
Treasure Data, inc.
Treasure Data & AWS
The light and dark side of the Cloud
Who am I
> Masahiro Nakagawa
> github: repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> Living at OSS :)
> D language - Phobos, a.k.a standard library, committer
> Fluentd - Main maintainer
> MessagePack / RPC - D and Python (only RPC)
> The organizer of several meetups (Presto, DTM, etc…)
> etc…
TD Service Architecture
Time to Value
Send query result 
Result Push
Acquire
 Analyze
Store
Plazma DB
Flexible, Scalable,
Columnar Storage
Web Log
App Log
Censor
CRM
ERP
RDBMS
Treasure Agent(Server)
SDK(JS, Android, iOS, Unity)
Streaming Collector
Batch /
Reliability
Ad-hoc /

Low latency
KPI$
KPI Dashboard
BI Tools
Other Products
RDBMS, Google Docs,
AWS S3, FTP Server, etc.
Metric Insights 
Tableau, 
Motion Board etc. 
POS
REST API
ODBC / JDBC
SQL, Pig 
Bulk Uploader
Embulk,

TD Toolbelt
SQL-based query
@AWS or @IDCF
Connectivity
Economy & Flexibility Simple & Supported
Treasure Data System Overview
Frontend
Job Queue
Worker
Hadoop
Presto
Fluentd
Applications push
metrics to Fluentd

(via local Fluentd)
Datadog
for realtime monitoring
Treasure Data
for historical analysis
Fluentd sums up data minutes

(partial aggregation)
Plazma - Treasure Data’s distributed
analytical database
Plazma by the numbers
> Data import
> 500,000 records / sec
> 43 billion records / day
> Hive Query
> 2 trillion records / day
> 2,828 TB/day
> Presto Query
> 10,000+ queries / day
Used AWS components
> EC2
> Hadoop / Presto Clusters
> API Servers
> S3
> MessagePack Columnar Storage
> RDS
> MySQL for service information
> PostgreSQL for Plazma metadata
> Distributed Job Queue / Schedular
Used AWS components
> CloudWatch
> Monitor AWS service metrics
> ELB
> Endpoint for APIs
> Endpoint for Heroku drains
> ElastiCache
> Store TD monitoring data
> Event de-duplication for mobile SDKs
Why not use HDFS for storage?
> To separate machine resource and storage
> Easy to add or replace workers
> Import load doesn’t affect queries
> Don’t want to maintain HDFS…
> HDFS crash
> Upgrading HDFS cluster is hard
> The demerit of S3 based storage
> Eventual consistency
> Network access
Data Importing
Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for

5 minute
✓ Retrying

(at-least once)
✓ On-disk buffering
on failure
✓ Unique ID for
each chunk
API
Server
It’s like JSON.
but fast and small.
unique_id=375828ce5510cadb
{“time”:1426047906,”uid”:1,…}
{“time”:1426047912,”uid”:9,…}
{“time”:1426047939,”uid”:3,…}
{“time”:1426047951,”uid”:2,…}
…
MySQL 

(PerfectQueue)
Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for

1 minute
✓ Retrying

(at-least once)
✓ On-disk buffering
on failure
✓ Unique ID for
each chunk
API
Server
It’s like JSON.
but fast and small.
MySQL 

(PerfectQueue)
unique_id time
375828ce5510cadb 2015-12-01 10:47
2024cffb9510cadc 2015-12-01 11:09
1b8d6a600510cadd 2015-12-01 11:21
1f06c0aa510caddb 2015-12-01 11:38
Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for

5 minute
✓ Retrying

(at-least once)
✓ On-disk buffering
on failure
✓ Unique ID for
each chunk
API
Server
It’s like JSON.
but fast and small.
MySQL 

(PerfectQueue)
unique_id time
375828ce5510cadb 2015-12-01 10:47
2024cffb9510cadc 2015-12-01 11:09
1b8d6a600510cadd 2015-12-01 11:21
1f06c0aa510caddb 2015-12-01 11:38UNIQUE
(at-most once)
Import
Queue
Import
Worker
Import
Worker
Import
Worker
✓ HA
✓ Load balancing
Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
Archive
Storage
Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,

2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32,

2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43,

2015-12-01 11:40:49]
14
… … … …
Archive
Storage
Metadata of the
records in a file
(stored on
PostgreSQL)
Amazon S3 /
Basho Riak CS
Metadata
Merge Worker

(MapReduce)
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,

2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32,

2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43,

2015-12-01 11:40:49]
14
… … … …
file index range records
[2015-12-01 10:00:00,

2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,

2015-12-01 12:00:00]
2,143
… … …
Realtime
Storage
Archive
Storage
PostgreSQL
Merge every 1 hourRetrying + Unique
(at-least-once + at-most-once)
Amazon S3 /
Basho Riak CS
Metadata
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,

2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32,

2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43,

2015-12-01 11:40:49]
14
… … … …
file index range records
[2015-12-01 10:00:00,

2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,

2015-12-01 12:00:00]
2,143
… … …
Realtime
Storage
Archive
Storage
PostgreSQL
GiST (R-tree) Index
on“time” column on the files
Read from Archive Storage if merged.
Otherwise, from Realtime Storage
Why not use LIST API?
> LIST API is slow
> It causes slow query on large dataset
> Riak CS’s LIST is also toooo slow!
> LIST API has a critical problem… ;(
> LIST skips some objects when high-loaded environment
> It doesn’t return an error…
> Using PostgreSQL improves the performance
> Easy to check time range
> Operation cost is cheaper than S3 call
Why not MySQL? - benchmark
0
45
90
135
180
INSERT 50,000 rows SELECT sum(id) SELECT sum(file_size) WHERE index range
0.656.578.79
168
3.66
17.2
MySQL PostgreSQL
(seconds)
Index-only scan
GiST index +
range type
Data Importing
> Scalable & Reliable importing
> Fluentd buffers data on a disk
> Import queue deduplicates uploaded chunks
> Workers take the chunks and put to Realtime Storage
> Instant visibility
> Imported data is immediately visible by query engines.
> Background workers merges the files every 1 hour.
> Metadata
> Index is built on PostgreSQL using RANGE type and

GiST index
Data processing
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
Files on Amazon S3 / Basho Riak CS
Metadata on PostgreSQL
path index range records
[2015-12-01 10:00:00,

2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,

2015-12-01 12:00:00]
2,143
… … …
MessagePack Columnar

File Format
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
path index range records
[2015-12-01 10:00:00,

2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,

2015-12-01 12:00:00]
2,143
… … …
column-based partitioning
time-based partitioning
Files on Amazon S3 / Basho Riak CS
Metadata on PostgreSQL
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
path index range records
[2015-12-01 10:00:00,

2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00,

2015-12-01 12:00:00]
2,143
… … …
column-based partitioning
time-based partitioning
Files on Amazon S3 / Basho Riak CS
Metadata on PostgreSQL
SELECT code, COUNT(1) FROM logs
WHERE time >= 2015-12-01 11:00:00

GROUP BY code
Handling Eventual Consistency
1. Writing data / metadata first
> At this time, data is not visible
2. Check S3 data is available or not
> GET, GET, GET…
3. S3 data become visible
> Query includes imported data!

Ex. Netflix case
> https://github.com/Netflix/s3mper
Hide network cost
> Open a lot of connections to S3
> Using range feature with columnar offset
> Improve scan performance for partitioned data
> Detect recoverable error
> We have error lists for fault tolerance
> Stall checker
> Watch the progress of reading data
> If processing time reached threshold, re-connect to S3
and re-read data
buffer
Optimizing Scan Performance
•  Fully utilize the network bandwidth from S3
•  TD Presto becomes CPU bottleneck
8
TableScanOperator
•  s3 file list
•  table schema
header
request
S3 / RiakCS
•  release(Buffer)
Buffer size limit
Reuse allocated buffers
Request Queue
•  priority queue
•  max connections limit
Header
Column Block 0
(column names)
Column Block 1
Column Block i
Column Block m
MPC1 file
HeaderReader
•  callback to HeaderParser
ColumnBlockReader
header
HeaderParser
•  parse MPC file header
• column block offsets
• column names
column block request
Column block requests
column block
prepare
MessageUnpacker
buffer
MessageUnpacker
MessageUnpacker
S3 read
S3 read
pull records
Retry GET request on
- 500 (internal error)
- 503 (slow down)
- 404 (not found)
- eventual consistency
S3 read•  decompression
•  msgpack-java v07
S3 read
S3 read
S3 read
Optimize scan performance
Recoverable errors
> Error types
> User error
> Syntax error, Semantic error
> Insufficient resource
> Exceeded task memory size
> Internal failure
> I/O error of S3 / Riak CS
> worker failure
> etc
We can retry these patterns
Recoverable errors
> Error types
> User error
> Syntax error, Semantic error
> Insufficient resource
> Exceeded task memory size
> Internal failure
> I/O error of S3 / Riak CS
> worker failure
> etc
We can retry these patterns
Presto retry on Internal Errors
> Query succeed eventually















log scale
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
user time code method
391 2015-12-01 11:10:09 200 GET
482 2015-12-01 11:21:45 200 GET
573 2015-12-01 11:38:59 200 GET
664 2015-12-01 11:43:37 200 GET
755 2015-12-01 11:54:52 “200” GET
… … …
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
user time code method
391 2015-12-01 11:10:09 200 GET
482 2015-12-01 11:21:45 200 GET
573 2015-12-01 11:38:59 200 GET
664 2015-12-01 11:43:37 200 GET
755 2015-12-01 11:54:52 “200” GET
… … …
MessagePack Columnar

File Format is schema-less
✓ Instant schema change
SQL is schema-full
✓ SQL doesn’t work

without schema
Schema-on-Read
Realtime
Storage
Query Engine

Hive, Pig, Presto
Archive
Storage
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}
Schema-on-Read
Schema-full
Schema-less
Realtime
Storage
Query Engine

Hive, Pig, Presto
Archive
Storage
Schema-full
Schema-less
Schema
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}
CREATE TABLE events (

user INT, name STRING, value INT, host INT
);
| user
| 54
| name
| “plazma”
| value
| 120
| host
| NULL
|
|
Schema-on-Read
Realtime
Storage
Query Engine

Hive, Pig, Presto
Archive
Storage
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}
CREATE TABLE events (

user INT, name STRING, value INT, host INT
);
| user
| 54
| name
| “plazma”
| value
| 120
| host
| NULL
|
|
Schema-on-Read
Schema-full
Schema-less
Schema
Monitoring
Datadog based monitoring
> dd-agent for system metrics
> Send application metrics using Fluentd
> Hadoop / Presto usage
> Service metrics
> PostgreSQL status
> Check AWS events
> EC2, CloudTrail and more
> Event based alert
CloudTrail example
Presto example
Pitfall of PostgreSQL on RDS
> PostgreSQL on RDS has TCP Proxy
> “DB connections” metrics shows TCP connections,

not execution processes of PostgreSQL
> PostgreSQL spawns a process for each TCP connection
> The problem is the process is sometimes still running
even if TCP connection is closed.
> In this result, “DB connections” is decreased but

PostgreSQL can’t receive new request ;(
> We collect actual metrics from PostgreSQL tables.
> Can’t use some extensions
Conclusion
> Build scalable data analytics platform on Cloud
> Separate resource and storage
> loosely-coupled components
> AWS has some pitfalls but we can avoid it
> There are many trade-off
> Use existing component or create new component?
> Stick to the basics!
Check: treasuredata.com

  treasure-data.hateblo.jp/ (Japan blog)
Cloud service for the entire data pipeline

More Related Content

What's hot (20)

PDF
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
PDF
Logging for Production Systems in The Container Era
Sadayuki Furuhashi
 
PDF
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
PDF
Technologies, Data Analytics Service and Enterprise Business
SATOSHI TAGOMORI
 
PDF
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
PDF
201809 DB tech showcase
Keisuke Suzuki
 
PDF
DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...
PROIDEA
 
PDF
Prestogres internals
Sadayuki Furuhashi
 
PDF
Planet-scale Data Ingestion Pipeline: Bigdam
SATOSHI TAGOMORI
 
PDF
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
PDF
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
PDF
Distributed Logging Architecture in Container Era
SATOSHI TAGOMORI
 
PDF
Fluentd - Flexible, Stable, Scalable
Shu Ting Tseng
 
PDF
Open Source Software, Distributed Systems, Database as a Cloud Service
SATOSHI TAGOMORI
 
PDF
Lambda Architecture Using SQL
SATOSHI TAGOMORI
 
PDF
Handling not so big data
SATOSHI TAGOMORI
 
PDF
Presto in the cloud
Qubole
 
PDF
Perfect Norikra 2nd Season
SATOSHI TAGOMORI
 
PDF
PGConf APAC 2018 - Tale from Trenches
PGConf APAC
 
PDF
RubyKaigi 2014: ServerEngine
Treasure Data, Inc.
 
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Sadayuki Furuhashi
 
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
Technologies, Data Analytics Service and Enterprise Business
SATOSHI TAGOMORI
 
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
201809 DB tech showcase
Keisuke Suzuki
 
DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...
PROIDEA
 
Prestogres internals
Sadayuki Furuhashi
 
Planet-scale Data Ingestion Pipeline: Bigdam
SATOSHI TAGOMORI
 
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
Distributed Logging Architecture in Container Era
SATOSHI TAGOMORI
 
Fluentd - Flexible, Stable, Scalable
Shu Ting Tseng
 
Open Source Software, Distributed Systems, Database as a Cloud Service
SATOSHI TAGOMORI
 
Lambda Architecture Using SQL
SATOSHI TAGOMORI
 
Handling not so big data
SATOSHI TAGOMORI
 
Presto in the cloud
Qubole
 
Perfect Norikra 2nd Season
SATOSHI TAGOMORI
 
PGConf APAC 2018 - Tale from Trenches
PGConf APAC
 
RubyKaigi 2014: ServerEngine
Treasure Data, Inc.
 

Similar to Treasure Data and AWS - Developers.io 2015 (20)

PDF
How to create Treasure Data #dotsbigdata
N Masahiro
 
PDF
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
DataStax
 
PDF
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
DataStax
 
PDF
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
DataStax Academy
 
PDF
201810 td tech_talk
Keisuke Suzuki
 
PDF
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
PDF
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
Amazon Web Services Korea
 
PPTX
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
andrei.arion
 
PDF
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
PDF
Granular Archival and Nearline Storage Using MySQL, S3, and SQS
waltjones
 
PDF
Lambda at Weather Scale - Cassandra Summit 2015
Robbie Strickland
 
PDF
Your First Data Lake on AWS_Simon Elisha
Helen Rogers
 
PDF
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
PDF
Cloud arch patterns
Corey Huinker
 
PDF
Real World Storage in Treasure Data
Kai Sasaki
 
PDF
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
PDF
Quixote
ceiparua
 
PDF
Outside The Box With Apache Cassnadra
Eric Evans
 
PDF
About "Apache Cassandra"
Jihyun Ahn
 
PDF
Inception Pack Vol 2: Bizarre premium
The Planning Lab
 
How to create Treasure Data #dotsbigdata
N Masahiro
 
Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra S...
DataStax
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
DataStax
 
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
DataStax Academy
 
201810 td tech_talk
Keisuke Suzuki
 
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개
Amazon Web Services Korea
 
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
andrei.arion
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
Granular Archival and Nearline Storage Using MySQL, S3, and SQS
waltjones
 
Lambda at Weather Scale - Cassandra Summit 2015
Robbie Strickland
 
Your First Data Lake on AWS_Simon Elisha
Helen Rogers
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
Cloud arch patterns
Corey Huinker
 
Real World Storage in Treasure Data
Kai Sasaki
 
Дмитрий Попович "How to build a data warehouse?"
Fwdays
 
Quixote
ceiparua
 
Outside The Box With Apache Cassnadra
Eric Evans
 
About "Apache Cassandra"
Jihyun Ahn
 
Inception Pack Vol 2: Bizarre premium
The Planning Lab
 
Ad

More from N Masahiro (20)

PDF
Fluentd Project Intro at Kubecon 2019 EU
N Masahiro
 
PDF
Fluentd v1 and future at techtalk
N Masahiro
 
PDF
Fluentd and Distributed Logging at Kubecon
N Masahiro
 
PDF
Fluentd v1.0 in a nutshell
N Masahiro
 
PDF
Fluentd v1.0 in a nutshell
N Masahiro
 
PDF
Presto changes
N Masahiro
 
PDF
Fluentd at HKOScon
N Masahiro
 
PDF
Fluentd v0.14 Overview
N Masahiro
 
PDF
Fluentd and Kafka
N Masahiro
 
PDF
fluent-plugin-beats at Elasticsearch meetup #14
N Masahiro
 
PDF
Dive into Fluentd plugin v0.12
N Masahiro
 
PDF
Technologies for Data Analytics Platform
N Masahiro
 
PDF
Docker and Fluentd
N Masahiro
 
PDF
Fluentd v0.12 master guide
N Masahiro
 
PDF
Fluentd and Embulk Game Server 4
N Masahiro
 
PDF
Fluentd Unified Logging Layer At Fossasia
N Masahiro
 
PDF
Treasure Data and OSS
N Masahiro
 
PDF
Fluentd - RubyKansai 65
N Masahiro
 
PDF
Fluentd - road to v1 -
N Masahiro
 
PDF
Fluentd: Unified Logging Layer at CWT2014
N Masahiro
 
Fluentd Project Intro at Kubecon 2019 EU
N Masahiro
 
Fluentd v1 and future at techtalk
N Masahiro
 
Fluentd and Distributed Logging at Kubecon
N Masahiro
 
Fluentd v1.0 in a nutshell
N Masahiro
 
Fluentd v1.0 in a nutshell
N Masahiro
 
Presto changes
N Masahiro
 
Fluentd at HKOScon
N Masahiro
 
Fluentd v0.14 Overview
N Masahiro
 
Fluentd and Kafka
N Masahiro
 
fluent-plugin-beats at Elasticsearch meetup #14
N Masahiro
 
Dive into Fluentd plugin v0.12
N Masahiro
 
Technologies for Data Analytics Platform
N Masahiro
 
Docker and Fluentd
N Masahiro
 
Fluentd v0.12 master guide
N Masahiro
 
Fluentd and Embulk Game Server 4
N Masahiro
 
Fluentd Unified Logging Layer At Fossasia
N Masahiro
 
Treasure Data and OSS
N Masahiro
 
Fluentd - RubyKansai 65
N Masahiro
 
Fluentd - road to v1 -
N Masahiro
 
Fluentd: Unified Logging Layer at CWT2014
N Masahiro
 
Ad

Recently uploaded (20)

PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
July Patch Tuesday
Ivanti
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Designing Production-Ready AI Agents
Kunal Rai
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
July Patch Tuesday
Ivanti
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 

Treasure Data and AWS - Developers.io 2015

  • 1. Masahiro Nakagawa Senior Software Engineer Treasure Data, inc. Treasure Data & AWS The light and dark side of the Cloud
  • 2. Who am I > Masahiro Nakagawa > github: repeatedly > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > Living at OSS :) > D language - Phobos, a.k.a standard library, committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of several meetups (Presto, DTM, etc…) > etc…
  • 3. TD Service Architecture Time to Value Send query result Result Push Acquire Analyze Store Plazma DB Flexible, Scalable, Columnar Storage Web Log App Log Censor CRM ERP RDBMS Treasure Agent(Server) SDK(JS, Android, iOS, Unity) Streaming Collector Batch / Reliability Ad-hoc /
 Low latency KPI$ KPI Dashboard BI Tools Other Products RDBMS, Google Docs, AWS S3, FTP Server, etc. Metric Insights Tableau, Motion Board etc. POS REST API ODBC / JDBC SQL, Pig Bulk Uploader Embulk,
 TD Toolbelt SQL-based query @AWS or @IDCF Connectivity Economy & Flexibility Simple & Supported
  • 4. Treasure Data System Overview Frontend Job Queue Worker Hadoop Presto Fluentd Applications push metrics to Fluentd
 (via local Fluentd) Datadog for realtime monitoring Treasure Data for historical analysis Fluentd sums up data minutes
 (partial aggregation)
  • 5. Plazma - Treasure Data’s distributed analytical database
  • 6. Plazma by the numbers > Data import > 500,000 records / sec > 43 billion records / day > Hive Query > 2 trillion records / day > 2,828 TB/day > Presto Query > 10,000+ queries / day
  • 7. Used AWS components > EC2 > Hadoop / Presto Clusters > API Servers > S3 > MessagePack Columnar Storage > RDS > MySQL for service information > PostgreSQL for Plazma metadata > Distributed Job Queue / Schedular
  • 8. Used AWS components > CloudWatch > Monitor AWS service metrics > ELB > Endpoint for APIs > Endpoint for Heroku drains > ElastiCache > Store TD monitoring data > Event de-duplication for mobile SDKs
  • 9. Why not use HDFS for storage? > To separate machine resource and storage > Easy to add or replace workers > Import load doesn’t affect queries > Don’t want to maintain HDFS… > HDFS crash > Upgrading HDFS cluster is hard > The demerit of S3 based storage > Eventual consistency > Network access
  • 11. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 5 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. unique_id=375828ce5510cadb {“time”:1426047906,”uid”:1,…} {“time”:1426047912,”uid”:9,…} {“time”:1426047939,”uid”:3,…} {“time”:1426047951,”uid”:2,…} … MySQL 
 (PerfectQueue)
  • 12. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 1 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL 
 (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38
  • 13. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 5 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL 
 (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38UNIQUE (at-most once)
  • 15. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker Archive Storage
  • 16. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … Archive Storage Metadata of the records in a file (stored on PostgreSQL)
  • 17. Amazon S3 / Basho Riak CS Metadata Merge Worker
 (MapReduce) uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … file index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL Merge every 1 hourRetrying + Unique (at-least-once + at-most-once)
  • 18. Amazon S3 / Basho Riak CS Metadata uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … file index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL GiST (R-tree) Index on“time” column on the files Read from Archive Storage if merged. Otherwise, from Realtime Storage
  • 19. Why not use LIST API? > LIST API is slow > It causes slow query on large dataset > Riak CS’s LIST is also toooo slow! > LIST API has a critical problem… ;( > LIST skips some objects when high-loaded environment > It doesn’t return an error… > Using PostgreSQL improves the performance > Easy to check time range > Operation cost is cheaper than S3 call
  • 20. Why not MySQL? - benchmark 0 45 90 135 180 INSERT 50,000 rows SELECT sum(id) SELECT sum(file_size) WHERE index range 0.656.578.79 168 3.66 17.2 MySQL PostgreSQL (seconds) Index-only scan GiST index + range type
  • 21. Data Importing > Scalable & Reliable importing > Fluentd buffers data on a disk > Import queue deduplicates uploaded chunks > Workers take the chunks and put to Realtime Storage > Instant visibility > Imported data is immediately visible by query engines. > Background workers merges the files every 1 hour. > Metadata > Index is built on PostgreSQL using RANGE type and
 GiST index
  • 23. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … MessagePack Columnar
 File Format
  • 24. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL
  • 25. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL SELECT code, COUNT(1) FROM logs WHERE time >= 2015-12-01 11:00:00
 GROUP BY code
  • 26. Handling Eventual Consistency 1. Writing data / metadata first > At this time, data is not visible 2. Check S3 data is available or not > GET, GET, GET… 3. S3 data become visible > Query includes imported data!
 Ex. Netflix case > https://github.com/Netflix/s3mper
  • 27. Hide network cost > Open a lot of connections to S3 > Using range feature with columnar offset > Improve scan performance for partitioned data > Detect recoverable error > We have error lists for fault tolerance > Stall checker > Watch the progress of reading data > If processing time reached threshold, re-connect to S3 and re-read data
  • 28. buffer Optimizing Scan Performance •  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck 8 TableScanOperator •  s3 file list •  table schema header request S3 / RiakCS •  release(Buffer) Buffer size limit Reuse allocated buffers Request Queue •  priority queue •  max connections limit Header Column Block 0 (column names) Column Block 1 Column Block i Column Block m MPC1 file HeaderReader •  callback to HeaderParser ColumnBlockReader header HeaderParser •  parse MPC file header • column block offsets • column names column block request Column block requests column block prepare MessageUnpacker buffer MessageUnpacker MessageUnpacker S3 read S3 read pull records Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency S3 read•  decompression •  msgpack-java v07 S3 read S3 read S3 read Optimize scan performance
  • 29. Recoverable errors > Error types > User error > Syntax error, Semantic error > Insufficient resource > Exceeded task memory size > Internal failure > I/O error of S3 / Riak CS > worker failure > etc We can retry these patterns
  • 30. Recoverable errors > Error types > User error > Syntax error, Semantic error > Insufficient resource > Exceeded task memory size > Internal failure > I/O error of S3 / Riak CS > worker failure > etc We can retry these patterns
  • 31. Presto retry on Internal Errors > Query succeed eventually
 
 
 
 
 
 
 
 log scale
  • 32. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … user time code method 391 2015-12-01 11:10:09 200 GET 482 2015-12-01 11:21:45 200 GET 573 2015-12-01 11:38:59 200 GET 664 2015-12-01 11:43:37 200 GET 755 2015-12-01 11:54:52 “200” GET … … …
  • 33. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … user time code method 391 2015-12-01 11:10:09 200 GET 482 2015-12-01 11:21:45 200 GET 573 2015-12-01 11:38:59 200 GET 664 2015-12-01 11:43:37 200 GET 755 2015-12-01 11:54:52 “200” GET … … … MessagePack Columnar
 File Format is schema-less ✓ Instant schema change SQL is schema-full ✓ SQL doesn’t work
 without schema Schema-on-Read
  • 34. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} Schema-on-Read Schema-full Schema-less
  • 35. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage Schema-full Schema-less Schema {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (
 user INT, name STRING, value INT, host INT ); | user | 54 | name | “plazma” | value | 120 | host | NULL | | Schema-on-Read
  • 36. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (
 user INT, name STRING, value INT, host INT ); | user | 54 | name | “plazma” | value | 120 | host | NULL | | Schema-on-Read Schema-full Schema-less Schema
  • 38. Datadog based monitoring > dd-agent for system metrics > Send application metrics using Fluentd > Hadoop / Presto usage > Service metrics > PostgreSQL status > Check AWS events > EC2, CloudTrail and more > Event based alert
  • 41. Pitfall of PostgreSQL on RDS > PostgreSQL on RDS has TCP Proxy > “DB connections” metrics shows TCP connections,
 not execution processes of PostgreSQL > PostgreSQL spawns a process for each TCP connection > The problem is the process is sometimes still running even if TCP connection is closed. > In this result, “DB connections” is decreased but
 PostgreSQL can’t receive new request ;( > We collect actual metrics from PostgreSQL tables. > Can’t use some extensions
  • 42. Conclusion > Build scalable data analytics platform on Cloud > Separate resource and storage > loosely-coupled components > AWS has some pitfalls but we can avoid it > There are many trade-off > Use existing component or create new component? > Stick to the basics!
  • 43. Check: treasuredata.com
   treasure-data.hateblo.jp/ (Japan blog) Cloud service for the entire data pipeline