SlideShare a Scribd company logo
Performance Evaluation of
Apache Tajo
Jihoon Son / Gruter Inc.
Goals
● Performance comparison with other
systems
● Scalability test of Tajo
2
Evaluation on Cloud Environment
● Google Cloud Platform
○ Instance type: n1-standard-8
■ 8 core, 30GB RAM
3
TPC-DS
● Data
○ 24 tables
■ Plain text format
■ Stored on Google Cloud Storage
● Query
○ Which can be executed on every system
without modifications
■ For Hive, 0.12 doesn't support implicit join, so
every query had to be changed
4
Performance Comparison with
Other Systems
5
Target Systems
● Tajo (0.11.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
● Hive (0.12)
○ Baseline performance
○ Default configuration provided by GCP
■ Use the whole cpu and memory
6
Target Systems
● Spark-SQL (1.5.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
■ Tungsten enabled by default
○ spark.sql.shuffle.partitions is
adjusted for better performance
7
SF 1000, 50 instances
8
SF 1000, 50 instances
9
Whole data can
be loaded into
memory
SF 1000, 50 instances
10
Tajo is always
the fastest one
except q50
SF 1000, 50 instances
11
Cannot be run
on 1TB
SF 10000, 50 instances
12
SF 10000, 50 instances
13
Complex query
Join of 2 tables and
1 derived table
(5 tables Join)
+ Aggregation
+ Sort
SF 10000, 50 instances
14
Already fixed in 0.11.1
(TAJO-1271, TAJO-1950,
TAJO-1983, TAJO-2000)
Scalability Test
15
SF 10000
16
SF 10000
17
Ideal
improvement
SF 10000
18
q55 is a very simple
query, so Tajo does not
need huge resource to
execute it.
Get Involved!
● We are recruiting contributors!
● General
○ http://tajo.apache.org/
● Getting Started
○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads
○ http://tajo.apache.org/downloads.html
● Issue tracker
○ http://issues.apache.org/jira/browse/TAJO
● Join the mailing list
○ dev-subscribe@tajo.apache.org
○ issues-subscribe@tajo.apache.org
19
Q & A
20

More Related Content

PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
PDF
Apache tajo configuration
Jihoon Son
 
PDF
Introduction to Apache Tajo: Future of Data Warehouse
Jihoon Son
 
PDF
Tajo case study bay area hug 20131105
Gruter
 
PDF
Query optimization in Apache Tajo
Jihoon Son
 
PPTX
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
PingCAP
 
PDF
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Jihoon Son
 
PDF
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
Apache tajo configuration
Jihoon Son
 
Introduction to Apache Tajo: Future of Data Warehouse
Jihoon Son
 
Tajo case study bay area hug 20131105
Gruter
 
Query optimization in Apache Tajo
Jihoon Son
 
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
PingCAP
 
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Jihoon Son
 
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 

What's hot (20)

PPTX
Update on OpenTSDB and AsyncHBase
HBaseCon
 
PPTX
HBaseCon 2013: OpenTSDB at Box
Cloudera, Inc.
 
PPTX
Understanding and tuning WiredTiger, the new high performance database engine...
Ontico
 
PDF
OpenTSDB for monitoring @ Criteo
Nathaniel Braun
 
PDF
Effectively deploying hadoop to the cloud
Avinash Ramineni
 
PDF
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
Equnix Business Solutions
 
PDF
OpenTSDB 2.0
HBaseCon
 
PDF
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Athens Big Data
 
PDF
openTSDB - Metrics for a distributed world
Oliver Hankeln
 
PDF
An introduction to Big-Data processing applying hadoop
Amir Sedighi
 
PPTX
Performance Tuning and Optimization
MongoDB
 
PPTX
Tachyon meetup slides.
David Groozman
 
PDF
Optimizing columnar stores
Istvan Szukacs
 
PPTX
Monitoring MySQL with OpenTSDB
Geoffrey Anderson
 
PPTX
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
DataStax
 
PPTX
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
 
PPTX
opentsdb in a real enviroment
Chen Robert
 
PDF
Caching in
RichardWarburton
 
PDF
ScyllaDB: NoSQL at Ludicrous Speed
J On The Beach
 
PDF
Alto Desempenho com Java
codebits
 
Update on OpenTSDB and AsyncHBase
HBaseCon
 
HBaseCon 2013: OpenTSDB at Box
Cloudera, Inc.
 
Understanding and tuning WiredTiger, the new high performance database engine...
Ontico
 
OpenTSDB for monitoring @ Criteo
Nathaniel Braun
 
Effectively deploying hadoop to the cloud
Avinash Ramineni
 
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
Equnix Business Solutions
 
OpenTSDB 2.0
HBaseCon
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Athens Big Data
 
openTSDB - Metrics for a distributed world
Oliver Hankeln
 
An introduction to Big-Data processing applying hadoop
Amir Sedighi
 
Performance Tuning and Optimization
MongoDB
 
Tachyon meetup slides.
David Groozman
 
Optimizing columnar stores
Istvan Szukacs
 
Monitoring MySQL with OpenTSDB
Geoffrey Anderson
 
Bucket Your Partitions Wisely (Markus Höfer, codecentric AG) | Cassandra Summ...
DataStax
 
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
 
opentsdb in a real enviroment
Chen Robert
 
Caching in
RichardWarburton
 
ScyllaDB: NoSQL at Ludicrous Speed
J On The Beach
 
Alto Desempenho com Java
codebits
 
Ad

Viewers also liked (6)

PPTX
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
Gruter
 
PPTX
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
PPTX
Capital One's Next Generation Decision in less than 2 ms
Apache Apex
 
PPTX
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
PPTX
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
 
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
Gruter
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Capital One's Next Generation Decision in less than 2 ms
Apache Apex
 
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks
 
Ad

Similar to Performance evaluation of apache tajo (20)

PDF
Introduction to Apache Tajo: Future of Data Warehouse
Gruter
 
PDF
Measuring a 25 and 40Gb/s Data Plane
Open-NFP
 
PPTX
Lessons learned from designing a QA Automation for analytics databases (big d...
Omid Vahdaty
 
PDF
Scaling Up with PHP and AWS
Heath Dutton ☕
 
ODP
Optimizing Linux Servers
Davor Guttierrez
 
PDF
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC
 
PPTX
Journey through high performance django application
bangaloredjangousergroup
 
PDF
The state of SQL-on-Hadoop in the Cloud
Nicolas Poggi
 
PDF
Our Story With ClickHouse at seo.do
Metehan Çetinkaya
 
PDF
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
VMware Tanzu
 
PDF
High performance json- postgre sql vs. mongodb
Wei Shan Ang
 
PDF
Comparing pregel related systems
Prashant Raaghav
 
PPTX
Logs @ OVHcloud
OVHcloud
 
PDF
SoC Idling for unconf COSCUP 2016
Koan-Sin Tan
 
PDF
Tuning data warehouse
Srinivasan R
 
PDF
Mongo nyc nyt + mongodb
Deep Kapadia
 
PDF
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
PDF
20140120 presto meetup_en
Ogibayashi
 
PDF
From Zero to Hero @ PyGrunn 2014
meij200
 
PPTX
haproxy_Load_Balancer.pptx
crezzcrezz
 
Introduction to Apache Tajo: Future of Data Warehouse
Gruter
 
Measuring a 25 and 40Gb/s Data Plane
Open-NFP
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Omid Vahdaty
 
Scaling Up with PHP and AWS
Heath Dutton ☕
 
Optimizing Linux Servers
Davor Guttierrez
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC
 
Journey through high performance django application
bangaloredjangousergroup
 
The state of SQL-on-Hadoop in the Cloud
Nicolas Poggi
 
Our Story With ClickHouse at seo.do
Metehan Çetinkaya
 
Pivotal Greenplum Cloud Marketplaces - Greenplum Summit 2019
VMware Tanzu
 
High performance json- postgre sql vs. mongodb
Wei Shan Ang
 
Comparing pregel related systems
Prashant Raaghav
 
Logs @ OVHcloud
OVHcloud
 
SoC Idling for unconf COSCUP 2016
Koan-Sin Tan
 
Tuning data warehouse
Srinivasan R
 
Mongo nyc nyt + mongodb
Deep Kapadia
 
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
20140120 presto meetup_en
Ogibayashi
 
From Zero to Hero @ PyGrunn 2014
meij200
 
haproxy_Load_Balancer.pptx
crezzcrezz
 

Recently uploaded (20)

PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Software Development Methodologies in 2025
KodekX
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 

Performance evaluation of apache tajo

  • 1. Performance Evaluation of Apache Tajo Jihoon Son / Gruter Inc.
  • 2. Goals ● Performance comparison with other systems ● Scalability test of Tajo 2
  • 3. Evaluation on Cloud Environment ● Google Cloud Platform ○ Instance type: n1-standard-8 ■ 8 core, 30GB RAM 3
  • 4. TPC-DS ● Data ○ 24 tables ■ Plain text format ■ Stored on Google Cloud Storage ● Query ○ Which can be executed on every system without modifications ■ For Hive, 0.12 doesn't support implicit join, so every query had to be changed 4
  • 6. Target Systems ● Tajo (0.11.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory ● Hive (0.12) ○ Baseline performance ○ Default configuration provided by GCP ■ Use the whole cpu and memory 6
  • 7. Target Systems ● Spark-SQL (1.5.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory ■ Tungsten enabled by default ○ spark.sql.shuffle.partitions is adjusted for better performance 7
  • 8. SF 1000, 50 instances 8
  • 9. SF 1000, 50 instances 9 Whole data can be loaded into memory
  • 10. SF 1000, 50 instances 10 Tajo is always the fastest one except q50
  • 11. SF 1000, 50 instances 11 Cannot be run on 1TB
  • 12. SF 10000, 50 instances 12
  • 13. SF 10000, 50 instances 13 Complex query Join of 2 tables and 1 derived table (5 tables Join) + Aggregation + Sort
  • 14. SF 10000, 50 instances 14 Already fixed in 0.11.1 (TAJO-1271, TAJO-1950, TAJO-1983, TAJO-2000)
  • 18. SF 10000 18 q55 is a very simple query, so Tajo does not need huge resource to execute it.
  • 19. Get Involved! ● We are recruiting contributors! ● General ○ http://tajo.apache.org/ ● Getting Started ○ http://tajo.apache.org/docs/current/getting_started.html ● Downloads ○ http://tajo.apache.org/downloads.html ● Issue tracker ○ http://issues.apache.org/jira/browse/TAJO ● Join the mailing list ○ [email protected][email protected] 19