SlideShare a Scribd company logo
A Fast Intro to Fast
Query with ClickHouse
Robert Hodges, Altinity CEO
Altinity Background
● Premier provider of software and services for ClickHouse
● Incorporated in UK with distributed team in US/Canada/Europe
● Main US/Europe sponsor of ClickHouse community
● Offerings:
○ Enterprise support for ClickHouse and ecosystem projects
○ Software (Kubernetes, cluster manager, tools & utilities)
○ POCs/Training
The shape of data has
changed
Business insights are
hidden in massive pools
of automatically
collected information
Applications that rule the digital era have a
common success factor
The ability to discover and apply
business-critical insights
from petabyte datasets in real time
Let’s consider a concrete example
Web properties track clickstreams to:
● Calculate clickthrough/buy rates
● Guide ad placement
● Optimize eCommerce services
Constraints:
● Run on commodity hardware
● Simple to operate
● Fast interactive query
● Avoid encumbering licenses
Existing analytic databases do not meet requirements fully
Cloud-native data
warehouses cannot
operate on-prem,
limiting range of
solutions
Legacy SQL databases
are expensive to run,
scale poorly on
commodity hardware,
and adapt slowly
Hadoop/Spark
ecosystem solutions
are resource
intensive with slow
response and
complex pipelines
Specialized solutions
limit query domain
and are complex/
resource-inefficient
for general use
ClickHouse fills the gaps and does much more besides
Understands SQL
Runs on bare metal to cloud
Stores data in columns
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
Is WAY fast!
Id a b c d
Id a b c d
Id a b c d
Id a b c d
What does “WAY fast” mean?
SELECT Dest d, count(*) c, avg(ArrDelayMinutes) ad
FROM ontime GROUP BY d HAVING c > 100000
ORDER BY ad DESC limit 5
┌─d───┬───────c─┬─────────────────ad─┐
│ EWR │ 3660570 │ 17.637564095209218 │
│ SFO │ 4056003 │ 16.029478528492213 │
│ JFK │ 2198078 │ 15.33669824273752 │
│ LGA │ 3133582 │ 14.533851994299177 │
│ ORD │ 9108159 │ 14.431460737565077 │
└─────┴─────────┴────────────────────┘
5 rows in set. Elapsed: 1.182 sec. Processed 173.82 million
rows, 2.78 GB (147.02 million rows/s., 2.35 GB/s.)
(Amazon md5.2xlarge: Xeon(R) Platinum 8175M, 8vCPU, 30GB RAM, NVMe SSD)
What are the main ClickHouse use patterns?
● Fast, scalable data warehouse for online services (SaaS
and in-house apps)
● Built-in data warehouse for installed analytic applications
● Exploration -- throw in a bunch of data and go crazy!
Getting started is easy with Docker image
$ docker run -d --name ch-s yandex/clickhouse-server
$ docker exec -it ch-s clickhouse client
...
11e99303c78e :) select version()
SELECT version()
┌─version()─┐
│ 19.3.3 │
└───────────┘
1 rows in set. Elapsed: 0.001 sec.
Or install recommended Altinity stable version packages
$ sudo apt -y install clickhouse-client=18.16.1 
clickhouse-server=18.16.1 
clickhouse-common-static=18.16.1
...
$ sudo systemctl start clickhouse-server
...
11e99303c78e :) select version()
SELECT version()
┌─version()─┐
│ 18.16.1 │
└───────────┘
1 rows in set. Elapsed: 0.001 sec.
Examples of table creation and data insertion
CREATE TABLE sdata (
DevId Int32,
Type String,
MDate Date,
MDatetime DateTime,
Value Float64
) ENGINE = MergeTree() PARTITION BY toYYYYMM(MDate)
ORDER BY (DevId, MDatetime)
INSERT INTO sdata VALUES
(15, 'TEMP', '2018-01-01', '2018-01-01 23:29:55', 18.0),
(15, 'TEMP', '2018-01-01', '2018-01-01 23:30:56', 18.7)
INSERT INTO sdata VALUES
(15, 'TEMP', '2018-01-01', '2018-01-01 23:31:53', 18.1),
(2, 'TEMP', '2018-01-01', '2018-01-01 23:31:55', 7.9)
Loading data from CSV files
cat > sdata.csv <<END
DevId,Type,MDate,MDatetime,Value
59,"TEMP","2018-02-01","2018-02-01 01:10:13",19.5
59,"TEMP","2018-02-01","2018-02-01 02:10:01",18.8
59,"TEMP","2018-02-01","2018-02-01 03:09:58",18.6
59,"TEMP","2018-02-01","2018-02-01 04:10:05",15.1
59,"TEMP","2018-02-01","2018-02-01 05:10:31",12.2
59,"TEMP","2018-02-01","2018-02-01 06:10:02",11.8
59,"TEMP","2018-02-01","2018-02-01 07:09:55",10.9
END
cat sdata.csv |clickhouse-client --database foo
--query='INSERT INTO sdata FORMAT CSVWithNames'
Select results can be surprising!
SELECT *
FROM sdata
WHERE
DevId < 20
┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │
└───────┴──────┴────────────┴─────────────────────┴───────┘
┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │
└───────┴──────┴────────────┴─────────────────────┴───────┘
┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │
└───────┴──────┴────────────┴─────────────────────┴───────┘
Result right after INSERT:
Result somewhat later:
Time for some research into table engines
CREATE TABLE sdata (
DevId Int32,
Type String,
MDate Date,
MDatetime DateTime,
Value Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(MDate)
ORDER BY (DevId, MDatetime)
How to manage data
and handle queries
How to break table
into parts
How to index and
sort data in each part
MergeTree writes parts quickly and merges them offline
/var/lib/clickhouse/data/default/sdata
201801_1_1_0/
201801_2_2_0/
Multiple parts after initial
insertion ( => very fast writes)
201801_1_2_1/
Single part after merge
( => very fast reads)
Rows are indexed and sorted inside each part
/var/lib/clickhouse/data/default/sdata
... ...
956 2018-01-01 15:22:37
575 2018-01-01 23:31:53
1300 2018-01-02 05:14:47
... ...
primary.idx
||||
.mrk .bin
||||
.mrk .bin
||||
.mrk .bin
||||
.mrk .bin
201802_1_1_0/
(DevId, MDateTime) DevId Type MDate MDatetime...
primary.idx .mrk .bin .mrk .bin .mrk .bin .mrk .bin
201801_1_2_1/
(DevId, MDateTime) DevId Type MDate MDatetime...
ClickHouse
Now we can follow how query works on a single server
SELECT DevId, Type, avg(Value)
FROM sdata
WHERE MDate = '2018-01-01'
GROUP BY DevId, Type
Identify parts to search
Query in parallel
Aggregate results
Result Set
Clickhouse distributed engine spreads queries across shards
SELECT ...
FROM
sdata_dist
ClickHouse
sdata_dist
(Distributed)
sdata
(MergeTable)
ClickHouse
sdata_dist sdata
ClickHouse
sdata_dist sdata
Result Set
ReplicatedMergeTree engine spreads over shards and replicas
ClickHouse
sdata_dist
sdata
ReplicatedMergeTree
Engine
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
SELECT ...
FROM
sdata_dist
Result Set
Zookeeper
Zookeeper
Zookeeper
SELECT Dest, count(*) c, avg(DepDelayMinutes)
FROM ontime
GROUP BY Dest HAVING c > 100000
ORDER BY c DESC limit 5
SELECT Dest, count(*) c, avg(DepDelayMinutes)
FROM ontime
WHERE toYear(FlightDate) =
toYear(toDate('2016-01-01'))
GROUP BY Dest HAVING c > 100000
ORDER BY c DESC limit 5
With basic engine knowledge you can now tune queries
Scans 355 table parts
in parallel; does not
use index
Scans 12 parts (3%
of data) because
FlightDate is
partition key
Hint: clickhouse-server.log has the query plan
Faster
SELECT
Dest d, Name n, count(*) c, avg(ArrDelayMinutes)
FROM ontime
JOIN airports ON (airports.IATA = ontime.Dest)
GROUP BY d, n HAVING c > 100000 ORDER BY ad DESC
SELECT dest, Name n, c AS flights, ad FROM (
SELECT Dest dest, count(*) c, avg(ArrDelayMinutes) ad
FROM ontime
GROUP BY dest HAVING c > 100000
ORDER BY ad DESC
) LEFT JOIN airports ON airports.IATA = dest
You can also optimize joins
Subquery
minimizes data
scanned in
parallel; joins on
GROUP BY results
Joins on data
before GROUP BY,
increased amount
to scan
Faster
ClickHouse has a wealth of features to help queries go fast
Dictionaries
Materialized Views
Arrays
Specialized functions and SQL
extensions
Lots more table engines
...And a nice set of supporting ecosystem tools
Client libraries: JDBC, ODBC, Python, Golang, ...
Kafka table engine to ingest from Kafka queues
Visualization tools: Grafana, Tableau, Tabix, SuperSet
Data science stack integration: Pandas, Jupyter Notebooks
Kubernetes ClickHouse operator
Where to get more information
● ClickHouse Docs: https://clickhouse.yandex/docs/en/
● Altinity Blog: https://www.altinity.com/blog
● Meetups and conference presentations
○ 2 April -- Madrid, Spain ClickHouse Meetup
○ 28-30 May -- Austin, TX Percona Live 2019
○ San Francisco ClickHouse Meetup
Questions?
Thank you!
Contacts:
info@altinity.com
Visit us at:
https://www.altinity.com
Read Our Blog:
https://www.altinity.com/blog

More Related Content

What's hot (20)

PDF
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
Altinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Ltd
 
PDF
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
PDF
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
PDF
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
PDF
Adventures with the ClickHouse ReplacingMergeTree Engine
Altinity Ltd
 
PDF
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Altinity Ltd
 
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
Altinity Ltd
 
PDF
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
 
PDF
10 Good Reasons to Use ClickHouse
rpolat
 
PDF
ClickHouse Materialized Views: The Magic Continues
Altinity Ltd
 
PDF
Altinity Quickstart for ClickHouse
Altinity Ltd
 
PDF
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
PDF
Fun with click house window functions webinar slides 2021-08-19
Altinity Ltd
 
PDF
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
 
PDF
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
Altinity Ltd
 
ClickHouse Deep Dive, by Aleksei Milovidov
Altinity Ltd
 
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Altinity Ltd
 
Altinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
Altinity Ltd
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Altinity Ltd
 
A Day in the Life of a ClickHouse Query Webinar Slides
Altinity Ltd
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Altinity Ltd
 
All about Zookeeper and ClickHouse Keeper.pdf
Altinity Ltd
 
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Altinity Ltd
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Altinity Ltd
 
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
Altinity Ltd
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Altinity Ltd
 
10 Good Reasons to Use ClickHouse
rpolat
 
ClickHouse Materialized Views: The Magic Continues
Altinity Ltd
 
Altinity Quickstart for ClickHouse
Altinity Ltd
 
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
Fun with click house window functions webinar slides 2021-08-19
Altinity Ltd
 
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlare
Altinity Ltd
 
ClickHouse on Kubernetes, by Alexander Zaitsev, Altinity CTO
Altinity Ltd
 

Similar to A Fast Intro to Fast Query with ClickHouse, by Robert Hodges (20)

PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Altinity Ltd
 
PPSX
implementation of a big data architecture for real-time analytics with data s...
Joseph Arriola
 
PDF
Performance tuning ColumnStore
MariaDB plc
 
PPTX
Tales from the Field
MongoDB
 
PDF
Live traffic capture and replay in cassandra 4.0
Vinay Kumar Chella
 
PDF
Improving the performance of Odoo deployments
Odoo
 
PPTX
this-is-garbage-talk-2022.pptx
Tier1 app
 
PPTX
Are your ready for in memory applications?
G2MCommunications
 
PDF
Data Structures for High Resolution, Real-time Telemetry at Scale
ScyllaDB
 
PPTX
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
PPTX
Spark Streaming Early Warning Use Case
random_chance
 
PDF
Tiered storage intro. By Robert Hodges, Altinity CEO
Altinity Ltd
 
PDF
Netflix SRE perf meetup_slides
Ed Hunter
 
PDF
Dip into prometheus
Zaar Hai
 
PDF
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
Fwdays
 
PDF
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
PDF
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
PDF
REX Hadoop et R
pkernevez
 
PDF
Accelerating Data Science With GPUs
iguazio
 
PDF
Five Lessons in Distributed Databases
jbellis
 
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Altinity Ltd
 
implementation of a big data architecture for real-time analytics with data s...
Joseph Arriola
 
Performance tuning ColumnStore
MariaDB plc
 
Tales from the Field
MongoDB
 
Live traffic capture and replay in cassandra 4.0
Vinay Kumar Chella
 
Improving the performance of Odoo deployments
Odoo
 
this-is-garbage-talk-2022.pptx
Tier1 app
 
Are your ready for in memory applications?
G2MCommunications
 
Data Structures for High Resolution, Real-time Telemetry at Scale
ScyllaDB
 
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
Spark Streaming Early Warning Use Case
random_chance
 
Tiered storage intro. By Robert Hodges, Altinity CEO
Altinity Ltd
 
Netflix SRE perf meetup_slides
Ed Hunter
 
Dip into prometheus
Zaar Hai
 
Александр Махомет "Beyond the code или как мониторить ваш PHP сайт"
Fwdays
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
REX Hadoop et R
pkernevez
 
Accelerating Data Science With GPUs
iguazio
 
Five Lessons in Distributed Databases
jbellis
 
Ad

More from Altinity Ltd (20)

PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Altinity Ltd
 
PDF
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Altinity Ltd
 
PPTX
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Altinity Ltd
 
PDF
Fun with ClickHouse Window Functions-2021-08-19.pdf
Altinity Ltd
 
PDF
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Altinity Ltd
 
PDF
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Altinity Ltd
 
PDF
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Altinity Ltd
 
PDF
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Altinity Ltd
 
PDF
ClickHouse ReplacingMergeTree in Telecom Apps
Altinity Ltd
 
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Altinity Ltd
 
PDF
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Ltd
 
PDF
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
Altinity Ltd
 
PDF
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
Altinity Ltd
 
PDF
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
Altinity Ltd
 
PDF
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
Altinity Ltd
 
PDF
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
Altinity Ltd
 
PDF
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
Altinity Ltd
 
PDF
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
Altinity Ltd
 
PDF
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
Altinity Ltd
 
PDF
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
Altinity Ltd
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Altinity Ltd
 
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Altinity Ltd
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Altinity Ltd
 
Fun with ClickHouse Window Functions-2021-08-19.pdf
Altinity Ltd
 
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Altinity Ltd
 
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Altinity Ltd
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Altinity Ltd
 
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Altinity Ltd
 
ClickHouse ReplacingMergeTree in Telecom Apps
Altinity Ltd
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Altinity Ltd
 
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Ltd
 
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
Altinity Ltd
 
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
Altinity Ltd
 
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
Altinity Ltd
 
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
Altinity Ltd
 
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
Altinity Ltd
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
Altinity Ltd
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
Altinity Ltd
 
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
Altinity Ltd
 
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
Altinity Ltd
 
Ad

Recently uploaded (20)

PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Chris Elwell Woburn, MA - Passionate About IT Innovation
Chris Elwell Woburn, MA
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Biography of Daniel Podor.pdf
Daniel Podor
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
What Makes Contify’s News API Stand Out: Key Features at a Glance
Contify
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 

A Fast Intro to Fast Query with ClickHouse, by Robert Hodges

  • 1. A Fast Intro to Fast Query with ClickHouse Robert Hodges, Altinity CEO
  • 2. Altinity Background ● Premier provider of software and services for ClickHouse ● Incorporated in UK with distributed team in US/Canada/Europe ● Main US/Europe sponsor of ClickHouse community ● Offerings: ○ Enterprise support for ClickHouse and ecosystem projects ○ Software (Kubernetes, cluster manager, tools & utilities) ○ POCs/Training
  • 3. The shape of data has changed Business insights are hidden in massive pools of automatically collected information
  • 4. Applications that rule the digital era have a common success factor The ability to discover and apply business-critical insights from petabyte datasets in real time
  • 5. Let’s consider a concrete example Web properties track clickstreams to: ● Calculate clickthrough/buy rates ● Guide ad placement ● Optimize eCommerce services Constraints: ● Run on commodity hardware ● Simple to operate ● Fast interactive query ● Avoid encumbering licenses
  • 6. Existing analytic databases do not meet requirements fully Cloud-native data warehouses cannot operate on-prem, limiting range of solutions Legacy SQL databases are expensive to run, scale poorly on commodity hardware, and adapt slowly Hadoop/Spark ecosystem solutions are resource intensive with slow response and complex pipelines Specialized solutions limit query domain and are complex/ resource-inefficient for general use
  • 7. ClickHouse fills the gaps and does much more besides Understands SQL Runs on bare metal to cloud Stores data in columns Parallel and vectorized execution Scales to many petabytes Is Open source (Apache 2.0) Is WAY fast! Id a b c d Id a b c d Id a b c d Id a b c d
  • 8. What does “WAY fast” mean? SELECT Dest d, count(*) c, avg(ArrDelayMinutes) ad FROM ontime GROUP BY d HAVING c > 100000 ORDER BY ad DESC limit 5 ┌─d───┬───────c─┬─────────────────ad─┐ │ EWR │ 3660570 │ 17.637564095209218 │ │ SFO │ 4056003 │ 16.029478528492213 │ │ JFK │ 2198078 │ 15.33669824273752 │ │ LGA │ 3133582 │ 14.533851994299177 │ │ ORD │ 9108159 │ 14.431460737565077 │ └─────┴─────────┴────────────────────┘ 5 rows in set. Elapsed: 1.182 sec. Processed 173.82 million rows, 2.78 GB (147.02 million rows/s., 2.35 GB/s.) (Amazon md5.2xlarge: Xeon(R) Platinum 8175M, 8vCPU, 30GB RAM, NVMe SSD)
  • 9. What are the main ClickHouse use patterns? ● Fast, scalable data warehouse for online services (SaaS and in-house apps) ● Built-in data warehouse for installed analytic applications ● Exploration -- throw in a bunch of data and go crazy!
  • 10. Getting started is easy with Docker image $ docker run -d --name ch-s yandex/clickhouse-server $ docker exec -it ch-s clickhouse client ... 11e99303c78e :) select version() SELECT version() ┌─version()─┐ │ 19.3.3 │ └───────────┘ 1 rows in set. Elapsed: 0.001 sec.
  • 11. Or install recommended Altinity stable version packages $ sudo apt -y install clickhouse-client=18.16.1 clickhouse-server=18.16.1 clickhouse-common-static=18.16.1 ... $ sudo systemctl start clickhouse-server ... 11e99303c78e :) select version() SELECT version() ┌─version()─┐ │ 18.16.1 │ └───────────┘ 1 rows in set. Elapsed: 0.001 sec.
  • 12. Examples of table creation and data insertion CREATE TABLE sdata ( DevId Int32, Type String, MDate Date, MDatetime DateTime, Value Float64 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(MDate) ORDER BY (DevId, MDatetime) INSERT INTO sdata VALUES (15, 'TEMP', '2018-01-01', '2018-01-01 23:29:55', 18.0), (15, 'TEMP', '2018-01-01', '2018-01-01 23:30:56', 18.7) INSERT INTO sdata VALUES (15, 'TEMP', '2018-01-01', '2018-01-01 23:31:53', 18.1), (2, 'TEMP', '2018-01-01', '2018-01-01 23:31:55', 7.9)
  • 13. Loading data from CSV files cat > sdata.csv <<END DevId,Type,MDate,MDatetime,Value 59,"TEMP","2018-02-01","2018-02-01 01:10:13",19.5 59,"TEMP","2018-02-01","2018-02-01 02:10:01",18.8 59,"TEMP","2018-02-01","2018-02-01 03:09:58",18.6 59,"TEMP","2018-02-01","2018-02-01 04:10:05",15.1 59,"TEMP","2018-02-01","2018-02-01 05:10:31",12.2 59,"TEMP","2018-02-01","2018-02-01 06:10:02",11.8 59,"TEMP","2018-02-01","2018-02-01 07:09:55",10.9 END cat sdata.csv |clickhouse-client --database foo --query='INSERT INTO sdata FORMAT CSVWithNames'
  • 14. Select results can be surprising! SELECT * FROM sdata WHERE DevId < 20 ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │ └───────┴──────┴────────────┴─────────────────────┴───────┘ ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐ │ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │ └───────┴──────┴────────────┴─────────────────────┴───────┘ ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐ │ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │ └───────┴──────┴────────────┴─────────────────────┴───────┘ Result right after INSERT: Result somewhat later:
  • 15. Time for some research into table engines CREATE TABLE sdata ( DevId Int32, Type String, MDate Date, MDatetime DateTime, Value Float64 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(MDate) ORDER BY (DevId, MDatetime) How to manage data and handle queries How to break table into parts How to index and sort data in each part
  • 16. MergeTree writes parts quickly and merges them offline /var/lib/clickhouse/data/default/sdata 201801_1_1_0/ 201801_2_2_0/ Multiple parts after initial insertion ( => very fast writes) 201801_1_2_1/ Single part after merge ( => very fast reads)
  • 17. Rows are indexed and sorted inside each part /var/lib/clickhouse/data/default/sdata ... ... 956 2018-01-01 15:22:37 575 2018-01-01 23:31:53 1300 2018-01-02 05:14:47 ... ... primary.idx |||| .mrk .bin |||| .mrk .bin |||| .mrk .bin |||| .mrk .bin 201802_1_1_0/ (DevId, MDateTime) DevId Type MDate MDatetime... primary.idx .mrk .bin .mrk .bin .mrk .bin .mrk .bin 201801_1_2_1/ (DevId, MDateTime) DevId Type MDate MDatetime...
  • 18. ClickHouse Now we can follow how query works on a single server SELECT DevId, Type, avg(Value) FROM sdata WHERE MDate = '2018-01-01' GROUP BY DevId, Type Identify parts to search Query in parallel Aggregate results Result Set
  • 19. Clickhouse distributed engine spreads queries across shards SELECT ... FROM sdata_dist ClickHouse sdata_dist (Distributed) sdata (MergeTable) ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata Result Set
  • 20. ReplicatedMergeTree engine spreads over shards and replicas ClickHouse sdata_dist sdata ReplicatedMergeTree Engine ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata SELECT ... FROM sdata_dist Result Set Zookeeper Zookeeper Zookeeper
  • 21. SELECT Dest, count(*) c, avg(DepDelayMinutes) FROM ontime GROUP BY Dest HAVING c > 100000 ORDER BY c DESC limit 5 SELECT Dest, count(*) c, avg(DepDelayMinutes) FROM ontime WHERE toYear(FlightDate) = toYear(toDate('2016-01-01')) GROUP BY Dest HAVING c > 100000 ORDER BY c DESC limit 5 With basic engine knowledge you can now tune queries Scans 355 table parts in parallel; does not use index Scans 12 parts (3% of data) because FlightDate is partition key Hint: clickhouse-server.log has the query plan Faster
  • 22. SELECT Dest d, Name n, count(*) c, avg(ArrDelayMinutes) FROM ontime JOIN airports ON (airports.IATA = ontime.Dest) GROUP BY d, n HAVING c > 100000 ORDER BY ad DESC SELECT dest, Name n, c AS flights, ad FROM ( SELECT Dest dest, count(*) c, avg(ArrDelayMinutes) ad FROM ontime GROUP BY dest HAVING c > 100000 ORDER BY ad DESC ) LEFT JOIN airports ON airports.IATA = dest You can also optimize joins Subquery minimizes data scanned in parallel; joins on GROUP BY results Joins on data before GROUP BY, increased amount to scan Faster
  • 23. ClickHouse has a wealth of features to help queries go fast Dictionaries Materialized Views Arrays Specialized functions and SQL extensions Lots more table engines
  • 24. ...And a nice set of supporting ecosystem tools Client libraries: JDBC, ODBC, Python, Golang, ... Kafka table engine to ingest from Kafka queues Visualization tools: Grafana, Tableau, Tabix, SuperSet Data science stack integration: Pandas, Jupyter Notebooks Kubernetes ClickHouse operator
  • 25. Where to get more information ● ClickHouse Docs: https://clickhouse.yandex/docs/en/ ● Altinity Blog: https://www.altinity.com/blog ● Meetups and conference presentations ○ 2 April -- Madrid, Spain ClickHouse Meetup ○ 28-30 May -- Austin, TX Percona Live 2019 ○ San Francisco ClickHouse Meetup
  • 26. Questions? Thank you! Contacts: [email protected] Visit us at: https://www.altinity.com Read Our Blog: https://www.altinity.com/blog