今天开始,我想做一个 Greenplum 和 ClickHouse 的一系列简单测试对比,使用CK官网提供的数据集,主要目的有2个:
- 简单体验一下 CK 在单表测试上的性能比 GP 好多少;
- 踩踩坑,积累一些二者对比测试经验。
特别说明:由于二者的特点各不相同,该测试仅供大家简单参考,不作为任何评判标准。
测试环境
- 个人 MacBook 笔记本
- 内存 32 GB
- CPU 6 cores
- 本地 SSD
- GPDB 分配 5个 Segment Instance
- CK 单 Server
NYC taxi data 简单测试对比
首先我们今天测试的参考内容源自这里 -> https://clickhouse.com/docs/en/tutorial
1. CK 数据加载
首先我们根据官网指导,创建表:
CREATE TABLE trips
(
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;
接下来,我们加载数据,CK 提供 S3 上的数据操作函数,可以一条命令下载并插入数据到表中,还是比较方便的:
INSERT INTO trips
SELECT * FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
'TabSeparatedWithNames'
)
我家里的网络条件一般般,最后下载入库花了大概 4 分钟:
Query id: 62eeb743-7f34-441a-8e8b-3bd7f82163db
Ok.
0 rows in set. Elapsed: 221.145 sec. Processed 2.00 million rows, 974.44 MB (9.04 thousand rows/s., 4.41 MB