2023-04-21 duckdb-Parallel Grouped Aggregation

本文深入探讨了DuckDB如何通过使用并行和分区哈希表来实现高效的分组聚合,从而在大数据分析中提供快速且可扩展的聚合能力。讨论了排序、哈希表、碰撞处理、线性探测以及并行化策略等关键优化点,并通过实验展示了DuckDB在不同数据规模和分组数量下的性能优势。

Parallel Grouped Aggregation in DuckDB - DuckDB

TL;DR: DuckDB has a fully parallelized aggregate hash table that can efficiently aggregate over millions of groups.

Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized parallel aggregation capability for fast and scalable summarization.

Jump straight to the benchmarks?

Introduction

GROUP BY changes the result set cardinality - instead of returning the same number of rows of the input (like a normal SELECT), GROUP BY returns as many rows as there are groups in the data. Consider this (weirdly familiar) example query:

SELECT
    l_returnflag,
    l_linestatus,
    sum(l_extendedprice),
    avg(l_quantity)
FROM
    lineitem
GROUP BY
    l_returnflag,
    l_linestatus;

GROUP BY is followed by two column names, l_returnflag and l_linestatus. Those are the columns to compute the groups on, and the resulting table will contain all combinations of the same column that occur in the data. We refer to the columns in the GROUP BY clause as the “grouping columns” and all occurring combinations of values therein as “groups”. The SELECT clause contains four (not five) expressions: References to the grouping columns, and two aggregates: the sum over l_extendedprice and the avg over l_quantity. We refer to those as the “aggregates”. If executed, the result of this query looks something like this:

l_returnflag l_linestatus sum(l_extendedprice) avg(l_quantity)
N O 114935210409.19 25.5
R F 56568041380.9 25.51
A F 56586554400.73 25.52
N F 1487504710.38 25.52

In general, SQL allows only columns that are mentioned in the GROUP BY clause to be part of the SELECT expressions directly, all other columns need to be subject to one of the aggregate functions like sumavg etc. There are many more aggregate functions depending on which SQL system you use.

How should a query processing engine compute such an aggregation? There are many design decisions involved, and we will discuss those below and in particular the decisions made by DuckDB. The main issue when computing grouping results is that the groups can occur in the input table in any order. Were the input already sorted on the grouping columns, computing the aggregation would be trivial, as we could just compare the current values for the grouping columns with the previous ones. If a change occurs, the next group begins and a new aggregation result needs to be computed. Since the sorted case is easy, one straightforward way of computing grouped aggregates is to sort the input table on the grouping columns first, and then use the trivial approach. But sorting the input is unfortunately still a computationally expensive operation despite our best efforts. In general, sorting has a computational complexity of O(nlogn) with n being the number of rows sorted.

Hash Tables for Aggregation

A better way is to use a hash table. Hash tables are a 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

悟世者

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值