2023-04-21 duckdb-Parallel Grouped Aggregation

最新推荐文章于 2025-09-06 14:46:50 发布

悟世者

最新推荐文章于 2025-09-06 14:46:50 发布

阅读量719

点赞数

CC 4.0 BY-SA版权

分类专栏： duckdb 文章标签：数据库 duckdb 聚合

本文链接：https://blog.csdn.net/adofsauron/article/details/130283900

本文深入探讨了DuckDB如何通过使用并行和分区哈希表来实现高效的分组聚合，从而在大数据分析中提供快速且可扩展的聚合能力。讨论了排序、哈希表、碰撞处理、线性探测以及并行化策略等关键优化点，并通过实验展示了DuckDB在不同数据规模和分组数量下的性能优势。

Parallel Grouped Aggregation in DuckDB - DuckDB

TL;DR: DuckDB has a fully parallelized aggregate hash table that can efficiently aggregate over millions of groups.

Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized parallel aggregation capability for fast and scalable summarization.

Jump straight to the benchmarks?

Introduction

GROUP BY changes the result set cardinality - instead of returning the same number of rows of the input (like a normal SELECT), GROUP BY returns as many rows as there are groups in the data. Consider this (weirdly familiar) example query:

SELECT
    l_returnflag,
    l_linestatus,
    sum(l_extendedprice),
    avg(l_quantity)
FROM
    lineitem
GROUP BY
    l_returnflag,
    l_linestatus;

GROUP BY is followed by two column names, l_returnflag and l_linestatus. Those are the columns to compute the groups on, and the resulting table will contain all combinations of the same column that occur in the data. We refer to the columns in the GROUP BY clause as the “grouping columns” and all occurring combinations of values therein as “groups”. The SELECT clause contains four (not five) expressions: References to the grouping columns, and two aggregates: the sum over l_extendedprice and the avg over l_quantity. We refer to those as the “aggregates”. If executed, the result of this query looks something like this:

l_returnflag	l_linestatus	sum(l_extendedprice)	avg(l_quantity)
N	O	114935210409.19	25.5
R	F	56568041380.9	25.51
A	F	56586554400.73	25.52
N	F	1487504710.38	25.52

In general, SQL allows only columns that are mentioned in the GROUP BY clause to be part of the SELECT expressions directly, all other columns need to be subject to one of the aggregate functions like sum, avg etc. There are many more aggregate functions depending on which SQL system you use.

How should a query processing engine compute such an aggregation? There are many design decisions involved, and we will discuss those below and in particular the decisions made by DuckDB. The main issue when computing grouping results is that the groups can occur in the input table in any order. Were the input already sorted on the grouping columns, computing the aggregation would be trivial, as we could just compare the current values for the grouping columns with the previous ones. If a change occurs, the next group begins and a new aggregation result needs to be computed. Since the sorted case is easy, one straightforward way of computing grouped aggregates is to sort the input table on the grouping columns first, and then use the trivial approach. But sorting the input is unfortunately still a computationally expensive operation despite our best efforts. In general, sorting has a computational complexity of O(nlogn) with n being the number of rows sorted.