Hologres的索引类别和使用说明

Double@加贝

已于 2025-06-22 16:26:16 修改

阅读量741

点赞数 21

CC 4.0 BY-SA版权

分类专栏： Hologres 文章标签：数据库 Hologres

于 2025-06-15 00:37:44 首次发布

本文链接：https://blog.csdn.net/mrhs_dhls/article/details/148660419

Hologres 专栏收录该内容

5 篇文章

订阅专栏

文章目录

一、主键Primary Key
二、分布键Distribution Key
三、Event Time Column（Segment Key）
四、聚簇索引Clustering Key
五、位图索引Bitmap

一、主键Primary Key

1、主键介绍

Hologres中，系统会自动在底层保存一个主键索引文件，采用行存结构存储，提供高速的KV（key- value）服务，索引文件的Key为表的主键PK，Value为RID（Row Identified，原名为unique_id）和聚簇索引（Clustering Key）。

2、使用建议

1. 主键尽量选择含有实际业务意义的字段
2. 不建议将Serial类型的字段设置为主键，因为Serial类型在写入的时候是表锁，导致写入性能有损失，且随着数据的增长，长度容易溢出。

3、使用限制

被设置为主键的字段是唯一且非空的列或者列组合，同时只能在一个语句里设置多列为表的主键。

最多支持32个列组成的联合主键。

不支持将FLOAT、DOUBLE、NUMERIC、ARRAY、JSON、JSONB、DATE及其他复杂数据类型的字段设置为主键。Hologres从 V1.3.22及以上版本支持将Date类型字段设为主键。

行存表必须设置主键，行列共存表必须设置主键，列存表不要求有主键。

不支持修改主键，如需修改主键请重新建表。

4、使用示例

①、新建普通列存表单个主键。

BEGIN;
CREATE TABLE tbl_1 (
 id bigint NOT NULL,
 name text NOT NULL,
 age bigint,
 class text,
 reg_timestamp timesatmptz, 
PRIMARY KEY (id)
);
CALL set_table_property('tbl_1', 'orientation', 'column');
CALL set_table_property('tbl_1', 'distribution_key', 'id');
CALL set_table_property('tbl_1', 'clustering_key', 'age');
CALL set_table_property('tbl_1', 'event_time_column', 'reg_timestamp');
CALL set_table_property('tbl_1', 'bitmap_columns', 'name,class');
CALL set_table_property('tbl_1', 'dictionary_encoding_columns', 'class:auto');
COMMIT;

②、新建普通列存表两个主键。

BEGIN;
CREATE TABLE tbl_2 (
 id bigint NOT NULL,
 name text NOT NULL,
 age bigint NOT NULL,
 class text NOT NULL,
 reg_timestamp timestamptz NOT NULL,
PRIMARY KEY (id,age)
);
CALL set_table_property('tbl_2', 'orientation', 'column');
CALL set_table_property('tbl_2', 'distribution_key', 'id');
CALL set_table_property('tbl_2', 'clustering_key', 'age');
CALL set_table_property('tbl_2', 'event_time_column', 'reg_timestamp');
CALL set_table_property('tbl_2', 'bitmap_columns', 'name,class');
CALL set_table_property('tbl_2', 'dictionary_encoding_columns', 'class:auto');
COMMIT;

③、创建行存表并指定主键。

BEGIN;
CREATE TABLE public.tbl_row (
    id text NOT NULL,
    name text NOT NULL,
    class text ,
PRIMARY KEY (id)
);
CALL set_table_property('public.tbl_row', 'orientation', 'row');
CALL set_table_property('public.tbl_row', 'clustering_key', 'id');
CALL set_table_property('public.tbl_row', 'distribution_key', 'id');
COMMIT;

④、创建分区表并指定主键。

BEGIN;
CREATE TABLE public.tbl_parent(
  a text , 
  b int, 
  c timestamp, 
  d text,
  ds text,
  PRIMARY KEY (ds,b)
  )
  PARTITION BY LIST(ds);
CALL set_table_property('public.tbl_parent', 'orientation', 'column');
CREATE TABLE public.tbl_child_1 PARTITION OF public.tbl_parent FOR VALUES IN('20221207');
CREATE TABLE public.tbl_child_2 PARTITION OF public.tbl_parent FOR VALUES IN('20221208');
COMMIT;

5、适用场景

①、Key/Value查询场景

--创建一张行存表test_kv_table并将key字段设置为主键
begin;
create table test_kv_table(
  key text primary key,
  value text
);
call set_table_property('test_kv_table', 'orientation', 'row');
call set_table_property('test_kv_table', 'clustering_key', 'key');
call set_table_property('test_kv_table', 'distribution_key', 'key');
commit;

--一次查询单个Key
select * from test_kv_table where key  = '1';

--一次查询多个Key
select * from test_kv_table where key  in ('1', '2', '3');

二、分布键Distribution Key

1、介绍说明

Distribution Key属性指定了表数据的分布策略，系统会保证Distribution Key相同的记录被分配到同一个Shard上。建表时设置语法如下：

-- Hologres V2.1版本起支持的语法
CREATE TABLE <table_name> (...) WITH (distribution_key = '[<columnName>[,...]]');

-- 所有版本支持的语法
BEGIN;
CREATE TABLE <table_name> (...);
call set_table_property('<table_name>', 'distribution_key', '[<columnName>[,...]]');
COMMIT;

2、使用建议

如果您的数据表经常需要进行GROUP BY、JOIN操作或为了避免数据倾斜，您可以在建表时设置分布键（Distribution Key），合适的分布键可以帮助数据均匀分布在所有计算节点上，显著提高计算和查询性能。

3、使用限制

设置Distribution Key需要在建表时设置，建表后如需修改Distribution Key需要重新建表并导入数据。
不支持修改Distribution Key对应列的值，如需修改请重新建表。
不支持将Float、Double、Numeric、Array、Json及其他复杂数据类型的字段设为Distribution Key。
表未设置PK时，Distribution Key没有限制，可以为空（不指定任何列）。如果为空，即随机Shuffle，数据随机分布到不同Shard上。从Hologres V1.3.28版本开始，Distribution Key禁止为空，示例用法如下。
```
--从1.3.28版本开始，写法将会被禁止 CALL SET_TABLE_PROPERTY('<tablename>', 'distribution_key', ''); 
```
Distribution Key列的值中有null时，当作“”（空串）看待，即Distribution Key为空。

4、使用示例

--设置a列为distribution key，系统会对a列的值做hash操作，再取模，即hash(a)%shard_count = shard_id，结果相同的一组数据分布在同一个Shard内
begin;
create table tbl (
a int not null,
b text not null
);
call set_table_property('tbl', 'distribution_key', 'a');
commit;

--设置a、b两列为distribution key，系统对a,b两个列的值做hash操作，再取模，即hash(a,b)%shard_count = shard_id，结果相同的一组数据分布在同一个Shard内
begin;
create table tbl (
  a int not null,
  b text not null
);
call set_table_property('tbl', 'distribution_key', 'a,b');
commit;

5、适用场景

①、Group By聚合场景

begin;
create table agg_tbl (
a int not null,
b int not null
);
call set_table_property('agg_tbl', 'distribution_key', 'a');
commit;

--示例查询，对a列做聚合查询
select a,sum(b) from agg_tbl group by a;

②、两表关联场景

--tbl1按照a列分布，tbl2按照c列分布，当tbl1与tbl2以a=c关联条件join时，对应的数据分布在同一个Shard内，这种查询可以实现Local Join的加速效果
begin;
create table tbl1(
a int not null,
b text not null
);
call set_table_property('tbl1', 'distribution_key', 'a');

create table tbl2(
c int not null,
d text not null
);
call set_table_property('tbl2', 'distribution_key', 'c');
commit;


select * from tbl1  join tbl2 on tbl1.a=tbl2.c;

③、多表关联场景

每个表的Join字段都相同，那么将Join字段都设置为Distribution Key。
每个表的Join字段不同，优先考虑大表间的Join，将大表的Join字段设置为Distribution Key。

begin;
create table join_tbl1(
a int not null,
b text not null
);
call set_table_property('join_tbl1', 'distribution_key', 'a');

create table join_tbl2(
a int not null,
d text not null,
e text not null
);
call set_table_property('join_tbl2', 'distribution_key', 'a');

create table join_tbl3(
a int not null,
e text not null,
f text not null,
g text not null
);
call set_table_property('join_tbl3', 'distribution_key', 'a');
commit;

--3表join查询
SELECT * FROM join_tbl1
INNER JOIN join_tbl2 ON join_tbl2.a = join_tbl1.a
INNER JOIN join_tbl3 ON join_tbl2.a = join_tbl3.a;

三、Event Time Column（Segment Key）

1、主键介绍

如果您需对大规模数据集使用基于主键的UPDATE或含范围过滤条件的查询，可以考虑为表设置Event Time Column（Segment Key），系统将数据文件基于Event Time Column范围排序后进行合并，减少文件之间的重叠，使得查询时能够过滤掉尽可能多的文件，从而提升查询效率。

-- Hologres V2.1版本起支持的语法
CREATE TABLE <table_name> (...) WITH (event_time_column = '[<columnName>[,...]]');

-- 所有版本支持的语法
BEGIN;
CREATE TABLE <table_name> (...);
call set_table_property('<table_name>', 'event_time_column', '[<columnName> [,...]]');
COMMIT;

2、使用建议

Event_time_column适用于数据为单调递增或单调递减的有序字段，例如时间戳字段。非常适用于日志、流量等和时间强相关的数据，合理设置可极大提升性能。完全无序的Event_time_column会导致合并之后每个文件缺乏区分度，达不到任何文件过滤效果。

如果表不存在明显的单调递增或单调递减的字段，可以选择额外扩充一列update_time，每次UPSERT时将当前时间写入该新增字段。

Event_time_column具备左匹配原则，因此不建议将多个字段设置为Event_time_column，否则使得查询场景受限，达不到加速效果，一般情况建议选择设置两个或者两个以内字段设置为Event_time_column。

3、使用限制

Event_time_column必须为not nullable的列或者列组合，可以不设置，但不能设置为空。如果业务有强需求设置Event_time_column为null，可以在SQL前添加如下参数：
set hg_experimental_enable_nullable_segment_key = true;


不支持修改Event_time_column，如需修改请重新建表。

行存表不能设置Event_time_column。

列存表默认将表中的第一个非空的Timestamp或Timestamptz类型字段作为Event_time_column，如果不存在这样的字段，则默认将第一个非空的Date类型字段作为Event_time_column（Hologres V0.9之前的版本默认为空）。

不支持Decimal、Numeric、Float、Double、Array、Json、Jsonb、Bit、Money及其他复杂数据类型。

4、使用示例

①、建表时创建一个Event_time_column。

BEGIN;
CREATE TABLE tbl_segment_test (
    a int NOT NULL,
    b timestamptz NOT NULL
    );
CALL set_table_property('tbl_segment_test', 'event_time_column', 'b');
COMMIT;

INSERT INTO tbl_segment_test VALUES 
(1,'2022-09-05 10:23:54+08'),
(2,'2022-09-05 10:24:54+08'),
(3,'2022-09-05 10:25:54+08'),
(4,'2022-09-05 10:26:54+08');

EXPLAIN SELECT * FROM tbl_segment_test WHERE b > '2022-09-05 10:24:54+08';

询执行计划（explain SQL），如果出现Segment Filter，说明有查询命中Event_time_column。

②、建表时创建多个Event_time_column。

BEGIN;
CREATE TABLE tbl_segment_test_2 (
    a int NOT NULL,
    b timestamptz NOT NULL
    );
CALL set_table_property('tbl_segment_test_2', 'event_time_column', 'a,b');
COMMIT;

INSERT INTO tbl_segment_test_2 VALUES 
(1,'2022-09-05 10:23:54+08'),
(2,'2022-09-05 10:24:54+08'),
(3,'2022-09-05 10:25:54+08'),
(4,'2022-09-05 10:26:54+08')
;

--不可命中segment key
SELECT * FROM tbl_segment_test_2 WHERE b > '2022-09-05 10:24:54+08';

--可命中segment key
SELECT * FROM tbl_segment_test_2 WHERE a = 3 and b > '2022-09-05 10:24:54+08';
SELECT * FROM tbl_segment_test_2 WHERE a > 3 and b < '2022-09-05 10:26:54+08';
SELECT * FROM tbl_segment_test_2 WHERE a > 3 and b > '2022-09-05 10:24:54+08';

5、适用场景

①、含范围过滤条件（包括等值条件）的查询场景。

②、基于主键的UPDATE。

四、聚簇索引Clustering Key

1、主键介绍

Hologres会按照聚簇索引在文件内对数据进行排序，建立聚簇索引能够加速在索引列上的范围和过滤查询。语法如下，需要建表时指定。

-- Hologres V2.1版本起支持的语法
CREATE TABLE <table_name> (...) WITH (clustering_key = '[<columnName>[,...]]');

-- 所有版本支持的语法
BEGIN;
CREATE TABLE <table_name> (...);
CALL set_table_property('<table_name>', 'clustering_key', '[<columnName>{:asc} [,...]]');
COMMIT;

2、使用建议

Clustering Key主要适用于点查以及范围查询的场景，对于过滤操作有比较好的性能提升，即对于where a = 1或者where a > 1 and a < 5的场景加速效果比较好。可以同时设置Clustering Key和Bitmap Column以达到最佳的点查性能。

Clustering Key具备左匹配原则，因此一般不建议设置Clustering Key超过两个字段，否则适用场景受限。Clustering Key是用于排序，所以Clustering Key里的列组合是有先后关系的，即排在前面列的排序优先级高于后面的列。

指定Clustering Key字段时，可在字段名后添加:asc来构建索引时的排序方式。排序方式默认为asc，即升序。Hologres V2.1以前版本不支持设置构建索引时的排序方式为降序（desc），如果设置了降序，无法命中Clustering Key，导致查询性能不佳；从V2.1版本开始，开启如下GUC后支持设置Clustering Key为desc，但仅支持Text、Char、Varchar、Bytea、Int等类型的字段，其余数据类型的字段暂不支持设置Clustering Key为desc。

set hg_experimental_optimizer_enable_variable_length_desc_ck_filter = on;
对于行存表，Clustering Key默认为主键（Hologres V0.9之前版本默认不设置）。如果设置和主键不同的Clustering Key，那么Hologres会为这张表生成两个排序（Primary Key排序和Clustering Key排序），造成数据冗余。

3、使用限制

1.如需修改Clustering Key，请重新建表并导入数据。

2.Clustering Key必须为not nullable的列或者列组合。如果业务有强需求设置Clustering Key为null，可以在SQL前添加如下参数。
set hg_experimental_enable_nullable_clustering_key = true;

3.不支持将Float、Float4、Float8、Double、Decimal(Numeric)、Json、Jsonb、Bit、Varbit、Money、Time With Time Zone及其他复杂数据类型的字段设置为Clustering Key。

4.Hologres 从V2.1版本开始，开启如下GUC后才 支持设置Clustering Key为desc，但仅支持Text、Char、Varchar、Bytea、Int等类型的字段，其余数据类型的字段暂不支持设置Clustering Key为desc。
set hg_experimental_optimizer_enable_variable_length_desc_ck_filter = on;

5.列存表的Clustering Key默认为空，需要根据业务场景显式指定。

6.每个表只能设置一组Clustering Key。即建表的时候只能使用call命令一次，不能执行多次。

4、使用示例

①、命中Clustering Key的场景

CREATE TABLE table1 (
    col1 int NOT NULL,
    col2 text NOT NULL,
    col3 text NOT NULL,
    col4 text NOT NULL
)
WITH (
    clustering_key = 'col1,col2'
);

--如上的建表sql，query可以被加速的情况如下：
-- 可加速
select * from table1 where col1='abc';

-- 可加速
select * from table1 where col1>'xxx' and col1<'abc';

-- 可加速
select * from table1 where col1 in ('abc','def');

-- 可加速
select * from table1 where col1='abc' and col2='def'; 

-- 不可加速
select col1,col4 from table1 where col2='def';

②、Clustering Key设置为asc/desc

CREATE TABLE tbl (
    a int NOT NULL,
    b text NOT NULL
)
WITH (
    clustering_key = 'a:desc,b:asc'
);

5、适用场景

①、范围过滤的场景。

比如where date= 1/1或者where a > 1/1 and a < 1/5的场景加速效果比较好。

五、位图索引Bitmap

1、主键介绍

bitmap_columns属性指定位图索引，是数据存储之外的独立索引结构，以位图向量结构加速等值比较场景，能够对文件块内的数据进行快速的等值过滤，适用于等值过滤查询的场景。

2、使用建议

1.适合将等值查询的列设置为Bitmap，能够快速定位到符合条件的数据所在的行号。但需要注意的是Bitmap对于基数比较高（重复数据较少）的列会有比较大的额外存储开销。

2.不建议为每一列都设置Bitmap，不仅会有额外存储开销，也会影响写入性能（因为要为每一列构造Bitmap）。

3.不建议为实际内容为JSON，但保存为text类型的列设置Bitmap。

3、使用限制

只有列存表和行列共存表支持设置Bitmap，行存表不支持设置。

Bitmap指定的列可以为空。

当前版本默认所有TEXT类型的列都会被隐式地设置为Bitmap。

设置位图索引命令可以在事务之外单独使用，表示修改位图索引列，修改之后非立即生效，比特编码构建和删除在后台异步执行，详情请参见ALTER TABLE。

bitmap_columns属性仅支持设为on或off，Hologres V2.0版本起，不支持将bitmap_columns属性设为auto。

4、使用示例

①、命中Bitmap索引

begin;
create table bitmap_test (
  uid int not null,
  name text not null,
  gender text not null,
  class text not null,
  PRIMARY KEY (uid)
);
call set_table_property('bitmap_test', 'bitmap_columns', 'gender,class');
commit;

INSERT INTO bitmap_test VALUES
(1,'张三','男','一班'),
(2,'李四','男','三班'),
(3,'王五','女','二班'),
(4,'赵六','女','二班'),
(5,'孙七','男','二班'),
(6,'周八','男','三班'),
(7,'吴九','女','一班');

explain SELECT * FROM bitmap_test where gender='男' AND  class='一班';
--当执行计划结果中有Bitmap Filter算子，说明命中Bitmap索引。

②、Clustering Key的优先级会比Bitmap更高

即如果为同一个字段设置了Clustering Key和Bitmap，那么优化器会优先使用Clustering Key去匹配文件，示例如下：

--设置uid,class,date 3列为clustering key，text列设置默认为bitmap
begin;
create table ck_bit_test (
  uid int not null,
  name text not null,
  class text not null,
  date text not null,
  PRIMARY KEY (uid)
);
call set_table_property('ck_bit_test', 'clustering_key', 'uid,class,date');
call set_table_property('ck_bit_test', 'bitmap_columns', 'name,class,date');
commit;

INSERT INTO ck_bit_test VALUES
(1,'张三','1','2022-10-19'),
(2,'李四','3','2022-10-19'),
(3,'王五','2','2022-10-20'),
(4,'赵六','2','2022-10-20'),
(5,'孙七','2','2022-10-18'),
(6,'周八','3','2022-10-17'),
(7,'吴九','3','2022-10-20');

查询uid,class,date 三列，SQL符合左匹配特征，都命中Clustering Key，即使是等值查询也走Clustering Key，而不是走Bitmap。

SELECT * FROM clustering_test WHERE uid = '3' AND class ='2' AND date > '2022-10-17';

查询uid,class,date 三列，但class是范围查询，根据左匹配原则，SQL里匹配到>或者<则停止左匹配，那么date因不满足左匹配原则，就不会命中Clustering Key。date设置了Bitmap，则会使用Bitmap。

SELECT * FROM clustering_test WHERE uid = '3' AND class >'2' AND date = '2022-10-17';

5、适用场景

适用于等值过滤查询的场景。

Hologres的索引类别和使用说明

文章目录

一、 主键Primary Key

1、主键介绍

2、使用建议

3、使用限制

4、使用示例

①、新建普通列存表单个主键。

②、新建普通列存表两个主键。

③、创建行存表并指定主键。

④、创建分区表并指定主键。

5、适用场景

①、Key/Value查询场景

二、分布键Distribution Key

1、介绍说明

2、使用建议

3、使用限制

4、使用示例

5、适用场景

①、Group By聚合场景

②、两表关联场景

③、多表关联场景

三、Event Time Column（Segment Key）

1、主键介绍

2、使用建议

3、使用限制

4、使用示例

①、建表时创建一个Event_time_column。

②、建表时创建多个Event_time_column。

5、适用场景

①、含范围过滤条件（包括等值条件）的查询场景。

②、基于主键的UPDATE。

四、聚簇索引Clustering Key

1、主键介绍

2、使用建议

3、使用限制

4、使用示例

①、命中Clustering Key的场景

②、Clustering Key设置为asc/desc

5、适用场景

①、范围过滤的场景。

五、 位图索引Bitmap

1、主键介绍

2、使用建议

3、使用限制

4、使用示例

①、命中Bitmap索引

②、Clustering Key的优先级会比Bitmap更高

5、适用场景

一、主键Primary Key

五、位图索引Bitmap