Hive 不同版本中update更新操作 (merge into 有则更新，无则插入)

宇文智

已于 2023-11-03 16:17:49 修改

阅读量1.2w

点赞数 1

CC 4.0 BY-SA版权

分类专栏：大数据技术文章标签： hive 大数据 sql

于 2022-03-23 09:49:14 首次发布

本文链接：https://blog.csdn.net/m0_38109926/article/details/123678211

大数据技术专栏收录该内容

32 篇文章

订阅专栏

本文介绍了Hive 2.2及以上版本中MergeInto语句的使用条件，包括参数配置、建表要求，并对比了与Hive 1.1.0版本Overwrite的不同之处。重点讲解了如何在ORC格式表上执行批量更新，以及如何优化更新效率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、使用条件

hive2.2.0及之后的版本支持使用merge into 语法，使用源表数据批量目标表的数据。使用该功能还需做如下配置

1、参数配置

set hive.support.concurrency = true;
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1;
set hive.auto.convert.join=false;
set hive.merge.cardinality.check=false; -- 目标表中出现重复匹配时要设置该参数才行

2、建表要求

Hive对使用Update功能的表有特定的语法要求, 语法要求如下:

(1)要执行Update的表中, 建表时必须带有buckets(分桶)属性

(2)要执行Update的表中, 需要指定格式,其余格式目前赞不支持, 如:parquet格式, 目前只支持ORCFileformat和AcidOutputFormat

(3)要执行Update的表中, 建表时必须指定参数(‘transactional’ = true);

DROP TABLE IF EXISTS dim_date_10000;
create table dim_date_10000(
 date_key        string                 comment'如:2018-08-08'
,day             int                  comment'日（1~31）'
,month           int                  comment'月，如:8'
,month_name      string        comment'月名称，如:8月'
,year            int                   comment'年，如:2018'
,year_month       int                   comment'年月，如201808'
,week_of_year    string                   comment'年内第几周 2018-1'
,week            int                  comment'周（1~7）'
,week_name       string          comment'周，如星期三'
,quarter         int                  comment'季（1~4）'
)
CLUSTERED BY (date_key) INTO 10 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS orc
TBLPROPERTIES('transactional'='true');

二、批量更新语法对比

对比在hive1.1.0 使用overwrite ，hive2.3.5使用merge into的方式，对不同量级的数据进行更新时的语法及效率。

之前hive表实现更新操作的步骤

insert overwrite table dim_date_100w
-- 旧的改变了的数据
select t2.date_key,t2.day,t2.month,t2.month_name,t2.year,t2.year_month,t2.week_of_year,t2.week,t2.week_name,1001 as quarter
from dim_date_100w t1
join dim_date_1w t2 on t1.date_key=t2.date_key
-- 旧的不变的数据
union all
select t1.*
from dim_date_100w t1
left join dim_date_1w t2 on t1.date_key=t2.date_key
where t2.date_key is null
-- 新增的数据
union all
select t1.*
from dim_date_1w t1
left join dim_date_100w t2 on t1.date_key=t2.date_key
where t2.date_key is null
;

Hive2.3.5

MERGE INTO dim_date_100w AS T USING dim_date_1w AS S
ON t.date_key=s.date_key
WHEN MATCHED THEN 
UPDATE SET quarter=1001   --关联上，变化的数据
WHEN NOT MATCHED THEN 
INSERT       --S 没关联上的 新增的数据
VALUES(S.date_key,S.day,S.month,S.month_name,S.year,S.year_month,S.week_of_year,S.week,S.week_name,S.quarter);

批量更新语法

 MERGE INTO <target table> AS T USING <source expression/table> AS S
 ON <``boolean` `expression1>
 WHEN MATCHED [AND <``boolean` `expression2>] THEN UPDATE SET <set clause list>
 WHEN MATCHED [AND <``boolean` `expression3>] THEN DELETE
 WHEN NOT MATCHED [AND <``boolean` `expression4>] THEN INSERT VALUES<value list>