hive入门

lijian972

已于 2022-04-18 18:49:06 修改

阅读量1.4k

点赞数

CC 4.0 BY-SA版权

分类专栏： hive基础入门文章标签： hive

于 2022-03-22 10:35:48 首次发布

本文链接：https://blog.csdn.net/lijian972/article/details/123653188

hive基础入门专栏收录该内容

1 篇文章

订阅专栏

#博学谷IT学习技术支持#

1.hive基础语法

1.1数据库操作

1.创建数据库
create database if not exists myhive;
2.使用数据库
use myhive;
3.查看数据库信息
desc database myhive;
4.删除数据库(不包含表)
drop database myhive;
5.删除数据库(包含表，强制删除)
drop database myhive cascade;

1.2表操作

1.建表语法
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], …)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]
[CLUSTERED BY (col_name, col_name, …)
[SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
说明：

1、CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 IF NOT EXISTS选项来忽略这个异常；
2、EXTERNAL关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION），Hive创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据；
3、LIKE允许用户复制现有的表结构，但是不复制数据；
4、ROW FORMAT DELIMITED 可用来指定行分隔符；
5、STORED AS SEQUENCEFILE|TEXTFILE|RCFILE来指定该表数据的存储格式，hive中，表的默认存储格式为TextFile；
6、CLUSTERED BY对于每一个表（table）进行分桶(MapReuce中的分区），桶是更为细粒度的数据范围划分。Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中；
7、LOCATION:指定表在HDFS上的存储位置。

2.常用数据类型
int(整形) string（字符串） date（日期） time（时分秒）
array(有序的同类型集合) map（键值对类型） struct(对象类型)

3.内部表操作
未被external修饰的是内部表（managed table）,内部表又称管理表,内部表数据存储的位置由hive.metastore.warehouse.dir参数决定（认：/user/hive/warehouse），删除内部表会直接删除元数据（metadata）及存储数据，因此内部表不适合和其他工具共享数据。

4.创建表的方式
4.1创建表并指定字段之间的分隔符
create table if not exists stu3(id int ,name string) row format delimited fields terminated by ‘\t’;

4.2根据查询结果创建表
create table stu3 as select * from stu2;

4.3根据已经存在的表结构创建表
create table stu4 like stu2;

5.查询表的类型
desc formatted stu2;

6.删除表
drop table stu2;

7.清空表
truncate table stu2;

7.创建external表的方式
create external table table_name +列名，只能已基础语法的方式建表，不能create external table table_name as select * from table_name。
删除外部表的时候会删除元数据，但会保留hdfs的文件；删除内部表则会一起删除元数据和hdfs上的文件。

8.给表加载数据
load data [local] inpath ‘/export/data/datas/student.txt’ [overwrite] | into table student [partition (partcol1=val1,…)];
参数说明:
1、load data:表示加载数据
2、local:表示从本地加载数据到hive表；否则从HDFS加载数据到hive表
3、inpath:表示加载数据的路径
4、overwrite:表示覆盖表中已有数据，否则表示追加
5、into table:表示加载到哪张表
6、student:表示具体的表
7、partition:表示上传到指定分区

9.分区
分区的实质就是分目录；
9.1静态分区
注意：分区的字段不能是表的字段
例子：
建表：
单极分区：
create table stu(id int,name string) partitioned by(gender string) row format delimited fields terminated by ‘\t’ location ‘/a/b/c’;
多级分区：
create table stu(id int,name string) partitioned by(year string,month string,day string) row format delimited fields terminated by ‘\t’ location ‘/a/b/c’;
单极分区加载数据：
load data [local] inpath ‘/a’ [overwrite] into table stu partition (gender=‘female’);
insert [overwrite] into table stu partition (gender=‘female’) select id,name from stu;
多极分区加载数据：
load data [local] inpath ‘/a’ [overwrite] into table stu partition (year=‘2022’,month=‘03’,day=‘22’);
insert [overwrite] into table stu partition (year=‘2022’,month=‘03’,day=‘22’) select id,name from stu;
9.2动态分区
注意事项：
1.开启动态分区和非严格模式
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
2.select语句后面的字段必须是分区的类型对应,保证顺序
加载数据：
insert [overwrite] into table stu partition (xxx) select id,name,gender from stu;
9.3动静结合
注意事项：
1.也是要开启动态分区和非严格模式；
2.select语句最后的字段保证和分区字段一致；
例子：
建表
create table stu(id int,name string) partitioned by(gender string,year string,month string,day string) row format delimited fields terminated by ‘\t’ ;
加载数据：
insert overwrite table stu partition(gender=‘fenale’,year,month,day) select id,name,‘2022’ as year,‘03’ as month from stu;
9.4分区的其他操作
查看分区 show partitions table_name;
添加一个分区 alter table table_name add partition(month=‘202008’);
删除分区 alter table table_name drop partition(month = ‘202010’);

10.分桶表
分桶就是将数据划分到不同的文件，其实就是MapReduce的分区；
分桶的字段必须是表中的字段。
将数据按照指定的字段进行分成多个桶中去，说白了就是将数据按照字段进行划分，可以将数据按照字段划分到多个文件当中去。
开启分桶功能：
set hive.enforce.bucketing=true;
设置reduce的个数
set mapreduce.job.reduces=3; #该参数在Hive2.x版本之后不起作用
创建分桶表
create table course (cid string,c_name string,tid string) clustered by(cid) into 3 buckets row format delimited fields terminated by ‘\t’;
桶表的数据加载，由于桶表的数据加载通过hdfs dfs -put文件或者通过load data均不好使，只能通过insert overwrite
创建普通表，并通过insert overwrite的方式将普通表的数据通过查询的方式加载到桶表当中去
加载数据：
insert overwrite table course select * from course_common cluster by(cid);

11.导出数据
将hive表中的数据导出到其他任意目录，例如linux本地磁盘，例如hdfs、mysql等等。
1）将查询的结果格式化导出到本地
insert overwrite local directory ‘/export/data/exporthive’ row format delimited fields terminated by ‘\t’ select * from student;
2）将查询的结果导出到HDFS上(没有local)
insert overwrite directory ‘/exporthive’ row format delimited fields terminated by ‘\t’ select * from score;
3) hive shell 命令导出
bin/hive -e “select * from myhive.score;” > /aa/bb/a.txt;