一、案例项目实施方案
拿到源数据access.log之后,准备工作如下:
1.数据进行预处理,加载hive表之前
>>MR程序处理
>>正则表达式(企业推荐)
>>python脚本
2.表拆分,源数据不变,创建对应业务需求的字表
3.基于子表的基础之上:
3-1.数据文件存储格式:orc/parquet
3-2.数据文件压缩:snappy
3-3.map output:中间结果数据压缩snappy
3-4.外部表
3-5.分区表
3-6.UDF数据处理
二、 RegexSerDe处理数据
查看数据:
创建hive表并通过正则表达式处理数据:
CREATE TABLE apache_log (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) (-|[^ ]*) (-|[^ ]*) (-|\\[[^\\]]*\\]) (\"[^\"]*\") ([0-9]*) ([0-9]*) (\"[^\"]*\") (\"[^\"]*\")?")
STORED AS TEXTFILE;
加载数据:
load data local inpath '/opt/datas/access.log' into table apache_log;
查看数据:
hive (db_hive)> select * from apache_log limit 5;
OK
apache_log.host apache_log.identity apache_log.user apache_log.time apache_log.request apache_log.status apache_log.size apache_log.referer apache_log.agent
194.237.142.21 - - [18/Sep/2013:06:49:18 +0000] "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 3040"-" "Mozilla/4.0 (compatible;)"
183.49.46.228 - - [18/Sep/2013:06:49:23 +0000] "-" 400 0 "-" "-"
163.177.71.12 - - [18/Sep/2013:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
163.177.71.12 - - [18/Sep/2013:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
101.226.68.137 - - [18/Sep/2013:06:49:42 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
Time taken: 0.309 seconds, Fetched: 5 row(s)
三、数据拆分
根据业务需求,我们不需要完全使用大表的数据,我们可以将大表中的部分数据提取出来,组成一个字表。
CREATE TABLE apache_log_comm (
host STRING,
time STRING,
status STRING,
referer STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
加载子表数据
insert into table apache_log_comm select host,time,status,referer from apache_log;
查看子表数据
hive (db_hive)> select * from apache_log_comm limit 10;
OK
apache_log_comm.host apache_log_comm.time apache_log_comm.status apache_log_comm.referer
194.237.142.21 [18/Sep/2013:06:49:18 +0000] 304 "-"
183.49.46.228 [18/Sep/2013:06:49:23 +0000] 400 "-"
163.177.71.12 [18/Sep/2013:06:49:33 +0000] 200 "-"
163.177.71.12 [18/Sep/2013:06:49:36 +0000] 200 "-"
101.226.68.137 [18/Sep/2013:06:49:42 +0000] 200 "-"
101.226.68.137 [18/Sep/2013:06:49:45 +0000] 200 "-"
60.208.6.156 [18/Sep/2013:06:49:48 +0000] 200 "http://cos.name/category/software/packages/"
222.68.172.190 [18/Sep/2013:06:49:57 +0000] 200