实战整理-阿里天池淘宝用户购物行为数据集实战(MySQL数据分析+Navicat)——Maydαy
数据集不一样但相似;用的Pytho分析,但有的内容也可以参考:
【数据分析与挖掘】淘宝用户行为分析(带数据集和代码)——CHRN晨
只参考标题即可,很多分析思路都不符合业务逻辑。
基于MySQL+Tableau的淘宝用户行为分析——小七七
1、项目背景
通过淘宝app平台的用户购买行为数据集,通过行业指标对淘宝用户行为进行分析,从而探索淘宝用户的行为模式。
具体行业指标包括:
日PV、
日UV分析、
付费率 / 转化率分析、
复购行为分析、
用户流失分析、
RFM模型用户分群分析
因此,提出以下问题:
日PV有多少
日UV有多少
付费率 / 转化率情况如何
复购率是多少
用户流失情况如何
用户价值(UV价值)情况
RFM模型用户分群分析
2、数据集来源
3、导入数据
利用Navicat软件直接导入下载好的Excel文件
因原数据集过大(超过一亿条记录),电脑MySQL跑不动,故截取前一百万条数据进行实战演练、
4、数据预处理/数据清洗
4.1 更改字段名
①导入数据时,可以直接更改字段名称。
②也可以导入后,不写代码,通过软件功能修改。
Navicat里修改表名和字段名的方法——Cindy辛蒂
③通过SQL代码修改。
ALTER TABLE 淘宝用户购物行为数据集 CHANGE COLUMN 用户ID user_id INT;
ALTER TABLE `淘宝用户购物行为数据集` CHANGE 商品ID item_id INT; -- COLUMN 关键字可省略
ALTER TABLE `淘宝用户购物行为数据集` CHANGE 商品类目ID category_id INT;
ALTER TABLE `淘宝用户购物行为数据集` CHANGE 行为类型 behavior_type VARCHAR(10);
ALTER TABLE `淘宝用户购物行为数据集` CHANGE 时间戳 timestamps INT;
4.2 检查处理空值
SELECT * FROM 淘宝用户购物行为数据集 WHERE user_id IS NULL;
SELECT * FROM 淘宝用户购物行为数据集 WHERE item_id IS NULL;
SELECT * FROM 淘宝用户购物行为数据集 WHERE category_id IS NULL;
SELECT * FROM 淘宝用户购物行为数据集 WHERE behavior_type IS NULL;
SELECT * FROM 淘宝用户购物行为数据集 WHERE timestamps IS NULL;
SELECT *
FROM 淘宝用户购物行为数据集
WHERE user_id IS NULL
OR item_id IS NULL
OR category_id IS NULL
OR behavior_type IS NULL
OR timestamps IS NULL;
4.3 检查重复值
-- 检查重复值:一个用户,在某个时刻,对一种商品,只能进行一种行为。
SELECT user_id, item_id, timestamps, COUNT(*) AS cnt
FROM 淘宝用户购物行为数据集
GROUP BY user_id, item_id, timestamps
HAVING COUNT(*) > 1;
4.4 设置主键
item_id, category_id 也同样因为包含重复值,所以不能设置为主键。
此表中的情况,属于没有主键,要新增一个自增列,放在第一列,并将其设置为主键。
步骤:
(1)要在表的最前面,先添加一个新的整数型,无默认值(满足设置自增的条件),非空(满足设置主键的条件)字段id;
(2)再将新字段列id设置为主键且自增。
-- 方法一:一步完成
ALTER TABLE 淘宝用户购物行为数据集
ADD id INT -- 由于不是英文,不需要给id加引号
NOT NULL
PRIMARY KEY
AUTO_INCREMENT
FIRST;
-- 方法二:等价于下面两步过程
-- 先增加新的整数型,无默认值(满足设置自增的条件),非空列id(满足设置主键的条件)到最前面
alter table userbehavior add id int not null first;
-- 再将其设置为主键,且自增
alter table userbehavior modify id int primary key auto_increment;
4.4 去除重复值
如果存在重复值,即一个用户,在某个时刻,对一种商品,进行了多种行为。
在添加及设置自增字段id为主键之后,将多种行为中,最小的主键id值所在的保留下来,其它id值都删除。
DELETE FROM 淘宝用户购物行为数据集
WHERE id NOT IN (SELECT a.min_id
FROM (select user_id,item_id,timestamps,min(id) AS min_id
from `淘宝用户购物行为数据集`
group by user_id,item_id,timestamps) a);
-- 在DELETE语句中,不能直接引用被删除的表。凡是引用被删除表的语句查询结果,必须以子查询的形式出现。
-- 注意在MySQL中,只能在WHERE子句中判断一个字段(判断多个字段时无法成功删除记录)。
4.5 新增时间戳、日期、时间、小时字段(一致化处理)
-- 先增加新的可读形式的TIMESTAMP类型时间戳列,名称为datetimes
ALTER TABLE `淘宝用户购物行为数据集` ADD datetimes TIMESTAMP(0);
-- 再使用FROM_UNIXTIME函数,将整数型时间戳timestamps,转换为可读形式的 '%Y-%m-%d %H:%i:%s' TIMESTAMP类型时间戳
-- 并将其作为新TIMESTAMP类型时间戳列datetimes的值
UPDATE `淘宝用户购物行为数据集` SET datetimes = FROM_UNIXTIME(timestamps);
-- 先依次增加 日期列dates(年月日),时间列times(时分秒)、小时列hours
ALTER TABLE `淘宝用户购物行为数据集` ADD dates DATE;
ALTER TABLE `淘宝用户购物行为数据集` ADD times TIME;
ALTER TABLE `淘宝用户购物行为数据集` ADD hours CHAR(2);
-- 再从TIMESTAMP类型时间戳列datetimes中,分别截取DATE类型的日期列dates(年月日),TIME类型的时间列times(时分秒)、CHAR类型的小时列hours的值
/* 提取dates、times列时,填入的结果的数据类型,应该与插入新列时,对这三列设置的字段类型保持一致。
而DATE_FORMAT函数函数,返回的是VARCHAR类型的字符串。
同时,为了避免隐式类型转换造成的效率低下问题,应使用CAST函数对DATE_FORMAT函数的返回结果,进行数据类型的显示转换。*/
UPDATE `淘宝用户购物行为数据集` SET dates = CAST(DATE_FORMAT(datetimes, '%Y-%m-%d') AS DATE);
-- 或者使用DATE(datetimes),DATE函数返回的一定是DATE类型的日期
-- 注意时间列times(时分秒)要将分隔符-换成英文冒号:
UPDATE `淘宝用户购物行为数据集` SET times = CAST(DATE_FORMAT(datetimes, '%H:%i:%s') AS TIME);
-- 或者使用TIME(datetimes),TIME函数返回的一定是TIME类型的日期
-- DATE_FORMAT函数返回的就是字符串类型
UPDATE `淘宝用户购物行为数据集` SET hours = DATE_FORMAT(datetimes, '%H');
4.6 数据类型转换
(1)要更改表中已经存在的字段的数据类型,应使用ALTER语句。
(2)如果要对某个函数的执行结果进行类型转换:
如对上面的DATE_FORMAT函数生成的字符串VARCHAR进行类型转换,在MySQL中,则应该使用CAST、CONVERT、str_to_date函数,将字符串数据类型转换为想要的数据类型。
反之,如果要将日期和时间类型的数据,转化为字符串类型,则应该使用CAST、CONVERT、date_format函数。
4.7 异常值处理
通过计算及观察数据集的多个统计量,发现数据集中包含超出2017年11月25日至2017年12月3日范围的数据,所以把这部分数据删掉。其它均无异常。
-- 首先添加新的列,将行为类型behavior_type 转换为对应的行为代码behavior_code -- behavior_type 中的点击、收藏、加购物车、支付四种行为,分别用数字1、2、3、4表示 ALTER TABLE `淘宝用户购物行为数据集` ADD behavior_code int(1) AFTER behavior_type; UPDATE `淘宝用户购物行为数据集` SET behavior_code = (CASE WHEN behavior_type = 'pv' THEN 1 WHEN behavior_type = 'fav' THEN 2 WHEN behavior_type = 'cart' THEN 3 WHEN behavior_type = 'buy' THEN 4 END);
4.7.1 总计数值、最小值、最大值、平均值、方差、标准差
SELECT "总计数值" AS item, COUNT(user_id) AS user_id, COUNT(item_id) AS item_id, COUNT(category_id) AS category_id, CAST(COUNT(behavior_type) AS CHAR) AS behavior_type, COUNT(timestamps) AS timestamps
FROM `淘宝用户购物行为数据集`
UNION ALL
SELECT "最小值" AS item, MIN(user_id) AS user_id, MIN(item_id) AS item_id, MIN(category_id) AS category_id, MIN(behavior_code) AS behavior_type, CAST(MIN(datetimes) AS CHAR) AS timestamps
FROM `淘宝用户购物行为数据集`
UNION ALL
SELECT "最大值" AS item, MAX(user_id) AS user_id, MAX(item_id) AS item_id, MAX(category_id) AS category_id, MAX(behavior_code) AS behavior_type, CAST(MAX(datetimes) AS CHAR) AS timestamps
FROM `淘宝用户购物行为数据集`
UNION ALL
SELECT "平均值" AS item, NULL AS user_id, ROUND(AVG(item_id),2) AS item_id, ROUND(AVG(category_id),2) AS category_id, ROUND(AVG(behavior_code),2) AS behavior_type, NULL AS timestamps -- 对用户ID和行为发生时间求平均值都没有意义
FROM `淘宝用户购物行为数据集`
UNION ALL
SELECT "方差" AS item, NULL AS user_id, ROUND(VAR_POP(item_id),2) AS item_id, ROUND(VAR_POP(category_id),2) AS category_id, ROUND(VAR_POP(behavior_code),2) AS behavior_type, ROUND(VAR_POP(timestamps),2) AS timestamps
FROM `淘宝用户购物行为数据集` -- VAR_POP为求总体方差,除以N;VAR_SAMP为求样本方差,除以N-1
UNION ALL
SELECT "标准差" AS item, NULL AS user_id, ROUND(STDDEV_POP(item_id),2) AS item_id, ROUND(STDDEV_POP(category_id),2) AS category_id, ROUND(STDDEV_POP(behavior_code),2) AS behavior_type, ROUND(STDDEV_POP(timestamps),2) AS timestamps -- STDDEV_POP为求总体标准差;STDDEV_SAMP为求样本标准差
FROM `淘宝用户购物行为数据集`
4.7.2 众数、中位数、截断平均值
【SQL】数据分析常见面试题之统计指标计算——中位数/截断平均/累计求和——懂点数分的杉杉
SELECT "众数" AS item,
(SELECT user_id AS cnt FROM `淘宝用户购物行为数据集` GROUP BY user_id HAVING COUNT(*) >= ALL(SELECT COUNT(*) FROM `淘宝用户购物行为数据集` GROUP BY user_id)) AS user_id,
(SELECT item_id AS cnt FROM `淘宝用户购物行为数据集` GROUP BY item_id HAVING COUNT(*) >= ALL(SELECT COUNT(*) FROM `淘宝用户购物行为数据集` GROUP BY item_id)) AS item_id,
(SELECT ROUND(category_id,0) AS cnt FROM `淘宝用户购物行为数据集` GROUP BY category_id HAVING COUNT(*) >= ALL(SELECT COUNT(*) FROM `淘宝用户购物行为数据集` GROUP BY category_id)) AS category_id,
(SELECT behavior_type AS cnt FROM `淘宝用户购物行为数据集` GROUP BY behavior_type HAVING COUNT(*) >= ALL(SELECT COUNT(*) FROM `淘宝用户购物行为数据集` GROUP BY behavior_type)) AS behavior_type,
(SELECT from_unixtime(timestamps) AS cnt FROM `淘宝用户购物行为数据集` GROUP BY timestamps HAVING COUNT(*) >= ALL(SELECT COUNT(*) FROM `淘宝用户购物行为数据集` GROUP BY timestamps)) AS timestamps
UNION ALL
SELECT "中位数" AS item,
(SELECT ROUND(AVG(a.user_id),0)
FROM (SELECT user_id, ROW_NUMBER() OVER(ORDER BY user_id) AS rk, COUNT(*) OVER() AS n
FROM `淘宝用户购物行为数据集`) a
WHERE a.rk IN (a.n/2, a.n/2+1, (a.n+1)/2)) AS user_id,
(SELECT ROUND(AVG(b.item_id),0)
FROM (SELECT item_id, ROW_NUMBER() OVER(ORDER BY item_id) AS rk, COUNT(*) OVER() AS n
FROM `淘宝用户购物行为数据集`) b
WHERE b.rk IN (b.n/2, b.n/2+1, (b.n+1)/2)) AS item_id,
(SELECT ROUND(AVG(a.category_id),0)
FROM (SELECT category_id, ROW_NUMBER() OVER(ORDER BY category_id) AS rk, COUNT(*) OVER() AS n
FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk IN (a.n/2, a.n/2+1, (a.n+1)/2)) AS category_id,
(SELECT ROUND(AVG(a.behavior_code),2)
FROM (SELECT behavior_code, ROW_NUMBER() OVER(ORDER BY behavior_code) AS rk, COUNT(*) OVER() AS n
FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk IN (a.n/2, a.n/2+1, (a.n+1)/2)) AS behavior_type,
(SELECT from_unixtime(ROUND(AVG(a.timestamps),0))
FROM (SELECT timestamps, ROW_NUMBER() OVER(ORDER BY timestamps) AS rk, COUNT(*) OVER() AS n
FROM `淘宝用户购物行为数据集`) a
WHERE a.rk IN (a.n/2, a.n/2+1, (a.n+1)/2)) AS timestamps
UNION ALL
SELECT "截断平均值" AS item, NULL AS user_id,
(SELECT ROUND(AVG(a.item_id),0)
FROM (SELECT item_id, ROW_NUMBER() OVER(ORDER BY item_id) AS rk1, ROW_NUMBER() OVER(ORDER BY item_id DESC) AS rk2
FROM `淘宝用户购物行为数据集`) a
WHERE a.rk1 <> 1 AND a.rk2 <> 1) AS item_id,
(SELECT ROUND(AVG(a.category_id),0)
FROM (SELECT category_id, ROW_NUMBER() OVER(ORDER BY category_id) AS rk1, ROW_NUMBER() OVER(ORDER BY category_id DESC) AS rk2
FROM `淘宝用户购物行为数据集`) a
WHERE a.rk1 <> 1 AND a.rk2 <> 1) AS category_id,
(SELECT ROUND(AVG(a.behavior_code),0)
FROM (SELECT behavior_code, ROW_NUMBER() OVER(ORDER BY behavior_code) AS rk1, ROW_NUMBER() OVER(ORDER BY behavior_code DESC) AS rk2
FROM `淘宝用户购物行为数据集`) a
WHERE a.rk1 <> 1 AND a.rk2 <> 1) AS behavior_code,
(SELECT from_unixtime(ROUND(AVG(a.timestamps),0))
FROM (SELECT timestamps, ROW_NUMBER() OVER(ORDER BY timestamps) AS rk1, ROW_NUMBER() OVER(ORDER BY timestamps DESC) AS rk2
FROM `淘宝用户购物行为数据集`) a
WHERE a.rk1 <> 1 AND a.rk2 <> 1) AS timestamps
4.7.3 极差、标准差系数、偏度
使用SQL 计算一组数据的峰度、偏度 公式依据excel的KURT函数和SKEW函数 Oracle 数据库——wanwgei
SELECT "极差" AS item,
NULL AS user_id,
MAX(item_id) - MIN(item_id) AS item_id,
MAX(category_id) - MIN(category_id) AS category_id,
MAX(behavior_code) - MIN(behavior_code) AS behavior_type,
DATEDIFF(MAX(dates),MIN(dates)) AS timestamps
FROM `淘宝用户购物行为数据集`
UNION ALL
SELECT "标准差系数 " AS item,
NULL AS user_id,
ROUND(100*STDDEV_POP(item_id)/AVG(item_id),2) AS item_id,
ROUND(100*STDDEV_POP(category_id)/AVG(category_id),2) AS category_id,
ROUND(100*STDDEV_POP(behavior_code)/AVG(behavior_code),2) AS behavior_code,
ROUND(100*STDDEV_POP(timestamps)/AVG(timestamps),2) AS timestamps
FROM `淘宝用户购物行为数据集`
UNION ALL
SELECT "偏度" AS item,
NULL AS user_id,
(select round(sum(power((item_id -(select avg(item_id) from `淘宝用户购物行为数据集`)),3)) / power(stddev(item_id), 3) * ( count(*)/((count(*)-1) * (count(*)-2)) ),2)
from `淘宝用户购物行为数据集`) as item_id, -- 对称
(select round(sum(power((category_id -(select avg(category_id) from `淘宝用户购物行为数据集`)),3)) / power(stddev(category_id), 3) * ( count(*)/((count(*)-1) * (count(*)-2)) ),2)
from `淘宝用户购物行为数据集`) as category_id, -- 轻微左偏
(select round(sum(power((behavior_code -(select avg(behavior_code) from `淘宝用户购物行为数据集`)),3)) / power(stddev(behavior_code), 3) * ( count(*)/((count(*)-1) * (count(*)-2)) ),2)
from `淘宝用户购物行为数据集`) as behavior_code, -- 严重右偏(当数据严重偏离时,中位数是位置的首选度量。)
(select round(sum(power((timestamps -(select avg(timestamps) from `淘宝用户购物行为数据集`)),3)) / power(stddev(timestamps), 3) * ( count(*)/((count(*)-1) * (count(*)-2)) ),2)
from `淘宝用户购物行为数据集`) as timestamps -- 轻微左偏
4.7.4 第一分位数Q1、第二分位数Q2(中位数)、第三分位数Q3
SELECT "第一分位数Q1" AS item,
NULL AS user_id,
(SELECT a.item_id
FROM (SELECT item_id, ROW_NUMBER() OVER(ORDER BY item_id) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.25)) AS item_id,
(SELECT a.category_id
FROM (SELECT category_id, ROW_NUMBER() OVER(ORDER BY category_id) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.25)) AS category_id,
(SELECT a.behavior_code
FROM (SELECT behavior_code, ROW_NUMBER() OVER(ORDER BY behavior_code) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.25)) AS behavior_type,
(SELECT from_unixtime(a.timestamps)
FROM (SELECT timestamps, ROW_NUMBER() OVER(ORDER BY timestamps) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.25)) AS timestamps
UNION ALL
SELECT "第二分位数Q2(中位数)" AS item,
NULL AS user_id,
(SELECT a.item_id
FROM (SELECT item_id, ROW_NUMBER() OVER(ORDER BY item_id) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.5)) AS item_id,
(SELECT a.category_id
FROM (SELECT category_id, ROW_NUMBER() OVER(ORDER BY category_id) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.5)) AS category_id,
(SELECT a.behavior_code
FROM (SELECT behavior_code, ROW_NUMBER() OVER(ORDER BY behavior_code) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.5)) AS behavior_type,
(SELECT from_unixtime(a.timestamps)
FROM (SELECT timestamps, ROW_NUMBER() OVER(ORDER BY timestamps) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.5)) AS timestamps
UNION ALL
SELECT "第三分位数Q3" AS item,
NULL AS user_id,
(SELECT a.item_id
FROM (SELECT item_id, ROW_NUMBER() OVER(ORDER BY item_id) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.75)) AS item_id,
(SELECT a.category_id
FROM (SELECT category_id, ROW_NUMBER() OVER(ORDER BY category_id) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.75)) AS category_id,
(SELECT a.behavior_code
FROM (SELECT behavior_code, ROW_NUMBER() OVER(ORDER BY behavior_code) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.75)) AS behavior_type,
(SELECT from_unixtime(a.timestamps)
FROM (SELECT timestamps, ROW_NUMBER() OVER(ORDER BY timestamps) rk, COUNT(*) OVER() AS n FROM `淘宝用户购物行为数据集` ) a
WHERE a.rk = CEIL(a.n * 0.75)) AS timestamps
4.7.5 根据四分位数判断异常值
4.7.5.1 商品ID(item_id)——无异常值
WITH a AS (SELECT item_id, ROW_NUMBER() OVER(ORDER BY item_id) rk, COUNT(*) OVER() AS n
FROM `淘宝用户购物行为数据集`),
b AS (SELECT MAX(