数据分析处理库-Pandas基础-下篇

本文是Pandas基础的下篇,主要讲解了SAC分组模式,包括groupby对象用于学生分组分析;数据合并部分介绍了concat函数和merge()函数的应用及参数设置;数据转换函数如map、apply和applymap的操作;时序数据处理的pd.date_range函数;以及数据清洗,包括重复数据和缺失值的处理方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

在这里插入图片描述

数据分析处理库-Pandas基础-下篇

1 SAC*Pandas 分组模式及其对象

  • SAC 指的是分组操作中的split-apply-combine过程,其中
    • split 基于某一些规则,将数据拆成若干组
    • apply 对每一组独立地使用函数
    • combine 指将每一组的结果组合成某一类数据结构

1.1groupby对象

  • 学生分组
import pandas as pd
df = pd.read_csv('./data/learn_pandas.csv')
df.head(4)
School Grade Name Gender Height Weight Transfer Test_Number Test_Date Time_Record
0 Shanghai Jiao Tong University Freshman Gaopeng Yang Female 158.9 46.0 N 1 2019/10/5 0:04:34
1 Peking University Freshman Changqiang You Male 166.5 70.0 N 1 2019/9/4 0:04:20
2 Shanghai Jiao Tong University Senior Mei Sun Male 188.9 89.0 N 2 2019/9/12 0:05:22
3 Fudan University Sophomore Xiaojuan Sun Female NaN 41.0 N 2 2020/1/3 0:04:08
df.groupby(['School','Gender'])['Height'].mean()
School                         Gender
Fudan University               Female    158.776923
                               Male      174.212500
Peking University              Female    158.666667
                               Male      172.030000
Shanghai Jiao Tong University  Female    159.122500
                               Male      176.760000
Tsinghua University            Female    159.753333
                               Male      171.638889
Name: Height, dtype: float64
df.groupby(['School','Grade','Gender'])['Gender'].count()
#根据学校和年纪及性别,统计男女人数
School                         Grade      Gender
Fudan University               Freshman   Female     8
                                          Male       1
                               Junior     Female     9
                                          Male       3
                               Senior     Female     9
                                          Male       2
                               Sophomore  Female     4
                                          Male       4
Peking University              Freshman   Female     7
                                          Male       6
                               Junior     Female     6
                                          Male       2
                               Senior     Female     5
                                          Male       3
                               Sophomore  Female     4
                                          Male       1
Shanghai Jiao Tong University  Freshman   Female     8
                                          Male       5
                               Junior     Female    15
                                          Male       2
                               Senior     Female    14
                                          Male       8
                               Sophomore  Female     4
                                          Male       1
Tsinghua University            Freshman   Female    17
                               Junior     Female    13
                                          Male       9
                               Senior     Female    10
                                          Male       4
                               Sophomore  Female     8
                                          Male       8
Name: Gender, dtype: int64
#根据学生的身高是否超过总体的均值来分组,最后计算体重的平均值
condition = df['Height'] > df["Height"].mean()
#身高超过均值True
#身高没有超过均值False
df.groupby(condition)['Weight'].mean()
Height
False    47.672414
True     66.684932
Name: Weight, dtype: float64
df .groupby([ 'School','Gender' ]).agg(
    {
   
   "Height":[
        ('height_max',"max"),
        ('height_min','min'),
        ("height_mean","mean")
    ],
"Weight" :[
        ("weight_ptp",lambda x:x.max()-x.min()),
        ("weight_median" , "median")
    ]
    }
)
Height Weight
height_max height_min height_mean weight_ptp weight_median
School Gender
Fudan University Female 170.2 147.3 158.776923 29.0 47.5
Male 177.3 167.6 174.212500 19.0 73.5
Peking University Female 170.0 147.8 158.666667 22.0 46.5
Male 185.3 162.4 172.030000 29.0 73.5
Shanghai Jiao Tong University Female 167.7 145.4 159.122500 23.0 49.0
Male 188.9 166.0 176.760000 27.0 75.0
Tsinghua University Female 168.9 150.5 159.753333 21.0 49.0
Male 193.9 155.7 171.638889 28.0 71.0
  • 北京不同aqiInfo下,北京的白天最高温度,均值,白天最低温迪,夜晚最高,夜晚最低,均值,以及aqiInfo的数量,aqi的,最大值,最小值,均值,空气质量指数大于50的天数有多少
import pandas as pd
import numpy as np
data = pd.read_csv('./data/beijing_tianqi_2018.csv')
data.head(4)
ymd bWendu yWendu tianqi fengxiang fengli aqi aqiInfo aqiLevel
0 2018-01-01 3℃ -6℃ 晴~多云 东北风 1-2级 59 2
1 2018-01-02 2℃ -5℃ 阴~多云 东北风 1-2级 49 1
2 2018-01-03 2℃ -5℃ 多云 北风 1-2级 28 1
3 2018-01-04 0℃ -8℃ 东北风 1-2级 28 1
data['bWendu'] = data['bWendu'].str.replace("℃",'').astype(np.int0)
data['yWendu'] = data['yWendu'].str.replace("℃",'').astype(np.int0)
df = data.groupby('aqiInfo').agg(
    {
   
   
        "bWendu":[
            ("bwendu_max","max"),
            ('bwendu_min','min'),
            ('bwendu_mean',"mean")
        ],
        "yWendu":[
            ("ywendu_max","max"),
            ('ywendu_min','min'),
            ('ywendu_mean',"mean")
        ],
        'aqiInfo':[
            ("aqi_info_num",'count')
        ],
        'aqi':[
            ('aqi_min','min'),
            ('aqi_max','max'),
            ('aqi_mean','mean'),
            ('aqi_50',lambda x: sum(x>50))
        ]
    }
)
  • set_index(‘label’,inplace=True) # 设置索引
  • reset_index(inplace=True) # 取消索引
df.reset_index(inplace=True)
df
aqiInfo bWendu yWendu aqiInfo aqi
bwendu_max bwendu_min bwendu_mean ywendu_max ywendu_min ywendu_mean aqi_info_num aqi_min aqi_max aqi_mean aqi_50
0 严重污染 25 25 25.000000 9 9 9.000000 1 387 387 387.000000 1
1 中度污染 36 6 20.238095 25 -5 8.047619 21 151 198 175.523810 21
2 35 -5 16.191304 26 -12 6.478261 115 21 50 37.426087 0
3 38 -2 20.233333 27 -10 10.100000 150 51 100 72.766667 150
4 轻度污染 37 -1 19.318182 27 -9 8.303030 66 101 149 117.772727 66
5 重度污染 27 7 15.916667 11 0 5.416667 12 206 293 244.333333 12

2 数据合并相关

2.1 concat函数

  • 不加参数直接合并,默认是按照轴0进行合并,不会忽略索引,缺失值用NaN填充
  • ignore_index= True, 忽略索引
  • axis=0 ,指定合并的方向
  • join=‘outer’ # 取并集,如果设置为"inner"则取交集
  • keys() # 连接的时候,加上一个层次的Key,用于判定数据来源于那个表
df1 = pd.DataFrame({
   
   'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3'],
'E': ['E0', 'E1', 'E2', 'E3']})
df2 = pd.DataFrame({
   
   'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': [
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

不 蔓

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值