Pandas (数据分析处理库)
Pandas:纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。提供了大量能使我们快速便捷地处理数据的函数和方法。
1.Pandas数据读取
1.1 Pandas读数据
import pandas
food_info = pandas.read_csv("food_info.csv")
print("---------food_info-------")
print(type(food_info))
print("---------food_info.shape-------")
print(food_info.shape)
print("---------food_info.values-------")
print(food_info.values)
print("---------food_info.dtypes-------")
print(food_info.dtypes)
print(help(pandas.read_csv))
#默认打印前5条数据
print(food_info.head())
print(food_info.tail())
#打印所有列名
print(food_info.columns)
1.2 Pandas读取列数据
print (food_info["Iron_(mg)"])
print("------------------")
div_1000 = food_info["Iron_(mg)"]/1000
print (div_1000)
2.Pandas索引与计算
# 2.1 获取索引为0的元素
print(food_info.loc[0])
#获取多条
print(food_info.loc[[2,5,10]])
NDB_No 1001
Shrt_Desc BUTTER WITH SALT
Water_(g) 15.87
Energ_Kcal 717
Protein_(g) 0.85
Lipid_Tot_(g) 81.11
Ash_(g) 2.11
Carbohydrt_(g) 0.06
Fiber_TD_(g) 0
Sugar_Tot_(g) 0.06
Calcium_(mg) 24
Iron_(mg) 0.02
Magnesium_(mg) 2
Phosphorus_(mg) 24
Potassium_(mg) 24
Sodium_(mg) 643
Zinc_(mg) 0.09
Copper_(mg) 0
Manganese_(mg) 0
Selenium_(mcg) 1
Vit_C_(mg) 0
Thiamin_(mg) 0.005
Riboflavin_(mg) 0.034
Niacin_(mg) 0.042
Vit_B6_(mg) 0.003
Vit_B12_(mcg) 0.17
Vit_A_IU 2499
Vit_A_RAE 684
Vit_E_(mg) 2.32
Vit_D_mcg 1.5
Vit_D_IU 60
Vit_K_(mcg) 7
FA_Sat_(g) 51.368
FA_Mono_(g) 21.021
FA_Poly_(g) 3.043
Cholestrl_(mg) 215
Name: 0, dtype: object
#2.2 根据列名获取列
#获取单列
print(food_info["NDB_No"])
#获取多列
print(food_info[["NDB_No","Water_(g)"]])
0 1001
1 1002
2 1003
3 1004
4 1005
5 1006
6 1007
7 1008
8 1009
9 1010
10 1011
11 1012
12 1013
13 1014
14 1015
15 1016
...
8608 44258
8609 44259
8610 44260
8611 48052
8612 80200
8613 83110
8614 90240
8615 90480
8616 90560
8617 93600
Name: NDB_No, Length: 8618, dtype: int64
#2.3 获取以克结尾的列
col_names = food_info.columns.tolist()
print(col_names)
gram_colmns = []
for c in col_names:
if c.endswith("(g)"):
gram_colmns.append(c)
gram_df = food_info[gram_colmns]
print(gram_df.head(3))
#2.4 新增加一列
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
iron_grams = food_info["Iron_(mg)"]/1000
print(food_info.shape)
food_info["Iron_(g)"] = iron_grams
print(food_info.shape)
3.数据预处理
#3.1 排序 从小到大
#inplace=True:不创建新的对象,直接对原始对象进行修改;
#inplace=False:对数据进行修改,创建并返回新的对象承载其修改结果
food_info.sort_values("Sodium_(mg)",inplace = True)
print (food_info["Sodium_(mg)"])
#排序 从大到小
food_info.sort_values("Sodium_(mg)",inplace = True,ascending = False)
print ("-----------------------")
print (food_info["Sodium_(mg)"])
NDB_No Shrt_Desc ... FA_Poly_(g) Cholestrl_(mg)
2 1003 BUTTER OIL ANHYDROUS ... 3.694 256.0
5 1006 CHEESE BRIE ... 0.826 100.0
10 1011 CHEESE COLBY ... 0.953 95.0
#3.2 统计缺失值
import pandas as pd
import numpy as np
titanic_survival = pd.read_csv("titanic_train.csv")
titanic_survival.head()
age = titanic_survival["Age"]
age_is_null = pd.isnull(age)
age__null_true = age[age_is_null]
age_null_count = len(age__null_true)
print(age_null_count)
#3.3 算不缺失值 的平均数
good_ages = titanic_survival["Age"][age_is_null == False]
correct_mean_age = sum(good_ages) / len(good_ages)
print (correct_mean_age)
#3.4 算不缺失值 的平均数
correct_mean_age = titanic_survival["Age"].mean()
print (correct_mean_age)
#3.5 算不同舱位 价格的平均数
passenger_classes = [1,2,3]
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
pclass_fares = pclass_rows["Fare"]
fare_for_class = pclass_fares.mean()
fares_by_class[this_class] = fare_for_class
print (fares_by_class)
{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}
#3.6 统计不同船仓的获救率
#以Pclass为基础,求Survived的获救率
passenger_survival = titanic_survival.pivot_table(index="Pclass",values="Survived",aggfunc=np.mean)
print (passenger_survival)
Pclass Survived
1 0.629630
2 0.472826
3 0.242363
#3.7 统计不同船仓与年龄之间关系
passenger_age = titanic_survival.pivot_table(index="Pclass",values="Age")
print (passenger_age)
Pclass Age
1 38.233441
2 29.877630
3 25.140620
#3.8 不同登船地点与车票和获救关系
#登船在C地点,共收10072.2962 获救93
port_stats = titanic_survival.pivot_table(index="Embarked",values=["Fare","Survived"],aggfunc=np.sum)
print (port_stats)
Embarked Fare Survived
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217
#3.9 把年龄和性别为空的去掉
#dropna函数
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age","Sex"])
print(new_titanic_survival)
#3.10 查看83号样本的年纪和766号的船仓
row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print (row_index_83_age)
print (row_index_1000_pclass)
28.0
1
#3.11 重新编辑索引
#reset_index函数
new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)
print (new_titanic_survival[0:10])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print ("------------------------")
print (titanic_reindexed.loc[0:10])
PassengerId Survived Pclass Name \
630 631 1 1 Barkworth, Mr. Algernon Henry Wilson
851 852 0 3 Svensson, Mr. Johan
493 494 0 1 Artagaveytia, Mr. Ramon
96 97 0 1 Goldschmidt, Mr. George B
116 117 0 3 Connors, Mr. Patrick
672 673 0 2 Mitchell, Mr. Henry Michael
745 746 0 1 Crosby, Capt. Edward Gifford
33 34 0 2 Wheadon, Mr. Edward H
54 55 0 1 Ostby, Mr. Engelhart Cornelius
280 281 0 3 Duane, Mr. Frank
Sex Age SibSp Parch Ticket Fare Cabin Embarked
630 male 80.0 0 0 27042 30.0000 A23 S
851 male 74.0 0 0 347060 7.7750 NaN S
493 male 71.0 0 0 PC 17609 49.5042 NaN C
96 male 71.0 0 0 PC 17754 34.6542 A5 C
116 male 70.5 0 0 370369 7.7500 NaN Q
672 male 70.0 0 0 C.A. 24580 10.5000 NaN S
745 male 70.0 1 1 WE/P 5735 71.0000 B22 S
33 male 66.0 0 0 C.A. 24579 10.5000 NaN S
54 male 65.0 0 1 113509 61.9792 B30 C
280 male 65.0 0 0 336439 7.7500 NaN Q
------------------------
PassengerId Survived Pclass Name Sex \
0 631 1 1 Barkworth, Mr. Algernon Henry Wilson male
1 852 0 3 Svensson, Mr. Johan male
2 494 0 1 Artagaveytia, Mr. Ramon male
3 97 0 1 Goldschmidt, Mr. George B male
4 117 0 3 Connors, Mr. Patrick male
5 673 0 2 Mitchell, Mr. Henry Michael male
6 746 0 1 Crosby, Capt. Edward Gifford male
7 34 0 2 Wheadon, Mr. Edward H male
8 55 0 1 Ostby, Mr. Engelhart Cornelius male
9 281 0 3 Duane, Mr. Frank male
10 457 0 1 Millet, Mr. Francis Davis male
Age SibSp Parch Ticket Fare Cabin Embarked
0 80.0 0 0 27042 30.0000 A23 S
1 74.0 0 0 347060 7.7750 NaN S
2 71.0 0 0 PC 17609 49.5042 NaN C
3 71.0 0 0 PC 17754 34.6542 A5 C
4 70.5 0 0 370369 7.7500 NaN Q
5 70.0 0 0 C.A. 24580 10.5000 NaN S
6 70.0 1 1 WE/P 5735 71.0000 B22 S
7 66.0 0 0 C.A. 24579 10.5000 NaN S
8 65.0 0 1 113509 61.9792 B30 C
9 65.0 0 0 336439 7.7500 NaN Q
10 65.0 0 0 13509 26.5500 E38 S
3.11 apply函数
# apply调用函数
def hundredth_row(column):
hundredth_item = column.loc[99]
return hundredth_item
hundredth_row = titanic_survival.apply(hundredth_row)
print (hundredth_row)
PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object
#3.12 通过函数的方式,查看第100号元素
def hundredth_row(column):
hundredth_item = column.loc[99]
return hundredth_item
hundredth_row = titanic_survival.apply(hundredth_row)
print (hundredth_row)
PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object
#3.13 返回每行有缺失值的个数
def not_null_count(column):
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print (column_null_count)
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
#3.14 根据pclass不同值,变换成英文
def which_class(row):
pclass = row["Pclass"]
if pd.isnull(pclass):
return "Unknow"
elif pclass == 1:
return "First Class"
elif pclass == 2:
return "Second Class"
elif pclass == 3:
return "Third Class"
classes = titanic_survival.apply(which_class, axis=1)
print (classes)
0 Third Class
1 First Class
2 Third Class
3 First Class
4 Third Class
5 Third Class
6 First Class
7 Third Class
8 Third Class
9 Second Class
10 Third Class
11 First Class
...
887 First Class
888 Third Class
889 First Class
890 Third Class
Length: 891, dtype: object
#3.15 根据年龄不同,返回不同值
def is_minor(row):
if row["Age"]<18:
return True
else:
return False
minors = titanic_survival.apply(is_minor,axis=1)
print(minors)
def generate_age_label(row):
age = row["Age"]
if pd.isnull(age):
return "unknow"
elif age<18:
return "minor"
else:
return "adult"
age_labels = titanic_survival.apply(generate_age_label,axis=1)
print(age_labels)
0 adult
1 adult
2 adult
3 adult
4 adult
5 unknow
6 adult
7 minor
8 adult
9 minor
10 minor
...
885 adult
886 adult
887 adult
888 unknow
889 adult
890 adult
#3.16 年龄与生存率之间关系
titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels",values="Survived")
print(age_group_survival)
age_labels Survived
adult 0.381032
minor 0.539823
unknow 0.293785
4.Series结构
DataFrame结构是由一系列Series结构组成
Series结构是由ndarray结构组成
#4.1 读取文件,打印
import pandas as pd
fandango = pd.read_csv("fandango_score_comparison.csv")
series_film = fandango["FILM"]
print (type(series_film))
print (series_film[0:5])
series_rt = fandango["RottenTomatoes"]
print (series_rt[0:5])
<class 'pandas.core.series.Series'>
0 Avengers: Age of Ultron (2015)
1 Cinderella (2015)
2 Ant-Man (2015)
3 Do You Believe? (2015)
4 Hot Tub Time Machine 2 (2015)
Name: FILM, dtype: object
0 74
1 85
2 80
3 18
4 14
Name: RottenTomatoes, dtype: int64
#4.2 获取元素
from pandas import Series
film_names = series_film.values
print(type(film_names))
print(film_names)
rt_scores = series_rt.values
print(rt_scores)
series_custom = Series(rt_scores , index=film_names)
print(series_custom[["Minions (2015)","Leviathan (2014)"]])
<class 'numpy.ndarray'>
['Avengers: Age of Ultron (2015)' 'Cinderella (2015)' 'Ant-Man (2015)'
'Do You Believe? (2015)' 'Hot Tub Time Machine 2 (2015)'
'The Water Diviner (2015)' 'Irrational Man (2015)' 'Top Five (2014)'
'Shaun the Sheep Movie (2015)' 'Love & Mercy (2015)'
'Far From The Madding Crowd (2015)' 'Black Sea (2015)' 'Leviathan (2014)'
'Unbroken (2014)' 'The Imitation Game (2014)' 'Taken 3 (2015)'
'Ted 2 (2015)' 'Southpaw (2015)'
'Night at the Museum: Secret of the Tomb (2014)' 'Pixels (2015)'
'McFarland, USA (2015)' 'Insidious: Chapter 3 (2015)'
'The Man From U.N.C.L.E. (2015)' 'Run All Night (2015)'
'Trainwreck (2015)' 'Selma (2014)' 'Ex Machina (2015)'
'Still Alice (2015)' 'Wild Tales (2014)' 'The End of the Tour (2015)'
'Red Army (2015)' 'When Marnie Was There (2015)'
'The Hunting Ground (2015)' 'The Boy Next Door (2015)' 'Aloha (2015)'
'The Loft (2015)' '5 Flights Up (2015)' 'Welcome to Me (2015)'
'Saint Laurent (2015)' 'Maps to the Stars (2015)'
"I'll See You In My Dreams (2015)" 'Timbuktu (2015)' 'About Elly (2015)'
'The Diary of a Teenage Girl (2015)'
'Kingsman: The Secret Service (2015)' 'Tomorrowland (2015)'
'The Divergent Series: Insurgent (2015)' 'Annie (2014)'
'Fantastic Four (2015)' 'Terminator Genisys (2015)'
'Pitch Perfect 2 (2015)' 'Entourage (2015)' 'The Age of Adaline (2015)'
'Hot Pursuit (2015)' 'The DUFF (2015)' 'Black or White (2015)'
'Project Almanac (2015)' 'Ricki and the Flash (2015)'
'Seventh Son (2015)' 'Mortdecai (2015)' 'Unfinished Business (2015)'
'American Ultra (2015)' 'True Story (2015)' 'Child 44 (2015)'
'Dark Places (2015)' 'Birdman (2014)' 'The Gift (2015)'
'Unfriended (2015)' 'Monkey Kingdom (2015)' 'Mr. Turner (2014)'
'Seymour: An Introduction (2015)' 'The Wrecking Crew (2015)'
'American Sniper (2015)' 'Furious 7 (2015)'
'The Hobbit: The Battle of the Five Armies (2014)' 'San Andreas (2015)'
'Straight Outta Compton (2015)' 'Vacation (2015)' 'Chappie (2015)'
'Poltergeist (2015)' 'Paper Towns (2015)' 'Big Eyes (2014)'
'Blackhat (2015)' 'Self/less (2015)' 'Sinister 2 (2015)'
'Little Boy (2015)' 'Me and Earl and The Dying Girl (2015)'
'Maggie (2015)' 'Mad Max: Fury Road (2015)' 'Spy (2015)'
'The SpongeBob Movie: Sponge Out of Water (2015)' 'Paddington (2015)'
'Dope (2015)' 'What We Do in the Shadows (2015)' 'The Overnight (2015)'
'The Salt of the Earth (2015)' 'Song of the Sea (2014)'
'Fifty Shades of Grey (2015)' 'Get Hard (2015)' 'Focus (2015)'
'Jupiter Ascending (2015)' 'The Gallows (2015)'
'The Second Best Exotic Marigold Hotel (2015)' 'Strange Magic (2015)'
'The Gunman (2015)' 'Hitman: Agent 47 (2015)' 'Cake (2015)'
'The Vatican Tapes (2015)' 'A Little Chaos (2015)'
'The 100-Year-Old Man Who Climbed Out the Window and Disappeared (2015)'
'Escobar: Paradise Lost (2015)' 'Into the Woods (2014)'
'It Follows (2015)' 'Inherent Vice (2014)' 'A Most Violent Year (2014)'
"While We're Young (2015)" 'Clouds of Sils Maria (2015)'
'Testament of Youth (2015)' 'Infinitely Polar Bear (2015)'
'Phoenix (2015)' 'The Wolfpack (2015)'
'The Stanford Prison Experiment (2015)' 'Tangerine (2015)'
'Magic Mike XXL (2015)' 'Home (2015)' 'The Wedding Ringer (2015)'
'Woman in Gold (2015)' 'The Last Five Years (2015)'
'Mission: Impossible – Rogue Nation (2015)' 'Amy (2015)'
'Jurassic World (2015)' 'Minions (2015)' 'Max (2015)'
'Paul Blart: Mall Cop 2 (2015)' 'The Longest Ride (2015)'
'The Lazarus Effect (2015)' 'The Woman In Black 2 Angel of Death (2015)'
'Danny Collins (2015)' 'Spare Parts (2015)' 'Serena (2015)'
'Inside Out (2015)' 'Mr. Holmes (2015)' "'71 (2015)"
'Two Days, One Night (2014)' 'Gett: The Trial of Viviane Amsalem (2015)'
'Kumiko, The Treasure Hunter (2015)']
[ 74 85 80 18 14 63 42 86 99 89 84 82 99 51 90 9 46 59
50 17 79 59 68 60 85 99 92 88 96 92 96 89 92 10 19 11
52 71 51 60 94 99 97 95 75 50 30 27 9 26 67 32 54 8
71 39 34 64 12 12 11 46 45 26 26 92 93 60 94 98 100 93
72 81 61 50 90 27 30 31 55 72 34 20 13 20 81 54 97 93
78 98 87 96 82 96 99 25 29 57 26 16 62 17 17 7 49 13
40 67 52 71 96 73 90 83 89 81 80 99 84 84 95 62 45 27
52 60 92 97 71 54 35 5 31 14 22 77 52 18 98 87 97 97
100 87]
Minions (2015) 54
Leviathan (2014) 99
dtype: int64
#4.3 重新排序
original_index = series_custom.index.tolist()
sorted_index = sorted(original_index)
sorted_by_index = series_custom.reindex(sorted_index)
print (sorted_by_index)
'71 (2015) 97
5 Flights Up (2015) 52
A Little Chaos (2015) 40
A Most Violent Year (2014) 90
About Elly (2015) 97
Aloha (2015) 19
American Sniper (2015) 72
American Ultra (2015) 46
..
Two Days, One Night (2014) 97
Unbroken (2014) 51
Unfinished Business (2015) 11
Unfriended (2015) 60
Vacation (2015) 27
Welcome to Me (2015) 71
What We Do in the Shadows (2015) 96
When Marnie Was There (2015) 89
While We're Young (2015) 83
Wild Tales (2014) 96
Woman in Gold (2015) 52
Length: 146, dtype: int64
#4.4 series重新排序
sc2 = series_custom.sort_index()
sc3 = series_custom.sort_values()
print (sc3[0:10])
Paul Blart: Mall Cop 2 (2015) 5
Hitman: Agent 47 (2015) 7
Hot Pursuit (2015) 8
Fantastic Four (2015) 9
Taken 3 (2015) 9
The Boy Next Door (2015) 10
The Loft (2015) 11
Unfinished Business (2015) 11
Mortdecai (2015) 12
Seventh Son (2015) 12
dtype: int64
#4.5 series重新排序后,相加
print(np.add(series_custom,series_custom))
print("-----------------sin-----------------------")
np.sin(series_custom)
print("------------------max----------------------")
np.max(series_custom)
#4.6 series查找
print (series_custom)
series_custom > 50
#print (series_custom)
series_greater_than_50 = series_custom[series_custom>50]
print (series_greater_than_50)
#4.7 series的values相加,再取平均数
rt_critics = Series(fandango["RottenTomatoes"].values,index=fandango["FILM"])
rt_users = Series(fandango["RottenTomatoes_User"].values,index=fandango["FILM"])
rt_mean = (rt_critics + rt_users)/2
print (rt_mean)
FILM
Avengers: Age of Ultron (2015) 80.0
Cinderella (2015) 82.5
Ant-Man (2015) 85.0
Do You Believe? (2015) 51.0
Hot Tub Time Machine 2 (2015) 21.0
...
Inside Out (2015) 94.0
Mr. Holmes (2015) 82.5
'71 (2015) 89.5
Two Days, One Night (2014) 87.5
Gett: The Trial of Viviane Amsalem (2015) 90.5
Kumiko, The Treasure Hunter (2015) 75.0
Length: 146, dtype: float64
#4.8 DataFrame的索引
print(type(fandango))
fandango_films = fandango.set_index("FILM",drop=False)
print(fandango_films.index)
#4.9 DataFrame的切片
print(fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"])
print("-------------------------")
print(fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"])
总结:
1.读数据:pandas.read_csv(“food_info.csv”)
2.打印数据:print (food_info[“Iron_(mg)”])
3.获取索引为0的元素:print(food_info.loc[0])
4.根据列名获取列:print(food_info[“NDB_No”]) print(food_info[[“NDB_No”,“Water_(g)”]])
5.获取以克结尾的列:c.endswith("(g)")
6.排序:food_info.sort_values(“Sodium_(mg)”,inplace = True)
7.统计缺失值: pd.isnull(age)
8.算平均数:titanic_survival[“Age”].mean()
9.船仓的获救率:titanic_survival.pivot_table(index=“Pclass”,values=“Survived”,aggfunc=np.mean)
10.把年龄和性别为空的去掉:titanic_survival.dropna(axis=0,subset=[“Age”,“Sex”])
11.查看83号样本的年纪和766号的船仓:titanic_survival.loc[83,“Age”] titanic_survival.loc[766,“Pclass”]
12.重新编辑索引:new_titanic_survival.reset_index(drop=True)
13.apply调用函数:titanic_survival.apply(hundredth_row)
14.DataFrame结构是由一系列Series结构组成。
Series结构是由ndarray结构组成
15.series重新排序:series_custom.sort_index() series_custom.sort_values()