
数据挖掘
进一步有进一步的欢喜
这个作者很懒,什么都没留下…
展开
专栏收录文章
- 默认排序
- 最新发布
- 最早发布
- 最多阅读
- 最少阅读
-
pandas 常见写法
1、填充特征值为’//N’的所有列为None data[i][data[i] == '\\N'] = None 2、labelencoder from sklearn.preprocessing import LabelEncoder le = LabelEncoder() data[cat] = le.fit_transform(data[cat])原创 2020-08-03 09:53:31 · 992 阅读 · 0 评论 -
ip处理
import numpy as np a=np.load('ip_explain_by_geoip2_china.npy',allow_pickle=True) ip_exp=a.item() temp = pd.DataFrame(list(ip_exp.items()), columns=['ip', 'ip_exp']) temp[['country','province_exp','c...原创 2019-08-13 12:26:43 · 203 阅读 · 0 评论 -
GBDT、Xgboost、Lightgbm、Catboost论文
1、GBDT,xgboost对比 添加链接描述 https://wenku.baidu.com/view/f3da60b4951ea76e58fafab069dc5022aaea463e.html 2、xgboost论文 https://arxiv.org/pdf/1603.02754.pdf 3、lightgbm论文 http://papers.nips.cc/paper/6907-lightg...原创 2019-08-13 12:16:30 · 917 阅读 · 0 评论 -
数据挖掘-ctr特征
def ctr_fea(train,test,feature): for fea in feature: print(fea) temp = train[['label',fea]].groupby(fea)['label'].agg({fea+'_sum':sum, ...原创 2019-08-22 13:19:46 · 826 阅读 · 0 评论 -
数据挖掘-统计特征
在def cnt_fea(data,feature,train_num): data['flag'] = '-' for fea in feature: print(fea) data[fea] = data[fea].map(data[fea].value_counts()) for i in range(len(feature)-1): ...原创 2019-08-22 13:19:32 · 584 阅读 · 0 评论 -
数据挖掘-特征差异性编码
差异性编码快速写法 1、取set() 2、建pd.dataframe格式 3、merge() arrs = ['adidmd5', 'imeimd5', 'macmd5', 'openudidmd5', 'ip'] val = [] for i in range(len(arrs)): val.append(list(set(train[arrs[i]].unique()) & s...原创 2019-08-13 12:27:39 · 448 阅读 · 0 评论 -
数据挖掘-众数
# 众数 def get_mode(arr): mode = [] arr_appear = dict((a, arr.count(a)) for a in arr) # 统计各个元素出现的次数 if max(arr_appear.values()) == 1: # 如果最大的出现为1 return # 则没有众数 else: ...原创 2019-08-10 00:10:10 · 421 阅读 · 0 评论 -
leetcode-手动labelEncoder
for col in obj_cols: data[col].fillna('-1', inplace = True) data[col] = data[col].map(dict(zip(data[col].unique(),list(range(data[col].nunique()))))) print(col+' over...')原创 2019-08-17 17:22:38 · 269 阅读 · 0 评论 -
华为精英算法大赛决赛总结
1、华为比赛总结 1、top2选手:EDA探索 比赛第一步,先做EDA,发现强特具体来说,如观察某个变量对于label的分布 2、top1选手:比赛理论 3、自我总结 理论深挖一下,如lgb模型原理,nn原理,避免侥幸。 比赛不能犯懒,理论补充不能犯懒 不能有依赖心理,不能仅靠依赖队友 做技术需要静下心来 不能有畏难心理 2、Ctr总结 1、EDA 观察特征,比如观察uid_value_cou...原创 2019-08-26 23:06:43 · 619 阅读 · 0 评论 -
数据挖掘-geoip2工具
import geoip2.database import sys # ip = input() ip = '210.32.149.0' reader = geoip2.database.Reader('./GeoLite2-City.mmdb') data = reader.city(ip) def ip_explain(ip): data = reader.city(ip) ...原创 2019-08-13 12:26:09 · 270 阅读 · 0 评论 -
数据挖掘-训练集、测试集绘制&保存
# train = data[data.label!=-1] # test = data[data.label==-1] # train = train.dropna() # test = test.dropna() # # for i in data.columns: # for i in ['city','lan', 'os', 'osv', 'ver', 'orientation', 'ca...原创 2019-08-02 16:56:37 · 557 阅读 · 0 评论 -
数据挖掘-正负样本绘制&保存
# train_pos = data[data['label']==1] # train_neg = data[data['label']==0] # train = train.dropna() # test = test.dropna() # for i in ['city','lan', 'os', 'osv', 'ver', 'orientation', 'carrier', 'ntt',...原创 2019-08-02 16:55:28 · 430 阅读 · 0 评论 -
数据挖掘-数值型特征聚类
cols = ['area','location', 'pv/uv', 'totalFloor', 'pv', 'shi'] cols_kmeans = [] for i in cols: data[i+'_kmeans'] = (data[i]- data[i].min())/(data[i].max() - data[i].min()) cols_kmeans.append(i...原创 2019-06-02 03:03:50 · 573 阅读 · 1 评论 -
数据挖掘-绘制分布图
import seaborn as sns import matplotlib.pyplot as plt for i in train.columns: try: g = sns.kdeplot(train[i], color="Red", shade = True) g = sns.kdeplot(test[i], ax =g, color="Blue"...原创 2019-06-12 08:51:02 · 1293 阅读 · 0 评论 -
数据挖掘-分层抽样
#分层抽样 gbr = data.groupby("area") gbr.groups typicalFracDict = { 1: 0.2, 2: 0.4, 3: 0.6 } def typicalSampling(group, typicalFracDict): name = group.name frac = typicalFracDi...转载 2019-06-12 09:37:40 · 868 阅读 · 1 评论 -
根据ip获取信息
根据ip获取信息 import requests import IPy def get_location(ip): url = 'https://sp0.baidu.com/8aQDcjqpAAV3otqbppnN2DJv/api.php?co=&resource_id=6006&t=1529895387942&ie=utf8&oe=gbk&c...原创 2019-08-02 16:43:01 · 486 阅读 · 0 评论 -
数据挖掘-去长尾操作
# def cut_col(data, col_name, cut_list): # print('cutting', col_name) # def _trans(array): # count = array['box_counts'] # for box in cut_list: # if count <= bo...原创 2019-08-02 16:45:09 · 526 阅读 · 0 评论 -
数据挖掘-常见写法(持续更...)
1、排序 train_new.sort_values(by='imeimd5') train_new.sort_values(by='imeimd5')['imeimd5'].max() train_ime = train_new['imeimd5'].unique() 2、迭代器进度条:tqdm tqdm cnt = 0 for i in tqdm.tqdm_notebook(test_ne...原创 2019-08-13 12:25:46 · 254 阅读 · 0 评论 -
欺诈黑名单获取
import numpy as np # a=np.load('ip_dict.npy',allow_pickle=True) # data=a.item() temp = train[['ip','label']].groupby('ip')['label'].agg({'mean_label':'mean','count_label':'count','sum_label':'sum'}...原创 2019-08-12 14:05:55 · 183 阅读 · 0 评论 -
数据挖掘-feature_importanct
# 特征重要性 import matplotlib.pyplot as plt import seaborn as sns cols = (feature_importance_data[["feature", "importance"]] .groupby("feature") .mean() .sort_values(by="importance...原创 2019-08-02 16:51:06 · 325 阅读 · 0 评论 -
lightgbm简单网格搜索
folds = KFold(n_splits=5, shuffle=True, random_state=1333) oof_lgb = np.zeros(len(train)) predictions_lgb = np.zeros(len(test)) feature_importance_data = pd.DataFrame() best_score = 0 learning_rate ...原创 2019-06-02 02:59:14 · 2248 阅读 · 0 评论