五 评分卡模型构建
学习目标
-
掌握KS值的计算方法
-
知道评分映射方法
-
知道XGBoost和LightGBM基本原理
-
掌握使用lightGBM进行特征筛选的方法
-
应用toad构建评分卡模型
1 模型构建流程
1.1 实验设计
-
新的模型能上线一定要比原有方案有提升,需要通过实验证明
冷启动 业务初期 成长期 波动期 策略调整 新增数据源 人工审核 人工审核 新旧模型对比 新旧模型对比 避免迭代模型 新旧模型对比 规则模型 标准模型 长短表现期对比 稳定和波动人群 线上模型、陪跑 和标准模型对比 数据驱动模型 上一版模型 -
业务逐渐稳定后,人工审核是否会去掉
-
一般算法模型上线后,在高分段和低分段模型表现较好,中间的用户可能需要人工参与审核
-
模型表现越来越好之后,人工审核的需求会逐步降低,但不会去掉
-
-
标准模型:逻辑回归,随机森林
-
策略和模型不会同时调整
-
1.2 样本设计
-
ABC卡观察期,表现期
观察期 | 表现期 | |
---|---|---|
实时A卡 | 申请时点往前6~12个月 | FPD7,FPD30 |
白名单A卡 | 邀请时点/激活时点往前6~12个月 | FPD30 |
B卡 | 当前未逾期用户任意用信时点前6~12个月 | 当期/后续2-6期DPD30/DPD60 |
C卡 | 当前逾期未还用户还款日后1天/30天/60天往前6~12个月 | 当期DPD30/60/90 |
-
还款状态和DPD一起刻画了用户的逾期情况
还款日前(Before Due) | 还款日后(After Due) | |
---|---|---|
完全还款(Fully Repay) | FB(好用户) | FA(催回来了) |
部分还款(Patially Repay) | PB | PA(有意愿,但无力完全还款) |
展期(Extend) | EB(提前展期) | EA (违约后展期,可能是高危用户) |
未还款(Not Repay) | NB | NA |
-
A卡 申请新客 B卡未逾期老客 C卡 逾期老客
-
当前逾期:出现逾期且到观测点为止未还清 NA,PA
-
历史逾期:曾经出现过逾期已还清或当前逾期 FA,NA,PA
-
举例
-
一月 二月 三月 四月 五月 还款状态 还清 还清 还清 还清 还清 DPD 40 0 0 0 0 上面情况属于B卡客户
-
一月 二月 三月 四月 五月 还款状态 还清 还清 还清 还清 未还 DPD 0 0 0 0 40 上面情况属于C卡客户
-
一月 二月 三月 四月 五月 还款状态 还清 还清 还清 未还 未还 DPD 40 0 0 40 10 上面情况属于C卡客户
-
-
样本设计表格
-
训练集 测试集 1月 2月 3月 4月 5月 6月 7月 8月 总# 100 200 300 400 500 600 700 800 坏# 3 6 6 8 15 12 14 24 坏% 3% 3% 2% 2% 3% 2% 2% 3% -
观察坏样本的比例,不要波动过大
-
客群描述:首单用户、内部数据丰富、剔除高危职业、收入范围在XXXX
-
客群标签:好: FPD<=30 坏: FPD>30
-
1.3 模型训练与评估
-
目前还是使用机器学习模型,少数公司在尝试深度学习
-
模型的可解释性>稳定性>区分度
-
区分度:AUC,KS
-
稳定性: PSI
-
-
业务指标:通过率,逾期率
-
逾期率控制在比较合理的范围的前提下,要提高通过率
-
A卡,要保证一定过得通过率,对逾期率可以有些容忍
-
B卡,想办法把逾期率降下来,好用户提高额度
-
-
AUC 和 KS
-
-
AUC:ROC曲线下的面积,反映了模型输出的概率对好坏用户的排序能力
-
KS反映了好坏用户的分布的最大的差别
-
ROC曲线是对TPR和FPR的数值对的记录
-
KS = max(TPR-FPR)
-
-
AUC和KS的区别可以简化为:
-
AUC反映模型区分度的平均状况
-
KS反映了模型区分度的最佳状况
-
-
PSI和特征里的PSI完全一样
1.4 模型上线整体流程
2 逻辑回归评分卡
2.1 评分映射方法
-
使用逻辑回归模型可以得到一个[0,1]区间的结果, 在风控场景下可以理解为用户违约的概率, 评分卡建模时我们需要把违约的概率映射为评分
-
举例:
-
用户的基础分为650分
-
当这个用户非逾期的概率是逾期的概率的2倍时,加50分
-
非逾期的概率是逾期的概率的4倍时,加100分
-
非逾期的概率是逾期的概率的8倍时,加150分
-
以此类推,就得到了业内标准的评分卡换算公式
-
-
$$score = 650+50 lg_2(P{正样本}/ P{负样本})$$
-
score是评分卡映射之后的输出,$P{正样本}$是样本非逾期的概率,$P{负样本}$是样本逾期的概率
-
-
逻辑回归评分卡如何与评分卡公式对应:
-
逻辑回归方程为
$$ln(P{正样本}/ P{负样本}) = w_1x_1+w_2x_2+w_3x_3+... ...$$
-
在信用评分模型建模时,逻辑回归的线性回归成分输出结果为$ln(P{正样本}/ P{负样本}) $,即对数似然。
-
由对数换底公式可知:
$$log_2(P{正样本}/ P{负样本}) = ln(P{正样本}/ P{负样本}) /ln2 = w_1x_1+w_2x_2+w_3x_3+... /ln2$$
只需要解出逻辑回归中每个特征的系数,然后将样本的每个特征值加权求和即可得到客户当前的标准化信用评分
-
-
基础分(Base Score)为650分,步长(Point of Double Odds,PDO)为50分,这两个值需要根据业务需求进行调整
-
2.2 逻辑回归评分卡
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#770088">import</span> <span style="color:#000000">pandas</span> <span style="color:#770088">as</span> <span style="color:#000000">pd</span>
<span style="color:#770088">from</span> <span style="color:#000000">sklearn</span>.<span style="color:#000000">metrics</span> <span style="color:#770088">import</span> <span style="color:#000000">roc_auc_score</span>,<span style="color:#000000">roc_curve</span>,<span style="color:#000000">auc</span>
<span style="color:#770088">from</span> <span style="color:#000000">sklearn</span>.<span style="color:#000000">model_selection</span> <span style="color:#770088">import</span> <span style="color:#000000">train_test_split</span>
<span style="color:#770088">from</span> <span style="color:#000000">sklearn</span> <span style="color:#770088">import</span> <span style="color:#000000">metrics</span>
<span style="color:#770088">from</span> <span style="color:#000000">sklearn</span>.<span style="color:#000000">linear_model</span> <span style="color:#770088">import</span> <span style="color:#000000">LogisticRegression</span>
<span style="color:#770088">import</span> <span style="color:#000000">numpy</span> <span style="color:#770088">as</span> <span style="color:#000000">np</span>
<span style="color:#770088">import</span> <span style="color:#000000">random</span>
<span style="color:#770088">import</span> <span style="color:#000000">math</span></span></span>
-
读取数据
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#000000">data</span> = <span style="color:#000000">pd</span>.<span style="color:#000000">read_csv</span>(<span style="color:#aa1111">'data/Bcard.txt'</span>)
<span style="color:#000000">data</span>.<span style="color:#000000">head</span>()</span></span>
输出结果
obs_mth bad_ind uid td_score jxl_score mj_score rh_score zzc_score zcx_score person_info finance_info credit_info act_info 0 2018-10-31 0.0 A10000005 0.675349 0.144072 0.186899 0.483640 0.928328 0.369644 -0.322581 0.023810 0.00 0.217949 1 2018-07-31 0.0 A1000002 0.825269 0.398688 0.139396 0.843725 0.605194 0.406122 -0.128677 0.023810 0.00 0.423077 2 2018-09-30 0.0 A1000011 0.315406 0.629745 0.535854 0.197392 0.614416 0.320731 0.062660 0.023810 0.10 0.448718 3 2018-07-31 0.0 A10000481 0.002386 0.609360 0.366081 0.342243 0.870006 0.288692 0.078853 0.071429 0.05 0.179487 4 2018-07-31 0.0 A1000069 0.406310 0.405352 0.783015 0.563953 0.715454 0.512554 -0.261014 0.023810 0.00 0.423077
-
数据字段说明:
-
bad_ind 为标签
-
外部评分数据:td_score,jxl_score,mj_score,rh_score,zzc_score,zcx_score
-
内部数据: person_info, finance_info, credit_info, act_info
-
obs_month: 申请日期所在月份的最后一天(数据经过处理,将日期都处理成当月最后一天)
-
-
看一下月份分布,用最后一个月做为跨时间验证集
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#000000">data</span>.<span style="color:#000000">obs_mth</span>.<span style="color:#000000">unique</span>()</span></span>
输出结果
<span style="background-color:#f8f8f8"><span style="color:#333333">array(['2018-10-31', '2018-07-31', '2018-09-30', '2018-06-30',
'2018-11-30'], dtype=object)</span></span>
-
划分测试数据和验证数据(时间外样本)
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#000000">train</span> = <span style="color:#000000">data</span>[<span style="color:#000000">data</span>.<span style="color:#000000">obs_mth</span> <span style="color:#981a1a">!</span>= <span style="color:#aa1111">'2018-11-30'</span>].<span style="color:#000000">reset_index</span>().<span style="color:#000000">copy</span>()
<span style="color:#000000">val</span> = <span style="color:#000000">data</span>[<span style="color:#000000">data</span>.<span style="color:#000000">obs_mth</span> == <span style="color:#aa1111">'2018-11-30'</span>].<span style="color:#000000">reset_index</span>().<span style="color:#000000">copy</span>()</span></span>
-
取出建模用到的特征
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#aa5500">#info结尾的是自己做的无监督系统输出的个人表现,score结尾的是收费的外部征信数据</span>
<span style="color:#000000">feature_lst</span> = [<span style="color:#aa1111">'person_info'</span>,<span style="color:#aa1111">'finance_info'</span>,<span style="color:#aa1111">'credit_info'</span>,<span style="color:#aa1111">'act_info'</span>,<span style="color:#aa1111">'td_score'</span>,<span style="color:#aa1111">'jxl_score'</span>,<span style="color:#aa1111">'mj_score'</span>,<span style="color:#aa1111">'rh_score'</span>]</span></span>
-
训练模型
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#000000">x</span> = <span style="color:#000000">train</span>[<span style="color:#000000">feature_lst</span>]
<span style="color:#000000">y</span> = <span style="color:#000000">train</span>[<span style="color:#aa1111">'bad_ind'</span>]
<span style="color:#000000">val_x</span> = <span style="color:#000000">val</span>[<span style="color:#000000">feature_lst</span>]
<span style="color:#000000">val_y</span> = <span style="color:#000000">val</span>[<span style="color:#aa1111">'bad_ind'</span>]
<span style="color:#000000">lr_model</span> = <span style="color:#000000">LogisticRegression</span>(<span style="color:#000000">C</span>=<span style="color:#116644">0.1</span>)
<span style="color:#000000">lr_model</span>.<span style="color:#000000">fit</span>(<span style="color:#000000">x</span>,<span style="color:#000000">y</span>)</span></span>
输出结果
<span style="background-color:#f8f8f8">LogisticRegression<span style="color:#0000ff">(C</span><span style="color:#981a1a">=</span><span style="color:#116644">0</span>.1, <span style="color:#0000ff">class_weight</span><span style="color:#981a1a">=</span>None, <span style="color:#0000ff">dual</span><span style="color:#981a1a">=</span>False, <span style="color:#0000ff">fit_intercept</span><span style="color:#981a1a">=</span>True, <span style="color:#0000ff">intercept_scaling</span><span style="color:#981a1a">=</span><span style="color:#116644">1</span>, <span style="color:#0000ff">max_iter</span><span style="color:#981a1a">=</span><span style="color:#116644">100</span>, <span style="color:#0000ff">multi_class</span><span style="color:#981a1a">=</span><span style="color:#aa1111">'ovr'</span>, <span style="color:#0000ff">n_jobs</span><span style="color:#981a1a">=</span><span style="color:#116644">1</span>, <span style="color:#0000ff">penalty</span><span style="color:#981a1a">=</span><span style="color:#aa1111">'l2'</span>, <span style="color:#0000ff">random_state</span><span style="color:#981a1a">=</span>None, <span style="color:#0000ff">solver</span><span style="color:#981a1a">=</span><span style="color:#aa1111">'liblinear'</span>, <span style="color:#0000ff">tol</span><span style="color:#981a1a">=</span><span style="color:#116644">0</span>.0001, <span style="color:#0000ff">verbose</span><span style="color:#981a1a">=</span><span style="color:#116644">0</span>, <span style="color:#0000ff">warm_start</span><span style="color:#981a1a">=</span>False)</span>
-
模型评价
-
ROC曲线:描绘的是不同的截断点时,并以FPR和TPR为横纵坐标轴,描述随着截断点的变小,TPR随着FPR的变化
-
纵轴:TPR=正例分对的概率 = TP/(TP+FN),其实就是查全率 召回
-
横轴:FPR=负例分错的概率 = FP/(FP+TN) 原本是0被预测为1的样本在所有0的样本中的概率
-
-
KS值
-
作图步骤:
-
根据学习器的预测结果(注意,是正例的概率值,非0/1变量)对样本进行排序(从大到小)-----这就是截断点依次选取的顺序 按顺序选取截断点,并计算TPR和FPR ---也可以只选取n个截断点,分别在1/n,2/n,3/n等位置 横轴为样本的占比百分比(最大100%),纵轴分别为TPR和FPR,可以得到KS曲线max(TPR-FPR)
-
ks = max(TPR-FPR)TPR和FPR曲线分隔最开的位置就是最好的”截断点“,最大间隔距离就是KS值,通常>0.2即可认为模型有比较好偶的预测准确性
-
-
-
绘制ROC计算KS
<span style="background-color:#f8f8f8"><span style="color:#333333">y_pred = lr_model.predict_proba(x)[:,1] #取出训练集预测值
fpr_lr_train,tpr_lr_train,_ = roc_curve(y,y_pred) #计算TPR和FPR
train_ks = abs(fpr_lr_train - tpr_lr_train).max() #计算训练集KS
print('train_ks : ',train_ks)
y_pred = lr_model.predict_proba(val_x)[:,1] #计算验证集预测值
fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred) #计算验证集预测值
val_ks = abs(fpr_lr - tpr_lr).max() #计算验证集KS值
print('val_ks : ',val_ks)
from matplotlib import pyplot as plt
plt.plot(fpr_lr_train,tpr_lr_train,label = 'train LR') #绘制训练集ROC
plt.plot(fpr_lr,tpr_lr,label = 'evl LR') #绘制验证集ROC
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc = 'best')
plt.show()
</span></span>
显示结果:
<span style="background-color:#f8f8f8">train_ks : 0.4151676259891534 val_ks : 0.3856283523530577 </span>
-
使用lightgbm进行特征筛选
<span style="background-color:#f8f8f8"><span style="color:#333333"># lightgbm版本 3.0.0
import lightgbm as lgb
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x,y,random_state=0,test_size=0.2)
def lgb_test(train_x,train_y,test_x,test_y):
clf =lgb.LGBMClassifier(boosting_type = 'gbdt',
objective = 'binary',
metric = 'auc',
learning_rate = 0.1,
n_estimators = 24,
max_depth = 5,
num_leaves = 20,
max_bin = 45,
min_data_in_leaf = 6,
bagging_fraction = 0.6,
bagging_freq = 0,
feature_fraction = 0.8,
)
clf.fit(train_x,train_y,eval_set = [(train_x,train_y),(test_x,test_y)],eval_metric = 'auc')
return clf,clf.best_score_['valid_1']['auc'],
lgb_model , lgb_auc = lgb_test(train_x,train_y,test_x,test_y)
feature_importance = pd.DataFrame({'name':lgb_model.booster_.feature_name(),
'importance':lgb_model.feature_importances_}).sort_values(by=['importance'],ascending=False)
</span></span>
显示结果:
feature_importance [1] training's auc: 0.759467 valid_1's auc: 0.753322 [2] training's auc: 0.809023 valid_1's auc: 0.805658 [3] training's auc: 0.809328 valid_1's auc: 0.803858 [4] training's auc: 0.810298 valid_1's auc: 0.801355 [5] training's auc: 0.814873 valid_1's auc: 0.807356 [6] training's auc: 0.816492 valid_1's auc: 0.809279 [7] training's auc: 0.820213 valid_1's auc: 0.809208 [8] training's auc: 0.823931 valid_1's auc: 0.812081 [9] training's auc: 0.82696 valid_1's auc: 0.81453 [10] training's auc: 0.827882 valid_1's auc: 0.813428 [11] training's auc: 0.828881 valid_1's auc: 0.814226 [12] training's auc: 0.829577 valid_1's auc: 0.813749 [13] training's auc: 0.830406 valid_1's auc: 0.813156 [14] training's auc: 0.830843 valid_1's auc: 0.812973 [15] training's auc: 0.831587 valid_1's auc: 0.813501 [16] training's auc: 0.831898 valid_1's auc: 0.813611 [17] training's auc: 0.833751 valid_1's auc: 0.81393 [18] training's auc: 0.834139 valid_1's auc: 0.814532 [19] training's auc: 0.835177 valid_1's auc: 0.815209 [20] training's auc: 0.837368 valid_1's auc: 0.815205 [21] training's auc: 0.837946 valid_1's auc: 0.815099 [22] training's auc: 0.839585 valid_1's auc: 0.815602 [23] training's auc: 0.840781 valid_1's auc: 0.816105 [24] training's auc: 0.841174 valid_1's auc: 0.816869
name importance 6 person_info 65 8 credit_info 57 9 act_info 55 7 finance_info 50 4 zzc_score 46 5 zcx_score 44 2 mj_score 39 0 td_score 34 3 rh_score 34 1 jxl_score 32
-
模型调优,去掉几个特征,重新建模
<span style="background-color:#f8f8f8"><span style="color:#333333">#确定新的特征
feature_lst = ['person_info','finance_info','credit_info','act_info']
x = train[feature_lst]
y = train['bad_ind']
val_x = val[feature_lst]
val_y = val['bad_ind']
lr_model = LogisticRegression(C=0.1)
lr_model.fit(x,y)
y_pred = lr_model.predict_proba(x)[:,1]
fpr_lr_train,tpr_lr_train,_ = roc_curve(y,y_pred)
train_ks = abs(fpr_lr_train - tpr_lr_train).max()
print('train_ks : ',train_ks)
y_pred = lr_model.predict_proba(val_x)[:,1]
fpr_lr,tpr_lr,_ = roc_curve(val_y,y_pred)
val_ks = abs(fpr_lr - tpr_lr).max()
print('val_ks : ',val_ks)
from matplotlib import pyplot as plt
plt.plot(fpr_lr_train,tpr_lr_train,label = 'train LR')
plt.plot(fpr_lr,tpr_lr,label = 'evl LR')
plt.plot([0,1],[0,1],'k--')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc = 'best')
plt.show()
</span></span>
显示结果:
<span style="background-color:#f8f8f8">train_ks : 0.41573985983413414 val_ks : 0.3928959732014397 </span>