半年前做过一个项目,中间涉及到计算 WOE 值,然后自己手撕了一次,过了半年就看不懂了,所以老老实实记录一下 WOE 的来由。
WOE 和 IV
IV 指信息量,用来衡量各个特征对标签 y 的预测能力,而计算 IV 是需要 WOE。WOE 是指每个离散值(或者连续值分箱后转换成离散值)的权重。
示例
准备数据
In [1]: import pandas as pd
...: import numpy as np
...: from random import random, randrange
In [2]: data = [[] for _ in range(1000)]
...: for i in range(1000):
...: age = randrange(100)
...: if age < 10:
...: label = 1 if random() <= 0.25 else 0
...: elif age < 18:
...: label = 1 if random() <= 0.1 else 0
...: elif age < 35:
...: label = 1 if random() <= 0.025 else 0
...: elif age < 50:
...: label = 1 if random() <= 0.075 else 0
...: else:
...: label = 1 if random() <= 0.05 else 0
...:
...: data[i].extend([randrange(100), label])
...:
In [3]: df = pd.DataFrame(data = data, columns = ['age', 'y'])
In [4]: df.head(10)
Out[4]:
age y
0 16 0
1 78 0
2 68 0
3 46 0
4 85 0
5 60 0
6 77 0
7 73 0
8 51 0
9 24 0
计算每个分箱内 0 和 1 的个数,以及总体的个数。
In [5]: cut_points = [0, 10, 18, 35, 50, np.inf]
In [6]: pd.cut(df['age'], cut_points)
Out[6]:
0 (10.0, 18.0]
1 (50.0, inf]
2 (50.0, inf]
3 (35.0, 50.0]
4 (50.0, inf]
...
995 (10.0, 18.0]
996 (35.0, 50.0]
997 (35.0, 50.0]
998 (0.0, 10.0]
999 (50.0, inf]
Name: age, Length: 1000, dtype: category
Categories (5, interval[float64, right]): [(0.0, 10.0] < (10.0, 18.0] < (18.0, 35.0] <
(35.0, 50.0] < (50.0, inf]]
In [7]: pd.crosstab(pd.cut(df['age'], bins=cut_points), df['y'])
Out[7]:
y 0 1
age
(0.0, 10.0] 81 6
(10.0, 18.0] 76 7
(18.0, 35.0] 147 16
(35.0, 50.0] 136 17
(50.0, inf] 468 35
In [8]: df['y'].value_counts()
Out[8]:
y
0 919
1 81
Name: count, dtype: int64
最后计算每个分箱 WOE 值
In [9]: gi = pd.crosstab(pd.cut(df['age'], bins=cut_points), df['y'])
...: gb = df['y'].value_counts()
...: gbi = ( gi[0] / ( gi[1]+(1e-9) ) ) / ( gb[0] / ( gb[1]+(1e-9) ) )
In [10]: gbi
Out[10]:
age
(0.0, 10.0] 1.189880
(10.0, 18.0] 0.956941
(18.0, 35.0] 0.809780
(35.0, 50.0] 0.705114
(50.0, inf] 1.178548
dtype: float64
In [11]: woe = [round(i, 8) for i in np.log(gbi).values]
Out[11]: [0.17385272, -0.04401378, -0.2109931, -0.34939543, 0.16428327]
再计算每个分箱的 IV 和整体 IV
In [12]: iv = (gi[1]/gb[1] - gi[0]/gb[0])*woe
...: iv
Out[12]:
age
(0.0, 10.0] -0.002445
(10.0, 18.0] -0.000164
(18.0, 35.0] -0.007928
(35.0, 50.0] -0.021624
(50.0, inf] -0.012675
dtype: float64
In [13]: iv.sum()
Out[13]: -0.04483546099786402
因为 IV 是小于 0.02 的,所以这个特征对于标签来说,没有什么预测效果。
IV 范围 | 预测效果 |
---|---|
小于 0.02 | 几乎没有 |
0.02~0.1 | 弱 |
0.1~0.3 | 中等 |
0.3~0.5 | 强 |
大于 0.5 | 难以置信 |
整理成一个函数
def woe(X, y, c, cut_points):
gi = pd.crosstab(pd.cut(X[c], bins=cut_points), y)
gb = pd.Series(data=y).value_counts()
gbi = ( gi[0] / ( gi[1]+(1e-9) ) ) / ( gb[0] / ( gb[1]+(1e-9) ) )
return [round(i, 8) for i in np.log(gbi).values]