计算IV和WOE

Database and Ruby, Python, History


半年前做过一个项目,中间涉及到计算 WOE 值,然后自己手撕了一次,过了半年就看不懂了,所以老老实实记录一下 WOE 的来由。

WOE 和 IV

IV 指信息量,用来衡量各个特征对标签 y 的预测能力,而计算 IV 是需要 WOE。WOE 是指每个离散值(或者连续值分箱后转换成离散值)的权重。

示例

准备数据

In [1]: import pandas as pd
   ...: import numpy as np
   ...: from random import random, randrange

In [2]: data = [[] for _ in range(1000)]
   ...: for i in range(1000):
   ...:     age = randrange(100)
   ...:     if age < 10:
   ...:         label = 1 if random() <= 0.25 else 0
   ...:     elif age < 18:
   ...:         label = 1 if random() <= 0.1 else 0
   ...:     elif age < 35:
   ...:         label = 1 if random() <= 0.025 else 0
   ...:     elif age < 50:
   ...:         label = 1 if random() <= 0.075 else 0
   ...:     else:
   ...:         label = 1 if random() <= 0.05 else 0
   ...:
   ...:     data[i].extend([randrange(100), label])
   ...:

In [3]: df = pd.DataFrame(data = data, columns = ['age', 'y'])

In [4]: df.head(10)
Out[4]:
   age  y
0   16  0
1   78  0
2   68  0
3   46  0
4   85  0
5   60  0
6   77  0
7   73  0
8   51  0
9   24  0

计算每个分箱内 0 和 1 的个数,以及总体的个数。


In [5]: cut_points = [0, 10, 18, 35, 50, np.inf]

In [6]: pd.cut(df['age'], cut_points)
Out[6]:
0      (10.0, 18.0]
1       (50.0, inf]
2       (50.0, inf]
3      (35.0, 50.0]
4       (50.0, inf]
           ...
995    (10.0, 18.0]
996    (35.0, 50.0]
997    (35.0, 50.0]
998     (0.0, 10.0]
999     (50.0, inf]
Name: age, Length: 1000, dtype: category
Categories (5, interval[float64, right]): [(0.0, 10.0] < (10.0, 18.0] < (18.0, 35.0] <
                                           (35.0, 50.0] < (50.0, inf]]

In [7]: pd.crosstab(pd.cut(df['age'], bins=cut_points), df['y'])
Out[7]:
y               0   1
age
(0.0, 10.0]    81   6
(10.0, 18.0]   76   7
(18.0, 35.0]  147  16
(35.0, 50.0]  136  17
(50.0, inf]   468  35

In [8]: df['y'].value_counts()
Out[8]:
y
0    919
1     81
Name: count, dtype: int64

最后计算每个分箱 WOE 值


In [9]: gi = pd.crosstab(pd.cut(df['age'], bins=cut_points), df['y'])
   ...: gb = df['y'].value_counts()
   ...: gbi = ( gi[0] / ( gi[1]+(1e-9) ) ) / ( gb[0] / ( gb[1]+(1e-9) ) )

In [10]: gbi
Out[10]:
age
(0.0, 10.0]     1.189880
(10.0, 18.0]    0.956941
(18.0, 35.0]    0.809780
(35.0, 50.0]    0.705114
(50.0, inf]     1.178548
dtype: float64

In [11]: woe = [round(i, 8) for i in np.log(gbi).values]
Out[11]: [0.17385272, -0.04401378, -0.2109931, -0.34939543, 0.16428327]

再计算每个分箱的 IV 和整体 IV

In [12]: iv = (gi[1]/gb[1] - gi[0]/gb[0])*woe
    ...: iv
Out[12]:
age
(0.0, 10.0]    -0.002445
(10.0, 18.0]   -0.000164
(18.0, 35.0]   -0.007928
(35.0, 50.0]   -0.021624
(50.0, inf]    -0.012675
dtype: float64

In [13]: iv.sum()
Out[13]: -0.04483546099786402

因为 IV 是小于 0.02 的,所以这个特征对于标签来说,没有什么预测效果。

IV 范围 预测效果
小于 0.02 几乎没有
0.02~0.1
0.1~0.3 中等
0.3~0.5
大于 0.5 难以置信

整理成一个函数

def woe(X, y, c, cut_points):
    gi = pd.crosstab(pd.cut(X[c], bins=cut_points), y)
    gb = pd.Series(data=y).value_counts()
    gbi = ( gi[0] / ( gi[1]+(1e-9) ) ) / ( gb[0] / ( gb[1]+(1e-9) ) )

    return [round(i, 8) for i in np.log(gbi).values]