欢迎光临散文网 会员登陆 & 注册

银行案例学习实例4_IV and WOE

2020-07-22 08:49 作者:python风控模型  | 我要投稿

python金融风控评分卡模型和数据分析微专业课:http://dwz.date/b9vv


up主金融微专业课


http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/

This is a continuation of our banking case study for scorecards development. In this part, we will discuss information value (IV) and weight of evidence. These concepts are useful for variable selection while developing credit scorecards. We will also learn how to use  weight of evidence (WOE) in logistic regression modeling. The following are the links where you can find the previous three parts (Part 1), (Part 2) & (Part 3).

这是我们针对计分卡开发的银行业案例研究的延续。 在这一部分中,我们将讨论信息价值(IV)和证据权重。 这些概念对于开发信用计分卡时的变量选择很有用。 我们还将学习如何在逻辑回归建模中使用证据权重(WOE)。 以下是可以在其中找到前三个部分(第1部分),(第2部分)和(第3部分)的链接。

Experts in Expensive Suits昂贵西装专家

Mr. Expert - by Roopam

A couple of weeks ago I was watching this show called ‘Brain Games’ on the National Geographic Channel. In one of the segments, they had a comedian dressed up as a television news reporter. He had a whole television camera crew along with him. He was informing the people coming out of a mall in California that Texas has decided to form an independent country, not part of the United States. Additionally, while on camera he was asking for their opinion on the matter. After the initial amusement, people took him seriously and started giving their serious viewpoints. This is the phenomenon psychologists describe as ‘expert fallacy’ or obeying authority, no matter how irrational the authorities seem. Later after learning the truth, the people on this show agreed that they believed this comedian because he was in an expensive suit with a TV crew.

Nate Silver in his book The Signal and The Noise described a similar phenomenon. He analyzed the forecasts made by the panel of experts on the TV program The McLaughlin Group. The forecasts turned out to be true only in 50% cases; you could have forecasted the same by tossing a coin. We do take experts in expensive suits seriously, don’t we? These are not few-off examples. Men in suits or uniforms come in all different forms – from army generals to security personnel in malls. We take them all very seriously.

We have just discovered that rather than accept an expert’s opinion, it would be better to look at the value of the information and make decisions oneself. Let us continue with the theme and try to explore how to assign the value to information using information value and weight of evidence. Then we will create a simple logistic regression model using WOE (weight of evidence). However, before that let us recapture the case study we are working on.

几个星期前,我在国家地理频道观看这个名为“脑游戏”的节目。在其中一个片段中,他们有一个扮成电视新闻记者的喜剧演员。他和他一起有一整个电视摄制组。他告诉从加利福尼亚州的一个商场出来的人们,德克萨斯州决定组建一个独立的国家,而不是美国的一部分。此外,他在镜头前询问他们对此事的看法。在最初的娱乐之后,人们认真地对待他并开始给予他们认真的观点。这是心理学家所描述的“专家谬误”或服从权威的现象,无论当局看起来多么不合理。在得知真相之后,这个节目的人们同意他们相信这个喜剧演员,因为他是一个昂贵的电视工作人员。

Nate Silver在他的着作“信号与噪音”中描述了类似的现象。他分析了电视节目The McLaughlin Group的专家小组所做的预测。仅在50%的情况下,预测结果是正确的;你可以通过掷硬币来预测同样的事情。我们认真对待昂贵西装的专家,不是吗?这些都不是很少的例子。穿西装或制服的男子有各种形式 - 从军队将军到商场的保安人员。我们非常重视他们。

我们刚刚发现,不要接受专家的意见,最好是查看信息的价值并自己做出决定。让我们继续讨论主题,并尝试探索如何使用信息值和证据权重为信息赋值。然后我们将使用WOE(证据权重)创建一个简单的逻辑回归模型。但是,在此之前让我们重新审视我们正在研究的案例研究。

Case Study Continues ..

This is a continuation of our case study on CyndiCat bank. The bank had disbursed 60816 auto loans with around 2.5% of the bad rate in the quarter between April–June 2012. We did some exploratory data analysis (EDA) using tools of data visualization in the first two parts (Part 1) & (Part 2). In the previous article, we have developed a simple logistic regression model with just age as the variable (Part 3). This time, we will continue from where we left in the previous article and use weight of evidence (WOE) for age to develop a new model. Additionally, we will also explore the predictive power of the variable (age) through information value.

信息价值是模型构建过程中变量选择的一个非常有用的概念。 我认为,信息价值的根源在于克劳德·香农提出的信息理论。 我相信的原因是相似性信息值与信息论中广泛使用的熵概念有关。 Chi Square值是一种广泛使用的统计量度量,是IV(信息值)的良好替代品。 然而,IV是业内流行且广泛使用的措施。 这样做的原因是与IV相关的变量选择的一些非常方便的经验法则 - 这些非常方便,您将在本文后面发现。 信息值的公式如下所示。

Information Value (IV) and Weight of Evidence (WOE)

Information value is a very useful concept for variable selection during model building. The roots of information value, I think, are in information theory proposed by Claude Shannon. The reason for my belief is the similarity information value has with a widely used concept of entropy in information theory. Chi Square value, an extensively used measure in statistics, is a good replacement for IV (information value). However, IV is a popular and widely used measure in the industry. The reason for this is some very convenient rules of thumb for variables selection associated with IV – these are really handy as you will discover later in this article. The formula for information value is shown below.

信息价值是模型构建过程中变量选择的一个非常有用的概念。 我认为,信息价值的根源在于克劳德·香农提出的信息理论。 我相信的原因是相似性信息值与信息论中广泛使用的熵概念有关。 Chi Square值是一种广泛使用的统计量度量,是IV(信息值)的良好替代品。 然而,IV是业内流行且广泛使用的措施。 这样做的原因是与IV相关的变量选择的一些非常方便的经验法则 - 这些非常方便,您将在本文后面发现。 信息值的公式如下所示。

What distribution good/bad mean will soon be clear when we will calculate IV for our case study. This is probably an opportune moment to define Weight of Evidence (WOE), which is the log component in information value.


Hence, IV can further be written as the following.


If you examine both information value and weight of evidence carefully then you will notice that both these values will break down when either the distribution good or bad goes to zero. A mathematician will hate it. The assumption, a fair one, is that this will never happen while a scorecard development because of the reasonable sample size. A word of caution, if you are developing non-standardized scorecards with smaller sample size use IV carefully.

如果仔细检查信息的价值和证据的重量,那么你会注意到,当分布好坏都归零时,这两个值都会崩溃。 数学家会讨厌它。 假设是合理的,因为合理的样本量,在记分卡开发时这种情况永远不会发生。 需要注意的是,如果您正在开发样本量较小的非标准化记分卡,请谨慎使用IV。

Back to the Case Study

In the previous article, we have created coarse classes for the variable age in our case study. Now, let us calculate both information value and weight of evidence for these coarse classes.在上一篇文章中,我们在案例研究中为可变年龄创建了粗糙的类。 现在,让我们计算这些粗略分类的信息价值和证据权重。

Let us examine this table. Here, distribution of loans is the ratio of loans for a coarse class to total loans. For the group 21-30, this is 4821/60801 = 0.079. Similarly, distribution bad (DB) = 206/1522 = .135 and distribution good = 4615/59279 (DG) = 0.078. Additionally, DG-DB = 0.078 – 0.135 = – 0.057. Further, WOE = ln(0.078/0.135) = -0.553.

让我们检查一下这张表。 在这里,贷款分配是粗略贷款与总贷款之比。 对于21-30组,这是4821/60801 = 0.079。 同样,分布不良(DB)= 206/1522 = .135,分布良好= 4615/59279(DG)= 0.078。 此外,DG-DB = 0.078 – 0.135 = – 0.057。 此外,WOE = ln(0.078 / 0.135)=-0.553。

Download the attached Excel to understand this calculation : Information Value (IV) and Weight of Evidence (WOE)

下载随附的Excel以了解此计算:信息值(IV)和证据权重(WOE)

Finally, component of IV for this group is (-0.057)*(-0.553) = 0.0318. Similarly, calculate the IV components for all the other coarse classes. Adding these components will produce the IV value of 0.1093 (last column of the table). Now the question is how to interpret this value of IV?  The answer is the rule of thumb described below.


信息价值预测能力
<0.02无法用于预测
0.02到0.1弱预测值
0.1到0.3中等预测值
0.3到0.5强预测器
  > 0.5可疑或太好不可能

Typically, variables with medium and strong predictive powers are selected for model development.  However, some school of thoughts would advocate just the variables with medium IVs for a broad-based model development. Notice, the information value for age is 0.1093 hence it is barely falling in the medium predictors’ range.

通常,选择具有中等和强预测能力的变量用于模型开发。 然而,一些学派只会提倡具有中等IV的变量来进行基础广泛的模型开发。 请注意,年龄的信息值为0.1093,因此在中期预测器的范围内几乎没有下降。

Logistic Regression with Weight of Evidence (WOE)

Finally, let us create a logistic regression model with weight of evidence of the coarse classes as the value for the independent variable age. The following are the results generated through a statistical software.

最后,让我们创建一个逻辑回归模型,其中粗类的证据权重作为自变量年龄的值。 以下是通过统计软件生成的结果。

If we estimate the value of bad rate for the age group 21-30 using the above information.


This is precisely the value we have obtained the last time (See the previous part) and is consistent with the bad rate for the group.

Sign-off note

I wish there was an instrument similar to information value available with us to estimate the value of information coming from so called experts. However, next time when an expert on a business channel gives you the advice to buy a certain stock, take that advice with a pinch of salt.

我希望有一种类似于信息价值的工具可用于估算来自所谓专家的信息的价值。 但是,下次商业渠道专家为您提供购买某种库存的建议时,请尽量不予理睬。

Read the remaining part of credit scoring series

  • Part 1: Data visualization for scoring

  • Part 2: Creating ratio variables for better scoring

  • Part 3: Logistic regression

  • Part 5: Reject inference

  • Part 6: Population stability index for scorecard monitoring

References1. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring – Naeem Siddiqi 2. Credit Scoring for Risk Managers: The Handbook for Lenders – Elizabeth Mays and Niall Lynas


up主微信公众号pythonEducation

博主网校主页 :http://dwz.date/bwes

博主网校主页


银行案例学习实例4_IV and WOE的评论 (共 条)

分享到微博请遵守国家法律