银行案例学习实例3_逻辑回归

2020-07-22 08:37 作者:python风控模型 0人读过 | 我要投稿

python金融风控评分卡模型和数据分析微专业课：http://dwz.date/b9vv

http://ucanalytics.com/blogs/case-study-example-banking-logistic-regression-3/参考

The Beautiful Formula美丽公式

The Beautiful Formula – by Roopam

Mathematicians often conduct competitions for the most beautiful formulae of all. The first position, almost every time, goes to the formula discovered by Leonhard Euler. Displayed below is the formula.

This formula is phenomenal because it is a combination of the five most important constants in mathematics i.e.

0 : Additive Identity
1 : Multiplicative Identity
π : King of geometry and trigonometry
i : King of complex algebra
e: King of logarithms

It is just beautiful how such a simple equation links these fundamental constants in mathematics. I was mesmerized when I learned this Euler’s formula in high school and still am. Euler is also responsible for coining the symbol e (our king of the logarithm), which is sometimes also known as Euler’s constant. The name is an apt choice for another reason – Euler is considered the most prolific mathematician of all time. He used to produce novel mathematics at an exponential rate. This is particularly startling since Euler was partially blind for more than half his life and completely blind for around last two decades of his life. Incidentally, he was producing a high-quality scientific paper a week for a significant period when he was completely blind.

Today, before we discuss logistic regression, we must pay tribute to the great man, Leonhard Euler as Euler’s constant (e) forms the core of logistic regression.

数学家经常为最美丽的公式进行比赛。几乎每次都是第一个位置，由Leonhard Euler发现的公式。下面显示的是公式。

e ^ {i \ pi} + 1 = 0
这个公式是惊人的，因为它是数学中五个最重要的常数的组合，即

0：附加标识
1：乘法身份
π：几何和三角学之王
我：复杂代数之王
e：对数之王

如此简单的方程如何将这些基本常数与数学联系起来，这真是太好了。当我在高中学习欧拉的公式并且仍然是我时，我被迷住了。欧拉还负责创造符号e（我们的对数之王），有时也称为欧拉常数。这个名字是另一个原因的合适选择 - 欧拉被认为是有史以来最多产的数学家。他曾经以指数速度创作出新的数学。这尤其令人吃惊，因为欧拉在他生命的一半以上部分失明，并且在他生命的最后二十年里完全失明。顺便说一下，在他完全失明的一段时间里，他每周都会制作一份高质量的科学论文。

今天，在我们讨论逻辑回归之前，我们必须向伟大的人莱昂哈德欧拉致敬，因为欧拉常数（e）构成了逻辑回归的核心。

Case Study Example – Banking

In our last two articles (part 1) & (Part 2), you were playing the role of the Chief Risk Officer (CRO) for CyndiCat bank. The bank had disbursed 60816 auto loans in the quarter between April–June 2012. Additionally, you had noticed around 2.5% of bad rate. You did some exploratory data analysis (EDA) using tools of data visualization and found a relationship between age (Part 1) & FOIR (Part 2) with bad rates. Now, you want to create a simple logistic regression model with just age as the variable. If you recall, you have observed the following normalized histogram for age overlaid with bad rates.

We shall use this plot for creating the coarse classes to run a simple logistic regression. However, the idea over here is to learn the nuances of logistic regression. Hence, let us first go through some basic concepts in logistic regression.

在我们的最后两篇文章（第1部分）和（第2部分）中，您扮演的是CyndiCat银行的首席风险官（CRO）。该银行在2012年4月至6月期间在该季度发放了60816份汽车贷款。此外，您注意到大约2.5％的不良率。您使用数据可视化工具进行了一些探索性数据分析（EDA），并发现年龄（第1部分）和FOIR（第2部分）与不良率之间的关系。现在，您想要创建一个简单的逻辑回归模型，仅将年龄作为变量。如果你还记得，你已经观察到以下标准化的直方图，其中年龄覆盖了不良率。

我们将使用此图创建粗类以运行简单的逻辑回归。然而，这里的想法是学习逻辑回归的细微差别。因此，让我们首先介绍逻辑回归中的一些基本概念

Logistic regression

In a previous article (Logistic Regression), we have discussed some of the aspects of logistic regression. Let me reuse a picture from the same article. I would recommend that you read that article, as it would be helpful while understanding some of the concepts mentioned here.

在前一篇文章（Logistic回归）中，我们讨论了逻辑回归的一些方面。让我重复使用同一篇文章中的图片。我建议你阅读那篇文章，因为在理解这里提到的一些概念时会有所帮助

Logistic Regression

In our case z is a function of age, we will define the probability of bad loan as the following

在我们的案例中，z是年龄的函数，我们将如下定义不良贷款的概率。

你必须注意到欧拉常数对逻辑回归的影响。贷款或P（不良贷款）的概率在Z =-∞时变为0，在Z = +∞时变为1。这使得概率范围在无限远的两侧保持在0和1之内

{P(Bad Loan)}=\frac{e^{Z}}{1+e^{Z}}=\frac{e^{\beta \times Age+Constant}}{1+e^{\beta \times Age+Constant}}

= odd/(1+odd)

You must have noticed the impact of Euler’s constant on logistic regression. The probability of loan or P(Bad Loan) becomes 0 at Z= –∞ and 1 at Z = +∞. This keeps the bounds of probability within 0 and 1 on either side at infinity.

Additionally, we know that probability of good loan is one minus probability of bad loan hence:

你必须注意到欧拉常数对逻辑回归的影响。贷款或P（不良贷款）的概率在Z =-∞时变为0，在Z = +∞时变为1。这使得概率范围在无限远的两侧保持在0和1之内。

If you have ever indulged in betting of any sorts, the bets are placed in terms of odds. Mathematically, odds are defined as the probability of winning divided by the probability of losing. If we calculate the odds for our problem, we will get the following equation.

如果你曾经沉迷于任何种类的投注，那么投注就是赔率。在数学上，赔率被定义为获胜概率除以失败概率。如果我们计算出问题的几率，我们将得到以下等式。

\frac{P(Bad Loan)}{P(Good Loan)}={e^{\beta \times Age+Constant}}

Here we have the Euler’s constant stand out in all its majesty.

在这里，我们让欧拉的不变在其所有的威严中脱颖而出。

Coarse Classing

Now, let create coarse classes from the data-set we have seen in the first article of this series for age groups. Coarse classes are formed by combining the groups that have similar bad rates while maintaining the overall trend for bad rates. We have done the same thing for age groups as shown below.

现在，让我们从本系列第一篇文章中为年龄组看到的数据集创建粗类。粗类通过组合具有相似不良率的组而形成，同时保持不良率的整体趋势。我们为年龄组做了同样的事情，如下所示。

Table 1 – Coarse Class

We will use the above four coarse classes to run our logistic regression algorithm. As discussed in the earlier article the algorithm tries to optimize Z. In our case, Z is a linear combination of age groups i.e Z = G1+G2+G3+Constant. You must have noticed that we have not used G4 in this equation. This is because the constant will absorb the information for G4. This is similar to using dummy variables in linear regression. If you want to learn more about this, you could post your questions on this blog and we can discuss it further.

我们将使用上述四个粗类来运行逻辑回归算法。正如在前面的文章中所讨论的，算法试图优化Z.在我们的例子中，Z是年龄组的线性组合，即Z = G1 + G2 + G3 +常数。你一定注意到我们没有在这个等式中使用G4。这是因为常数将吸收G4的信息。这类似于在线性回归中使用虚拟变量。如果您想了解更多相关信息，可以在此博客上发布您的问题，我们可以进一步讨论。

Logistic Regression

Now, we are all set to generate our final logistic regression through a statistical program for the following equation.

现在，我们都准备通过以下等式的统计程序生成我们的最终逻辑回归。

\frac{P(Bad Loan)}{P(Good Loan)}=e^{\beta _{1}\times G_{1}+\beta _{2}\times G_{2}+\beta _{3}\times G_{3}+Constant}

You could either use a commercial software (SAS, SPSS or Minitab) or an open source software (R) for this purpose. They will all generate a table similar to the one shown below:

您可以使用商业软件（SAS，SPSS或Minitab）或开源软件（R）来实现此目的。它们都将生成一个类似于下图所示的表：

Let us quickly decipher this table and understand how the coefficients are estimated here. Let us look at the last column in this table i.e. Odds Ratio. How did the software arrive at the value of 3.07 for G1? The odds (bad loans/good loans) for G1 are 206/4615 = 4.46% (refer to above Table 1 – Coarse Class). Additionally, odds for G4 (the baseline group) are 183/12605 =1.45%. The odds ratio is the ratio of these two numbers 4.46%/1.45% = 3.07. Now, take the natural log of 3.07 i.e. ln(3.07) = 1.123 – this is our c for G1. Similarly, you could find the coefficient for G2 and G3 as well. Try it with your calculator!

These coefficients are the β values to our original equation and hence the equation will look like the following

让我们快速解读这个表，并了解如何估计系数。让我们看看这个表中的最后一列，即优势比。 G1软件如何达到3.07的价值？ G1的赔率（不良贷款/优惠贷款）为206/4615 = 4.46％（参见上表1 - 粗类）。此外，G4（基线组）的赔率为183/12605 = 1.45％。优势比是这两个数字的比率4.46％/ 1.45％= 3.07。现在，取3.07的自然对数，即ln（3.07）= 1.123 - 这是G1的c。同样，您也可以找到G2和G3的系数。试试你的计算器吧！

3.5/1.4=2.5

2.4/1.4
Out[5]: 1.7142857142857144

这些系数是我们原始方程的β值，因此方程式如下所示

\frac{P(Bad Loan)}{P(Good Loan)}=e^{1.123\times G_{1}+0.909\times G_{2}+0.508\times G_{3}-4.232}

Remember, G1, G2 and G3 can only take values of either 0 or 1. Additionally, since they are mutually exclusive when either of them is 1 the remaining will automatically become 0. If you make G1 = 1 the equation will take the following form.

请记住，G1，G2和G3只能取0或1的值。此外，由于当它们中的任何一个为1时它们是互斥的，剩余的将自动变为0.如果你使G1 = 1，则等式将采用以下形式。

Similarly, we could find the estimated value of bad rate for G1

This is precisely the value we have observed. Hence, the logistic regression is doing a good job for estimation of bad rate. Great! We have just created our first model.

Sign-off Note

Euler, though blind, showed us the way to come so far! Let me also reveal some more facts about the most beautiful formulae we have discussed at the beginning of this article. In the top five places, you will find two more formulae discovered by Leonhard Euler. That is 3 out of 5 most beautiful formulae. Wow! I guess we need to redefine blind.

To learn more about leonhard Euler watch the following You Tube Video by William Dunham (Video)

欧拉虽然是盲目的，却向我们展示了到目前为止的方式！让我也揭示一些关于我们在本文开头讨论过的最美丽公式的更多事实。在前五名中，你会发现Leonhard Euler发现的另外两个公式。这是5种最美丽的配方中的3种。哇！我想我们需要重新定义盲目。

up主微信公众号pythonEducation

博主网校主页：http://dwz.date/bwes

标签：

银行案例学习实例3_逻辑回归

The Beautiful Formula美丽公式

Case Study Example – Banking

Logistic regression

Coarse Classing

Logistic Regression

Sign-off Note