信用评分卡Credit Scorecards (1-3)
up主微信公众号pythonEducation

Credit Scorecards – Introduction (part 1 of 7)
http://ucanalytics.com/blogs/credit-scorecards-part-1/
Credit Scorecards in the Age of Credit Crisis
This incident took place at a friend’s party circa 2009, in the backdrop of the worst financial crisis the planet has seen for a long time. The average Joe on the street was aware of terms such as mortgaged-backed securities (MBS), sub-prime lending and credit crisis – the reasons for his plight. Back to our party, I met an informed & compassionate elderly woman and after a few minutes of chitchat, the topic came to what I do for a living. At that point, I was working on a project of developing credit-scorecard for a leading mortgage lender in Mumbai. As I started explaining the details of my job, her expression changed from curious to angst and pain. Eventually, she interrupted and said – why would you do such a thing? Is this not the reason for all the mess? I was used to this reaction and had to correct her misconception.
信用危机时代的信用记分卡
这一事件发生在大约2009年的朋友聚会上,在这个星球长期以来最严重的金融危机背景下。 街上的乔普通知道抵押贷款支持证券(MBS),次级贷款和信贷危机等条款 - 这是他困境的原因。 回到我们的聚会上,我遇到了一位知情和富有同情心的老年妇女,经过几分钟的闲聊,这个主题来到了我的生活。 那时,我正在为孟买一家领先的抵押贷款机构开发一个信用记分卡项目。 当我开始解释我的工作细节时,她的表情从好奇变为焦虑和痛苦。 最后,她打断了她说 - 你为什么要做这样的事? 这不是所有混乱的原因吗? 我习惯了这种反应,不得不纠正她的误解。

Predictive Analytics: The lurking Danger – by Roopam
Credit or application scorecards can be excellent tools for both lender and borrower to work out debt serving capability of the borrower. For lenders, scorecards can help them assess the creditworthiness of the borrower and maintain a healthy portfolio – which will eventually influence the economy as a whole. Additionally to the borrower, they can provide valuable information such as 45% of people with her socio-economic background have struggled to keep up with the EMI commitment. This could help the borrower make a well-informed decision before getting into a debt trap. Blaming science for reckless human behavior is not new. I believe, any rigorous science with practical applications is like a sharp German blade, a master chef prepares delicious meals with it and the irresponsible leaves a deep and painful cut.
信用卡或应用程序记分卡可以成为贷款人和借款人计算借款人偿债能力的绝佳工具。 对于贷方而言,记分卡可以帮助他们评估借款人的信誉并维持健康的投资组合 - 这最终将影响整个经济。 除借款人外,他们还可以提供有价值的信息,例如45%具有社会经济背景的人都在努力跟上EMI的承诺。 这可以帮助借款人在陷入债务陷阱之前做出明智的决定。 为鲁莽的人类行为指责科学并不新鲜。 我相信,任何具有实际应用的严谨科学就像一把锋利的德国刀片,一位大厨用它准备可口的饭菜,而不负责任的会留下深刻而痛苦的切口。
Scorecards and Predictive Analytics
In the following series, we will explore the practitioners’ approach for developing and maintaining a scorecard. At a very high-level, credit scorecards have their roots in the classification problem in statistics & data mining. The classification problems present an extremely broad methodology/thought-process that has multiple business applications. A few applications for classification problem are:
• Application or credit scorecards to assess repayment risk of the borrower
• Image analytics of MRI to identify if the cancer is benevolent or malignant
• Behavioral models to identify the most probable future action of the customer
• Identification of potential drug targets in the protein structure
• Fraud detection models
• Sentiment analysis of Tweets and Facebook posts
• Cross/up sell propensity models
• Campaign response models
• Insurance ratings
在下面的系列中,我们将探讨从业者开发和维护记分卡的方法。 在非常高的层次上,信用记分卡的根源在于统计和数据挖掘中的分类问题。 分类问题提供了一个极其广泛的方法/思维过程,具有多个业务应用程序。 一些分类问题的应用是:
•应用程序或信用记分卡,用于评估借款人的还款风险
•MRI的图像分析,以确定癌症是仁慈的还是恶性的
•行为模型,用于识别客户最可能的未来行为
•鉴定蛋白质结构中的潜在药物靶标
•欺诈检测模型
•推文和Facebook帖子的情绪分析
•交叉/向上销售倾向模型
•活动响应模型
•保险评级
For that matter, there are subtle links between credit scorecards and other models mentioned above. The details of these models could be drastically different but the underlining idea for these models is linked to the classification problem. In this series, I shall focus on credit or application scorecard methodology but will try to bring in other another scorecards and models whenever possible.
就此而言,信用记分卡与上述其他模型之间存在微妙的联系。 这些模型的细节可能截然不同,但这些模型的强调理念与分类问题有关。 在本系列中,我将重点介绍信用卡或应用记分卡方法,但会尝试尽可能引入其他记分卡和模型。

Credit Scoring: Development Stages of Credit Scorecard – by Roopam
Flow of Subsequent Articles
The flow of subsequent articles in the series will be as following
1. Classification problem and sampling
2. Variable selection and coarse classing
3. Predictive Models
4. Logistic regression and scorecards
5. Model validation
6. Application and business process integration
后续文章的流程
该系列中后续文章的流程如下
1.分类问题和抽样
2.变量选择和粗略分类
3.预测模型
4.逻辑回归和记分卡
5.模型验证
6.应用程序和业务流程集成
Books for Credit Scorecards
I have compiled a list of books you may find useful while learning about analytical scorecards. The first four of these books have more or less the same flow, with Anderson’s book (#4) a little more detailed. However, you could choose any one of these four books without losing much .The last book (#5) is a collection of articles / papers by practitioners and academicians and is quite interesting.
信用记分卡的书籍
在编写分析记分卡时,我编制了一份您可能会发现有用的书籍清单。 这些书中的前四本或多或少都有相同的流程,而安德森的书(#4)更为详细。 但是,您可以选择这四本书中的任何一本,而不会损失太多。最后一本书(#5)是一组由从业者和学者组成的文章/论文,非常有趣。
1. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring – Naeem Siddiqi
2. Credit Scoring, Response Modeling, and Insurance Rating: A Practical Guide to Forecasting Consumer Behavior – Steven Finlay
3. Credit Scoring for Risk Managers: The Handbook for Lenders – Elizabeth Mays and Niall Lynas
4. The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation – Raymond Anderson
5. Credit Risk Models – Elizabeth Mays
Sign-off Note
Look forward to sharing my views on predictive analytics and hearing back from you. See you soon with the second part of this series.
Credit Scorecards – Classification Problem (part 2 of 7)
http://ucanalytics.com/blogs/credit-scorecards-classification-problem-part-2/
Classification Problem in Statistics & Data Mining
I must say I was shocked when Amishi, a girl little over three years old, announced that going forward she is only friends with my wife and not me. Her reason for the breakup was that I am a boy and girls can only be friends with girls. She has learned this social norm from her friends at the preschool. I still remember the way she modeled for me in her swimsuit and umbrella just a few months ago. She was aware of the boy-girl difference even then, it is just she has learned this weird social norm now. The point over here is that toddlers can distinguish genders without much effort. Nature has given us a built-in equation to classify gender through a mere glance with a high degree of precision. Imagine a similar mechanism to distinguish between good and bad borrowers. You are talking about every banker’s dream. However, evolution has trained us to mate not to lend.
我必须说,当三十岁的女孩Amishi宣布前进时,她只是与我的妻子而不是我的朋友,我感到震惊。 分手的原因是我是男孩,女孩只能是女孩的朋友。 她从幼儿园的朋友那里学到了这种社会规范。 几个月前,我还记得她在泳衣和雨伞中为我塑造的方式。 即便如此,她也意识到了男女之间的差异,现在只是她已经学会了这种奇怪的社会规范。 这里的重点是,幼儿可以毫不费力地区分性别。 大自然给了我们一个内置的方程式,通过高度精确的一瞥来对性别进行分类。 想象一下类似的机制来区分好的和坏的借款人。 你在谈论每个银行家的梦想。 然而,进化训练我们交配不放贷。

Predictive Analytics: Classification Problem – by Roopam
As I have mentioned in the previous article, scorecards have their roots in the classification problem in statistics and data mining. The idea with most classification problems is to create a mathematical equation to distinguish dichotomous variables. These variables can only take two values such as
• Male/ Female
• Good / Bad
• Yes / No
• God / Devil
• Happy / Sad
• Sales / No Sales
The list can go on until eternity. The reason why most business problems try to model dichotomies is that it is easy to comprehend for us humans. We must appreciate that dichotomies are never absolute and have degrees attached to them. For example, I am 80% good and 20% bad – at least I would like to believe this. I shall keep Pareto’s 80-20 principle away from this i.e. my 20% bad is responsible for my 80% of behavior.
正如我在上一篇文章中提到的,记分卡的根源在于统计和数据挖掘中的分类问题。 大多数分类问题的想法是创建一个数学方程来区分二分变量。 这些变量只能采用两个值,例如
•男/女
• 好坏
•是/否
•上帝/魔鬼
•快乐/悲伤
•销售/无销售
这份清单可以持续到永恒。 大多数商业问题试图模拟二分法的原因是它很容易理解我们人类。 我们必须明白,二分法从来都不是绝对的,是有度的。 例如,我80%好,20%坏 - 至少我想相信这一点。 我将保持帕累托的80-20原则远离这一点,即我的20%不好对我80%的行为负责。
Credit Scorecards Development – Problem Statement & Sampling(坏客户定义是灵活的)
In the case of credit scorecards, the problem statement is to distinguish analytically between the good and bad borrowers. Hence, the first task is to define a good and a bad borrower. For most loan products, good and bad credit is defined in the following way
1. Good loan: never or once missed on the EMI payment
2. Bad loan: ever missed 3 consecutive EMIs in a row (i.e. 90 days-past-due)
Additionally, for tagging someone good or bad, you need to observe his or her behavior for a significant length of time. This length of time varies from product to product based on the tenor of the loan. For home loans, with a tenor of 20 years, 2-3 years is a reasonable observation period.
However, there is nothing sacrosanct about the above definition and can be modified at the discretion of the analyst. Roll-rate analysis and vintage analysis are the two analytical tools you may want to consider while constructing the above definition.
信用记分卡开发 - 问题陈述和抽样
在信用记分卡的情况下,问题陈述是在好的和坏的借款人之间进行分析。因此,第一个任务是定义一个好的和坏的借款人。对于大多数贷款产品,信用良好和不良以下列方式定义
1.良好的贷款:永远或曾一次逾期
2.不良贷款:连续3次错过EMI(即90天过期)
此外,为了标记好人或坏人,你需要在很长一段时间内观察他或她的行为。根据贷款期限,这段时间因产品而异。对于房屋贷款,期限为20年,2 - 3年是合理的观察期。
但是,对于上述定义没有什么神圣不可侵犯的,可以由分析师自行决定修改。滚动率分析和复古分析是您在构建上述定义时可能需要考虑的两种分析工具。
Sampling Strategy for Credit Scorecards
A few years ago, I did a daylong workshop on Statistical Inference for a large German shipping & cargo company in Mumbai. At the time of Q&A session the Vice President of operations asked a tricky question, what is a good sample size to achieve good precision? He was looking for a one-size-fits-all answer and I wish it were that simple. The sample size depends on the degree of similarity or homogeneity of the population in question. For example, what do you think is a good sample size to answer the following two questions?
1. What is the salinity of the Pacific Ocean?
2. Is there another planet with intelligent life in the Universe?
In terms of population size, a number of drops in the ocean and planets in the Universe is similar. A couple of drops of water are enough to answer the first question since the salinity of oceans is fairly constant. On the other hand, the second question is a black swan problem. You may need to visit every single planet to rule our possibility of an intelligent form of life.
For credit scorecard development, the accepted rule of thumb for sample size is at least 1000 records of both good and bad loans. There is no reason why you cannot build a scorecard with a smaller sample size (say 500 records). However, the analyst needs to be cautious in doing so because a higher degree of randomness creeps in a small data sample. Additionally, it is also advisable to keep the sample window as short as possible i.e. a financial quarter or two while scorecard development. Further, the sample is divided into two pieces – usually, 70 % for development and remaining for validation sample. We discuss the development and validation sample in detail in the subsequent sections of this series.
信用记分卡的抽样策略
几年前,我为孟买的一家大型德国航运和货运公司举办了为期一天的统计推断研讨会。在问答环节时,运营副总裁提出了一个棘手的问题,即获得良好精度的样本量是多少?他正在寻找一个通用的答案,我希望它很简单。样本量取决于所讨论的群体的相似程度或同质性。例如,您认为回答以下两个问题的样本量是多少?
1.太平洋的盐度是多少?
2.宇宙中还有另一个拥有智慧生命的星球吗?
就人口规模而言,宇宙中海洋和行星的数量下降是相似的。由于海洋的盐度相当稳定,几滴水足以回答第一个问题。另一方面,第二个问题是黑天鹅问题。您可能需要访问每个星球来统治我们生活的智能生活的可能性。
对于信用记分卡开发,样本大小的公认经验法则是至少1000个好的和坏的贷款记录。没有理由不能建立样本量较小的记分卡(比如500条记录)。但是,分析师需要谨慎行事,因为较小程度的随机性会在小数据样本中蔓延。此外,还建议尽可能缩短样本窗口,即在记分卡开发时用一个或两个季度数据。此外,样品分为两部分 - 通常70%用于显影,剩余用于验证样品。我们将在本系列的后续章节中详细讨论开发和验证示例。

Credit Scorecard Development: Sampling Strategy – by Roopam
Sign-off Note
In the next article, we will discuss an important topic of variables classing and coarse classing for credit scorecards. See you soon
Credit Scorecards – Variables Selection (part 3 of 7)
http://ucanalytics.com/blogs/credit-scorecards-variables-selection-part-3/
Variables Selection in Predictive Analytics

Predictive Analytics: Variables Selection – by Roopam
The following story goes back to the time when I just started my transition from physics to business. I met this investment banker* in his mid-thirties during a Friday night party. After gulping down a few pints of beer, his mood became a bit somber and he told me how he hates his job. However, he had a plan of working his ass off until he retires at 45. Then he will do everything that makes him happy. I was thoroughly confused, how could someone debar himself from an emotion – happiness – for so many years and rediscover it later? I was wondering about the recipe for happiness – raindrops on roses and whiskers on kittens. An individual’s happiness is a tricky thing; however, I shall attempt to tackle this issue in my later article on logistic regression. For now, let us try to explore how states measure the collective well-being of their people. I shall use this topic of population well-being to explore an interesting topic in analytical scorecard development: variables selection.
以下故事可以追溯到我刚开始从物理到商业的过渡时期。 我在周五晚上的聚会期间遇到了这位投资银行家*。 在喝了几品脱啤酒之后,他的心情变得有些忧郁,他告诉我他是如何讨厌自己的工作的。 然而,他有一个计划工作他的屁股,直到他在45退休。然后他会做一切让他开心的事情。 我彻底搞糊涂了,这么多年以后,有多少人会从情感 - 快乐中贬低自己,并在以后重新发现它? 我想知道快乐的秘诀 - 玫瑰上的雨滴和小猫的胡须。 个人的幸福是一件棘手的事情; 但是,我将在后面关于逻辑回归的文章中尝试解决这个问题。 现在,让我们试着探讨各国如何衡量其人民的集体福祉。 我将利用这个人口福祉主题来探索分析记分卡开发中的一个有趣话题:
Variables Selection – Lessons from GDP & GNH
The most popular measure for national prosperity, unanimously projected by economists and TV channels, is Gross Domestic Product (GDP). The equation for measuring GDP as taught in macroeconomics 101 is:

Clearly, there are 5 factors/variables that govern GDP according to this equation. The first look at GDP as a measure for national well-being seemed incomplete to me. All the variables for GDP were from commerce. They are important but cannot be the only factors for country’s well-being, more so in a highly diverse & complicated country like India.
ariables Selection - 来自GDP和GNH的经验教训
经济学家和电视频道一致预测的最受国民兴趣的衡量标准是国内生产总值(GDP)。 宏观经济学101中教授的衡量GDP的等式是:
GDP方程式
显然,根据这个等式,有5个因素/变量可以控制GDP。 首先将国内生产总值视为衡量国家福祉的指标对我来说似乎不完整。 GDP的所有变量都来自商业。 它们很重要,但不能成为国家福祉的唯一因素,在印度等高度多样化和复杂的国家更是如此。
Gross National Happiness Index – The Story of Bhutan Naresh

Variables Selection – by Roopam
Ok, so what else do we have? A lesser-known index is Gross National Happiness (GNH). The origins of GNH are in Bhutan. They measure their country’s progress through GNH. The term was coined and implemented by Jigme Singye Wangchuck. This name immediately takes me back to the early nineties live telecast of the SAARC summit by India’s national broadcaster Doordarshan (DD). The old-timer Hindi commentators were referring to a modest man in a bathrobe-like-attire as ‘Bhutan Naresh’ – King of Bhutan. At first glance, he did not fit well with the power horses of the south Asian region. Nevertheless, he seems to have devised a more holistic metric to measure his country’s well-being. GNH is a combination of the following broad categories:
1. Living standard & income
2. Health coverage
3. Physiological well-being
4. Time spent at work and relaxing
5. Good governance
6. Schooling & education
7. Cultural diversity
8. Community vitality
9. Environmentalism and conservatism
There are 72 total variables in GNH measured on a scale of 0 to 1, such as daily hours of sleep and trust in media; hmmm, not a bad start! You could do your own research on GNH and let me know what you feel about it. Actually, we can work out our own formula for a GNH like metric. The idea is to select the right variables to build your model!
国民幸福总指数 - 不丹纳雷什的故事
变量选择 - 由Roopam
好的,那我们还有什么呢?一个鲜为人知的指数是国民幸福总值(GNH)。 GNH的起源在不丹。他们通过GNH衡量他们国家的进步。该术语由Jigme Singye Wangchuck创造和实施。这个名字让我回到了印度国家广播公司Doordarshan(DD)在九十年代早期的SAARC峰会现场直播。旧时的印地语评论员指的是一个穿着浴衣般装扮的谦虚男人,就像不丹之王“不丹纳雷什”。乍一看,他并不适合南亚地区的动力马。然而,他似乎已经设计了一个更全面的衡量标准来衡量他的国家的福祉。 GNH是以下大类的组合:
1.生活水平和收入
2.健康保险
3.生理健康
4.工作和放松的时间
5.善治
6.学校教育
7.文化多样性
8.社区活力
9.环境保护主义和保守主义
GNH中有72个总变量,按0到1的等级测量,例如每天的睡眠时间和对媒体的信任;嗯,这不是一个糟糕的开始!你可以自己研究GNH,让我知道你对它的看法。实际上,我们可以为GNH度量标准制定出我们自己的公式。我们的想法是选择正确的变量来构建您的模型!
Variables Selection in Credit Scoring
In data mining and statistical model building exercises, similar to credit scoring, variables selection process is performed through statistical significance – a reasonably automated process through advanced software. However, the variables are still created and measured by humans. High impact analyses in businesses are still driven by hunches. Human intelligence is not obsolete yet.
In one of the projects I did with a financial organization, the result of credit risk analysis and scoring led to redesigning of the application form. Application forms are a major source of data collection regarding the borrower. However, nobody wants to fill a lengthy form hence an optimal size of the form ensures accurate information provided by the borrower. The idea is to select the right variable and ensure accurate measurement.
There are several aspects regarding variables but I will mention just one of them here (coarse classing).
信用评分中的变量选择
在数据挖掘和统计模型构建练习中,类似于信用评分,变量选择过程通过统计显着性来执行 - 通过高级软件进行合理自动化的过程。 但是,变量仍由人类创造和测量。 企业的高影响力分析仍然受到预感的驱动。 人类智慧尚未过时。
在我与金融机构合作的一个项目中,信用风险分析和评分的结果导致了申请表的重新设计。 申请表是有关借款人的主要数据收集来源。 然而,没有人想要填写冗长的表格,因此表格的最佳尺寸确保了借款人提供的准确信息。 我们的想法是选择正确的变量并确保准确的测量。
关于变量有几个方面,但我在这里只提到其中一个(粗略分类)。
Coarse Classing in Credit Scoring

One of my favorite activities as a kid was going to a shoe store and getting my feet measured every summer before the school started. The shoe shops had a strange, miniature, slide-like device to measure foot size. It was fun to see my feet grow from one size to another every year or two. The growth was quantized i.e you are size-2 or 3 never 2.5 or 2.7. This aspect of converting measure such as 2.5 & 2.7 to 3 is called grouping, bucketing or classing. This is an integral part of creating scorecards that you will find in all the books I have listed in the first part of this blog series.
I have been a part of several heated discussions on the relevance of coarse class in scorecard development throughout my career. In most, if not all academic articles you will rarely see coarse classing as a technique during model development. Quite a few academicians & practitioners for a good reason believe that coarse classing results in loss of information. However, in my opinion, coarse classing has the following advantage over using raw measurement for a variable.
1. It reduces random noise that exists in raw variables – similar to averaging and yes, you lose some information here.
2. It handles extreme events – on two extremes of a variable – much better where you have thin data.
3. It handles the non-linear relationship between dependent and independent variable without a lot of effort of variable transformation from the analyst.
信用评分中的粗分类
3鞋子测量我小时候最喜欢的一项活动是去一家鞋店,每年夏天在学校开始前测量我的脚。这些鞋店有一个奇怪的,微型的滑动式设备来测量脚的大小。每年或每两年看到我的脚从一个尺寸增长到另一个尺寸很有趣。增量被量化,即你的大小为2或3从不2.5或2.7。将诸如2.5和2.7之类的度量转换为3的这一方面称为分组,分组或分类。这是创建记分卡的一个组成部分,您可以在本博客系列的第一部分列出的所有书籍中找到这些记分卡。
在我的职业生涯中,我参与了几个关于粗俗课程在记分卡开发中的相关性的热烈讨论。在大多数情况下,如果不是所有的学术文章,你很少会在模型开发过程中看到粗略的分类。相当多的学者和从业者有充分理由相信粗略的分类会导致信息丢失。但是,在我看来,粗略分类比使用变量的原始测量具有以下优势。
1.它减少了原始变量中存在的随机噪声 - 类似于平均值,是的,你在这里丢失了一些信息。
它处理极端事件 - 在变量的两个极端情况下 - 在您拥有精简数据的情况下更好。
3.它处理依赖变量和自变量之间的非线性关系,而无需分析师进行变量转换。
Sign-off Note
We are half way through this series on ‘Analytical Scorecard Development’ and I am enjoying writing this thoroughly. I hope as a reader you are on the same page. Scorecard building is highly technical and I have tried to discuss some aspects with easy to understand examples. However, to manage the length of the article, I am not able to get into the details. I must say that I love the details! So, if you have any queries, doubts, points-of-view or recommendations please write back on the discussion board or on my email: roopam.up@gmail.com
参考http://ucanalytics.com/blogs/credit-scorecards-part-1/
博主网校主页
https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149


