欺诈检测、预防和评估的统计技术Statistical techniques for fraud detection, prevention, and evaluation |
|
课程网址: | http://videolectures.net/mmdss07_hand_stf/ |
主讲教师: | David J. Hand |
开课单位: | 帝国理工学院 |
开课时间: | 2007-12-03 |
课程语种: | 英语 |
中文简介: | 通过设置背景开始谈话:定义欺诈并概述其广度;数据显示欺诈行为有多严重;检查不同的欺诈领域,包括医疗欺诈,银行欺诈和科学欺诈。详细描述和说明了银行欺诈的特定数据分析挑战。其中包括类别高度不平衡的事实(1000个交易中通常不超过1个是欺诈性的),类别标签可能经常是不正确的,通常会发现真实标签的延迟,交易到达时间是随机的,数据是动态的,并且可能是最具挑战性的,分布是反应性的,随着欺诈检测系统的实施而变化。描述了机械和经验模型在解决这些问题中的作用。两者都被广泛使用,两者都有所贡献。详细检查银行数据,特别是银行欺诈数据。原始信用卡交易数据每个交易有70个80个变量,对于行为数据,这可以成倍增加,如欺诈检测问题。关于如何汇总数据的问题出现了:是否应该尝试对单个交易进行分类,还是应该构建活动记录?数据分析中任何预测问题的一个基本方面是选择适当的估计和绩效评估标准。在欺诈的情况下,尤其需要结合分类准确性和分类的及时性。这意味着分类性能的标准度量,例如错误率,AUC,KS统计量,信息值等是不够的。描述了合适的措施和性能曲线,它们结合了这些方面并且现在正被业界采用。已经针对欺诈检测问题开发了各种统计(在约翰钱伯斯的“更大统计”意义上使用)方法,并且使用来自与我们合作的一些银行的数据来描述和说明一些方法。特别是,我们看一下监督分类和异常检测方法。最后,在银行欺诈的背景下,概述了一些更深层但非常重要的概念问题,包括经济上的必要性,欺诈是否现在变得“可接受”,以及我们从实证比较中得到了什么,科学欺诈与银行欺诈形成对比。他们有相当不同的司机。特别是,经济收益通常与科学欺诈无关,这使得它成为一种不寻常的欺诈行为,当然,影响可能更严重。从一系列学科中给出了几个例子。描述了数据分析工具在检测科学欺诈中的作用以及这些工具的性质 |
课程简介: | The talk begins by setting the context: fraud is defined and its breadth outlined; figures are given showing how significant fraud is; and different areas of fraud are examined, including health care fraud, banking fraud, and scientific fraud. The particular data analytic challenges of banking fraud are described and illustrated in detail. These include the fact that the classes are highly unbalanced (with typically no more than 1 in a 1000 transactions being fraudulent), that class labels may often be incorrect, that there will typically be delays in discovering the true labels, that the transaction arrival times are random, that the data are dynamic, and, perhaps most challenging of all, that the distributions are reactive, changing in response to the implementation of fraud detection systems. The role of mechanistic and empirical models in tackling these problems is described. Both have been widely used, and both have a contribution to make. Banking data, and in particular banking fraud data are examined in detail. Raw credit card transaction data have 70-80 variables per transaction, and this can be multiplied many-fold for behavioural data, as in fraud detection problems. Questions arise as to how to aggregate the data: should one try to classify individual transactions or should activity records be constructed? A fundamental aspect of any predictive problem in data analysis is the choice of an appropriate criterion for estimation and performance assessment. In the case of fraud, one needs, in particular, to combine both classification accuracy and timeliness of classification. This means that standard measures of classification performance, such as error rate, AUC, KS statistic, information value, etc, are not sufficient. Suitable measures and performance curves are described which combine these aspects and which are now being adopted by the industry. Various statistical (used here in John Chambers’s sense of ‘greater statistics’) approaches have been developed for fraud detection problems, and some are described and illustrated, using data from some of the banks which have been collaborating with us. In particular, we look at supervised classification and anomaly detection methods. Finally in the context of banking fraud, some of the deeper but very important conceptual issues are outlined, including the economic imperative, whether fraud is now becoming ‘acceptable’, and what exactly we learn from empirical comparisons, Scientific fraud is contrasted with banking fraud. They have rather different drivers. In particular, financial gain is generally irrelevant to scientific fraud, which makes it an unusual kind of fraud - although, of course, the impact can be even more serious. Several examples are given, from a range of disciplines. The role of data analytic tools in detecting scientific fraud, and the nature of such tools, is described |
关 键 词: | 数据欺诈; 不平衡事实; 数据分析 |
课程来源: | 视频讲座网 |
最后编审: | 2020-01-16:chenxin |
阅读次数: | 64 |