0


挖掘海量数据集

Mining Massive Data Sets
课程网址: http://videolectures.net/mmdss07_fogelman_mmds/  
主讲教师: Françoise Fogelman Soulié
开课单位: 克森公司
开课时间: 2007-11-26
课程语种: 英语
中文简介:
今天,来自所有可能来源的数据量都是巨大的,而且增长速度很快,很大程度上是因为无处不在的网络及其在我们日常生活中的日益增长的存在;还有电子邮件、手机、信用卡、零售、金融……这些数据服务于各种功能:从查询和搜索到提取信息、提供服务以及管理安全性。涉及的领域很多:统计、数据挖掘、文本挖掘、数据流、搜索、社交网络…学术活动所产生的复杂技术并不缺乏,其中的挑战主要涉及算法的新颖性、准确性和可扩展性。然而,在实际应用中,挑战是截然不同的:可扩展性(通常比学术出版物多一到两个数量级)、易用性和以透明方式将有效技术集成到工作系统中的能力,同时始终为客户创造价值。现实世界中的解决方案很复杂,通常需要从前面提到的各个领域集成许多技术组件:因此,评估这些领域如何相互补充变得非常重要。在本文的第一部分,我将介绍现实数据挖掘应用程序的挑战。我将介绍一般的统计学习理论框架,并讨论其中涉及的一些技术问题(大维数据集、缺失数据、异常值、非I.I.D.结构化数据、未标记数据…)在第二部分中,我将以KXen中的实现和开发的应用程序为例,说明框架(结构风险最小化[1])可用于解决现实世界中遇到的一些挑战。最后,我将描述一些开放的实际问题,这些问题需要进一步的理论研究。
课程简介: Today, the amount of data coming from all possible sources is enormous and growing at a fast pace due, in large part, to the ubiquitous Web and its increasing presence in our everyday life; but also to emails, cell phones, credit cards, retail, finance ... These data serve all sorts of functions : from query and search, to extracting information, providing services as well as managing security. Many fields are involved : statistics, data mining, text mining, data streams, search, social networks ... There is no lack of sophisticated techniques produced by academic activity, where challenges mostly deal with novelty, accuracy, and scalability of algorithms. However, in real-world applications, challenges are quite different : scalability (usually one or two orders of magnitude more than in academic publications), ease-of-use and capability to integrate efficient techniques into working systems in a transparent way, while always producing value for the customer. Real-world solutions are complex and usually need to integrate many technical components, from the various fields mentioned before: it thus becomes important to assess how these fields can complement one another. In the first part of the talk, I will present the challenges of real-world data mining applications. I will introduce the general Statistical Learning Theory framework and discuss some of the technical issues involved (large dimension data sets, missing data, outliers, non-i.i.d. structured data, unlabelled data ...) In the second part, I will show, taking examples from the implementation in KXEN and applications developed, how a theoretical framework (Structural Risk Minimization [1]) can be used to solve some of the challenges met in the real-world. I will finally describe some open practical issues which will require further theoretical investigation.
关 键 词: 计算机科学; 数据挖掘; 数据缺失
课程来源: 视频讲座网
最后编审: 2020-07-29:yumf
阅读次数: 54