0


利用投影抽样改进数据挖掘工具

Improving Data Mining Utility with Projective Sampling
课程网址: https://videolectures.net/videos/kdd09_last_idmups  
主讲教师: Mark Last
开课单位: 信息不详。欢迎您在右侧留言补充。
开课时间: 2025-02-04
课程语种: 英语
中文简介:
数据挖掘过程的整体性能不仅取决于归纳知识的价值,还取决于过程本身的各种成本,如获取和预处理训练示例的成本、模型归纳的CPU成本和犯错误的成本。最近,为了使数据挖掘的整体效用最大化,提出了几种渐进式采样策略。所有这些策略都是基于重复获取额外的训练示例,直到观察到效用下降。在本文中,我们提出了一种替代的投影采样策略,该策略将函数拟合到从一小部分潜在可用数据中获得的部分学习曲线和部分运行时曲线上,然后使用这些投影函数来分析估计最优训练集的大小。使用RapidMiner环境对机器学习和数据挖掘过程的各种基准数据集进行了评估。结果表明,与常用的渐进式采样方法相比,仅从几个数据点投影的学习和运行时曲线可以降低数据挖掘过程的成本。
课程简介: Overall performance of the data mining process depends not just on the value of the induced knowledge but also on various costs of the process itself such as the cost of acquiring and pre-processing training examples, the CPU cost of model induction, and the cost of committed errors. Recently, several progressive sampling strategies for maximizing the overall data mining utility have been proposed. All these strategies are based on repeated acquisitions of additional training examples until a utility decrease is observed. In this paper, we present an alternative, projective sampling strategy, which fits functions to a partial learning curve and a partial run-time curve obtained from a small subset of potentially available data and then uses these projected functions to analytically estimate the optimal training set size. The proposed approach is evaluated on a variety of benchmark datasets using the RapidMiner environment for machine learning and data mining processes. The results show that the learning and run-time curves projected from only several data points can lead to a cheaper data mining process than the common progressive sampling methods.
关 键 词: 模型归纳; 数据挖掘:机器学习
课程来源: 视频讲座网
数据采集: 2025-03-30:zsp
最后编审: 2025-03-30:zsp
阅读次数: 17