0


噪声数据的处理

Handling noisy data
课程网址: http://videolectures.net/acai05_bratko_hnd/  
主讲教师: Ivan Bratko
开课单位: 卢布尔雅那大学
开课时间: 2007-02-25
课程语种: 英语
中文简介:
在机器学习实践中,学习数据通常包含错误。数据的不完善可能是由于各种各样的,通常是不可避免的原因:测量误差、人为错误、训练样本分类时专家判断的错误等等。我们把这些都称为噪声。当一个属性值未知的例子被一组与该缺失值的概率分布相对应的加权例子所替代时,噪声也可以来自缺失值的处理。噪声在学习数据中的典型后果是:(a)诱导假设对新数据的预测精度低,(b)用户难以理解和解释的大假设。例如,具有成百上千个节点的决策树不适合由领域专家解释。我们说这种复杂的假设与数据不符。当假设不仅反映了域内的真实规律,而且还能跟踪数据中的噪声时,就会出现过拟合。为了减轻噪音的有害影响,我们必须防止过拟合。要做到这一点,一个常见的想法是简化诱导假设。在学习规则或决策树时,这会导致树的修剪或规则截断。假设简化的主要问题是:我们如何知道我们的假设大小合适,不太简单也不太复杂?例如,在树木修剪,我们什么时候应该停止修剪?可以根据修剪前和修剪后假设的估计精度来进行决策,然后将估计精度最大化。然而,估计精度是困难的,并且涉及到从小样本估计概率的问题。这节课将讨论几种简化方法,并说明简化的效果。一种确定合适尺寸的相关方法。一个假设的最小描述长度原则(MDL)。另一种减少噪声影响的方法是使用关于学习领域的背景知识或先验知识。例如,在从数值数据中学习时,一个有用的想法是使学习算法尊重目标概念的已知定性性质。
课程简介: In the practice of machine learning, learning data typically contain errors. Imperfections in data can be due to various, often unavoidable causes: measurement errors, human mistakes, errors of expert judgement in classifying training examples etc. We refer to all of these as noise. Noise can also come from the treatment of missing values, when an example with unknown attribute value is replaced by a set of weighted examples corresponding to the probability distribution of the missing value. The typical consequences of noise in learning data are: (a) low prediction accuracy of induced hypotheses on new data, and (b) large hypotheses that are hard to interpret and to understand by the user. For example, decision trees with hundreds or thousands of nodes are not suitable for interpretation by the domain expert. We say that such complex hypotheses overfit the data. Overfitting occurs when the hypothesis not only reflects the genuine regularities in the domain, but it also traces noise in data. To alleviate the harmful effects of noise, we have to prevent overfitting. To do this, one common idea is to simplify induced hypotheses. In the learning of rules or decision trees, this leads to tree pruning or rule truncation. The main question in hypothesis simplification is: How can we know that our hypothesis is of “the right size”, not too simple and not too complex? For example in tree pruning, when should we stop the pruning? The decision can be based on the estimated accuracy of a hypothesis before pruning and after pruning, and then the estimated accuracy is maximised. However, estimating the accuracy can be difficult, and involves the problem of estimating probabilities from small samples. Several methods for this will be discussed in this lecture, and the effects of simplification will be illustrated. A somewhat related approach of deciding about the “right size” of a hypothesis is based on the minimum description length principle (MDL). Another way of reducing the effects of noise is to use background or prior knowledge about the domain of learning. For example, in the learning from numerical data, a useful idea is to make the learning algorithm respect the known qualitative properties of the target concept.
关 键 词: 噪声; 处理; 机器学习
课程来源: 视频讲座网
最后编审: 2019-10-31:lxf
阅读次数: 70