0


清洗伪装丢失数据:一种启发式方法

Cleaning Disguised Missing Data: A Heuristic Approach
课程网址: http://videolectures.net/kdd07_pei_cdmd/  
主讲教师: Jian Pei
开课单位: 西蒙弗雷泽大学
开课时间: 信息不详。欢迎您在右侧留言补充。
课程语种: 英语
中文简介:
在一些应用程序中,例如在Web上填写客户信息表单,一些缺少的值可能不会显式地表示为这样的值,而是显示为可能有效的数据值。这种缺失值被称为伪装缺失数据,可能严重影响数据分析的质量,如在假设检验、相关性分析和回归分析中造成显著偏差和误导性结果。以往关于清理隐藏缺失数据的研究非常有限,使用了异常值挖掘和分布异常检测。它们高度依赖于特定应用程序中的域背景知识,在伪装值为inlier的情况下可能无法很好地工作。为了解决隐藏缺失数据的清除问题,本文首先建立了隐藏缺失数据的分布模型,并提出了嵌入的无偏样本启发式算法。然后,我们开发了一种有效的方法来识别经常使用的伪装值,这些值捕获了伪装丢失数据的主体。我们的方法不需要任何领域背景知识来找到可疑的伪装值。我们报告了一个使用真实数据集的经验评估,这表明我们的方法是有效的–我们的方法发现的常用伪装值与领域专家确定的值很匹配。我们的方法对于处理大型数据集也是有效和可扩展的。
课程简介: In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely, such as causing significant biases and misleading results in hypothesis tests, correlation analysis and regressions. The very limited previous studies on cleaning disguised missing data use outlier mining and distribution anomaly detection. They highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers. To tackle the problem of cleaning disguised missing data, in this paper, we first model the distribution of disguised missing data, and propose the embedded unbiased sample heuristic. Then, we develop an effective and efficient method to identify the frequently used disguise values which capture the major body of the disguised missing data. Our method does not require any domain background knowledge to find the suspicious disguise values. We report an empirical evaluation using real data sets, which shows that our method is effective – the frequently used disguise values found by our method match the values identified by the domain experts nicely. Our method is also efficient and scalable for processing large data sets.
关 键 词: 计算机科学; 机器学习; 预处理
课程来源: 视频讲座网
最后编审: 2019-11-16:cwx
阅读次数: 51