0


发现敏感数据中的频繁模式

Discovering Frequent Patterns in Sensitive Data
课程网址: http://videolectures.net/kdd2010_thakurta_dfp/  
主讲教师: Abhradeep Guha Thakurta
开课单位: 微软公司
开课时间: 2010-10-01
课程语种: 英语
中文简介:
从数据中发现频繁的模式是数据挖掘中一种流行的探索技术。但是,如果数据是敏感的(例如,患者健康记录、用户行为记录),那么发布有关重要模式或趋势的信息会对隐私造成重大风险。本文展示了如何在包含敏感信息的数据集中准确地发现和发布最重要的模式及其频率,同时为存储在其中的信息的个人提供严格的隐私保证。我们提出了两种在敏感记录数据集中发现k最频繁模式的有效算法。我们的算法满足差分隐私,这是最近引入的一个定义,在存在任意外部信息时提供有意义的隐私保证。不同的私有算法需要一定程度的不确定性才能保护隐私。我们的算法通过返回接近数据中k个最频繁模式的实际列表的模式的噪声列表来处理这个问题。我们定义了一个实用的新概念,它量化了私有Top-K模式挖掘算法的输出精度。在典型的数据集中,我们的效用标准意味着报告列表中的假阳性和假阴性率较低。我们证明了我们的方法符合新的效用准则;我们还通过对来自FIMI存储库的事务数据集的大量实验来证明我们的算法的性能。虽然本文关注的是频繁的模式挖掘,但只要数据挖掘输出是根据适当的可靠兴趣度量排序的元素列表,这里开发的技术就具有相关性。
课程简介: Discovering frequent patterns from data is a popular exploratory technique in datamining. However, if the data are sensitive (e.g., patient health records, user behavior records) releasing information about significant patterns or trends carries significant risk to privacy. This paper shows how one can accurately discover and release the most significant patterns along with their frequencies in a data set containing sensitive information, while providing rigorous guarantees of privacy for the individuals whose information is stored there. We present two efficient algorithms for discovering the k most frequent patterns in a data set of sensitive records. Our algorithms satisfy differential privacy, a recently introduced definition that provides meaningful privacy guarantees in the presence of arbitrary external information. Differentially private algorithms require a degree of uncertainty in their output to preserve privacy. Our algorithms handle this by returning noisy lists of patterns that are close to the actual list of k most frequent patterns in the data. We define a new notion of utility that quantifies the output accuracy of private top-k pattern mining algorithms. In typical data sets, our utility criterion implies low false positive and false negative rates in the reported lists. We prove that our methods meet the new utility criterion; we also demonstrate the performance of our algorithms through extensive experiments on the transaction data sets from the FIMI repository. While the paper focuses on frequent pattern mining, the techniques developed here are relevant whenever the data mining output is a list of elements ordered according to an appropriately robust measure of interest.
关 键 词: 频繁模式; 数据挖掘; 私有算法
课程来源: 视频讲座网
最后编审: 2020-04-13:chenxin
阅读次数: 72