0


适用于大数据和稀疏数据的可扩展PARAFAC2

SPARTan: Scalable PARAFAC2 for Large & Sparse Data
课程网址: https://videolectures.net/videos/kdd2017_perros_SPARTan  
主讲教师: Ioakeim Perros
开课单位: KDD 2017研讨会
开课时间: 2017-10-09
课程语种: 英语
中文简介:
在探索性张量挖掘中,一个常见的问题是如何跨一组观察结果不自然对齐的主题分析一组变量。例如,当对一组患者的医疗特征进行建模时,治疗的数量和持续时间可能会随着时间的推移而变化很大,这意味着没有任何有意义的方法来对齐他们的临床记录以进行分析。为了处理这些数据,最先进的张量模型是所谓的PARAFAC2,它产生可解释和鲁棒的输出,可以自然地处理稀疏数据。然而,到目前为止,它的主要局限性是缺乏能够处理大规模数据集的高效算法。在这项工作中,我们通过开发一种可扩展的方法来计算大型稀疏数据集的PARAFAC2分解,称为SPARTan,从而填补了这一空白。我们的方法利用了PARAFAC2中的特殊结构,导致了一种新的算法重新表述,它比以前的工作更快(在绝对时间内),内存效率更高。此外,我们能够将SPARTan应用于从真实和医学上复杂的儿科患者数据中挖掘时间演化现象。临床专家认可了这一过程中发现的现象的临床意义,以及它们随着时间的推移对几名患者的时间演变。
课程简介: In exploratory tensor mining, a common problem is how to analyze a set of variables across a set of subjects whose observations do not align naturally. For example, when modeling medical features across a set of patients, the number and duration of treatments may vary widely in time, meaning there is no meaningful way to align their clinical records across time points for analysis purposes. To handle such data, the state-of-the-art tensor model is the so-called PARAFAC2, which yields interpretable and robust output and can naturally handle sparse data. However, its main limitation up to now has been the lack of efficient algorithms that can handle large-scale datasets. In this work, we fill this gap by developing a scalable method to compute the PARAFAC2 decomposition of large and sparse datasets, called SPARTan. Our method exploits special structure within PARAFAC2, leading to a novel algorithmic reformulation that is both faster (in absolute time) and more memory-efficient than prior work. We evaluate SPARTan on both synthetic and real datasets, showing 22X performance gains over the best previous implementation and also handling larger problem instances for which the baseline fails. Furthermore, we are able to apply SPARTan to the mining of temporally-evolving phenotypes on data taken from real and medically complex pediatric patients. The clinical meaningfulness of the phenotypes identified in this process, as well as their temporal evolution over time for several patients, have been endorsed by clinical experts.
关 键 词: 大数据; 稀疏数据; 高效算法
课程来源: 视频讲座网
数据采集: 2024-12-25:liyq
最后编审: 2024-12-26:liyq
阅读次数: 11