0


使用机器算法预测抗癌症分子活动

Predicting anti-cancer molecule activity using machine learning algorithms
课程网址: http://videolectures.net/licsb08_santos_pam/  
主讲教师: Jose Santos
开课单位: 伦敦帝国理工学院
开课时间: 2008-04-17
课程语种: 英语
中文简介:

在本文中,我们研究了4.000种独特化合物对60种细胞系(例如白血病,前列腺癌,乳腺癌)的抗癌活性。小分子在生物学中起着重要作用,因为它们可以用作更复杂分子的构建基块,还可以与抑制或促进其作用的蛋白质相互作用。在这种情况下,将这种化合物添加到细胞中的后果可能很深远,因为蛋白质可能会参与非常复杂的链反应。这样,可以设计可以是有用药物的小分子。在这里,我们仅专注于预测给定分子的特性:它是否会针对给定的癌细胞系显示出抗癌活性(被测量为引起至少50%的细胞生长抑制)。这种计算预测非常重要,因为全球数据库中小分子的数量正在增加,并且适当的实验室测试的能力也受到限制。例如,美国国家癌症研究所(NCI)的体外细胞系筛选项目目前每年(仅)评估多达3000种化合物的潜在抗癌活性。从机器学习的角度来看,生物学问题是一个很好的应用程序,因为数据集非常丰富,数据是真实的,最适合特定问题的算法类型可能会发生实质性变化,并且突出强调机器学习研究需求的问题并不罕见。最后,帮助解决生物学问题可能会对更广泛的科学界产生重大影响。我们使用的分子数据集可在NCI网站上公开获得。我们针对此问题应用了一系列数据挖掘分类算法:决策树,归纳逻辑编程和支持向量机(SVM)。作为用于学习的分子特征,我们使用了分子量,辛醇水分配系数(logp)和碎片数。片段是一组连接的原子,其中片段中的每个原子都可以通过其类型简单地标识。 (例如碳)。如果我们以分子图的形式看待,则片段列表由直径为2的所有连接成分组成。实验表明,我们使用支持向量机(带有RBF内核)的结果与以前发表的最新技术水平相同,平均预测准确度为73%(基线为54%)。然而,令我们惊讶的是,如果我们不使用片段计数而不是仅使用原子计数,结果几乎是相同的(尽管差异具有统计学意义,但准确性降低了约1%)。必须指出的一点是,尽管像SVM这样的数字黑匣子算法往往比逻辑模型更准确(此数据集中的决策树和ILP的准确度比SVM低3%至4%),但可以说这种预测准确性与药物设计等重要实际应用的相关性。在药物设计中,有用的是要有一组规则来描述“好”化合物的外观。像我们在本文中描述的那样,使用人类可读的逻辑模型很容易实现该目标。

课程简介: In this paper we study the anti-cancer activity of - 4.000 unique compounds against a set of 60 cell lines (e.g. Leukemia, Prostate, Breast). Small molecules play an important role in biology as they can be used as building blocks for more complex molecules and also interact with proteins inhibiting or promoting their action. In this case the consequence of adding such a compound to a cell can be far reaching as the protein may be involved in a very complex chain reaction. As such it is possible to design small molecules which can be useful drugs. Here we concentrate only in predicting a property of a given molecule: whether it will show anti-cancer activity (measured as causing at least 50% cell growing inhibition) against a given cancerous cell line. This computational prediction is important as there are a growing number of small molecules in databases worldwide and the capacity for proper lab testing is limited. For instance, the In Vitro Cell Line Screening Project at the National Cancer Institute (NCI) can currently evaluate (only) up to 3000 compounds per year for potential anti-cancer activity. From a machine learning perspective, biological problems are a good application because datasets are abundant, the data is real, the type of algorithms most suitable for a particular problem may vary substantial and it is not unusual for a problem to highlight research needs in machine learning. Finally, helping to solve biological problems may have a big impact in the wider scientific community. The molecule dataset we used is publicly available at the NCI site. We applied a range of data mining classification algorithms to this problem: Decision Trees, Inductive Logic Programming and Support Vector Machines (SVMs). As molecular features used for the learning we have used molecular weight, octanol water partition coefficient (logp) and fragment counts. A fragment is a set of connected atoms where each atom in a fragment is simply identified by its type. (e.g. carbon). If we look at the molecule as a graph, the fragment list consists of all connected components with diameter two. The experiments demonstrate that our results using support vector machines (with RBF kernel) are identical to previous published state of the art work yielding an average 73% predictive accuracy (having 54% as the baseline). We noticed however, to our surprise, that if instead of using fragment counts we use only atom counts the results are nearly identical (about 1% less accuracy, although the diference is statistical significant). An important point that must be made is that, although numerical black box algorithms like SVMs tend to be slightly more accurate than logic models (Decision Trees and ILPs in this dataset have an accuracy 3% to 4% below SVMs), it is arguable the relevance of this predictive accuracy for important practical applications like drug design. In a drug design setting what is useful is to have a set of rules that describe what a "good" compound should look like. That goal is much easily achieved with a human readable logic model like the ones we also describe in the paper.
关 键 词: 预测算法; 机器学习; 逻辑模型; 决策树; 支持向量机; SVM
课程来源: 视频讲座网
数据采集: 2020-04-08:zhouxj
最后编审: 2020-05-25:cxin
阅读次数: 88