0


使用机器学习算法预测抗癌分子活性

Predicting anti-cancer molecule activity using machine learning algorithms
课程网址: http://videolectures.net/licsb08_santos_pam/  
主讲教师: Jose Santos
开课单位: 伦敦帝国学院
开课时间: 2008-04-17
课程语种: 英语
中文简介:
在本文中,我们研究了针对一组60的独特化合物的抗癌活性细胞系(例如白血病,前列腺,乳房)。小分子在生物学中起着重要作用。它们可以用作更复杂分子的构建模块,也可以与蛋白质相互作用抑制或促进他们的行动。在这种情况下,添加这样的化合物的结果。由于蛋白质可能参与非常复杂的链式反应,因此细胞可能很遥远。因此有可能设计出可用作有用药物的小分子。在这里我们只关注预测给定分子的特性:它是否会显示出抗癌活性(测量为对给定的癌细胞系引起至少50%的细胞生长抑制。这个计算于全球数据库中存在越来越多的小分子,因此预测非常重要并且适当的实验室测试能力有限。例如,体外细胞系筛选国家癌症研究所(NCI)的项目目前可以(仅)评估多达3000种化合物每年进行潜在的抗癌活动。从机器学习的角度来看,生物学问题是一个很好的应用,因为数据集丰富,数据是真实的,类型最适合特定问题的算法可能会有很大的不同,并且它并不罕见一个突出机器学习研究需求的问题。最后,帮助解决生物问题问题可能会对更广泛的科学界产生重大影响。我们使用的分子数据集可在NCI网站上公开获取。我们应用了一系列数据挖掘分类算法解决这个问题:决策树,归纳逻辑编程和支持向量机(SVM)的。作为用于学习的分子特征,我们使用了分子量辛醇水分配系数(logp)和碎片计数。片段是一组连接的原子其中片段中的每个原子都由其类型简单地标识。 (例如碳)。如果我们看一下分子作为图形,片段列表由直径为2的所有连通分量组成。实验证明我们的结果使用支持向量机(带RBF内核)与以前发表的最先进的工作相同,产生平均73%的预测准确性(以54%为基线)。然而,令我们惊讶的是,我们注意到,如果不是使用片段计数我们只使用原子计数,结果几乎相同(约少1%准确性,虽然差异是统计上显着的)。必须要做的重点虽然像SVM这样的数字黑盒算法往往比稍微准确一些。逻辑模型(此数据集中的决策树和ILP的精度比SVM低3%到4%),可以说这种预测准确性对于重要的实际应用是相关的药物设计。在药物设计中,有用的是制定一套描述内容的规则一个“好”的化合物应该是这样的。用人类可读的方法很容易实现这一目标逻辑模型就像我们在论文中描述的那样。
课程简介: In this paper we study the anti-cancer activity of - 4.000 unique compounds against a set of 60 cell lines (e.g. Leukemia, Prostate, Breast). Small molecules play an important role in biology as they can be used as building blocks for more complex molecules and also interact with proteins inhibiting or promoting their action. In this case the consequence of adding such a compound to a cell can be far reaching as the protein may be involved in a very complex chain reaction. As such it is possible to design small molecules which can be useful drugs. Here we concentrate only in predicting a property of a given molecule: whether it will show anti-cancer activity (measured as causing at least 50% cell growing inhibition) against a given cancerous cell line. This computational prediction is important as there are a growing number of small molecules in databases worldwide and the capacity for proper lab testing is limited. For instance, the In Vitro Cell Line Screening Project at the National Cancer Institute (NCI) can currently evaluate (only) up to 3000 compounds per year for potential anti-cancer activity. From a machine learning perspective, biological problems are a good application because datasets are abundant, the data is real, the type of algorithms most suitable for a particular problem may vary substantial and it is not unusual for a problem to highlight research needs in machine learning. Finally, helping to solve biological problems may have a big impact in the wider scientific community. The molecule dataset we used is publicly available at the NCI site. We applied a range of data mining classification algorithms to this problem: Decision Trees, Inductive Logic Programming and Support Vector Machines (SVMs). As molecular features used for the learning we have used molecular weight, octanol water partition coefficient (logp) and fragment counts. A fragment is a set of connected atoms where each atom in a fragment is simply identified by its type. (e.g. carbon). If we look at the molecule as a graph, the fragment list consists of all connected components with diameter two. The experiments demonstrate that our results using support vector machines (with RBF kernel) are identical to previous published state of the art work yielding an average 73% predictive accuracy (having 54% as the baseline). We noticed however, to our surprise, that if instead of using fragment counts we use only atom counts the results are nearly identical (about 1% less accuracy, although the diference is statistical significant). An important point that must be made is that, although numerical black box algorithms like SVMs tend to be slightly more accurate than logic models (Decision Trees and ILPs in this dataset have an accuracy 3% to 4% below SVMs), it is arguable the relevance of this predictive accuracy for important practical applications like drug design. In a drug design setting what is useful is to have a set of rules that describe what a "good" compound should look like. That goal is much easily achieved with a human readable logic model like the ones we also describe in the paper.
关 键 词: 抗癌活性细胞系; 链式反应; 支持向量机; 逻辑模型
课程来源: 视频讲座网
最后编审: 2019-05-13:cjy
阅读次数: 18