首页概率论
   首页函数论
   首页生物工程
   首页遗传学
0


利用高斯过程潜变量模型估算非遗传因素对基因表达的贡献

Estimating the contribution of non-genetic factors to gene expression using Gaussian process latent variable models
课程网址: http://videolectures.net/licsb2010_fusi_ecn/  
主讲教师: Nicolò Fusi
开课单位: 曼彻斯特大学
开课时间: 2010-05-03
课程语种: 英语
中文简介:
由于最近可获得的基因图谱数据量的增加以及通过基因表达来表征疾病活动的能力,我们可以更详细地了解与每种疾病相关的多种原因。这是一项具有挑战性的任务,因为整合不同的生物数据来源并不简单,而且非遗传因素(如实验环境的差异或个体特征,如性别和种族)并不总是人为控制的。由于这些非遗传因素可能导致大多数基因表达的变异降低了基因研究的准确性,因此迫切需要明确考虑这些因素的模型。我们提出了一个模型,其中非遗传因素是不可见的潜在变量,基因表达水平可以描述为这些潜在变量和单核苷酸多态性(SNP)的线性函数。从生成角度来看,基因表达水平y为y=s v+x w+mu 1^t+epsilon,其中s是包含snps的矩阵,x是潜在变量,v和w是映射矩阵,是高斯分布的各向同性误差模型,mu允许模型具有非零均值。该模型受到Stegle等人提出的模型的启发。[1]但是,我们没有优化参数和边缘化潜在变量(如概率PCA),而是边缘化参数和优化潜在变量。对于映射矩阵w和v的先验选择,这两种方法是等效的。这种模型被称为双概率PCA,属于更广泛的一类模型,即高斯过程潜在变量模型。实际上,双PPCA是一种特殊情况,即假定输出尺寸为线性、独立和相同分布。这些假设中的每一个都可以轻松获得新的概率模型。该模型的许多扩展都是可能的,但即使以最简单的形式,Eqtl研究结果在发现的显著关联的数量方面也非常有前景。
课程简介: Thanks to the recent increase in the amount of genetic profiling data available and to the ability to characterize disease activity through gene expression, it is possible to understand more in detail the multitude of causal factors linked with each disease. This is a challenging task because the integration of different sources of biological data is not straightforward and because non-genetic factors (such as differences in the experimental setting or individual characteristics such as gender and ethnicity) are not always artificially controlled. Since these non-genetic factors may cause most of the variation in gene-expression reducing the accuracy of genetic studies, there’s a pressing need for models that take them explicitly into account. We present a model in which non-genetic factors are unobserved latent variables the gene expression levels can be described as linear functions of both these latent variables and Single Nucleotide Polymorphisms (SNPs). From a generative point of view, we can see the gene expression levels Y as Y = SV + XW +mu 1^T + epsilon Where S is the matrix containing the SNPs, X are the latent variables, V and W are mapping matrices, is a Gaussian distributed isotropic error model and mu allows the model to have non-zero mean. The model is inspired by the one proposed by Stegle et al. [1], but instead of optimizing parameters and marginalising latent variables (as in Probabilistic PCA), we marginalise the parameters and optimize the latent variables. For a particular choice of prior over the mapping matrices W and V the two approaches are equivalent. This kind of model is called dual Probabilistic PCA and it belongs to a wider class of models called Gaussian Process - Latent Variable Models. Indeed, dual PPCA is the special case where the output dimensions are assumed to be linear, independent and identically distributed. Each of these assumptions can be relaxed obtaining new probabilistic models. Many extensions of this model are possible, but even in its simplest form the eQTL study results are extremely promising in terms of number of significant associations found.
关 键 词: 基因表达; 非遗传因素; 高斯分布; 优化变量; 概率模型
课程来源: 视频讲座网
最后编审: 2019-12-27:lxf
阅读次数: 58