By Huwenbo Shi, Feb 16th, 2018, see original post @Huwenbo
The paper “Explaining Missing Heritability Using Gaussian Process Regression” by Sharp et al. tries to tackle the problem of missing heritability and the detection of higher-order interaction effects through Gaussian process regression, a technique widely used in the machine learning community. The authors obtained estimates of broad-sense heritability for a number of mice and yeast phenotypes using an RBF kernel that models higher-order interactions and found these estimates significantly larger than the narrow-sense heritability of these phenotypes. The authors also detected several loci displaying interaction effects.
In genetics, phenotypes are modeled by the following equation
where width="20px" height="20px"> is the phenotype measurement of the i-th individual, the genotype vector, a random effect term that captures relatedness among individuals, and the environmental noise. Here, is a function that maps the genotype vector into a real number. Under this model, heritability is defined as the proportion of variance in that is due to variation of ,
Different flavors of heritability exist based on the complexity of the function and the input that goes into . In general geneticists work with four types of heritability, as listed below.
Based on the definition of the four flavors of heritability, it follows that . The missing heritability problem often refers to the gap between and the narrow-/broad-sense heritability.
Parametric regression problems often involve a function, governed by a set of parameters , that maps each input with a response. For example, in Poisson regression , the distribution of the response variable is characterized by the mean parameter and the density function of Poisson.
Gaussian Process Regression is different from parametric regression in that one does not assume any parametric form for the function . Instead, a Gaussian Process prior assumes that the function values of , , for a number of inputs, , follow a multivariate normal distribution where is the kernel matrix, measuring the similarity between samples, that contraints the possible space of . Because the only constraint on the kernel function is that the covariance matrix is positive definite, this enables Gaussian Process Regression to model a broad range of functions.
The following is a list of kernel functions that are widely used (credit to Wikipedia),
Specifying the kernel function is a fundamental step of Gaussian Process Regression. An appropriate kernel allows one to model interaction of any order among genetic variations. In the Sharp et al. paper, the authors proposed a generalized version of the RBF kernel to measure similarity between two individuals and across the genotypes of SNPs,
where is a parameter that governs the overall similarity between and , the contribution of SNP to the variations of the phenotype - a large suggests that SNP contributes little to the variation of the phenotype, and a small implies signifiant contribution. By examining the magnitude of the hyperparameters , one can infer whether a genetic loci contribute significantly to the trait.
Overfit may occur when the number of parameters to estimate is larger than the amount of data one has. To avoid overfitting and improve parsimony of the model, the authors imposed a Gamma prior over the inverse of , . The Gamma prior has density function
Setting removes any mode in the density function, resulting in a monotonically decreasing function with a heavy tail concentrated around 0 (see figure below), enforcing most of to be close to zero.
Gaussian Process prior allows one to analytically perform integration over the space of , resulting in a posterior for the parameters
where incorporates the sparsity-inducing priors. The integration step effectively averages over all possible f(⋅)f(⋅), discarding the need to estimate each instance of separately. This step also increases power to detect loci that contribute to phenotypes.
There is no analytical solution to the posterior mode or mean of θ. However, sampling based approach (e.g. MCMC) can be used to start from a starting point and lead to the posteior mode. In the Sharp et al. paper, a Hybrid Monte Carlo that models a particle’s trajectory was used to make inference over θ.
Once the parameters estimated, one can use these estimates to quantify broad-sense heritability from the Gaussian Regression model. The basic idea is as follows: