Driving Genes

The driving genes are the ones that make the major contribution to the conformation and variability of the pathway scores values. In mathematical terms, these driving genes correspond to the variables with higher loading values in the principal components that define the pathway signatures. Therefore, taking into account that the loadings in PCA can be understood as the weights for each original variable (gene expression) when calculating the scores of the principal components (pathway activity), these driving genes can be interpreted as the genes that are pulling the pathways.

In other words, these genes are the fundamental pieces in the regulation of these pathways. For this reason, the study of the feasibility and biological relevance of the pathways links can be focused on the analysis of their driving genes.

Regards the computational identification of the driving genes, PANA use the minAS (minimum algorithmic selection) method, which is an algorithmic strategy to classify features according to the values of a certain univariate statistic measuring the importance of those features. The only assumption to apply minAS is that the distribution of the statistic is at least bimodal, that is, the statistic follows a mixture distribution with at least two components. The first component is the one associated to the smallest values of the statistic and, hence, to the features that should be discarded. The minAS strategy is especially useful when it is difficult to estimate parametrically the distribution of the components of the mixture.

The method consists basically of two steps:

  • minAS estimates empirically the mixture density function using a Kernel Density Estimator, with Gaussian kernel. The bandwidth (smoothing parameter) is computed following Silverman's rule, which takes into account the dispersion and size of the data.

  • minAS calculates the value where the first local minimum is reached and this value is used as the threshold that separates the first component from the rest.

In PANA, the statistic values are the absolute value of the loadings coming from a PCA model. Thus, the first component of the mixture corresponds to loadings close to zero, i.e., to features that are not contributing very much to the PCA model. Hence, minAS will obtain the optimum empirical value that separates those “unimportant” features from the rest, so the most important features can be selected for the subsequent analysis.

In this link the list of driving genes associated to each signature pathway is provided.