Category Archives: Similarity

Extracting Biologically Meaningful Information from Gene Expression Data: Gene CoExpression Networks

Data generated from gene expression experiments hold a important amount of biological information (Eisen et al.1998). The end point of any analysis of this sort is to gain a thorough view and understanding in the “inner life” of a cell i.e. the ongoing biological processes in the cell. This can be considered as a bottom-up approach, whereby we can slowly build our way up from the transcript levels, to the cellular process and ultimately the understand biological process under question (of course by combining other appropriate methods in order to be able to extract causal relationships). A natural thought to do, is that genes that have similar expression patterns, within a dataset, may be participating in common biological processes or even be under the same regulatory mechanism(s) (Tavazoie et al., 1999). Clustering of genes with similar expression patterns is a useful approach to gain this sort of information and and also putatively extent the information of common regulatory control to extract participation of the genes in various pathways.  Paraphrasing/Quoting from Eisen et al., 1998: “Statistical organization (clustering) and graphical display  of a microarray dataset allows for researchers to assimilate and explore data in a biologically meaningful way.  … Also, similarity in the gene expression pattern may be the easiest way to make -at least provisional- attribution of function on a genomic scale”.

Along the same lines, Transcription factor (TF) binding sites are critical in our understanding of transcription and trascriptional regulation. A TF binding site lies close or in a promoter region, therefore it has the ability to regulate transcription by either recruiting the RNA-polymerase in the promoter, or by blocking its docking on the DNA. The actions of TFs are transcript specific i.e. the TFs has a range of genes whose transcription it modulates. Thephysical approachof constructing gene networks, seeks to determine the TFs and their respective DNA motifs to which they bind to regulate transcription. Another strategy, the “influence approach” of constructing gene networks, deals with gene expression data and describes the relationships between the transcript levels and how they interact to regulate each other’s transcription. The transcript interactions are described with a graph, in which the nodes represent transcripts and the edges represent a relationship between the connected transcripts, according to the graph-construction method followed. The graph can be constructed as a system of differential equation models, a bayesian network, a boolean network or as an association network. The latter approach creates a gene coexpression network by assigning edges to pairs of genes with high statistical similarity. Different similarity metrics have been used such as Euclidean distance, Pearson correlation coefficient, mutual information (e.g. ARACNE, CLR), partial correlation coefficient (graphical Gaussians models (GGMs)). Moreover to tackle with analysis of gene expression data from time-series experiments appropriate algorithms extract correlation relationships between transcript level changes at the different time points  (Schmit Raab Stephanopoulos Genome Res04; Arkin, Shen , Ross Science 1997).

Genomic strategies in our days are advancing with a speed-of-light and the amounts of data generated are massive. The aforementioned network approaches, borrowed by graph theory and statistics hold the promise to reveal critical biological information where the “data mining” ability of a bench researcher stops. This is especially important, but without being the only, for cancer research. For example, breast cancer is the leading cancer death cause in women. It self is of heterogeneous phenotype, both in terms of histological origin/initiation (e.g. can develop in the ducts or lobule of the breast) as also, in terms of heterogeneity in the mutational landscape of the cancer cells. The latter means that the tumor it self can be highly heterogeneous. Combining transcript level analysis by coexpression networks with the recent advancements in breast tumor whole-genome sequencing (see Gray and Druker Nature 2012), may prove critical in our understanding on cancer initiation and evolution.

For more information on coexpression network construction the interested reader is referred to Gardner and Faith PLReav 2005.


Tavazoie et al., Nature Genetics 1999 

Eisen et al .PNAS 1998

Gardner and Faith PLReav 2005

Schmit et al., Genome Res 2004

Arkin et al., Science 1997

Gray and Druker Nature 2012

Leave a comment

Filed under Biology, Coexpression, Gene Expression, Graph Theory, Microarray, Networks, Science, Similarity