Biological validation of computationally determined gene-gene interaction

Biological validation of computationally determined gene-gene interaction

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

How can a computationally determined three-way gene interaction be biologically validated? What kind of assays or tests must be performed using cell/tissue-based and/or mouse models to prove that the three genes may indeed have a joint effect?

Let's say, it is easier to identify and validate gene interactions that involve transcription factors like FOXM1. Consider a three-way gene interaction in Breast cancer - FOXM1-BUB1-CHEK1 - that can be tested for direct interactions via western blot and reporter assays. But such interactions may or may not be interesting - given the fact that transcription factors may be affecting expression levels of other genes. Most computational studies are focussed on identifying gene interactions based on coexpression or co-occurrence. There is literature on computationally identifying AND/OR relationships between interacting genes. I don't have a specific example to provide but, for argument sake, if we did suspect that three genes are interacting in an AND like manner, how do we biologically validate this finding? I'd also appreciate if you may have comments on the usefulness of such findings, particularly, with regard to designing more efficacious combination therapies against a disease".


Hang in there, this answer will grow over time.

Just off the top of my head, I think we could write a small book (or a very long review article) to cover both the depth and the breadth of your question(s).

First, we need to define, and/or clarify, some terms. The term interaction probably means different things to different authors at different times. In fact, one author may use this term to mean different things at different times. So I would like to propose that we make some distinctions. For example, I would make a distinction between a genetic interaction between two genes (which is typically detected in one of the model organisms, using a genetic test between two loss-of-function alleles) versus a protein-protein interaction (sometimes called a PPI) between the two proteins encoded by those same two genes. PPIs have been detected classically by co-sedimentation in a sucrose density gradient, or co-immunoprecipitation (co-IP), or by affinity chromatography. More recently PPIs have been detected by mass spectrometry (MS) (sometimes coupled with affinity chromatography). PPIs have been inferred by using proxy assays, such as the yeast two hybrid (Y2H) assay.

PPIs can also be detected by surface plasmon resonance (SPR) assays, and there may well be others.

So, when you say gene-gene interactions, what precisely do you mean?

I'm gonna try to cover as many scenarios as I can think of:

  • Gene interaction ( = co-expression):
    This is probably the easiest to validated (e.g. by western blot like you mentioned) but the hardest to interpret. Without further analysis of the genes functions you have no idea what effect their co-occurring activity might have. If the genes are metabolic enzymes or part of a signal cascade it would be a good idea to check both their individual roles in these pathways, but also check for potential synergistic effects. The same is in principal also true for transcription factors, but in that case 'pathways' are often not equally well understood, instead go for:

  • Gene interaction ( = on same targets):
    In the case of transcription factors (but also e.g. Kinases/Phosphatases) that are co-expressed it might not unreasonable that they also affect the same targets. Here you need to again compare both individual effects (list of up/down regulated genes) with synergistic effects (e.g. geneA prevents normal function of geneB, geneC increases geneB function multifold). For transcription factor analysis you should generally analyse expression on the mRNA level (via qPCR) since the protein level introduces (at least) another layer of regulation.

  • Protrein interactions:
    A good way to prove these experimentally is by using co-IP (immuno-precipitation) experiments: you use an antibody to retain a specific protein and all its (direct) binding partners. Then you show the presence of the partners you're interested in with a western blot.
    One caveat for this method with 3-way (or higher order) interactions is that - depending on the binding strength of the proteins - it can be very hard to prevent anything but direct interaction partners from being co-purified. If you want to absolutely prove that two proteins interact directly a yeast-2-hybrid (y2h) experiment is better suited.

Others have covered a lot more ground but I'll throw in a couple of approaches from the spectroscopy side since I've been looking at them recently.

Bioluminescent resonant energy transer (BRET) shows if proteins bind in vivo: "BRET measures the interaction of proteins using a bioluminescent donor fused to a protein of interest and a fluorescent receptor fused to its binding partner. The bioluminescent donor, usually a luciferase, does not excite the fluorophore using light, but transfers resonance energy through dipole-dipole coupling. To transfer resonance energy, the donor must be within 10nm of the receptor and in the proper orientation making the technique useful for measuring proteins in close proximity." (From

Surface plasmon resonance (SPR) also shows if proteins bind, but in vitro: "SPR is sensitive to changes in refractive index within about 150 nm from the sensor surface. To study the interaction between two binding partners, one partner is attached to the surface and the other is passed over the surface in a continuous flow of sample solution. The SPR response is directly proportional to the change in mass concentration close to the surface. Biacore systems can be used to study interactions involving (in principle) any kind of molecule, from organic drug candidates to proteins, nucleic acids, glycoproteins and even viruses and whole cells." (From the Biacore assay handbook)

These are both ways to look at direct interaction, ie binding. Functional interaction is a whole different story.

Computational methods for chromosome-scale haplotype reconstruction

High-quality chromosome-scale haplotype sequences of diploid genomes, polyploid genomes, and metagenomes provide important insights into genetic variation associated with disease and biodiversity. However, whole-genome short read sequencing does not yield haplotype information spanning whole chromosomes directly. Computational assembly of shorter haplotype fragments is required for haplotype reconstruction, which can be challenging owing to limited fragment lengths and high haplotype and repeat variability across genomes. Recent advancements in long-read and chromosome-scale sequencing technologies, alongside computational innovations, are improving the reconstruction of haplotypes at the level of whole chromosomes. Here, we review recent and discuss methodological progress and perspectives in these areas.

Target Identification

Identifying a biological target that is ‘druggable’ – a target is termed ‘druggable’ if its activity (behavior or function) can be modulated by a therapeutic – whether it be a small molecule drug, or biologic. Proteins and nucleic acids are both examples of biological targets. 2

But what makes a ‘good’ target?

  1. The target has a confirmed role in the pathophysiology of a disease and/or is disease-modifying.
  2. Target expression is not evenly distributed throughout the body.
  3. The target’s 3D-structure is available to assess druggability.
  4. The target is easily ‘assayable’ enabling high-throughput screening.
  5. The target possesses a promising toxicity profile, potential adverse effects can be predicted using phenotypic data.
  6. The proposed target has a favorable intellectual property (IP) status. (relevant for pharma companies)

Access options

Get full journal access for 1 year

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Get time limited or full article access on ReadCube.

All prices are NET prices.


Analysis 1, Tests for Interaction among CVRF SNPs

From the literature sources described above, we identified 242 independent SNPs reported to be robustly associated with CVRFs or cardiovascular endpoints these SNPs, the reported phenotypes, and the p-values for association with MI in the MIGen study are shown in File S1 Supporting Table 1. Using Test A, we performed 29,161 pair-wise interaction tests among these 242 risk factor SNPs (File S1 Supporting Figure 2 ), the results of which did not deviate significantly from their empirical expected distribution ( Figure 2b ). The most significant interaction (p =𠂥.54휐 𢄦 see File S1 Supporting Table 1) occurred between SNPs originally reported to be associated with LDL cholesterol levels (rs2072183, in NPC1L1) and smoking initiation (rs1013442, near BDNF). This result did not exceed the significance threshold for this Analysis (p =𠂡.51휐 𢄦 Figure 2a File S1 Supporting Table 2). Under an interaction model with additive × additive effects, we estimated that this analysis had high power (80%) to detect an odds ratio (OR) for interaction of between 𢏁.6 and 𢏁.3 when both SNPs have a MAF of 𢏀.2 and 𢏀.5, respectively ( Figure 2c File S1 Supporting Table 3, File S1 Supporting Figure 4).

Analysis 2, Tests for Interaction between CVRF SNPs and Marginal SNPs (p� 𢄣 )

We selected 656 independent SNPs that showed moderate marginal association (p� 𢄣 ) with MI in the MIGen study and excluded 13 that had been captured in Analysis 1. Using Test A, we performed 155,606 interaction tests between the remaining 643 SNPs and the 242 CVRF SNPs (File S1 Supporting Figure 2 ), the results of which did not deviate significantly from their empirical expected distribution (File S1 Supporting Figure 3). The most significant result for interaction was p =𠂩.48휐 𢄧 , between SNPs associated with HDL cholesterol levels (rs3136441, in LRP4) and MI (rs9990208, located near RFTN1 and DAZL on chromosome 3, p =𠂡.2휐 𢄤 in MIGen). This result did not exceed the significance threshold for this Analysis (p =𠂣.13휐 𢄧 File S1 Supporting Table 2, File S1 Supporting Figure 3). Under an additive × additive interaction model, this analysis was estimated to have high power to detect interaction effects of between 𢏁.7 and 𢏁.4 for SNPs with MAF of 𢏀.2 and 𢏀.5, respectively (File S1 Supporting Table 3, File S1 Supporting Figure 4).

Analysis 3a, Tests for Interaction among Marginal SNPs (p� 𢄣 )

For the 643 independent SNPs that achieved a p-value of � 𢄣 for association with MI in the MIGen study and that were not captured in Analysis 1, we performed 201,537 pair-wise interaction tests using Test B (out of a possible 206,403 pairs test not feasible for 4,866 pairs (𢏂.35%) due to low allele frequencies, see File S1 Section 3.3, File S1 Supporting Figure 2 ). The results of these tests did not deviate significantly from their empirical expected distribution (File S1 Supporting Figure 3). The most significant p-value for interaction was 3.49휐 𢄦 , between rs761174 (within HHAT on chromosome 1, p =𠂡.75휐 𢄥 in MIGen) and rs167490 (within CHST11 on chromosome 12, p =𠂥.92휐 𢄤 in MIGen), which did not exceed the significance threshold for this Analysis (p =𠂢.93휐 𢄧 File S1 Supporting Figure 3c). Under an additive × additive interaction model, this analysis was estimated to have high power to detect interaction effects of between 𢏁.75 and 𢏁.4 for SNPs with MAF of 𢏀.2 and 𢏀.5, respectively (File S1 Supporting Table 3, File S1 Supporting Figure 4).

Analysis 3b, Tests for Interaction among Marginal SNPs (p� 𢄢 )

Relaxing the minimum threshold of the observed marginal effects of putative interacting SNPs, we selected 6,066 independent SNPs that achieved a p-value of � 𢄢 for association with MI in the MIGen study and that were not captured in the previous Analyses, and performed 17,470,706 interaction tests, out of a possible 18,180,305 pairs (discarded 214,840 tests already captured by previous Analyses test not feasible for a further 709,599 (𢏃.9%) pairs due to low allele frequencies, see File S1 Section 3.3, File S1 Supporting Figure 2 ). The results of these tests did not deviate significantly from their empirical expected distribution (File S1 Supporting Figure 3). The most significant p-value for interaction was 5.51휐 𢄨 , between rs194243 (between CYP26B1 and EXOC6B on chromosome 2, p =𠂣.97휐 𢄣 in MIGen) and rs4589969 (within CACNA2D3 on chromosome 3, p =𠂧.75휐 𢄣 in MIGen), which did not exceed the significance threshold for this Analysis (p =𠂣.57휐 𢄩 File S1 Supporting Figure 3d). Under a double-additive model, this analysis was estimated to have high power to detect interaction effects of between 𢏁.85 and 𢏁.45 for SNPs with MAF of 𢏀.2 and 𢏀.5, respectively (File S1 Supporting Table 3, File S1 Supporting Figure 4).

Validation of the Top Results from Analyses 1𠄳 in an Independent Sample

While the minimum observed p-values in each Analysis were 𢏃� times larger than the corresponding significance threshold, it is possible that real interaction effects are present but could not be declared statistically significant because of the demanding multiple testing burden. Therefore, we sought to validate our findings for all SNP pairs that achieved a p-value for interaction within 3 orders of magnitude of the required significance threshold in each Analysis (File S1 Section 3.8). In a large sample of cases of CHD and controls from the WTCCC (File S1 Section 1), we replicated our analysis for 47, 49, 45 and 50 pairs of SNPs (out of 48, 52, 54 and 55 pairs that met this criterion) in Analyses 1, 2, 3a and 3b, respectively. After correcting for multiple testing, none of these pairs showed nominally significant evidence of interaction in the WTCCC data (File S1 Supporting Table 2) for the SNP pairs from Analysis 1 (pmin =𠂠.0041 α𢒀.05/47𢒀.0011), Analysis 2 (pmin =𠂠.0392 α𢒀.05/49𢒀.001), Analysis 3a (pmin =𠂠.006 α𢒀.05/45𢒀.001) or Analysis 3b (pmin =𠂠.012 α𢒀.05/50𢒀.001). Similarly, we observed no additional evidence of interaction after performing a meta-analysis of both studies (see File S1 Section 3.8 for methods and File S1 Supporting Table 2 for results Analysis 1, pmin =𠂡.49휐 𢄥 Analysis 2, pmin =𠂡.41휐 𢄥 Analysis 3a, pmin =𠂡.01휐 𢄤 Analysis 3b, pmin =𠂧.01휐 𢄧 significance thresholds equal to those for the corresponding discovery Analyses, p =𠂡.51휐 𢄦 , p =𠂣.13휐 𢄧 , p =𠂢.93휐 𢄧 , p =𠂣.57휐 𢄩 , respectively).


A Motivating Example.

A major challenge of genome-wide analyses is how to extract sparse signals from large-scale datasets, which tend to be heterogeneous and noisy. To illustrate how the level of noise in the data increases the complexity of detecting genes involved in a specific biological process, we performed a simple study of the cholesterol metabolic process using transcriptomic measures from 426 LCLs (lymphoblastoid cell lines) derived from participants of the CAP (Cholesterol and Pharmacogenetics) statin clinical trial (13) (CAP-LCLs). This is one of the major datasets that we use in this paper to demonstrate the performance of our GeneFishing method.

From Ensembl BioMart (, we extracted 120 genes that are annotated with the GO BP (Gene Ontology biological process) term “GO:0008203 cholesterol metabolic process,” of which 82 are expressed in the CAP-LCL dataset. We first measured coexpression of all gene pairs as the absolute value of Spearman rank correlation of gene expression values across subjects. Thus, our data can be thought of as a T × T gene coexpression matrix (here, T = 82). We next performed spectral analysis based on the coexpression matrix to project each gene onto the space of the first 2 non-0 eigenvectors of the normalized Laplacian matrix and identified a tight cluster of 21 genes (Fig. 1A), 18 of which encode enzymes in the cholesterol biosynthesis pathway (14), with the remaining 3 genes known to be involved in the transcriptional regulation of these 18 genes (i.e., INSIG1 and SREBF2) or complementary functions (LDLR, the key regulator of low density lipoprotein [LDL] uptake) (SI Appendix, Fig. S1 and Table S1). To test whether this tight cluster persisted in the context of other genes, we repeated the analysis using gene sets composed of the 21 genes as well as 100, 1,500, and 2,000 random genes (Fig. 1 B to D). Since the majority of genes should be unrelated to cholesterol metabolism, we expect that the sheer number of pairs of such genes outweighs those that show patterned relations among our subjects. As shown in Fig. 1B, the 21 genes created an obvious cluster when mixed with 100 random genes. However, this cluster became obscured in the presence of larger sets of random genes as shown in Fig. 1 C and D. These results illustrate how the information provided by the 21 cholesterol genes is progressively obscured by random noise patterns with increasing numbers of random genes.

Motivation and workflow of GeneFishing. (A to D) Spectral clustering plot of the 21 bait genes (colored in red) with another 61 genes (colored in blue) associated with the GO BP term “cholesterol metabolic process” (A), and 100 (B), 1,500 (C), and 2,000 (D) random genes (colored in gray). (E) Workflow of GeneFishing.

The GeneFishing Procedure.

Our goal is to develop an effective procedure to identify genes relevant to known biological processes using transcriptomic data. Leveraging the clustering of the 21 cholesterol-related genes observed above, we develop GeneFishing, a semisupervised, nonparametric clustering procedure based on a bagging-like idea to reconstruct portraits of biological processes of interest in varying context. The input data of GeneFishing are an M × T matrix representing the normalized expression values of T genes across M subjects together with a small set of preidentified genes known to be relevant to the biological process of interest (such as the 21 genes mentioned in the motivating example). This set of genes can be used as “bait” genes to guide our search for additional genes that are potentially relevant to the biological process.

The GeneFishing flowchart is shown in Fig. 1E. Given bait genes, step 1, reduction of search space, is key in our method, facilitating the pulling of “signal” out of the “noise.” In particular, the candidate genes are randomly split into many subsearch spaces of m genes each (e.g., m = 100). The bait genes are then added to each of the candidate gene subsets. In step 2, coexpression matrices are constructed for gene pairs contained within each subsearch space, and the spectral clustering algorithm was applied to each matrix separately. The current implementation uses Spearman rank correlation to generate gene coexpression matrices. Other coexpression measures can be more appropriate in other contexts as discussed in refs. 15 ⇓ –17. While in most cases, the bait genes cluster separately from the candidate genes, in some instances candidate gene(s) will cluster with the bait genes (e.g., when a gray dot clusters within the red dots as shown in Fig. 1B). When this occurs, we consider the candidate gene to be “fished out.” Since a candidate gene may randomly cocluster with the bait genes, we repeat steps 1 and 2 (defining 1 round of GeneFishing) n times (e.g., n = 1000). In step 3, the results from all rounds are aggregated. The final output is a table that records the “capture frequency rate” (CFR the ratio of the number of times that each candidate gene has been fished out in the n rounds of GeneFishing to n). We consider fished out genes with large CFR values as “discoveries.” Note, however, that we can only conclude that these discoveries are likely functionally related to the bait genes, not that they perform a specific or similar function as the bait genes. Complete technical details of the GeneFishing procedure as well as the computation of approximate P values and false discovery rates (FDRs) are provided in Methods and SI Appendix.

Evaluating GeneFishing with Real and Simulated Datasets.

All statistical models (or methods) in genomics are crude approximations to reality. They are used to generate procedures and provide measures using model-based inferences of the potential validity of perceived findings. In the usual case when we lack reliable models for some of the biological systems of interest, we focus on 3 minimum requirements: interpretability, replicability, and stability (18). By interpretability, we mean that some of the results can be related to known biology and ideally, guide further experimental study. Replicability refers to the stability of conclusions when the same methodology is applied to similar independent datasets. Stability means that conclusions should vary little under small statistical perturbations of the data and the model.


We first assessed whether the discoveries derived from GeneFishing were biologically plausible. Since genes involved in sterol metabolism are themselves well known to be transcriptionally coregulated, we used the 21 genes discussed in our motivating example as bait genes and applied GeneFishing to the CAP-LCL dataset. We noted that the CFR distribution of GeneFishing was strongly bimodal, which indicates a very natural cutoff for CFR (Fig. 2A). Finally, we identified 27 genes with CFR ≥ 0.99 (SI Appendix, Table S2). Interestingly, 10 of these had known roles in lipid or sterol metabolism and included TMEM55B, which we had previously identified as a cholesterol-regulatory gene based on its very high degree of coexpression with HMGCR, 1 of the 21 bait genes (19).

Evaluation of GeneFishing. (A) Distribution of CFR values when GeneFishing was applied to the CAP-LCL dataset. (B) For each method, 2 ranked gene lists were generated by applying the method to the CAP-LCL and GEUVADIS-LCL datasets. Each colored curve corresponds to a gene prioritization method, plotting the number of overlapped genes between the 2 lists up to a rank position (y axis) against the rank (x axis). GBA is short for guilt-by-association. (C) Scatterplots of the CFR values when GeneFishing was applied to the raw CAP-LCL dataset and 3 randomly perturbed datasets.


To assess replicability, we tested the performance of GeneFishing in 2 other independent LCL datasets: the GEUVADIS-LCL (20) (462 lymphoblastoid cell lines from Genetic European Variation in Disease project) dataset and the GTEx-LCL (4) dataset (118 lymphoblastoid cell lines from GTEx project). We first checked the expression of the 21 bait genes in both datasets and observed clear clustering of the 21 genes again by spectral analysis (SI Appendix, Fig. S2 A and B). We then applied GeneFishing to each dataset using the 21 genes as bait and tested the overlap within the top t fished out genes (ordered by CFR values with t varying from 20 to 100) between the 3 (CAP, GEUVADIS, and GTEx). For benchmarking purposes, we also compared GeneFishing with other methods, including WGCNA (21) (weighted correlation network analysis, an unsupervised approach for finding gene coexpression clusters) and 3 different versions of guilt by association approaches (i.e., the association between a candidate gene and the set of bait genes is evaluated by the mean, median, and maximum of the Spearman rank correlations between the candidate and each of the bait genes, respectively). Of the methods tested, GeneFishing had the best (or equally good) replicability (Fig. 2B and SI Appendix, Fig. S2C).


Using the CAP-LCL dataset, we assessed the stability of GeneFishing under the following 3 scenarios: (i) when random genes are included in the set of bait genes (i.e., there is noise in the bait set), (ii) when only a subset of the 21 genes is used as baits, and (iii) when the method is applied to subsamples of all subjects (e.g., 80% of subjects were used to construct a gene–gene coexpression matrix when performing GeneFishing). As shown in Fig. 2C, the CFR values of each scenario were reasonably correlated with that derived from original CAP-LCL dataset, especially for high CFRs (e.g., when CFR > 0.9). This suggests that GeneFishing is quite robust to small perturbations of the input dataset. We also performed a simulation study to further investigate the stability of GeneFishing, and the results are presented in SI Appendix.

Application of GeneFishing to Liver and a Follow-Up Wet Laboratory Experiment Implicate GLO1 as a Cholesterol Metabolism Regulator.

Since the liver is the major organ that impacts plasma cholesterol, we applied GeneFishing to the GTEx human liver RNAseq dataset (119 samples). After confirming clear clustering of the 21 bait genes (SI Appendix, Fig. S3A), we identified 56 genes with a CFR ≥ 0.99 (SI Appendix, Table S3). GO term enrichment analysis (with R package GOStats) (22) identified substantial enrichment for multiple GO terms related to sterol metabolism, including “lipid metabolic process” (FDR = 7.56E-09) and “lipid biosynthetic process” (FDR = 5.29E-07). Next, since many genes involved in cholesterol metabolism are themselves transcriptionally regulated by cellular sterols, we sought to determine if any of the 56 genes showed evidence of sterol regulation. We performed transcriptome-wide sequencing on HepG2 cells that were first sterol depleted (incubated with 2 μM simvastatin + 10% lipoprotein-deficient serum for 24 h), after which 50 μg/mL low-density lipoprotein cholesterol (LDLC) was added back and incubated for an additional 24 h. Of the 56 genes, transcript levels of 28 genes were changed in response to sterol depletion (adjusted P value < 0.05), with effects reversed by LDLC add back (SI Appendix, Table S3). Interestingly, 13 of the 56 genes did not appear to be changed in response to sterol depletion (P value > 0.5) 6 of the 56 genes were not expressed at a high-enough level in the HepG2 cells to meet the minimum threshold for expression. Several of the genes identified (e.g., MMAB, SNAI3-AS1) appeared to share promoter elements with 1 of the 21 bait genes (SI Appendix, Fig. S3B).

Of the genes not previously implicated in cholesterol metabolism, we tested the effect of knockdown of 11 of these genes on measures of intracellular cholesterol. We purposefully selected some genes that showed no evidence of sterol regulation (e.g., GLO1, TDRKH, TTC39B, and C2orf82) (Fig. 3A), as the reason why and/or how these genes may have been identified by GeneFishing was unclear. Huh7 cells were reverse transfected with siRNAs (silencing RNAs) targeting each gene of interest or a nontargeting control siRNA, and after 48 h, changes in gene expression and cellular cholesterol were quantified by qPCR and via the Amplex Red Cholesterol assay, respectively (Fig. 3A). Knockdown of 2 genes, GLO1 and RDH11, significantly impacted transcript levels of SQLE, which encodes an enzyme in the cholesterol synthesis pathway (Fig. 3B). This change was confirmed in a second human hepatoma cell line, HepG2 (Fig. 3C). In addition, consistent with the increase in SQLE levels, we found that GLO1 knockdown significantly increased cellular cholesterol esters in both Huh7 and HepG2 cells (Fig. 3D).

Effect of candidate gene knockdown on transcript levels of cholesterol related genes. (A) Transcript levels (in the Huh7 cell line) of candidate genes were quantified by SYBR Green assay via qPCR to assess the degree of gene knockdown. (B) Transcript level of SQLE (in the Huh7 cell line) was quantified by SYBR Green assay to test whether candidate genes knockdown modulated its expression level. (C) Transcript levels (in the HepG2 cell line) of GLO1 and RDH11 were quantified by SYBR Green assay via qPCR to assess the degree of gene knockdown. Transcript level of SQLE (in the HepG2 cell line) was quantified by SYBR Green assay to test whether GLO1 and RDH11 knockdown modulated its expression level. In A to C, data were analyzed using the delta Ct (cycle threshold) method and normalized to CLPTM1 transcript levels as a loading control. All qPCR assays were performed in triplicate. (D) Cellular cholesterol levels were quantified using the Amplex Red Cholesterol Assay kit with values normalized to total cellular protein quantified via Bradford assay. There are 3 to 6 replicates per treatment condition. NTC, nontargeting control.

Pantissue GeneFishing Analysis.

The cholesterol metabolic process functions widely in different human tissues. Motivated by the success of GeneFishing in the application to the GTEx liver data, we next sought to determine if the strong clustering of the 21 bait genes was also observed in other tissue types. In more detail, given a tissue, we performed the same spectral clustering analysis as in Fig. 1A and computed 2 statistics: tightness (defined as the ratio between within-cluster sum-of-squares and total sum-of-squares) of the cluster that contains most of the 21 genes and Jaccard index between the cluster and the 21 bait genes. Most tissues exhibited the 21 genes as a tight cluster. However, the 21-gene module was not apparent in some tissues due either to stronger coexpression with genes outside of the 21-gene module (e.g., adrenal gland) or to complete absence of coexpression (e.g., skeletal muscle) (Fig. 4B). Although it is well established that genes in the cholesterol synthesis pathway are coregulated, the change in their coexpression pattern that we observed across different tissues indicates an unexpectedly high degree of tissue specificity of such coregulation and meanwhile, may inform their unknown functions (or interesting connections of the cholesterol synthesis pathway to other biological processes).

Pantissue GeneFishing analysis. (A) Examination of modularity of the 21 bait genes across GTEx tissues. GeneFishing was applied to the 17 tissues inside the blue circle. The Inset shows the detailed coordinates of the 17 tissues. (B) The coexpression pattern of the genes associated with the GO BP term cholesterol metabolic process in 6 representative tissues. In each heat map, the row and column have identical gene orders, and the side bar indicates whether the gene belongs to the 21 bait genes (red means yes). (C) Visualization of pantissue GeneFishing results. Each row is associated with a gene, and each column is associated with a tissue (labeled with different colors). If the color of an entry is not gray, then it means that the CFR of the corresponding gene is higher than 0.9 in the corresponding tissue.

To construct a somewhat global picture of cholesterol metabolism as well as its potential cross-talk with other biological processes, we next applied GeneFishing to the 17 GTEx tissues in which the coexpression pattern of the 21 genes was well maintained. In the previous sections, when generating candidate gene lists for experimental validation, we used a very strict CFR ≥ 0.99 threshold here, we loosened the cutoff to 0.9, as the coexpression strength between bait genes and genes that are functionally linked to lipid metabolism are strongest in the liver as compared to other tissues. We discuss in SI Appendix that much lower cutoff points than 0.9 are still likely to correspond to very low FDR. In total, 329 genes were identified with a CFR larger than 0.9 in at least 1 tissue (SI Appendix, Table S4). Almost 74% (246 genes) of these were identified in only 1 tissue, while only 7.5% (28 genes) were identified in at least 8 tissues, illustrating that there is a high degree of tissue specificity. Tissue-specific GO enrichment analysis of the 329 genes identified 52 GO BP terms, each of which is significant in at least 1 tissue (FDR < 0.001). Interestingly, all of the 52 GO BP terms were child terms of the “GO:0008152 metabolic process” (SI Appendix, Table S5). As expected, “GO:0006629 lipid metabolic process” was enriched in the genes identified in all of the 17 tissues. We also performed hierarchical clustering based on the GO enrichment profile and found that 6 tissues (artery–aorta, artery–tibial, whole blood, thyroid, pancreas, and stomach) seemed to be distinct from the remaining 11 tissues due to a depletion of the GO terms that were broadly enriched in other tissues (SI Appendix, Fig. S4). For example, while “GO:0006641 triglyceride metabolic process” was identified in 10 of the other 11 tissues, it was not enriched in any of the 6 tissues mentioned above.

Comparing GeneFishing with GIANT and ENDEAVOUR.

Two popular methods, GIANT and ENDEAVOUR, were proposed before our study, and both of them have been widely used for gene prioritization. Although differing in key aspects from GeneFishing, the 3 methods share identical input–output schema: they all accept a group of seed (or bait) genes that are related to a biological process as input and return a list of genes that have been ranked according to computed functional relevance. We ranked all GTEx liver-expressed genes with GIANT and ENDEAVOUR. Since liver is the tissue that plays an important role in lipid metabolism and the 21 bait genes are all related to cholesterol metabolism, it is reasonable to expect that, in the returned gene list from any of the 3 gene prioritization methods, lipid metabolism-related genes should have high rankings. We found that GeneFishing captured the highest number of genes associated with the GO BP term “lipid metabolic process” among its top-ranked genes, demonstrating its superiority to the other 2 methods, at least in this application (Fig. 5). When compared with ENDEAVOUR, GeneFishing did substantially better in the identification of lipid-related genes. Although a similarly high number of lipid-related genes is found among the first 25 genes as ranked by our method and GIANT separately, our method outperforms GIANT substantially from then on. Interestingly, we found that gene PCSK9, a promising drug target to lower the LDLC level (which is also an SREBF2 target gene) (23), was fished out (with CFR = 1) by GeneFishing, while its priority rank in the ranked list of candidate genes by GIANT was low (rank 6,102). In addition, the distribution of functional relevance measure returned by GIANT did not show as strong of bimodality as GeneFishing, suggesting that the calibration of the GIANT scores seems quite inferior to ours (SI Appendix, Fig. S5). We note that GIANT and ENDEAVOUR attempt to incorporate multiple sources of data (such as gene expression, protein–protein interaction, DNA sequence) to perform gene prioritization. They thus have large advantages in terms of broad applicability. However, as we demonstrate here, the generality of the information that they use may lead them to miss patterns specifically related to the biological question of interest. This is consistent with the phenomenon that we observed in Fig. 1 (in which inclusions of too much input data or noisy candidate genes obscure signal) and that we believe accounts for the mediocre performance of “all-purpose systems” in this task.

In both panels, each colored curve corresponds to a method, with x axis representing the rank and the y axis representing the number of lipid metabolism genes among the top-ranked genes.

S1 Fig

The observed proportions of the nine possible SNP pair genotype combinations from models 5, 6, 8, 9, 10, 11, 12, 15, 16, 17, 18 and 20 are depicted in this figure, per cases and controls. Genotypes are ordered according to minor allele frequency, with the wildtype homozygote appearing first, and the rare homozygote appearing last.

S2 Fig

The frequencies of the four possible SNP pair allele combinations from models 5, 6, 8, 9, 10, 11, 12, 15, 16, 17, 18 and 20 are depicted in this figure, per cases and controls. The frequencies were estimated using an EM-algorithm.

S3 Fig

The logits of genotype combinations from models 5, 6, 8, 9, 10, 11, 12, 15, 16, 17, 18 and 20 are depicted in this figure. Genotypes are ordered according to minor allele frequency, with the wildtype homozygote appearing first, and the rare homozygote appearing last. Non-parallel lines are indicative of interaction effects. The effects were estimated by absorbing the marginal effects of the SNPs into the SNP × SNP interaction term, and adjusting for the covariates included in the model by averaging over them.

S4 Fig

The observed proportions of SNP pair genotype combinations from models 3, 5, 7, 8, 9, 16, 17 and 18 are depicted in this figure, per cases and controls. Recessive/dominant effects in these models may better explain the interactions observed in the cohort (smaller p-values were achieved compared to the genotypic models, and the best models with 1 or more recessive or dominant encodings listed in S5 Table are presented in this figure). Rare homozygotes and heterozygotes are combined to represent dominant encoding of alleles, and wild type homozygotes and heterozygotes are combined to represent recessive encoding of alleles. For dominant and recessive allelic encodings of SNPs, the last genotype presented therefore reflects an encoding of 1.

S1 Table

The table summarizes the total number of samples that were successfully genotyped in each candidate gene study and how many samples have complete confounder information (age, gender and ancestry).

S2 Table

P-values were calculated using logistic regression.

S3 Table

A summary listing web URLs, version information and important parameter settings of software used in this study.

S4 Table

A spreadsheet with two worksheets, showing the results of the top 250 SAC models and the 245 Gambian models that were used for validation.

S5 Table

This table provides a summary of each SNP’s individual minor allele frequency (MAF) and association with having TB.

S6 Table

The genotypic model p-values, which were used to select the top 20 models, are presented in this table. The p-values of the corresponding allelic interaction models that achieved the smallest p-values are also shown.



The study population consisted of 1,293 unrelated healthy Korean individuals. These were the same individuals included as controls in our previous study of bipolar disorder [16]. They consisted mostly of college students, nurses, and public officials, who were recruited after a brief psychiatric interview. Potential participants were excluded if they reported a history of a psychotic disorder, mood disorder, anxiety disorder, substance use disorder, brain trauma, or intellectual disability. All participants were informed of the purpose and methods of the study and provided informed consent before enrollment. The Ethics Committee of Eulji General Hospital approved the study protocol (IRB No. 2016-08-009).

Measurement of chronotype

Chronotype was measured using a self-reported questionnaire. The CS is a 13-item questionnaire, which assesses individual differences in the time of day a person prefers to carry out various activities it classifies people as morning, intermediate, or evening types [5]. Three items are scored on a five-point scale from 1 to 5 the other 10 items are scored on a four-point scale, from 1 to 4. Higher scores indicate morning preference. All participants completed the CS questionnaire.


The clock genes investigated in this study were BHLHB2, CLOCK, CSNK1E, NR1D1, PER1, PER2, PER3, and TIMELESS. These eight genes were analyzed for 19 different tag single nucleotide polymorphisms (SNPs) with minor allele frequencies exceeding 5% in Asian populations. DNA was extracted from blood and SNPs were genotyped using the TaqMan method (Applied Biosystems, Foster City, CA, USA). Table 1 presents a summary of the minor allele frequencies and chromosomal locations of the SNPs.

Table 1.

SNPs of clock genes and minor allele frequency

GeneSNP a BaseChrPositionFunctionMAF
rs2137947CT34989276Noncoding transcript variant 0.323

SNP, single nucleotide polymorphism Chr, chromosome MAF, minor allele frequency UTR, untranslated region.

Statistical analysis

Individual SNPs were examined for Hardy-Weinberg equilibrium two SNPs violating Hardy-Weinberg equilibrium were removed. Each SNP association with CS score was analyzed by simple regression analysis. Haplotype association with CS was also analyzed by PLINK if more than two SNPs for each gene were included [17].

Gene-gene interactions were analyzed using the quantitative multifactor-dimensionality reduction (QMDR) method, an extension of the multifactor-dimensionality reduction (MDR) algorithm to work with quantitative or continuous phenotypes [18]. The MDR method is one a commonly used method for detection and characterization of high-order gene-gene or gene-environment interactions in case-control studies this comprises a nonparametric combinatorial approach that reduces the number of dimensions [19]. For each multi-locus genotype combination, QMDR calculates the mean value of phenotype and compares it to the overall mean to determine the genotype combination is high risk or low risk. By pooling all the genotypes into either high-risk or low-risk groups, a new binary attribute is created. The t-test is used to compare the means between high and low risk groups using a t-test and t-statistic is used as a training score to choose the best model. In QMDR, the training and testing score are defined by t-test statistic. The training score is used to determine the best K-order interaction model. QMDR use 10-fold cross-validation and cross-validation consistencies (CVCs) of each model chosen are recorded. The best overall QMDR model is selected as that with the maximum testing score and highest cross-validation consistency. To estimate the p-values of the chosen model, empirical null distribution is used [18].

In this study, interactions of up to three loci were tested using 10-fold cross-validation in a search considering all possible SNP combinations. SNP combination with maximum CVC was considered the best model. p-values were determined empirically by 1,000-fold permutations of case and control labels.


In the context of interactions, a brief explanation of the function and all functional interactions can be used to accurately narrow down a large amount of data. Having sufficient knowledge about interactions is a prerequisite as it reveals a dimensional view of many potential functional activities. Consequently, the complete description of biological phenomenon directly designates the specific interaction between entities 1,2,3 . For large assemblies of entities, a three-dimensional view can be more meaningful.

Cellular modes may be determined by mass transport while the sequestration of signaling interactions and molecular actions may be regulated as well by “cooperative binding”. Based on the valuable insights of interactions, notes have been added that categorize interacting proteins into functional sets that are labeled similar to signaling pathways, physical complexes and a limited tightly linked ‘modules’ 4,5,6 . Nevertheless, the distribution of interactions into diverse complexes or pathways are divisible which are likely to prevent verification of the likelihood of crosstalk and dynamic states in the interacting domain 7 . One commonly employed approach is to avoid the subdividing of functions in a network, particularly creating a network that is based on topological outcomes of all types of known or predicted interactions. In the context of the network, a web-based system is considered outstanding when it accurately integrates numerous kinds of interactions that express stable physical partnerships, frequent attachment, chaining of a substrate, communication of data, and many others. The primary interaction repository 8,9,10,11,12 provides an organized experimental dataset that includes multiple genetic, biochemical and biophysical techniques 13,14 . Progressions have focused on biological interactions from predicted computational data that are mainly focused on several forecasted communications using numerous algorithms 15,16,17,18,19 . Furthermore, the prospect of comprehensive and detailed coverage was elucidated using couple of web-based means that offers information about the combination of identified and forecasted communications. These databases mainly include STRING, GeneMANIA 20 , FunCoup 17 , I2D 21 , ConsensusPathDB 22 and others that are based on specific necessities. The most flexible and stable online platform is the STRING database, which has allowed for confidential interactions, valuable scoring and detailed comprehensive analysis for many years. The primary interaction unit that is typically used for a specific and productive functional relationship regarding a protein interaction is a functional connotation. Interactions can be derived from various available sources, similar to known experimental interactions, counting primary databases, pathway data parsed within manually curated databases, automated text-mining for statistical or semantic connections in proteins, genomic and coexpression interactions’ analysis predicted de novo, and precomputed orthologs. Additionally, the interactions observed in one organism can be orderly transferred to another organism 22,23,24 .

The proposed WeiBI database predominantly focuses on gene (protein-yielding) alternative-loci splice isoforms or genes that are altered at the post-translational stage further alterations are not available but are collapsed for a gene locus. The highly ranked functional grouping familiarized through unautomated curated Kyoto Encyclopedia 4,25,26 of genes and genomes pathway maps provide the sources of interactions, and their declarations have been proven ideal. As stated earlier, WeiBI covers 115570 entries. To gain more knowledge of the biological phenomena, there are supplementary updates available for all the primary data resources, and aims to re-execute the text-mining pipeline with new and long technologies. Through extensive literature investigation, we examined many features and interfaces in other databases 27,28,29 . However, the data are not sufficient to be heavily banking on. Hence, we support ongoing studies that are focused on modifications and alternative additions to the database.

Author information


Bioinformatics Center, Key Lab of Systems Biology, Shanghai Institutes for Biological Sciences, the Chinese Academy of Sciences, Shanghai, China

Changzheng Dong, Tieliu Shi & Yixue Li

Chinese National Human Genome Center at Shanghai, Shanghai, China

Changzheng Dong, Xun Chu, Ying Wang & Wei Huang

Graduate School of the Chinese Academy of Sciences, Beijing, China

MOE Key Laboratory of Contemporary Anthropology and Center for Evolutionary Biology, Fudan University, Shanghai, China

Rui Jin Hospital, School of Medicine, Shanghai Jiaotong University, Shanghai, China

Watch the video: Genteknik del 1 (February 2023).