Sign in Register Submit Manuscript

Hapres Home

Location:Home >> Detail

J Psychiatry Brain Sci. 2016;1(4):1; https://doi.org/10.20900/jpbs.20160015

Article

Genetic Biomarker Selection for Obsessive-Compulsive Disorder by Sparse Representation Based Variable Selection Method

Xuemin Wang1, Peng Zhou1, Stephanie Buggs 2, Shaolei Teng2*

1 Department of Biomedical Engineering, Tianjin University, Tianjin 300072, China

2 Department of Biology, Howard University, 2400 Sixth St NW, Washington, DC 20059, USA

* Correspondence: Dr. Shaolei Teng, Department of Biology, Howard University, 2400 Sixth St NW, Washington, DC 20059, USA; Tel: +1 202-806-6100.

Published: 25 October 2016

ABSTRACT

Background: Obsessive-compulsive disorder (OCD) is a debilitating neuropsychiatric condition estimated to afflict 1-3 % of the world population. Dozens of OCD candidate genes have been reported by an increased number of articles. Nevertheless, each patient/patient group may demonstrate unique etiologic characteristics that need personalized treatment.

Methods: We integrated a sparse representation based variable selection (SRVS) approach with an OCD-gene ResNet relation data analysis to select top genes for a specific group of 118 subjects, including 16 OCD cases and 102 healthy controls. The gene expression profile were acquired from the dorsolateral prefrontal cortex (DLPFC) of postmortem tissue of these subjects. A 77 OCD candidate genes were acquired from ResNet relation data analysis. Pathway enrichment analysis (PEA), sub-network enrichment analysis (SNEA) and gene-gene Interaction analysis (GGI) were conducted to study the functional profile of the top genes selected by SRVS, and compared with previous reported genetic markers.

Results: A significantly high classification accuracy (CR) of 79.66 % was acquired (permutation p-value = 0.0046) using the top 9 genes selected by SRVS, including HOXB8, HTR2C, CRHR2, GRIK3, HGF, OXT, TPH2, DRD2 and ADRA1A. These genes were enriched within multiple pathways and sub-networks that were previously implicated with OCD. In contrast, using the same number of most frequently reported, a CR of only 65.5 % is achieved. Moreover, GGI results showed that these genes demonstrated a strong functional correlation with the frequently reported OCD genes.

Conclusion: Our study suggests that SRVS is an effective method for data driven variable selection for OCD, and that the genes that were frequently reported to associate with OCD might not be the best biomarkers for a specific OCD patient/ patient group.

Keywords: Obsessive-compulsive disorder; Sparse Representation; Variables selection; ResNet Database

INTRODUCTION

Obsessive-Compulsive Disorder (OCD) is a mental disorder where people feel the need to check things repeatedly. This order typically arises in late adolescence or early adulthood and, if left untreated, has a chronic course regardless of sex, race, intelligence, marital status, socioeconomic status, religion or nationality[1,2]. Although the causes of most OCD's cases remain unknown, it is believed that both genetic and environmental factors play a role[3, 4].

In recent years, sparse representation has received great attention in applications such as signal recovery and significant components identification[5, 6]. However, in the case of large variable and small sample number applications, specific modulation is required to fulfil the variable selected task. In many biomedical problems (e.g., genomic data, image data) the number of samples is far less than the number of variables.

In this study, we proposed a sparse representation based variable selection (SRVS) algorithm that selects significant biomarkers at different detection resolutions. This method has previously been shown effective in variable selection with SNP data and fMARI data[7]. Instead of selecting a specific of number of variable, this data driven method ranks all the variables by generating a sparse regression weight for each of them[7].

METHODS AND MATERIALS

In this section, we first describe the proposed SRVS algorithm (Section 1), then we applied it to an OCD candidate genetic biomarker selection problem (Section 2). Finally, we studied the SRVS selected genes in terms of pathway enrichment analysis (PEA), sub-network enrichment analysis (SNEA)[8] and gene-gene Interaction analysis (GGI) (Section 3).

1. SRVS Algorithm

In general, a sparse representation model can be presented as Eq. (1).

y = Xℛ + ε           (1)

where y ∈ Rn×1 is the observation vector; X ∈ Rn×p are measurements of the data and p » n. ε ∈ Rn×1 is the measurement error caused by noise. The goal is to reconstruct the unknown vector δ ∈ Rp×1 based on y and X.

To best approximate y by choosing a small number of non-zeros entries of δ for the model given by Eq. (1), we consider the following Lp minimization problem (P0):

(P0) min||δ||p subject to ||y - Xδ||2 ≤ ε           (2)

where ||*||p is the Lp norm, and p ∈ [0,1]. The following algorithm is designed to solve the minimization problem (P0) given by Eq. (2) and detect the columns of X relevant to y.

SRVS Algorithm Steps

1. Initial δ(0)= 0;

2. For the Step ι, randomly choose kcolumns from X = {x1,...,xp} ∈ Rn×p to construct a n × k sub-matrix denoted as Xι ∈ Rn×k; and mark the selected columns' indexes as Iι ∈ R1×k;

3. Solve the following Lp minimization problem to find the optimal sparse solution δι ∈ Rk×1:

min||δι||p s.t. ||y - Xιδι||2 ≤ ε           (3)

*There are many proposed methods for solving the Lp minimization problem, such as the Homotopy method[9] for p = 1, and the orthogonal matching pursuit (OMP) algorithm[10] for p = 0.

4. Update δ(ι) ∈ Rp×1 with δι: δ(ι)(Iι) = δ(ι-1)(Iι) + δι; where δ(ι)(Iι) and δ(ι-1)(Iι)denote the Iι th entries in δ(ι) and δ(ι-1) respectively;

5. If ||δ(ι)/ι-δ(ι-1)/(ι-1)||2 ﹥ α, where αis α predefined constant, update ι= ι+1, and go to Step 2. Otherwise, set δ = δ(ι)/ι. The non-zero entries in δ correspond to the column vectors selected.

2.OCD Candidate Genes for Evaluation

A 77-OCD-candidate gene pool was acquired from the OCD-Gene relation data acquired from Pathway Studio (PS) ResNet database, which are a group of real-time update network databases. This database includes: curated signaling, cellular process and metabolic pathways, ontologies and annotations, as well as molecular interactions and functional relationships extracted from the 35M+ references covering the entire PubMed abstract and Elsevier full text journals. More information about the PS ResNet Mammalian databases please refer to http://pathwaystudio.gousinfo.com/ResNetDatabase.html.

The gene expression profile (GEO: GSE60190) was acquired from the dorsolateral prefrontal cortex (DLPFC) from postmortem tissue of 118 subjects, among which 16 were with OCD cases and 102 controls.

3.Validation of SRVS results

To test the validity of the proposed method, we studied the SRVS selected genes through four approaches: OCD predication, PEA, SNEA and GGI. The results were compared with that of frequently reported OCD risk genes. Accordingly, we proposed two metric scores for each genes: 1) Reference score (Rscore): the reference number underlying a gene-disease relationship; 2) SRVS score (Sscore): the SRVS approach generated weights for each gene.

3.1 OCD predication

We hypothesize that significant OCD candidate gene/gene set should contribute to distinguishing OCD patients from healthy controls. To validate the effectiveness of the selected genes and the proposed SRVS approach, we performed a Euclidean distance-based multivariate classification[7] on the gene expression data set, followed by a leave-one-out (LOO) cross validation, using the overall gene set and the sub-sets selected by Sscore and Rscore as tentative markers. Permutation of 5,000 runs was then conducted to test the hypothesis that a randomly selected gene set with the same size can reach an equal or higher classification accuracy (CR).

3.2 Enrichment and GGI Analysis

PEA was conducted using both Pathway Studio (www.pathwaystudio.com) and DAVID (http://david.abcc.ncifcrf.gov/). In Pathway Studio, a given gene group was compared with GO ontology term, Pathway Studio ontology terms, and over 2,000 pathways accumulated by biologist. Overlapped genes and enrichment p-values (FDR corrected) using Fisher-exact test were provided. For DAVID, the names of candidate targets were used as the inputs. The ‘GOTERM_BP_FAT’, ’GOTERM_CC_FAT’ and ’GOTERM_MF_FAT’ were used for the gene ontology search, and the ’KEGG_PATHWAY’ was utilized for the pathway search.

PEA was conducted using both Pathway Studio (www.pathwaystudio.com) and DAVID (http://david.abcc.ncifcrf.gov/). In Pathway Studio, a given gene group was compared with GO ontology term, Pathway Studio ontology terms, and over 2,000 pathways accumulated by biologist. Overlapped genes and enrichment p-values (FDR corrected) using Fisher-exact test were provided. For DAVID, the names of candidate targets were used as the inputs. The ‘GOTERM_BP_FAT’, ’GOTERM_CC_FAT’ and ’GOTERM_MF_FAT’ were used for the gene ontology search, and the ’KEGG_PATHWAY’ was utilized for the pathway search.

RESULTS

1.OCD candidate genes

Analysis of OCD-Gene relation data revealed an OCD gene pool of 77 genes, supported by 439 articles. Here we evaluated 77 of these OCD candidate genes using the proposed SRVS algorithm with an independent gene expression data (GEO: GSE60190). Fig. 1 presents these OCD candidate genes. The full gene list of the 77 genes and related information, including Sscore and Rscore is provided in Supplementary Table S1 and the supporting references are provided in Supplementary Table S2a.

FIGURE 1
Fig. 1 The 77 OCD candidate genes analyzed. genes were acquired from Pathway Studio ResNet database

The enrichment analysis using DAVID of 77 candidate genes shows that many of these targets are enriched in some GO terms and Pathways associated with neurological processes (Table 1). For example, neurotransmitter-related GO terms included “cell-cell signaling”, “regulation of transmission of nerve impulse”, “regulation of system process”, “transmission of nerve impulse and “regulation of neurological system process” and “neurological system process” (Supplementary Table S2b); The neurological system Pathways contained “Neuroactive ligand-receptor interaction”, “Gap junction” and “Long-term potentiation” (Supplementary Table S2c).

TABLE 1
Table 1 Enrichment analysis of 77 OCD Candidate Genes
2.Results of OCD Prediction

To evaluate the effectiveness of the SRVS generated metrics, Sscore, a case/control classification and LOO cross validation were conducted on a gene expression dataset (GEO: GSE60190), followed by a permutation test of 5,000 runs. For comparison purposes, we also tested the Rscore and for the LOO cross validation, we first rank the 77 genes by different metric scores, then we used the top n( n = 1, 2 …) genes as input variables for classification and LOO cross validation. Fig. 2 presents the results with the maximum classification ratios (CRs) marked at the position of corresponding number of genes.

FIGURE 2
Fig. 2 Comparison of Different Metrics through A LOO Cross Validation(genes ranked in descending order)

We also present in Table 2 the results of the pumutation p-value and using Rscore ranked genes and all the 77 OCD candidate genes.

TABLE 2
Table 2 LOO Cross Validation and Permutation Results

Figure 2 and Table 2 establish that compared to the CRs generated by randomly selected gene sets, the top genes selected by Sscore can lead to significant better classification accuracies with the same size. To note, the highest CRs were acquired using only the top 9 genes selected by Sscores (See Fig. 2 and Table 2), validating the Sscore (adding more genes with a lower score essentially has no effect). In contrast, the CR for top Rscore is only 65.5%. Moreover, the Sscore led to lower permutation p-values of 0.0046, demonstrating the effectiveness of the proposed method. We present the top 9 genes by Sscore in Table 3. For comparison purposes, we also provide the top 9 genes by Rscore, and the full lists in Supplementary Table S2.

TABLE 3
Table 3 Top 9 Genes Reported Associations with OCD Ranked by Sscore and Rscore
3.Comparison of Top Genes

To better understand the profile of the genes selected by SRVS approach, we further compared the two groups of top genes selected by Sscore and Rscore (Table 2) using the PEA and GGI approach.

Analysis identified that among the 9 genes selected by Sscore and Rscore, there was only one gene overlap: HTR2C, as depicted in Fig. 3 (a). Nevertheless, GGI analysis demonstrated that there were 64 relations of different types between 7/9 genes from Sscore group and 9/9 genes from Rscore group (Fig. 3 (b)), supported by 407 references (Supplementary Table S3), suggesting a strong relation between the two gene groups.

FIGURE 3
Fig. 3 Overlap and Association between the Sub Gene Sets with the Highest Sscore and Rscore.(a) Venn diagram of the top 9 genes by both scores; (b) Gene-Gene connection between top 9 genes by both scores; genes selected by Sscore are highlighted in yellow; genes selected by Rscore in blue.
4. Enrichment Analysis

In this section, we present PEA and SNEA results for the different groups listed in Table 3. At the same enrichment p-valuethreshold (< 5e-03), we identified 80 pathways enriched by the top 9 genes by Rscore, while for the Sscore group, there were only 20 enriched pathways. We present the top 10 pathways/gene sets by Rscore and Sscore in Table 4. The full results are presented in Supplementary Table S4a and S4b.

TABLE 4
Table 4 Pathways/groups Enriched by 9 Genes with the Highest Sscore and Rscore

For the 20 pathways/gene sets enriched with the 9 genes by Sscore (p-value< 0.005,with 9/9 unique genes; see Supplementary Table 4a), there was 1 pathway(2/9 unique genes) related to neuro system: negative regulation of synaptic transmission, glutamatergic (GO: 0051967; p-value= 0.00057, overlap: 2); 1 pathway (3/9 unique genes) related to behavior: feeding behavior (GO: 0007631; p-value= 9.2e-005, overlap: 3); 2 pathways (4/9 unique genes) related to heart development: positive regulation of heart rate (GO: 0010460; p-value= 0.0016, overlap: 2); regulation of heart rate (GO: 0002027; p-value = 0.0034, overlap: 2) and 2 pathways (3/9 unique genes) related to blood pressure: positive regulation of blood pressure (GO: 0045777; p-value= 0.0027, overlap: 2); negative regulation of blood pressure (GO: 0045776; p-value= 0.0032, overlap: 2)[10-14].

In addition to PEA, we also performed SNEA using Pathway Studio with the purpose of identifying the pathogenic significance of the selected genes to other disorders that are possibly related to OCD. The full results are presented in Supplementary Table S5a and S5b. Due to lack of space, we only present in Table 5 the top 5 disease-related sub-networks enriched by the two groups of genes. This table also indicates that both groups enriched some other OCD related sub-networks, as well as other OCD and genetic mutation related sub-networks. Moreover, we noted an overlap of two the two enriched sub-networks: major depressive disorder.

TABLE 5
Table 5 SNEA Results by 9 Genes with the Highest Sscore and Rscore

DISCUSSION AND CONCLUSION

Previous studies suggested that OCD may be caused by imbalances of the neurotransmitters such as serotonin and dopamine[11]. The enrichment analysis of 77 genes confirms that these OCD candidates are enriched in some neurotransmitter-related terms and pathways (Table 1). This study proposed a sparse representation based genetic marker selection approach, and applied it to the evaluation of 77 OCD candidate genes. The candidate genes were identified from an OCD-Gene network relation data set acquired from the ResNet database, which were also overlapped with a RNA gene expression data set. Two metrics scores were generated and compared: Sscore from SRVS analysis and Rscore from OCD-Gene relation data set analysis.

LOO cross validation demonstrated that using all of 77 OCD candidate genes, a classification ratio of 72.88 % was reached with a permutation p-value of 0.3518 (Table 1). However, using the top genes by Sscore, a better maximum CR were acquired (79.66 %) with a significant permutation p-value (0.0046). This suggested the necessity of variable selection for the candidate OCD genes tested, as well as the efficacy of Sscore.

To better understand the 9 genes selected by the SRVS method, we compared them with the top 9 genes selected with Rscore. Analysis showed that these two groups only share one gene: HTR2C (Fig. 3 (a)). Their differences were also demonstrated in terms of enrichment pathways (Table 4) and associated sub-networks (Table 5). Even though the well-studied OCD candidate genes were significant to the disease and effective in disease prediction (LOO permutation 0.1142), they were not the best genetic markers for the subjects involved with the expression data tested.

Despite the differences between the top genes selected by high Sscore and Rscore, we identified that many of the Sscore enrichment pathways were previously reported with OCD. For example, feeding behavior, negative regulation of synaptic transmission, glutamatergic, grooming behavior positive regulation of heart rate, positive regulation of blood pressure and grooming behavior[12-17]. Furthermore, these genes were also identified to be the genetic basis of other OCD related diseases, such as post-traumatic stress disorder, anxiety and major depressive disorder[18-22]. As a matter of fact, all these Sscore genes has been previous implicated to be linked to OCD, as shown in Fig. 1. For example, HTR2A has been frequently identified by independent studies of its pathogenic importance to OCD[23-25]. These results supported the biological validity of the top genes selected by the SRVS approach.

In addition to the direct literature support for the association between OCD and the top 9 genes selected by Sscore (Supplementary Table S1), we observed a strong functional association between the top genes selected by the Sscore and Rscore groups (Fig. 3 (b)), supported by over 400 references (Supplementary Table S3). A gene with a high Rscore indicates that the gene gets strong literature supports for its linkage to OCD. Therefore, our observation provides indirect support that the majority of the top genes selected by the SRVS method pose functional significance to OCD.

Nevertheless, this study has several limitations that need future work. Although the algorithm was tested on the 77 OCD-candidate genes, there are other genes linked to OCD that were not included in the data set and were therefore not analyzed. More inclusive data sets covering all OCD genes should be used to test the accuracy of the method. Additionally, the method should also be tested on other diseases to study its validity.

Altogether, we conclude that OCD is a complex disease whose genetic causes are linked to a network composed of a large group of genes. Each patient/patient group may present unique genomic variations that require treatment based on their specific disease risk prediction, where our proposed SRVS method can be employed as an effective tool.

DECLARATION OF INTERESTS

The authors declare no conflict of interests.

REFERENCES

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

How to Cite This Article

Wang X, Zhou P, Buggs S, Teng S. Genetic Biomarker Selection for Obsessive-Compulsive Disorder by Sparse Representation Based Variable Selection Method. J Psychiatry Brain Sci. 2016;1(4):1; https://doi.org/10.20900/jpbs.20160015

Copyright © 2020 Hapres Co., Ltd. Privacy Policy | Terms and Conditions