Sign in Register Submit Manuscript

Hapres Home

Location:Home >> Detail

J Psychiatry Brain Sci. 2016;1(1):4;


Functional Network Composed of 1,219 Genes for Schizophrenia-a Literature Data Mining and Enrichment Analysis

Shunan Li1 , Benjamin H Lehrman2, Hongbao Cao3, Lydia C Manor4*

1 Vanderhouwen & Associates, 6342 SW Macadam Ave, Portland 97239, OR;

2 Rush Medical College, 600 S Paulina St, Chicago 60612, IL;

3 Elsevier Inc., 5635 Fishers LnRockville 20852, MD;

4 American Informatics Consultant LLC, Rockville 20852, MD.

* Correspondence: Dr. Manor, American Informatics Consultant LLC, Rockville, MD, 20852, USA.

Published: 25 April 2016


Background: Over the past decade, numerous studies have focused on identifying genetic factors associated with schizophrenia (SCZ). Sample variations, such as size, population, race, disease status, and data processing methods resulted in selection differences. Nevertheless, no systemic study has been completed to summarize these reports and provide an objective full list of genes with a reported association to SCZ.

Methods: We conducted a literature data mining (LDM) of over 13,515 articles covering publications from 1958 to Feb. 2016. These articles reported multiple types of marker-disease associations between 1,219 genes and SCZ. Then we conducted a gene set enrichment analysis (GSEA) and a sub-network enrichment analysis (SNEA) to study the functional profile and validate the pathogenic significance of these genes to SCZ. Finally, we presented additional results from the systemic review, including publication date, quality scores, and author affiliations.

Results: All of these genes have been demonstrated to present multiple mutations associating with SCZ, some of which were supported by a large number of high quality articles. Enrichment analyses showed that many psychiatric and neuropathic pathways/groups related to SCZ have been significantly enriched by these genes and that they are functionally associated with each other.

Conclusion: Our results indicate that these genes may operate as a functional biomarker network influencing the development of SCZ, and that LDM together with GSEA and SNEA could serve as an effective approach in finding these potential target genes.


Schizophrenia (SCZ) is a mental disorder characterized by abnormal social behavior and failure to understand reality. The cause of SCZ is believed to be a ombination of environmental and genetic factors (Owen et al., 2016). Although the estimates of heritability vary, due to the difficulty in separating the genetic and environmental effects, the averaged number of 0.80 has been used (Herson M, 2011). Over time, we have noticed an increased number of articles reporting over a thousand of genes/proteins that may potentially be related to SCZ (Sacchi et al., 2013; Winchester et al., 2014; Liao et al., 2015). Among those reported marker-disease relations, many were supported by high quality articles with thousands of citations (Stefansson et al., 2002; Egan et al., 2003), while some of those genes were newly reported in recent years, with a limited number of publication supports (Schmidt et al., 2015). Moreover, most of the genes have been reported in different genomic studies to have different types of relations with SCZ.

According to how the gene-SCZ relations were reported, those articles can generally be classified into several different categories: 1) Biomarker 2) Clinical Trial 3) Genetic Change; 4) Quantitative Change ; 5) Regulation ; 6) State Change.

A relatively small amount of articles suggested that the genes reported in their study could serve as biomarkers for the disease (Leonard et al., 2006; Boneberg et al., 2006; Kanazawa et al. 2007; Nohesara et al., 2011; Tenback et al., 2011;Fatemi et al., 2013; Abdolmaleky et al., 2014). Compared to the total number of articles reporting marker-disease relationships, biomarker-type articles are in the minority. It also should be noted that most articles in this group are using a weak modal for their conclusion. For example, AKT1 has been reported to be potentially related to SCZ, because of the weak p-values of the study (Thiselton et al., 2008).

An even smaller number of articles reported SCZ clinical trial related genes ((Stricket al., 2011; Lindenmayer et al. 2015). However, clinical study results are effective in evaluating the functional relation between genes and SCZ in a more direct way.

There are large number of articles claiming genetic changes of the genes in SCZ in terms of gene deletions, amplifications, mutations, or epigenetic changes. This is the largest group of all article categories. Many researchers working in this group studied the direct relation between genetic change of marker genes and the development of the disease using GWAS or sequencing data from both human and animals (Sacchi et al., 2013; Winchester et al., 2014; Porteous et al., 2014; Agha et al., 2014). Alternatively, many observed marker-disease connections are actually associations between markers and one or more symptoms of SCZ (Deng et al., 2013; Pearson-Fuhrhop et al., 2013; Cheng et al., 2014). It should also be noted that many studies focused on gene polymorphisms (Houston et al., 2013; Liao et al., 2015; Hu et al., 2015). Interestingly, the molecular changes of some genes have been identified to be case-sensitive, and thereby have been suggested as markers to differentiate subgroups of SCZ (Villar-Menéndez Iet al., 2014). Additionally, the population of the study has been identified to influence the association between the markers and the disease (Gu et al., 2013). Nevertheless, the results are not always consistent. For example, both positive and negative associations of common SNPs in CHRNA7 have been reported in schizophrenia GWAS studies (Winchester et al., 2014).

Quantitative Change is another big group. Articles in this group reported changes in abundance/activity/expression of a gene/protein in the disease state, and most were using gene expression studies (Yanagi et al., 2014; Chen et al., 2014; Yin et al., 2013; Le Magueresseet al., 2013; Du et al., 2014; Vauquelin et al., 2012; Penzes et al., 2011). Among those reports, some genes were observed to have increased expression levels in the case of SCZ (Zvara et al., 2005; Graziane et al., 2009;Fuxe et al., 2010; Sacchi et al., 2013;Yin et al., 2013;Chen et al., 2014; Joshi et al., 2014; Du et al., 2014). These genes include: BDNF, NRG1, DRD4,ERBB4, COMT, HTR2A, DRD3, and DAO. Alternatively, many other gene were reported to present decreased gene expression levels in SCZ studies: CHRNA7 (Zheng et al., 2012), GAD1 (Yanagi et al., 2014), DTNBP1 (Le Magueresseet al., 2013), RELN (Folsom et al., 2012), AKT1 (Chakraborty et al., 2014), DRD2 (Vauquelin et al., 2012), DISC1 (Penzes et al., 2011), MTHFR (Hill et al., 2011), and SLC6A3 (El Hage et al., 2015).

Many articles reported changing activity of a target gene by an unknown mechanism, which are clustered as a regulation group. The quantity of those types of article reports are similar to Quantitative Change. This is a less specific relation type than the others provided. Although reports of this type of relation were equivocal when describing the mechanism ofassociation, they are affirmative in stating that the genes/proteins were related to the disease (Nikolaus et al., 2014;Yu et al., 2013; Ivleva et al. 2008; Kret et al., 2015; Luoni et al., 2014; Schroeder et al., 2015; Lee et al., 2012; Roffeei et al., 2014). It should be noted that, many genes were suggested to be important for new SCZ drug development (Winchester et al., 2014; Khoddami et al., 2015; Sacchi et al., 2013). Although some conflicting outcomes among different methodologies led to continuous debate, the vast amount of regulation-type reports did provide more information regarding the association between the markers and SCZ.

The last group is state change, which is about the same size as the Biomarker group. Articles in this group report changes in a protein/gene post-translational modification status or alternative splicing events associated with SCZ (Guest et al., 2000; Kroeger et al., 2003; Neddens et al., 2011; Sequeira et al., 2012; Wilkinson et al., 2013). Although state change is a relatively smaller group, those studies reported specific protein/gene state changes that may be related to SCZ, and therefore are important for the understanding of the mechanism of the disease.

Although it is possible that false positives can be reported in some articles, those genes that have been frequently identified by different studies associating with SCZ should be brought up for further analysis and could be used as potential biomarker targets. Moreover, the ones that are newly reported in recent years may attract extra attention. Nevertheless, by far, no systematic analysis has evaluated the quality and strength of these reports to provide a full list of genes associating with SCZ. This is the aim of this study. Instead of focusing on one specific marker or function as were done in other reviews (Ando et al., 2011; Bliksted2016), we attempt to provide a full view of the genetic-map that is related to SCZ.


Here is the workflow of this study: 1.) Literature data mining (LDM) to discover gene-SCZ relations; 2.) Enrichment analysis on the genes identified to validate their pathogenic significance to SCZ. 3.) Results of systematic review of the articles underlie these relations, including publication date, quality metric score, and author-affiliation analysis.

Literature Data Mining

In this study, we performed a literature data mining (LDM) of all articles available in the Pathway Studio database (, which covers over 40 million scientific articles, seeking the ones that reported gene-SCZ relations. The LDM was conducted by employing the Natural Language Processing (NLP) text mining module of the Pathway Studio software.

Gene Set Enrichment Analysis

To better understand the underlying functional profile and validate the pathogenic significance of the reported genes, we performed a gene set/pathway enrichment analysis (GSEA) and a sub-network enrichment analysis (SNEA) on five groups: 1) Whole gene list (1,219 genes); 2) 4-subgroups selected using the highest quality matrix scores (61 genes in each group). In addition, we conducted a network connectivity analysis on a subset of genes using Pathway Studio(

GSEA (also known as functional enrichment analysis) is a method for analyzing biological high throughput experiments, which identify classes of genes or proteins that are over-represented in a large set of genes or proteins. These gene sets could be known biochemical pathways or otherwise functionally related genes. The method uses statistical approaches to identify significantly enriched or depleted groups of genes to retrieve a functional profile of the input gene set, in order to better understand the underlying biological processes. With this method, one does not consider the perturbation of single genes but of whole (functionally related) gene sets. The advantage of this approach is that it is more robust. It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed.

In addition to GSEA, we performed a sub-network enrichment analysis (SNEA), which is implemented in Pathway Studio using master casual networks (database) containing more than 6.5 million relationships derived from more than 4 million full text articles and 25 million PubMed abstracts. These networks are generated by a finely-tuned Natural Language Processing (NLP) text mining system to extract relationship data from the scientific literature, rather than the manual curation process used by IPA ( The ability to quickly update the terminologies and linguistics rules used by NLP systems ensures that new terms can be captured soon after entering regular use in the literature. This extensive database of interaction data provides high levels of confidence when interpreting experimentally-derived genetic data against the background of previously published results(

Quality Metric Analysis

In this work, we perform a quality metrics analysis on all marker-disease relations. Output of the analysis includes quality score (QScore), citation score (CScore), novelty score (NScore) and report frequency score (RScore) at article level as well as marker level. These quality measures could be used to sort the marker list.

Using the RScore one can identify the most frequently reported markers. At article level, RScore=1, in dictating a marker-disease relation has been reported; otherwise RScore=0. At marker level, RScore is the sum of article level RScores, representing the report frequency of the marker.

Using the NScore one can identify the newly reported markers. Here we define the publication age as the Current year - publication year +1. According to different publication age threshold , we differentiate NScores into , where (years) =1, 2... ; at article level, =0 when the publication age of the article is older than ; otherwise >0. At marker level, =0 means the marker-disease relation has been reported more than years ago.

Using the CScore one can identify the marker-disease relations that are highly cited. The CScore of an article is defined as its number of citations, and the marker level CScore of a relation is the sum of the total citations of all the articles supporting the relation between the marker and the disease.

The QScore is a composite index considering three factors of an article-reported relation: 1) the citation number; 2) the publication age, and 3) the RScore. The QScore of an article is in the range of (0,1), and it is inversely related to publication age and positively related to its citation number. If an article is recently published with a high citation number, its QScore will be close to 1; otherwise, if the article is old with poor citations, its QScore will be close to 0. The marker level QScore is the sum of the QScores of all the articles supporting the marker. It should be noted that both article level and marker level QScores are designed on the relation level to evaluate the significance of the article(s) to the relation. If multiple marker-disease relations have been reported by one article, this articles will have QScores for each of those relations, although they are equal in value.

Article Author-affiliation Analysis

We summarized author-affiliation info of all the articles that reported a gene-SCZ relationship. Our intent is to provide the affiliations of the authors who are most active in the field, which is especially helpful for those who are seeking collaborators for SCZ related genetic/genomic studies.


Results Summary of LDM

Through the LDM approach, we discovered 13,515 articles (1958~ Feb. 2016) reporting 1,219 genes associated with SCZ. According to the reported category of gene-SCZ relations, the articles can generally be clustered into 6 different groups: 1) Biomarker (0.62%); 2) Clinical Trial (0.16%); 3) Genetic Change (53.91%); 4) Quantitative Change (22.51%); 5) Regulation (21.75%); 6) State Change (1.05%).

Among the reported genes, some are frequently reported with huge number of article support, such as COMT (430 articles), DRD2 (394 articles) and BDNF (341articles). These genes are the ones with highest RScores; Some are recently reported (e.g., reported within the last two years) to have a high NScore, such as FKBP5 (NScore:6.5), AGO2 (NScore: 4) and TREM2 (NScore:4). Some are frequently cited to have a high CScore: COMT (CScore: 20369), NRG1 (CScore:18039), GAD1 (CScore:18000). It should also be noted that genes with high report frequencies (RScore), do not necessarily have a higher number of citations (CScore), which may be caused by many factors such as the total number of underlying articles and their publication age. To balance these factors, we can use the QScore. We present in Table 1 the top 61 genes ranked by different scores, representing genes with different significance. The full results are provided in Supplementary Material 1.

Table 1 Top 61 Genes Reported Associations with SCZ ranked by Different Scores

We note that the top genes selected by QScore, CScore, RScore have large overlaps with each other; however, they show no overlap with the gene group selected by NScore, which contains all the genes that are newly reported within the last two years.

In addition, we demonstrated that 53 out of the 1,219 genes identified in this study are encoded by the 108 loci reported by Ripke et al., 2014. Some of those genes have been frequently reported with many citations, such as DRD2, ZNF804A and MIR137, while many others are relatively new with fewer supporting articles, such as MAD1L1 and AMBRA1. More information regarding these 53 genes are presented in Supplementary Material 1.

Enrichment Analysis

In this section, we present GSEA and SNEA results for 5 different groups: 1) All 1,219 genes; 2) The 4 gene groups listed in Table 1.

1.Enrichment Analysis on All 1,219 Genes

The full list of 819 pathways/gene sets that has been enriched with p-value<1E-5 has been listed in Supplementary Material 2, where 605 pathways/genes were enriched with p-values<1E-6, and 279 pathways/genes were enrich with p-values<1E-10, 87 pathways/genes were enriched with p-values<1E-20, and 22 pathways/genes were enriched with p-values<1E-40 (Jaccard similarity >0.04), as shown in Table 2.

Table 2 Molecular function pathways/ groups enriched by 1,219 genes reported

From Table 2 we show that most of the extremely enriched pathways/gene sets were related to neuron system development or neuron signal transition. For example, synapse, neuronal cell body, axon, dendrite, postsynaptic density, signal transduction, neuron projection, synaptic transmission, and nervous system development. Enrichment in these pathways suggests that the genes are related to the neurological basis of SCZ (Danielyan and Nasrallah, 2009).

Some pathways/gene sets were related to drug effects, including response to drug (GO: 0042493, P=5.03E-79), response to ethanol (GO: 0045471; P=2.35E-52), and response to estradiol (GO: 0032355; p-value=2.58E-42). The genes in these pathways play important roles in the change in state or activity of a cell or an organism in terms of movement, secretion, enzyme production, gene expression, etc. as a result of a drug stimulus. A drug is a substance used in the diagnosis, treatment or prevention of a disease. Enrichment of the groups suggests that the genes may serve as drug targets for SCZ. Coincidentally, many of them, such as COMT, DRD2, DRD3, DRD3 and HTR2A (Stricket al., 2011; Winchester 2014), have already been suggested as antagonist genes for SCZ.

Other interesting pathways/gene sets include aging (GO: 0016280; p-value=2.79E-48; 88 genes) and memory (GO: 000761; p-value=6.77E-41; 47 Genes). It has been well established that SCZ is related to learning and memory impairments, and that age is an important factor influencing the pathogenesis of SCZ (Paulsen et al., 1995).

More significantly enriched pathways have been identified and presented in Supplementary Material 2. The results provide insights for understanding the functional processing through which the genes affect the development of SCZ, and validate the pathogenic significance of these 1,219 genes to SCZ.

In addition to GSEA, we performed SNEA using Pathway Studio with the purpose of identifying the pathogenic significance of reported genes with other disorders, especially psychiatric disorders. We provide the full list of results in Supplementary Material 3. We list in Table 3 the disease related sub-networks enriched with a p-value<1.24E-168.

Table 3 Sub-networks Enriched by the by 1,219 Genes Reported

From Table 3 we see that many of these reported SCZ related genes were also identified in other mental health disorders, with a large percentage of overlaps (Jaccard similarity>0.15).

2.Enrichment Analysis onTop 61 Genes with Highest Scores

From Table 1 we see that the gene set with the highest NScores has few overlaps with the top gene setsas ranked by the other scores, while the top genes selected by QScore, CScore, RScore have large overlaps with each other. This is because QScore, CScore, RScore are strongly related (See Methods) while NScores represents relatively different meanings. Here we compared their difference in terms of GSEA and SNEA results. Considering the similarity of the groups selected by QScore, CScore, RScore, we only present the results for the NScore group and the RScore group here (Table 4), and report the full results for QScore and CScore groups in Supplementary Material 2 and 3.

Table 4 Pathways/groups Enriched by 61 Genes with the Highest NScore and RScore

From Table 4 we see that, the genes with the top NScores and those with the top RScores are enriching different groups of pathways, with different p-values (NScore group: 2.14E-05~4.98E-08; RScore group: 5.82E-14~1.93E-25), indicating that the newly reported genes are functionally different from the frequently reported ones.

Moreover, we observe that the top 10 RScore group enriched pathways/gene sets (Table 4) have several overlaps with the 22 pathways/gene sets enriched by the overall1,219 genes (Table 2), including: response to drug (GO: 0017035), memory (GO: 0007613), axon (GO: 0030424), and synaptic transmission(GO:0007268), though with relatively bigger but still strong p-values. Similarly, we see that the signal transduction group (GO: 0007165) is enriched by both overall genes and the NScore group alone, although with much weaker significance (6.53E-68 vs. 4.98E-08), indicating that many more genes with similar function have already been discovered.

For the SNEA analysis, we only test the disease sub-networks that have been enriched by the two groups of genes. We provide the full list of results in Supplementary Material 3. We present in Table 5 the top 10 disease related sub-networks enriched by the two groups of genes.

Table 5 SNEA Results by 61 Genes with the Highest NScore and RScore

From Table 4 we see that both groups enriched some psychiatric disorder related sub-networks. However, the enrichment p-values by the RScore group are much more smaller than those by the NScore group. Moreover, more disease sub-networks were enriched by both RScore group and over-all group.

3. Connectivity Analysis

In addition to GSEA and SNEA, we performed a network connectivity analysis on the top 20 genes with the highest RScore (from Table 1) to generate a functional network, as illustrated in Fig. 1.

Fig. 1 Connectivity Network of 20 Genes with Highest RScore. The network is generated using Pathway Studio. The 20 genes discussed in the paper are highlighted using a green halo. All, but one gene DAO, have functional associations with each other, supported by over 2,000 articles. DAO has a strong indirect connection with other genes, and here we only present its indirect connection with RELN and GAD1 as an example.

From Fig. 1 we see that the genes, especially the ‘hot’ ones in terms of reported frequency, demonstrate strong connectivity between each other.

Quantitative Analysis of Literature Support

Fig. 2 presents the histogram of the number of articles against publication date. As shown in Fig. 2, the LDM of this study covers articles within the past 59 years (1958 to 2016). However, they have an averaged publication age of 7 years, suggesting that most of the articles are published in recent years. Here we define the publication age = current year - publication date +1. It should be noted that, recent years saw an increased number of publications, especially after 2009. Although we observe the highest number of articles published in 2012, we cannot neglect that overall there are similar and higher volume of publications in recent years. The frequent and continuous exposure in terms of scientific reports in recent years suggest that these genes have been, and are still attracting significant attention in the genetic studies of SCZ.

Fig. 2 Histogram of the Publications Reporting Marker-disease Relationships between SCZ and 1,219 Genes

In addition, we performed a marker-wise analysis of the publication date distribution of the supporting articles and provided the error-bar plot as shown in Fig. 3. The X-axis contains the names of the markers ranked by QScore. The Y-axis represents the mean of the publication date plus/minus the corresponding standard deviation. From the figure, we can see that the publication date distribution of most articles associated with each marker is similar, which are all centered around 2010.

Fig. 3 Error-bar Plot of the Publication Date of the Articles for Each Gene
Quality Metrics Analysis

Fig. 4 presents the marker-wise CScore, QScore and NScore, along with the number of references (RScore). The X-axis represents the index of markers ranked by QScore; The Y-axis contains the CScore, QScores, NScore, and RScore normalized by their maximum values, respectively. As shown in Fig. 4, the QScores are roughly proportional to the RScore and CScore. Similarly Pathway Studio documents also demonstrate that QScore is inversely related to publication age. Nevertheless, the markers we discussed have similar publication ages (Fig. 3). Therefore, publication age contributes less in the ranking of these markers. In this study, the genes we identified have QScores in the range of [0.01, 172.16].

It also should be noted that some genes with low QScores present relatively high NScores (Fig. 3). This signifies that these genes have been newly identified during the past two years and it also signifies that the higher the NScore, the ‘hotter’ the newly identified genes for SCZ.

Fig.4 Plot of CScore, QScore, NScore and RScore.Each of these measures were normalized using the corresponding maximum value. The NScore presented in this figure are the NScore_2, so that the NScore will be zero if corresponding gene gets supports from articles with publication age older than 2 years.

Interestingly, we found that the majority of the QScores of the articles underlying associations between particular genes and the disease follow a normal distribution, as shown in the QQPlot provided by Fig. 5. The articles with QScores following normal distributions are roughly corresponding to the ones that lie within a band around the mean with a width of one standard deviation, representing about 75.29% of the total articles, while for an normal distribution, this percentage is 68.27%.

Fig.5 QQPlot of All QScores of 13,515 Articles
Fig.6 The Gene-wise Bubble Plot of QScores

Fig. 6 presents the marker-wise bubble plot of the articles, where the number of articles are represented by both the size and the color of the bubbles. As shown in Fig. 6, most of the articles are supporting the genes with higher QScore.

Furthermore, there are approximately11.93% articles above the 1-standard-deviation band, and about 12.78% articles below the 1-standard-deviation band, as shown in Fig. 5. Although all the articles designating a marker are important in supporting the marker-disease relation, the quantity metric scores, namely RScore, NScore, CScore and QScore, provide several tentative methods to rank the markers related to the disease, as well as provide an approach to rank the articles behind a marker if there are numerous.

Author-affiliation Analysis of Literature Support

In this section, we analyzed the affiliations that publish these articles, and present the analyzing results including: 1) The top affiliations ranked by number of publications; 2) The collaboration network among these top affiliations; 3) The classification of these top affiliations; and 4) The top companies that studied the relations between the markers and the disease.

Fig. 7 presents the top 15 affiliations with the number of articles they published. There isa total of 13,515 articles from 3,601 affiliations supporting the relations between the markers and SCZ, out of which 3,858 articles are from the top 15 affiliations(0.4%), accounting for 28.5% of all articles. As shown in Fig. 7, most of these top affiliations are famous academic institutes, including National Institute of Mental Health, Shanghai Jiaotong University, VA Medical Center, University of Toronto, Shanghai Mental Health Center, Japan Science and Technology Agency, King's College London, The Johns Hopkins School of Medicine, Icahn School of Medicine at Mount Sinai, Peking University, Chinese Academy of Sciences, Ludwig-Maximilians-University at Munchen, Karolinska Institute, University of Pittsburgh, and Centre for Addiction and Mental Health. Although an article by a famous affiliation is not necessarily a guarantee of a good publication, it's still good to know that these organizations with high reputation were involved in the studies of the marker-disease relations between the genes reported and SCZ.

Fig.7 TheTop 15 Affiliations Ranked by Article Numbers They Published Supporting the Marker-disease Relations Between the 1218 Genes and SCZ

During the affiliation analysis of these articles, we noticed that many of the studies reported in the papers were developed under the joint efforts from two or more affiliations, as shown in Fig. 8.

Fig.8 Collaboration Map of the Top Organizations Ranked by Publication Numbers;The number in each cell is the number of articles coauthored by both corresponding affiliations. If the number is equal to or greater than 255, it will appear as 255.

Fig.8suggests that the top affiliations have strong collaborations with each other in terms of paper publications. For example, Shanghai Mental Health Center and Shanghai Jiaotong University have been published over 255 articles together studying the relations between the disease and the genes discovered. These collaborations indicated the joint efforts in the field. Generally, collaborations would lead to more solid and reliable conclusions.

It should also be noted that, besides schools and universities, there are also many other types of affiliations, such as hospitals, medical centers and companies. In this work, we separated the affiliations into four different classes (Fig. 9): 1) Academic institutes, including schools, universities, research centers and government academic institutes; 2) Hospitals and medical centers; 3) Companies; 4) Others.

Fig.9 Pie Plot of the Author Affiliation Distribution
Fig.10 The 15 Companies Reporting Associations between the Genes and SCZ

From Fig. 9 we can see that most of the papers (69.0%) were developed by academic institutes, while the number for hospitals and medical center and companies were 14.9% and 2.9%, respectively. About 13.2% of articles have affiliations that could not be classified to any of these three classes. This was partially due to the incomplete information extraction. We extracted the affiliation information using Elsevier Scopus Search API (, which is publicly available, but is not 100% percent responsive to quotes. Although the articles by companies only contribute a small part to the total publications (Fig. 10), the companies with a relatively high number of publications should be considered with higher regard. Different from academic institutes whose research studies are mostly in the field of basic science, companies are more focused on applied research. The research they sponsor or directly perform is more likely to be linked to immediate applications, such as drug development. Fig.10 depicts the top 15 companies reporting associations between the markers we discussed and SCZ.


In this work, we performed a LDM and a systematic summarization of over 13,515 articles (from 1958 to Feb. 2016) reporting 1,219 genes associated with SCZ. We provided in Supplementary Materials 1 the full analysis results. In addition, we conducted GSEA to study functional profile and the pathogenic significance of the reported genes with SCZ, with full results provided in Supplementary Materials 2. In addition, we conducted SNEA to further study the significance of these genes to potential SCZ related diseases and present the full analysis results in Supplementary Materials 3. Different from the genetic studies using raw data to report novel discoveries, this is a literature-based summarization and validation of already reported marker-diseases relations.

As an automatic data mining approach, the Natural Language Processing (NLP) technique used for LDM is effective and necessary in dealing with millions of articles. However, the automatic LDM method may produce some false positives. Therefore, the results of this study are intended to lay the groundwork for further studies in the area. Towards this purpose, we provided in Supplementary material 1 the detailed information of all the 13,515 articles studied for further investigation, including the sentences where a specific relation has been located.

This study has several limitations that need further work. 1) The literature data of 13,515 articles studied were extracted from Pathway Studio database. Although Pathway Studio database is composed of over 40 million articles, it is still possible that some articles studying gene-SCZ associations were out of their coverage. 2) The 4 quality scores, RScore, NScore, CScore and QScore are proposed as quality measures of LDR identified marker-disease relations, feasible to rank the markers/relations according to different needs/significance. Even though related to, however, they are not biological significance measures of the markers to the disease, which is different from other statistical studies like GWAS, meta-analysis and enrichment analysis.

Nevertheless, results from this up-to-date LDM suggest that these 1,219 genes have multiple types of association with SCZ, and enrichment analysis suggests that these genes play significant roles in the pathogenesis of SCZ, as well as in the pathogenesis of many other SCZ related psychiatric disorders. Together with the network connectivity analysis, we conclude that SCZ is a complex disease whose genetic causes are linked to a network composed of a large group of genes. LDM together with GSEA and SNEA could serve as an effective approach in finding these potential target genes.


Dr. Hongbao Cao is from Elsevier Inc. The other authors declare no conflict of interests.












































































How to Cite This Article

Li S, Lehrman BH, Cao H, Manor LC. Functional Network Composed of 1,219 Genes for Schizophrenia-a Literature Data Mining and Enrichment Analysis. J Psychiatry Brain Sci. 2016;1(1):4;

Copyright © 2020 Hapres Co., Ltd. Privacy Policy | Terms and Conditions