Literature Data Mining and Enrichment Analysis on Top 235 Genes for Attention Deficit Hyperactivity Disorder

Background: Attention deficit hyperactivity disorder (ADHD) is a psychiatric disorder of the neuro-developmental type, marked by an ongoing pattern of inattention or hyperactivity/impulsivity, which interferes with functioning or development. The disorder affects approximately 5-7 % children and 2-5 % of adults worldwide. Numerous studies have indicated that genetic factors predominate the causes for ADHD. Nevertheless, no systematic study has summarized these findings and provided an objective and complete list of genes with a reported association to ADHD. Methods: Literature and enrichment metrics analyses were used to discover genes of specific significance associated with ADHD. We conducted a literature data mining (LDM) of over 2,410 articles covering publications from Jan. 1988 to Apr. 2016, where 235 genes were reported to be associated with the disease. Then we performed a gene set enrichment analysis (GSEA) and a sub-network enrichment analysis (SNEA) to study the functional profile and pathogenic significance of these genes associated with ADHD. Lastly, we performed a network connectivity analysis (NCA) to study the associations between the reported genes. Results: 181/235 genes enriched 100 pathways (p<1.1e-007), demonstrating multiple associations with ADHD. Twelve genes were discovered to be associated with ADHD, in terms of both functional diversity and replication frequency, including SLC6A3, DRD4, BDNF, DRD2, HTR2A, DBH, HTR1B, DRD5, GRM7, DRD3, TH and GRIN2A. In addition, one novel gene, SHANK2, was suggested worthy of further study. Moreover, SNEA and NCA results indicated that many of these genes form a functional network, playing roles in the pathogenesis of other ADHD related disorders. Conclusion: Our results suggest that the genetic causes of ADHD are linked to a genetic and functional network composed of a large group of genes. The gene lists, together with the literature and enrichment metrics provided in this study, could serve as groundwork for further biological/genetic studies in the field.


INTRODUCTION
Attention deficit hyperactivity disorder (ADHD) is a brain disorder characterized by problems in paying attention, excessive activity, or difficulty in controlling behavior that is inappropriate for a person's age.(1) The World Health Organization estimated that it affected around 39 million people as of 2013.(2)ADHD is diagnosed approximately three times more in boys than in girls.(3)Despite being the most commonly studied and diagnosed psychiatric disorder in children and adolescents, the cause for the disorder remains unknown in majority of the cases.Numerous studies have indicated that the onset of the disorder involves an interaction between the genetic and environmental factors.(4) There have been an increased number of articles reporting hundreds of genes/proteins related to ADHD, many of which have been suggested as potential biomarkers for the disease, such as SLC6A3 and ADRA2A.(5,6) Some of these genes (e.g., SLC6A2) have been studied in clinical trials as well.(7) Articles have also reported on the quantitative changes in gene expression in the case of ADHD.(8,9) Both increased and decreased gene expression levels/activities have been observed.(10,11) To note, many genes were also reported to influence the pathogenic development of ADHD with an unknown mechanism.(12,13) Some recent studies have suggested a functional mechanism of a mutation that can cause ADHD.Hong et al. showed that differential expressions of Homer 1a and Homer 2a/b, a family of scaffolding proteins localized to the postsynaptic density of glutamatergic excitatory synapses, were observed in the prefrontal cortex and extended to the hippocampus.These genes have direct connections to attention and cognition, the two functions that were disturbed in ADHD.(14) Nevertheless, no systematic analysis has been done to evaluate the quality and strength of these reported genes as a functional network/group in order to study the underlying biological processes of ADHD.In this study, instead of focusing on a specific gene, we attempt to provide a full view of the genetic-map, and use gene set enrichment analysis (GSEA), as well as a sub-network enrichment analysis (SNEA) to study the underlying functional profile of the genes identified.(15) We hypothesized that the majority of these previously reported genes, if not all of them, play roles in the development of ADHD, and that the major pathways/gene sets enriched by these genes are the ones associated with the disease.

METHODS AND MATERIALS
The workflow of the study is as follows: 1) Literature data mining (LDM) to discover gene-ADHD relationship; 2) Enrichment analysis on the identified genes to study their pathogenic significance in ADHD; 3) Literature and enrichment metrics analysis; and 4) Network connectivity analysis (NCA) to test the functional association between these reported genes.

Literature Data Mining and Article Selection Criterion
In this study, we performed a LDM for all articles available in the Pathway Studio database (www.pathwaystudio.com), that covered over 40 million scientific articles up until Apr.2016, seeking the ones that reported gene-ADHD relations.The LDM was conducted by employing the finely-tuned Natural Language Processing (NLP) system of the Pathway Studio software, which has the capability of identifying and extracting relationship data from scientific literature.Only the publications containing a biological gene-ADHD interaction defined by ResNet Exchange (RNEF) data format was included (http://www.gousinfo.com/).Results were presented with a full list of genes names, the information of the underlying articles, and the metrics scores, which are described below.

Literature Metrics Analysis
For literature metrics analysis, we proposed two scores for each gene-disease relationship.We define the reference number underlying a gene-disease relationship as the gene's reference score (RScore), given by Eq. (1).

RScore = n (1)
where, n is the total number of references supporting a gene-disease relation.We also define the earliest publication age of a gene-disease relationship as the gene's age score (AScore), given by Eq. (2).
where, n is the total number of references supporting a gene-disease relationship, and a Article publication age (ArticlePubAge), given by Eq. (3).

Enrichment Metric Analysis
Supposing a disease is associated with n genetic pathways, then we define the gene-wise enrichment score (EScore) for the kth gene within a gene set as given by Eq. ( 4).
EScorek=∑ (i=1) (-log 10 pValue i ) / max (1< i< n) (-log 10 pValue i ) where, pValue i is the enrichment score of the ith pathway within the gene set; n is the total number of pathways; m the number of pathways including the kth gene.

Enrichment Analysis
To better understand the underlying functional profile and the pathogenic significance of the reported genes, we performed a gene set/pathway enrichment analysis (GSEA)and a sub-network enrichment analysis (SNEA) on 3 groups: 1) The whole gene list (235 genes); 2) Two subgroups selected using the highest quality matrix scores.In addition, we conducted a network connectivity analysis (NCA) using the Pathway Studio network building module.

Summary of LDM Results
In this study, we conducted a LDM of 2,410 articles that reported 235 genes associated with ADHD.
For the 235 genes, 1.27% genes presented Biomarker relationship to the disease, 0.32% with Clinical Trial, 58.23% with Genetic Change, 12.03% with Quantitative Change, and 28.16% with Regulation.Moreover, 26.81% genes were reported to have multiple relationships with the disease.Specifically, 73.19% genes presented 1 type of relationship to the disease, 20.00% with 2, 5.96% with 3, and 0.85% with 4, as shown in Fig. 1.

Fig.1 Gene-wise Relation Type Distribution of 235 Genes
The publication date distribution of these 2,410 articles is presented in Fig. 2 (a), where we show that this study covers literature data from the past 28 years (1988 -2016), with novel genes reported in each year (Fig. 2 (b)).To note, these articles have an average publication age of only 6.4 years, indicating that most of the articles were published in recent years.In addition, recent years saw an increased number of publications, especially after 2010, with more novel genes being discovered (Fig. 2 (b)).Moreover, our analysis showed that the publication date distributions of the articles underlying each of the 235 genes were also similar to that presented in Fig.

Marker Ranking
Using the 2 literature metrics scores, we identified that some genes were frequently reported with large numbers of articles to support them, such as SLC6A3 (366 articles), DRD4 (285 articles) and SLC6A2 (111 articles).These genes are the ones with highest RScores.Some genes have also been recently reported (e.g., reported within last two years) such as AS3MT, ANKK1 and MAP1B.
Among the 235 genes, 26 were reported within two years (2015-2016), which are listed in Table 1 and the full results are provided in Supplementary Material 1.For comparison purposes, Table 1 also lists the top 26 genes with the highest RScore.

Enrichment Analysis
In this section, we present the GSEA and SNEA results for 3 different groups: all 235 genes, and the 2 gene groups listed in Table 1.
Moreover, among these 100 pathways/gene sets enriched, we identified 15 pathways/gene sets (with 136 unique genes) that are related to the neuronal system, 9 pathways/gene sets (27 unique genes) were related to neuro transmitter, 7 pathways/gene sets (43 unique genes) were related to brain function/development and 5 pathways/gene sets (42 unique genes) were related to behavior.In Table 2, the Jaccard similarity (J x ), a statistics measure used for comparing the similarity and diversity of sample sets defined by Eq. ( 5), is given.
Besides GSEA, we also performed a SNEA using Pathway Studio with the purpose of identifying the pathogenic significance of the reported genes to other disorders that are potentially related to ADHD.We provide the full list of results in Supplementary Material 3. In Table 3, we present the disease related subnetworks enriched with a p-value<7.05E-46.From Table 3, we see that many of these reported ADHD related genes were also identified in other mental health related diseases, with a large percentage of overlap (Jaccard similarity ≥ 0.04).

Enrichment Analysis on Top 26 Genes with Highest Scores
We compared the top 26 genes listed in Table 1 in terms of GSEA and SNEA results.Here we only present the top 10 pathways/sub-networks for the AScore and the RScore groups respectively (Table 4 and Table 5), and report the full in Supplementary Material 2 and 3.
Using the same enrichment p-value threshold (p<1E-3), we identified 41 pathways/gene sets that were enriched with the 26 genes with top AScores, while the number of genes for RScore group is 181.The full lists of these pathways/gene sets are provided in Supplementary Material 2. Table 4 presents the top 10 pathways enriched with the 26 genes from AScore and RScore groups, respectively.4) were observed in Table 2, which lists the top 20 pathways/gene sets enriched with 155 /235 genes, whereas the number for AScore group is 0.
For the SNEA analysis, we only performed an enrichment analysis against disease sub-networks.We provide the full list of results in Supplementary Material 3. Table 5 presents the top 10 disease related subnetworks enriched by the top 26 genes from AScore group and RScore group, respectively.From Table 5, we see that both groups enriched other mental health related sub-networks.However, the enrichment p-values of the RScore group were much more significant than those shown by the AScore group, and with higher Jaccard similarities.

Connectivity Analysis
In addition to GSEA and SNEA, we performed a NCA on the top 26 geneswith the highest RScores and AScores (from Table 1) to generate gene-gene interaction network.Results showed that for the RScore group, there were 154 connections among 24/26 genes, which are strongly supported by the literature.Only 2 genes present no direct connection with any other genes (Fig. 3 (a); highlighted in blue).In contrast, genes within the AScore group demonstrated only 36 relations among 18/26 genes, as shown in Fig. 3 (b), with 8 genes showing no direct relation with other genes in the group (Fig. 3 (b); highlighted in blue).This observation is consistent with the GSEA and SNEA, suggesting that genes with the smallest AScores are not as functionally close to each other as those from the RScore group.

EScore Analysis
Through GSEA, we also generated a biological metrics, EScore, for each gene.The value of an ESocre represents how a gene is related to the pathways associated with ADHD.To compare the EScore and the literature metrics, we performed a cross-analysis of the top 26 genes selected using different scores, and present a Venn diagram in Fig. 4  Besides comparing the top 26 genes (Fig. 4 (a)), we also compared the averaged metric values of all the 235 genes on a group level, as shown in Fig. 4 (b).We used a group size of 7 genes, that is, we first sorted the 235 genes by RScore, then we averaged each type of metrics value using a moving window of length 7. Results showed that the average scores were strongly correlated, as shown in Table 6.

DISCUSSION
In this work, we performed a LDM on 2,410 articles (from 1988 to April 2016), reporting 235 genes associated with ADHD.We provide in Supplementary Materials 1 the full gene list together with the literature and enrichment metrics scores.In addition, results from GSEA and SNEA support the current literature that most of these genes may play roles in the pathogenesis of ADHD.Furthermore, NCA showed that many of these genes were functionally linked to one another.
As an automatic data mining approach, the NLP technique is effective and efficient in dealing with large amounts of literature data for LDM.However, the LDM method may produce some false positives.Therefore, the results of this study are intended to provide an overview map for the current field of genetic studies of ADHD and lay the groundwork for further biological/genetic studies in this area.
Although our analysis did not specifically focus on single genes, we noticed that the 235 genes identified were not equal in terms of publication frequency (RScore), their novelties (AScore) and the functional diversity (EScore).Using the proposed quality metrics scores, one is able to rank the genes according to different needs/significance and pick the top ones for further analysis (Table 1).For example, the top 5 genes by AScore, namely AS3MT, ANKK1, MAP1B, PER2and TRAF4, are the ones that were recently reported.On the other hand, SLC6A3, DRD4, SLC6A2, COMT and BDNF are the top 5 genes that were found to be most often replicated in studies (with highest RScores), suggesting them as common variables in the occurrence of ADHD.These genes likely possess biological significance in relation to the disease.
There is a relatively large overlap between the top ESocre and RScore genes (Fig. 4), indicating that the frequently replicated genes tend to play roles within multiple pathways associated with ADHD.Moreover, one gene, SHANK2, reported recently in only a few article, demonstrated high EScore.SHANK2 is been included in 13 of the top 100 pathways, many of which have been implicated to be related to ADHD.These include neuron projection (0043005); synapse (0045202); postsynaptic membrane (0045211); neuronal cell body (0043025); learning (0007612); memory (0007613); dendritic spine (0043197); social behavior (0035176); postsynaptic density (0014069); long-term synaptic potentiation (0060291); adult behavior (0030534).(16,17) The observation suggested that SHANK2 is worthy of further study with regard to the pathogenesis of ADHD.Additionally, we observed that most genes identified by this LDM were included in the pathways previously implicated with ADHD, including 15 neuronal system pathways, 9 neuronal transmitter pathways, 7 brain function related pathways and 5 behavior related gene ontology terms.(16,19) To note, 181/235 were included in the top 100 enriched pathways (p-value< 1.1e-007), and 155/235 in the top 20 pathways listed in Table 2 (p-value< 2.6e-017).We hypothesize that the majority of these literature reported genes, especially the ones that were identified from significantly enriched pathways, should be functionally linked to ADHD.Although there may be false positives from the separate studies undertaken in the different literature publications, it is less likely that a large group of genes were falsely perturbed at the same time than a single gene was, which is one of the advantages of GSEA.(15) Another advantage of GSEA is that, when the members of a gene set exhibit strong cross-correlation, GSEA can boost the signal-to-noise ratio and make it possible to detect modest changes in individual genes.(15) The NCA analysis showed that many of the 235 reported genes were functionally associated with one another (Fig. 3), indicating that these functionally related genes from literature possess higher probabilities as true hits than that as noise (false positives).
In addition to GSEA, we performed a sub-network enrichment analysis (SNEA), which was implemented in Pathway Studio using master casual networks, a database containing more than 6.5 million relationships derived from more than 4 million full text articles and 25 million PubMed abstracts.These networks were generated by a finely-tuned NLP text mining system to extract relationship data from the scientific literature.The ability to quickly update the terminologies and linguistic rules used by NLP systems ensures that new terms can be captured soon after entering into regular use in the literature.This extensive database of interaction data provides high levels of confidence when interpreting experimentally-derived genetic data against the background of previously published results (http://help.pathwaystudio.com/fileadmin/standalone/pathway_studio/help_ps_10.0/index.html?analyze_experiment.htm).Here, SNEA results demonstrated that many of the 235 genes (>90%) showing strong association with ADHD were also identified as causal genes involved in other mental health disorders (schizophrenia, aggression, cognitive impairment).(20)(21)(22) Nevertheless, this study has some limitations that should be considered in future work.The literature data of the 2,410 articles studied were extracted from the Pathway Studio database.Although Pathway Studio database covers over 40 million articles, it is still possible that some articles studying gene-ADHD associations were beyond their scope of coverage.Additionally, the quality scores, RScore, AScore, and EScore were proposed as quality measures of the literature reported gene-disease relations.Although related to, they are not the direct biological significance measures of relationship of the genes to the disease.

CONCLUSION
Results from this up-to-date LDM reveal that the 235 genes identified have multiple types of associations with ADHD, providing an overview map for the current genetic study of ADHD.Meanwhile, the literature and enrichment metrics discovered top genes with specific significance in relation to the disease.In addition,

Fig. 2
Fig. 2 Histogram of the Publications Reporting Gene-disease Relationships between ADHD and 235 Genes.(a) presents the histogram of article publication date; (b) presents the histogram of the number of novel genes identified in each year;

Fig. 3
Fig. 3 Connectivity Networks Built by 26 Genes from Different Groups.The Networks Were Generated Using Pathway Studio; The Un-related Genes Are Highlighted in Blue.

Fig. 4
Fig.4 Comparison of Different Metrics Ranking the 235 Genes.(a) A Venn diagram of top 26 genes selected by different metrics; (b) Comparison of average metrics values with gene set size of 7

Table 4
, we see that the genes with the top AScores and those with the top RScores were enriching different groups of pathways, with different p-values (AScore group: 3.7E-06~4.12E-05;RScore group: 5.84E-16~1.35E-11),indicating that the newly reported genes are functionally different from the most frequently reported ones.Moreover, we observed that 6 out of the 10 pathways/gene sets enriched by the RScore group (Table