Alzheimer’s Disease is a common neurodegenerative disorder, which impairs the memory, language and various body behaviors. Although no database recording the mutation type (LOF mutation / GOF mutation) info of AD-related genes, there are lots of literatures that report the studies of AD pathogenesis. The mutation type of the genes are widely implied in the description of the literatures. Therefore, AGAC was applied to AD literatures to extract the mutated genes and the biological processes changed by them.
The AGAC-based Vital Mutation Finding Pipeline on AD
We developed a semi-automatic NLP pipeline to extract vital mutations in PubMed on AD. The pipeline fully used state-of-arts NLP technique, and it consisted of several steps:
- Obtain AD papers in PubMed and go through a rule-based text relevance filtering.
- Apply a BERT-CRF deep neural network to annotation AGAC labels to all AD papers.
- Utilize AGAC corpus to infer LOF/GOF information of mutation happened in AD papers.
- Go through a manual check by domain experts.
What’s for? The pipeline aims to extract hidden knowledge in the large-scale of PubMed texts.
An Example of AGAC-based Mutation Finding: Mutations in APP gene
As AGAC-based vital mutation finding pipeline was designed to extract key function change caused by variations, we applied this AGAC-based pipeline to go through all “AD”-topical PubMed abstracts for a large scale key mutation discovery.
Here is an example of APP gene mutations that extracted through AGAC-based vital mutation discovery. These ﬁve mutations are on three locations of APP, two of which locate on the Beta-APP domain of APP protein. The animo acid location 717 is found three mutations, and this location is between the sequence producing Beta-APP domain and sequence producing APP amyloid domain which form the beta-amyloid and is strongly implicated in the pathogenesis of AD. Moreover, the corresponding sentence evidences of these APP mutations are below.
Overview of AGAC-based Mutation Finding: Biological Process Categories of The Extracted Mutations
AGAC-based vital mutation finding pipeline extracted 325 mutations in abstracts which carry the clear semantic of the downstream biological processes after mutations, which can be divided into 8 types after manual curation. Gene Expression, Protein Activity, Interaction, Pathway Activity and Cell Activity are the fundamental biological processes which follows the central dogma and are from molecular level to cell level.
In addition, the Phosphorylation, Abeta Accumulation and Ca2+ Concentration are frequently mentioned. Interestingly, these three biological processes are related to the known hypothesises of AD pathogenesis. Abeta is the production of APP gene, the accumulation of which, especially Abeta42, forms the ﬁbrillar amyloid plaques in brain and impair the ability of spatial learning and memory. Phosphorylation related to another hypothesis of AD pathogenesis, especially the phosphortlation of Tau protein which encoded by MAPT gene. The hyperphosphorylation of Tau protein leads to neuroﬁbrillary tangles in neurons and eventually results in the apoptosis of neurons. Intracellular Ca2+ concentration is also thought as part of the cause of AD. The dysregulation of intracellular Ca2+ signaling disturbs many neural processes, which implicated in AD mechanism.
Click here to see the dynamic visualization of annotation result.
The full results of biological process category of the genes and evidence sentences are available for downloading.
Check the link please: Supplementary data S2. You can also check the data in the AGAC track website: https://sites.google.com/view/bionlp-ost19-agac-track/data-repository.
You can also check: the table version of Figure 6.
Cite this work
Kaiyin Zhou, Yuxing Wang, Kevin Bretonnel Cohen, Jin-Dong Kim, Xiaohang Ma, Zhixue Shen, Xiangyu Meng, Jingbo Xia. Bridging Heterogeneous Mutation Data to Enhance Disease-Gene Discovery. Briefing in Bioinformatics, 2021, doi: 10.1093/bib/bbab079.