Date of Award
Doctor of Philosophy in Computing Sciences - (Ph.D.)
Hai Nhat Phan
Nowadays, with the growing availability of large-scale genomic datasets and advanced computational techniques, more and more data-driven computational methods have been developed to analyze genomic data and help to solve incompletely understood biological problems. Among them, deep learning methods, have been proposed to automatically learn and recognize the functional activity of DNA sequences from genomics data. Techniques for efficient mining genomic sequence pattern will help to improve our understanding of gene regulation, and thus accelerate our progress toward using personal genomes in medicine.
This dissertation focuses on the development of deep learning methods for mining genomic sequences. First, we compare the performance between deep learning models and traditional machine learning methods in recognizing various genomic sequence patterns. Through extensive experiments on both simulated data and real genomic sequence data, we demonstrate that an appropriate deep learning model can be generally made for successfully recognizing various genomic sequence patterns. Next, we develop deep learning methods to help solve two specific biological problems, (1) inference of polyadenylation code and (2) tRNA gene detection and functional prediction. Polyadenylation is a pervasive mechanism that has been used by Eukaryotes for regulating mRNA transcription, localization, and translation efficiency. Polyadenylation signals in the plant are particularly noisy and challenging to decipher. A deep convolutional neural network approach DeepPolyA is proposed to predict poly(A) site from the plant Arabidopsis thaliana genomic sequences. It employs various deep neural network architectures and demonstrates its superiority in comparison with competing methods, including classical machine learning algorithms and several popular deep learning models. Transfer RNAs (tRNAs) represent a highly complex class of genes and play a central role in protein translation.
There remains a de facto tool, tRNAscan-SE, for identifying tRNA genes encoded in genomes. Despite its popularity and success, tRNAscan-SE is still not powerful enough to separate tRNAs from pseudo-tRNAs, and a significant number of false positives can be output as a result. To address this issue, tRNA-DL, a hybrid combination of convolutional neural network and recurrent neural network approach is proposed. It is shown that the proposed method can help to reduce the false positive rate of the state-of-art tRNA prediction tool tRNAscan-SE substantially. Coupled with tRNAscan-SE, tRNA-DL can serve as a useful complementary tool for tRNA annotation. Taken together, the experiments and applications demonstrate the superiority of deep learning in automatic feature generation for characterizing genomic sequence patterns.
Gao, Xin, "Deep learning methods for mining genomic sequence patterns" (2018). Dissertations. 1387.