Date of Award

Fall 2009

Document Type


Degree Name

Doctor of Philosophy in Computing Sciences - (Ph.D.)


Computer Science

First Advisor

Usman W. Roshan

Second Advisor

Dennis Ray Livesay

Third Advisor

Zhi Wei

Fourth Advisor

Barry Cohen

Fifth Advisor

Jason T. L. Wang


The field of comparative genomics is abundant with problems of interest to computer scientists. In this thesis, the author presents solutions to three contemporary problems: obtaining better alignments for phylogeny reconstruction, identifying related RNA sequences in genomes, and ranking Single Nucleotide Polymorphisms (SNPs) in genome-wide association studies (GWAS).

Sequence alignment is a basic and widely used task in bioinformatics. Its applications include identifying protein structure, RNAs and transcription factor binding sites in genomes, and phylogeny reconstruction. Phylogenetic descriptions depend not only on the employed reconstruction technique, but also on the underlying sequence alignment. The author has studied and established a simple prescription for obtaining a better phylogeny by improving the underlying alignments used in phylogeny reconstruction. This was achieved by improving upon Gotoh's iterative heuristic by iterating with maximum parsimony guide-trees. This approach has shown an improvement in accuracy over standard alignment programs.

A novel alignment algorithm named Probalign-RNAgenome that can identify non-coding RNAs in genomic sequences was also developed. Non-coding RNAs play a critical role in the cell such as gene regulation. It is thought that many such RNAs lie undiscovered in the genome. To date, alignment based approaches have shown to be more accurate than thermodynamic methods for identifying such non-coding RNAs. Probalign-RNAgenome employs a probabilistic consistency based approach for aligning a query RNA sequence to its homolog in a genomic sequence. Results show that this approach is more accurate on real data than the widely used BLAST and Smith- Waterman algorithms.

Within the realm of comparative genomics are also a large number of recently conducted GWAS. GWAS aim to identify regions in the genome that are associated with a given disease. The support vector machine (SVM) provides a discriminative alternative to the widely used chi-square statistic in GWAS. A novel hybrid strategy that combines the chi-square statistic with the SVM was developed and implemented. Its performance was studied on simulated data and the Wellcome Trust Case Control Consortium (WTCCC) studies. Results presented in this thesis show that the hybrid strategy ranks causal SNPs in simulated data significantly higher than the chi-square test and SVM alone. The results also show that the hybrid strategy ranks previously replicated SNPs and associated regions (where applicable) of type 1 diabetes, rheumatoid arthritis, and Crohn's disease higher than the chi-square, SVM, and SVM Recursive Feature Elimination (SVM-RFE).