Document Type


Date of Award

Spring 5-31-2014

Degree Name

Master of Science in Bioinformatics - (M.S.)


Computer Science

First Advisor

Usman W. Roshan

Second Advisor

Jason T. L. Wang

Third Advisor

Zhi Wei


Genome wide association study (GWAS) is widely used with various machine learning algorithms to predict disease risk. This thesis investigates this widely used approach of GWAS using Single Nucleotide Polymorphism (SNP) genotype data and a novel approach of disease risk prediction with whole exome sequencing data, namely Whole Exome Wide Association Study (WEWAS). It further applies a discriminating machine learning algorithm, namely a Support Vector Machine (SVM) with different Kernel functions. For this study, only SNPs generated using genotyping technology, which focuses more on common variants, are used initially for disease prediction. Later, the whole exome data generated using Next Generation Sequencing (NSG) technology is used in the prediction. Another distinction between traditional GWAS and the new approach, WEWAS, presented in this thesis is the use of insertions and deletions in the genomic sequence (INDEL) together with SNPs as a feature for prediction. A substantial improvement in the prediction accuracy is achieved using the latter approach. The success of the approach of using NSG data shows that it contains valuable information which the SNP genotyping method is unable to capture.



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.