Document Type


Date of Award

Spring 5-31-2014

Degree Name

Master of Science in Bioinformatics - (M.S.)


Computer Science

First Advisor

Usman W. Roshan

Second Advisor

Jason T. L. Wang

Third Advisor

Zhi Wei


Genome wide association study (GWAS) is widely used with various machine learning algorithms to predict disease risk. This thesis investigates this widely used approach of GWAS using Single Nucleotide Polymorphism (SNP) genotype data and a novel approach of disease risk prediction with whole exome sequencing data, namely Whole Exome Wide Association Study (WEWAS). It further applies a discriminating machine learning algorithm, namely a Support Vector Machine (SVM) with different Kernel functions. For this study, only SNPs generated using genotyping technology, which focuses more on common variants, are used initially for disease prediction. Later, the whole exome data generated using Next Generation Sequencing (NSG) technology is used in the prediction. Another distinction between traditional GWAS and the new approach, WEWAS, presented in this thesis is the use of insertions and deletions in the genomic sequence (INDEL) together with SNPs as a feature for prediction. A substantial improvement in the prediction accuracy is achieved using the latter approach. The success of the approach of using NSG data shows that it contains valuable information which the SNP genotyping method is unable to capture.