Document Type
Thesis
Date of Award
Spring 5-31-2012
Degree Name
Master of Science in Bioinformatics - (M.S.)
Department
Computer Science
First Advisor
Usman W. Roshan
Second Advisor
Jason T. L. Wang
Third Advisor
Zhi Wei
Abstract
Genome wide association studies (GWAS) search for correlations between single nucleotide polymorphisms (SNPs) in a subject genome and an observed phenotype. GWAS can be used to generate models for predicting phenotype based on genotype, as well as aiding in identification of specific genes affecting the biological mechanism underlying the phenotype.
In this investigation, phenotype prediction models are constructed from GWAS training data and are evaluated for performance on test data. Three methods are used to rank SNPs by their correlation with the phenotype: the univariate Wald test, a multivariate, support vector machine (SVM) based technique, and a hybrid method where a subset of top ranked SNPs from the Wald test are used to train the SVM. Both case- control studies and quantitative phenotypes are examined. For each method and data set, a series of least squares linear regression models is generated from nested subsets of the best SNPs from each ranking method. The accuracy of these models is determined on a test data set, and a plot of prediction performance against the number of top ranked SNPs considered is generated.
The SVM and hybrid methods are found to be consistently superior to the Wald test in ranking predictive SNPs. The hybrid method allows a useful trade-off between increasing accuracy vs. using fewer SNPs to be optimized as desired.
Recommended Citation
Roberts, Andrew, "Phenotype prediction and feature selection in genome-wide association studies" (2012). Theses. 130.
https://digitalcommons.njit.edu/theses/130