Cross-validation and cross-study validation of chronic lymphocytic leukemia with exome sequences and machine learning

Document Type

Conference Proceeding

Publication Date

12-16-2015

Abstract

The era of genomics brings the potential of better DNA based risk prediction and treatment. While genome-wide association studies are extensively studied for risk prediction, the potential of using whole exome data for this purpose is unclear. We explore this problem for chronic lymphocytic leukemia that is one of the largest whole exome dataset of 186 case and 169 controls available from the NIH dbGaP database. We perform a standard next generation sequence procedure to obtain SNP variants on 153 cases and 144 controls after exclusion of samples with missing data. To evaluate their predictive power we first conduct a 50% training and 50% test cross-validation study on the full dataset with the support vector machine as the classifier. There we obtain a mean accuracy of 82% with top 20 ranked SNPs obtained by the Pearson correlation coefficient. We then perform a cross-study validation on case and controls from a lymphoma external study and just controls from head and neck cancer and breast cancer studies (all obtained from NIH dbGaP). On the external dataset we obtain an accuracy of 70% with top ranked SNPs obtained from the original dataset. We also find our top Pearson ranked SNPs to lie on previously implicated genes for this disease. Our study shows that even with a small sample size we can obtain moderate to high accuracy with exome sequences and is thus encouraging for future work.

Identifier

84962348905 (Scopus)

ISBN

[9781467367981]

Publication Title

Proceedings 2015 IEEE International Conference on Bioinformatics and Biomedicine Bibm 2015

External Full Text Location

https://doi.org/10.1109/BIBM.2015.7359878

First Page

1367

Last Page

1374

This document is currently not available here.

Share

COinS