Dissertations

Model-based deep autoencoders for characterizing discrete data with application to genomic data analysis

Tian Tian, New Jersey Institute of Technology

Document Type

Dissertation

Date of Award

5-31-2019

Degree Name

Doctor of Philosophy in Computing Sciences - (Ph.D.)

Department

Computer Science

First Advisor

Zhi Wei

Second Advisor

Guiling Wang

Third Advisor

Usman W. Roshan

Fourth Advisor

Wenge Guo

Fifth Advisor

Renqiang Min

Sixth Advisor

Zhigen Zhao

Abstract

Deep learning techniques have achieved tremendous successes in a wide range of real applications in recent years. For dimension reduction, deep neural networks (DNNs) provide a natural choice to parameterize a non-linear transforming function that maps the original high dimensional data to a lower dimensional latent space. Autoencoder is a kind of DNNs used to learn efficient feature representation in an unsupervised manner. Deep autoencoder has been widely explored and applied to analysis of continuous data, while it is understudied for characterizing discrete data. This dissertation focuses on developing model-based deep autoencoders for modeling discrete data. A motivating example of discrete data is the count data matrix generated by single-cell RNA sequencing (scRNA-seq) technology which is widely used in biological and medical fields. scRNA-seq promises to provide higher resolution of cellular differences than bulk RNA sequencing and has helped researchers to better understand complex biological questions. The recent advances in sequencing technology have enabled a dramatic increase in the throughput to thousands of cells for scRNA-seq. However, analysis of scRNA-seq data remains a statistical and computational challenge. A major problem is the pervasive dropout events obscuring the discrete matrix with prevailing 'false' zero count observations, which is caused by the shallow sequencing depth per cell. To make downstream analysis more effective, imputation, which recovers the missing values, is often conducted as the first step in preprocessing scRNA-seq data. Several imputation methods have been proposed. Of note is a deep autoencoder model, which proposes to explicitly characterize the count distribution, over-dispersion, and sparsity of scRNA-seq data using a zero-inflated negative binomial (ZINB) model. This dissertation introduces a model-based deep learning clustering model ? scDeepCluster for clustering analysis of scRNA-seq data. The scDeepCluster is a deep autoencoder which simultaneously learns feature representation and clustering via explicit modeling of scRNA-seq data generation using the ZINB model. Based on testing extensive simulated datasets and real datasets from different representative single-cell sequencing platforms, scDeepCluster outperformed several state-of-the-art methods under various clustering performance metrics and exhibited improved scalability, with running time increasing linearly with the sample size. Although this model-based deep autoencoder approach has demonstrated superior performance, it is over-permissive in defining ZINB model space, which can lead to an unidentifiable model and make results unstable. Next, this dissertation proposes to impose a regularization that takes dropout events into account. The regularization uses a differentiable categorical distribution - Gumbel-Softmax to explicitly model the dropout events, and minimizes the Maximum Mean Discrepancy (MMD) between the reconstructed randomly masked matrix and the raw count matrix. Imputation analyses showed that the proposed regularized model-based autoencoder significantly outperformed the vanilla model-based deep autoencoder.

Recommended Citation

Tian, Tian, "Model-based deep autoencoders for characterizing discrete data with application to genomic data analysis" (2019). Dissertations. 1690.
https://digitalcommons.njit.edu/dissertations/1690

Download

Included in

Bioinformatics Commons, Computer Sciences Commons

COinS

Dissertations

Model-based deep autoencoders for characterizing discrete data with application to genomic data analysis

Document Type

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Sixth Advisor

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Dissertations

Model-based deep autoencoders for characterizing discrete data with application to genomic data analysis

Author

Document Type

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Sixth Advisor

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links