Document Type
Dissertation
Date of Award
12-31-2025
Degree Name
Doctor of Philosophy in Computing Sciences - (Ph.D.)
Department
Computer Science
First Advisor
Zhi Wei
Second Advisor
Guiling Wang
Third Advisor
Ioannis Koutis
Fourth Advisor
Pan Xu
Fifth Advisor
Nan Gao
Abstract
Semi-supervised learning has emerged as a powerful paradigm for analyzing single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) data, where full annotation is often costly or impractical. scRNA-seq technologies measure the expression of thousands of genes across tens of thousands of cells, whereas ST additionally captures the spatial coordinates of gene expression within intact tissue sections. Annotation is a key step in both scRNA-seq and ST analysis pipelines, aiming to identify cell types, spatial domains, and latent biological structures. However, most existing annotation approaches rely on separate clustering methods that are typically fully unsupervised and fail to leverage side information available in real-world experiments, such as partial labels or known marker genes. This dissertation focuses on developing semi-supervised learning frameworks that incorporate minimal supervision to enhance the interpretability, robustness, and accuracy of scRNA-seq and ST analyses.
To begin with, a semi-supervised active learning framework is proposed for annotating scRNA-seq data. By iteratively selecting and labeling the most informative cells, the model achieves superior annotation performance with only a few labeled samples compared with conventional unsupervised methods. This framework demonstrates efficient label utilization and strong potential for practical applications in resource-constrained biological studies. For spatial transcriptomics, a method called MGGNN (Marker Gene-Guided Graph Neural Network) is introduced. MGGNN constructs a spatial graph of tissue spots and learns representations through a self-supervised contrastive learning strategy, followed by fine-tuning with a small number of marker gene—derived labels. It achieves state-of-the-art annotation results on both simulated and real ST datasets and maintains robustness under noisy supervision. Furthermore, a model named SCDRL (Semi-Supervised Disentangled Representation Learning) is developed for scRNA-seq data. SCDRL separates latent representations into interpretable components—such as cell type, batch, and biological signature—while isolating residual variation. With only 5% labeled data, SCDRL consistently outperforms existing methods on multiple benchmark datasets by producing disentangled and biologically meaningful latent spaces. Collectively, these methods demonstrate the versatility and effectiveness of semi-supervised learning in scRNA-seq and ST data analysis. They provide scalable and interpretable solutions that bridge domain knowledge with deep learning, offering practical utility for uncovering cellular heterogeneity under realistic, annotation-scarce conditions.
Recommended Citation
Liu, Haoran, "Semi-supervised learning for annotation and representation of single-cell rna sequencing and spatial transcriptomics data" (2025). Dissertations. 1865.
https://digitalcommons.njit.edu/dissertations/1865
