Document Type

Dissertation

Date of Award

12-31-2025

Degree Name

Doctor of Philosophy in Computing Sciences - (Ph.D.)

Department

Computer Science

First Advisor

Zhi Wei

Second Advisor

Guiling Wang

Third Advisor

Ioannis Koutis

Fourth Advisor

Pan Xu

Fifth Advisor

Nan Gao

Abstract

Semi-supervised learning has emerged as a powerful paradigm for analyzing single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) data, where full annotation is often costly or impractical. scRNA-seq technologies measure the expression of thousands of genes across tens of thousands of cells, whereas ST additionally captures the spatial coordinates of gene expression within intact tissue sections. Annotation is a key step in both scRNA-seq and ST analysis pipelines, aiming to identify cell types, spatial domains, and latent biological structures. However, most existing annotation approaches rely on separate clustering methods that are typically fully unsupervised and fail to leverage side information available in real-world experiments, such as partial labels or known marker genes. This dissertation focuses on developing semi-supervised learning frameworks that incorporate minimal supervision to enhance the interpretability, robustness, and accuracy of scRNA-seq and ST analyses.

To begin with, a semi-supervised active learning framework is proposed for annotating scRNA-seq data. By iteratively selecting and labeling the most informative cells, the model achieves superior annotation performance with only a few labeled samples compared with conventional unsupervised methods. This framework demonstrates efficient label utilization and strong potential for practical applications in resource-constrained biological studies. For spatial transcriptomics, a method called MGGNN (Marker Gene-Guided Graph Neural Network) is introduced. MGGNN constructs a spatial graph of tissue spots and learns representations through a self-supervised contrastive learning strategy, followed by fine-tuning with a small number of marker gene—derived labels. It achieves state-of-the-art annotation results on both simulated and real ST datasets and maintains robustness under noisy supervision. Furthermore, a model named SCDRL (Semi-Supervised Disentangled Representation Learning) is developed for scRNA-seq data. SCDRL separates latent representations into interpretable components—such as cell type, batch, and biological signature—while isolating residual variation. With only 5% labeled data, SCDRL consistently outperforms existing methods on multiple benchmark datasets by producing disentangled and biologically meaningful latent spaces. Collectively, these methods demonstrate the versatility and effectiveness of semi-supervised learning in scRNA-seq and ST data analysis. They provide scalable and interpretable solutions that bridge domain knowledge with deep learning, offering practical utility for uncovering cellular heterogeneity under realistic, annotation-scarce conditions.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.