Document Type

Dissertation

Date of Award

5-31-2023

Degree Name

Doctor of Philosophy in Computing Sciences - (Ph.D.)

Department

Computer Science

First Advisor

Ioannis Koutis

Second Advisor

Baruch Schieber

Third Advisor

Zhi Wei

Fourth Advisor

Senjuti Basu Roy

Fifth Advisor

Alexandros Tzatsos

Abstract

High-throughput technologies such as DNA microarrays and RNA-seq are used to measure the expression levels of large numbers of genes simultaneously. To support the extraction of biological knowledge, individual gene expression levels are transformed into Gene Co-expression Networks (GCNs). GCNs are analyzed to discover gene modules. GCN construction and analysis is a well-studied topic, for nearly two decades. While new types of sequencing and the corresponding data are now available, the software package WGCNA and its most recent variants are still widely used, contributing to biological discovery.

The discovery of biologically significant modules of genes from raw expression data is a non-typical unsupervised problem; while there are no training data to drive the computational discovery of modules, the biological significance of the discovered modules can be evaluated with the widely used module enrichment metric, measuring the statistical significance of the occurrence of Gene Ontology terms within the computed modules. WGCNA and other related methods are entirely heuristic and they do not leverage the aforementioned non-typical nature of the underlying unsupervised problem.

The main contribution of this thesis is SGCP, a novel Self-Training Gene Clustering Pipeline for discovering modules of genes from raw expression data. SGCP almost entirely replaces the steps followed by existing methods, based on recent progress in mathematically justified unsupervised clustering algorithms. It also introduces a conceptually novel self-training step that leverages Gene Ontology information to modify and improve the set of modules computed by the unsupervised algorithm.

SGCP is tested on a rich set of DNA microarrays and RNA-seq benchmarks, coming from various organisms. These tests show that SGCP greatly outperforms all previous methods, resulting in highly enriched modules. Furthermore, these modules are often quite dissimilar from those computed by previous methods, suggesting the possibility that SGCP can indeed become an auxiliary tool for extracting biological knowledge. To this end, SGCP is implemented as an easy-to-use R package that is made available on Bioconductor.

Recommended Citation

Aghaieabiane, Niloofar, "Machine learning and network embedding methods for gene co-expression networks" (2023). Dissertations. 1658.
https://digitalcommons.njit.edu/dissertations/1658

Download

Included in

Artificial Intelligence and Robotics Commons, Computational Biology Commons, Numerical Analysis and Scientific Computing Commons

COinS

Dissertations

Machine learning and network embedding methods for gene co-expression networks

Document Type

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Dissertations

Machine learning and network embedding methods for gene co-expression networks

Author

Document Type

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links