Document Type
Dissertation
Date of Award
Fall 1-31-2005
Degree Name
Doctor of Philosophy in Computing Sciences - (Ph.D.)
Department
Computer Science
First Advisor
Jason T. L. Wang
Second Advisor
James A. McHugh
Third Advisor
Frank Y. Shih
Fourth Advisor
Qun Ma
Fifth Advisor
Katherine Grace Herbert
Abstract
Frequent structure mining (FSM) aims to discover and extract patterns frequently occurring in structural data such as trees and graphs. FSM finds many applications in bioinformatics, XML processing, Web log analysis, and so on. In this thesis, two new FSM techniques are proposed for finding patterns in unordered labeled trees. Such trees can be used to model evolutionary histories of different species, among others.
The first FSM technique finds cousin pairs in the trees. A cousin pair is a pair of nodes sharing the same parent, the same grandparent, or the same great-grandparent, etc. Given a tree T, our algorithm finds all interesting cousin pairs of T in O(|T|2) time where |T| is the number of nodes in T. Experimental results on synthetic data and phylogenies show the scalability and effectiveness of the proposed technique. This technique has been applied to locating co-occurring patterns in multiple evolutionary trees, evaluating the consensus of equally parsimonious trees, and finding kernel trees of groups of phylogenies. The technique is also extended to undirected acyclic graphs (or free trees).
The second FSM technique extends traditional MAST (maximum agreement subtree) algorithms by employing the Apriori data mining technique to find frequent agreement subtrees in multiple phylogenies. The correctness and completeness of the new mining algorithm are presented. The method is also extended to unrooted phylogenetic trees.
Both FSM techniques studied in the thesis have been implemented into a toolkit, which is fully operational and accessible on the World Wide Web.
Recommended Citation
Zhang, Sen, "Pattern discovery in structural databases with applications to bioinformatics" (2005). Dissertations. 687.
https://digitalcommons.njit.edu/dissertations/687