Document Type

Dissertation

Date of Award

Spring 5-31-1999

Degree Name

Doctor of Philosophy in Computing Sciences - (Ph.D.)

Department

Computer and Information Science

First Advisor

Jason T. L. Wang

Second Advisor

James A. McHugh

Third Advisor

James M. Calvin

Fourth Advisor

ShiPengcheng Shi Pengcheng

Fifth Advisor

Frank C.D. Tsai

Abstract

Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing.

In this dissertation we present algorithms for finding patterns in the ordered labeled trees. Specifically we study the largest approximately common substructure (LACS) problem for such trees. We consider a substructure of a tree T to be a connected subgraph of T. Given two trees T₁, T₂ and an integer d, the LACS problem is to find a substructure U₁ of T₁ and a substructure U₂ of T₂ such that U₁ is within distance d of U₂ and where there does not exist any other substructure V₁ of T₁ and V₂ of T₂ such that V₁ and V₂ satisfy the distance constraint and the sum of the sizes of V₁ and V₂ is greater than the sum of the sizes of U₁ and U₂. The LACS problem is motivated by the studies of document and RNA comparison.

We consider two types of distance measures: the general edit distance and a restricted edit distance originated from Selkow. We present dynamic programming algorithms to solve the LACS problem based on the two distance measures. The algorithms run as fast as the best known algorithms for computing the distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithms, we discuss their applications to discovering motifs in multiple RNA secondary structures.

Such an application shows an example of scientific data mining. We represent an RNA secondary structure by an ordered labeled tree based on a previously proposed scheme. The patterns in the trees are substructures that can differ in both substitutions and deletions/insertions of nodes of the trees. Our techniques incorporate approximate tree matching algorithms and novel heuristics for discovery and optimization. Experimental results obtained by running these algorithms on both generated data and RNA secondary structures show the good performance of the algorithms. It is shown that the optimization heuristics speed up the discovery algorithm by a factor of 10. Moreover, our optimized approach is 100,000 times faster than the brute force method.

Finally we implement our techniques into a graphic toolbox that enables users to find repeated substructures in an RNA secondary structure as well as frequently occurring patterns in multiple RNA secondary structures pertaining to rhinovirus obtained from the National Cancer Institute. The system is implemented in C programming language and X windows and is fully operational on SUN workstations.

Recommended Citation

Chang, Chia-Yo, "Pattern discovery in trees : algorithms and applications to document and scientific data management" (1999). Dissertations. 978.
https://digitalcommons.njit.edu/dissertations/978

Download

Included in

Computer Sciences Commons

COinS

Dissertations

Pattern discovery in trees : algorithms and applications to document and scientific data management

Document Type

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Dissertations

Pattern discovery in trees : algorithms and applications to document and scientific data management

Author

Document Type

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Fifth Advisor

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links