#### Title

Pattern discovery in trees : algorithms and applications to document and scientific data management

#### Document Type

Dissertation

#### Date of Award

Spring 5-31-1999

#### Degree Name

Doctor of Philosophy in Computing Sciences - (Ph.D.)

#### Department

Computer and Information Science

#### First Advisor

Jason T. L. Wang

#### Second Advisor

James A. McHugh

#### Third Advisor

James M. Calvin

#### Fourth Advisor

ShiPengcheng Shi Pengcheng

#### Fifth Advisor

Frank C.D. Tsai

#### Abstract

Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing.

In this dissertation we present algorithms for finding patterns in the ordered labeled trees. Specifically we study the largest approximately common substructure (LACS) problem for such trees. We consider a substructure of a tree T to be a connected subgraph of T. Given two trees T_{1}, T_{2} and an integer d, the LACS problem is to find a substructure U_{1} of T_{1} and a substructure U_{2} of T_{2} such that U_{1} is within distance d of U_{2} and where there does not exist any other substructure V_{1} of T_{1} and V_{2} of T_{2} such that V_{1} and V_{2} satisfy the distance constraint and the sum of the sizes of V_{1} and V_{2} is greater than the sum of the sizes of U_{1} and U_{2}. The LACS problem is motivated by the studies of document and RNA comparison.

We consider two types of distance measures: the general edit distance and a restricted edit distance originated from Selkow. We present dynamic programming algorithms to solve the LACS problem based on the two distance measures. The algorithms run as fast as the best known algorithms for computing the distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithms, we discuss their applications to discovering motifs in multiple RNA secondary structures.

Such an application shows an example of scientific data mining. We represent an RNA secondary structure by an ordered labeled tree based on a previously proposed scheme. The patterns in the trees are substructures that can differ in both substitutions and deletions/insertions of nodes of the trees. Our techniques incorporate approximate tree matching algorithms and novel heuristics for discovery and optimization. Experimental results obtained by running these algorithms on both generated data and RNA secondary structures show the good performance of the algorithms. It is shown that the optimization heuristics speed up the discovery algorithm by a factor of 10. Moreover, our optimized approach is 100,000 times faster than the brute force method.

Finally we implement our techniques into a graphic toolbox that enables users to find repeated substructures in an RNA secondary structure as well as frequently occurring patterns in multiple RNA secondary structures pertaining to rhinovirus obtained from the National Cancer Institute. The system is implemented in C programming language and X windows and is fully operational on SUN workstations.

#### Recommended Citation

Chang, Chia-Yo, "Pattern discovery in trees : algorithms and applications to document and scientific data management" (1999). *Dissertations*. 978.

https://digitalcommons.njit.edu/dissertations/978