Faculty Publications

Knowledge discovering for document classification using tree matching in TEXPROS

Ching Song Wei, Department of Computer Science
Qianhong Liu, Department of Computer Science
Jason T.L. Wang, Department of Computer Science
Peter A. Ng, Department of Computer Science

Document Type

Article

Publication Date

1-1-1997

Abstract

This paper describes a knowledge-based system for classifying documents based upon the layout structure and conceptual information extracted from their contents. The spatial elements in a document are laid out in rectangular blocks which are represented by nodes in an ordered labeled tree, called the "Layout Structure Tree" (L-S Tree). Each leaf node of an L-S Tree points to its corresponding block content. A Knowledge Acquisition Tool (KAT) is devised to perform the inductive learning from L-S Trees of document samples, and then generate the Document Sample Tree and Document Type Tree bases. A testing document is classified if a Document Type Tree is discovered as a substructure of the L-S Tree of the testing document. Then we match the L-S Tree with the Document Sample Trees of the classified document type to find the format of the testing document. The Document Sample Trees and Document Type Trees are called Structural Knowledge Base (SKB). The tree discovering and matching processes involve comparing the SKB trees and a testing document's L-S Tree by using pattern matching and discovering toolkits. Our experimental results demonstrate that many office documents can be classified correctly using the proposed approach. © Elsevier Science Inc. 1997.

Identifier

0031206983 (Scopus)

Publication Title

Information Sciences

External Full Text Location

https://doi.org/10.1016/S0020-0255(97)00048-0

ISSN

00200255

First Page

255

Last Page

310

Issue

1-4

Volume

100

Recommended Citation

Wei, Ching Song; Liu, Qianhong; Wang, Jason T.L.; and Ng, Peter A., "Knowledge discovering for document classification using tree matching in TEXPROS" (1997). Faculty Publications. 16837.
https://digitalcommons.njit.edu/fac_pubs/16837

This document is currently not available here.

COinS

DOI

10.1016/S0020-0255(97)00048-0

Faculty Publications

Knowledge discovering for document classification using tree matching in TEXPROS

Document Type

Publication Date

Abstract

Identifier

Publication Title

External Full Text Location

ISSN

First Page

Last Page

Issue

Volume

Recommended Citation

DOI

Search

Browse

Author Corner

Links

Faculty Publications

Knowledge discovering for document classification using tree matching in TEXPROS

Authors

Document Type

Publication Date

Abstract

Identifier

Publication Title

External Full Text Location

ISSN

First Page

Last Page

Issue

Volume

Recommended Citation

Share

DOI

Search

Browse

Author Corner

Links