Faculty Publications

XML clustering and retrieval through principal component analysis

Jason T.L. Wang, Department of Computer Science
Jianghui Liu, Department of Computer Science
Junhan Wang, Department of Computer Science

Document Type

Article

Publication Date

8-1-2005

Abstract

XML is increasingly important in data exchange and information management. A great deal of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. We also discuss an extension of our techniques to XML retrieval. Experimental results based on documents obtained from Wisconsin's XML data bank show the effectiveness and good performance of the proposed techniques. © World Scientific Publishing Company.

Identifier

33746253388 (Scopus)

Publication Title

International Journal on Artificial Intelligence Tools

External Full Text Location

https://doi.org/10.1142/S0218213005002326

ISSN

02182130

First Page

683

Last Page

699

Issue

Volume

Grant

IIS-9988636

Fund Ref

National Science Foundation

Recommended Citation

Wang, Jason T.L.; Liu, Jianghui; and Wang, Junhan, "XML clustering and retrieval through principal component analysis" (2005). Faculty Publications. 19610.
https://digitalcommons.njit.edu/fac_pubs/19610

This document is currently not available here.

COinS

DOI

10.1142/S0218213005002326

Faculty Publications

XML clustering and retrieval through principal component analysis

Document Type

Publication Date

Abstract

Identifier

Publication Title

External Full Text Location

ISSN

First Page

Last Page

Issue

Volume

Grant

Fund Ref

Recommended Citation

DOI

Search

Browse

Author Corner

Links

Faculty Publications

XML clustering and retrieval through principal component analysis

Authors

Document Type

Publication Date

Abstract

Identifier

Publication Title

External Full Text Location

ISSN

First Page

Last Page

Issue

Volume

Grant

Fund Ref

Recommended Citation

Share

DOI

Search

Browse

Author Corner

Links