Date of Award

Fall 2011

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Computing Sciences - (Ph.D.)

Department

Computer Science

First Advisor

James Geller

Second Advisor

Min Song

Third Advisor

Narain Gehani

Fourth Advisor

Vincent Oria

Fifth Advisor

Xiaohua Hu

Abstract

Approximate string matching methods are utilized by a vast number of duplicate detection and clustering applications in various knowledge domains. The application area is expected to grow due to the recent significant increase in the amount of digital data and knowledge sources. Despite the large number of existing string similarity metrics, there is a need for more precise approximate string matching methods to improve the efficiency of computer-driven data processing, thus decreasing labor-intensive human involvement.

This work introduces a family of novel string similarity methods, which outperform a number of effective well-known and widely used string similarity functions. The new algorithms are designed to overcome the most common problem of the existing methods which is the lack of context sensitivity.

In this evaluation, the Longest Approximately Common Prefix (LACP) method achieved the highest values of average precision and maximum F1 on three out of four medical informatics datasets used. The LACP demonstrated the lowest execution time ensured by the linear computational complexity within the set of evaluated algorithms. An online interactive spell checker of biomedical terms was developed based on the LACP method. The main goal of the spell checker was to evaluate the LACP method’s ability to make it possible to estimate the similarity of resulting sets at a glance.

The Shortest Path Edit Distance (SPED) outperformed all evaluated similarity functions and gained the highest possible values of the average precision and maximum F1 measures on the bioinformatics datasets. The SPED design was inspired by the preceding work on the Markov Random Field Edit Distance (MRFED). The SPED eradicates two shortcomings of the MRFED, which are prolonged execution time and moderate performance.

Four modifications of the Histogram Difference (HD) method demonstrated the best performance on the majority of the life and social sciences data sources used in the experiments. The modifications of the HD algorithm were achieved using several re- scorers: HD with Normalized Smith-Waterman Re-scorer, HD with TFIDF and Jaccard re-scorers, HD with the Longest Common Prefix and TFIDF re-scorers, and HD with the Unweighted Longest Common Prefix Re-scorer.

Another contribution of this dissertation includes the extensive analysis of the string similarity methods evaluation for duplicate detection and clustering tasks on the life and social sciences, bioinformatics, and medical informatics domains. The experimental results are illustrated with precision-recall charts and a number of tables presenting the average precision, maximum F1, and execution time.

Share

COinS