Date of Award
Doctor of Philosophy in Computing Sciences - (Ph.D.)
Computer and Information Science
Frank Y. Shih
Peter A. Ng
James A. McHugh
This dissertation presents document preprocessing and fuzzy unsupervised character classification for automatically reading daily-received office documents that have complex layout structures, such as multiple columns and mixed-mode contents of texts, graphics and half-tone pictures. First, the block segmentation algorithm is performed based on a simple two-step run-length smoothing to decompose a document into single-mode blocks. Next, the block classification is performed based on the clustering rules to classify each block into one of the types such as text, horizontal or vertical lines, graphics, and pictures. The mean white-to-black transition is shown as an invariance for textual blocks, and is useful for block discrimination.
A fuzzy model for unsupervised character classification is designed to improve the robustness, correctness, and speed of the character recognition system. The classification procedures are divided into two stages. The first stage separates the characters into seven typographical categories based on word structures of a text line. The second stage uses pattern matching to classify the characters in each category into a set of fuzzy prototypes based on a nonlinear weighted similarity function. A fuzzy model of unsupervised character classification, which is more natural in the representation of prototypes for character matching, is defined and the weighted fuzzy similarity measure is explored. The characteristics of the fuzzy model are discussed and used in speeding up the classification process.
After classification, the character recognition procedure is simply applied on the limited versions of the fuzzy prototypes. To avoid information loss and extra distortion, an topography-based approach is proposed to apply directly on the fuzzy prototypes to extract the skeletons. First, a convolution by a bell-shaped function is performed to obtain a smooth surface. Second, the ridge points are extracted by rule-based topographic analysis of the structure. Third, a membership function is assigned to ridge points with values indicating the degrees of membership with respect to the skeleton of an object. Finally, the significant ridge points are linked to form strokes of skeleton, and the clues of eigenvalue variation are used to deal with degradation and preserve connectivity. Experimental results show that our algorithm can reduce the deformation of junction points and correctly extract the whole skeleton although a character is broken into pieces. For some characters merged together, the breaking candidates can be easily located by searching for the saddle points. A pruning algorithm is then applied on each breaking position. At last, a multiple context confirmation can be applied to increase the reliability of breaking hypotheses.
Chen, Shy-Shyan, "Document preprocessing and fuzzy unsupervised character classification" (1995). Dissertations. 1109.