Date of Award
Master of Science in Computer Science - (M.S.)
Computer and Information Science
Jason T. L. Wang
James M. Calvin
Franz J. Kurfess
A biomolecular object, such as a deoxyribonucleic acid (DNA), a ribonucleic acid (RNA) or a protein molecule, is made up of a long chain of subunits. A protein is represented as a sequence made from 20 different amino acids, each represented as a letter. There are a vast number of ways in which similar structural domains can be generated in proteins by different amino acid sequences. By contrast, the structure of DNA, made up of only four different nucleotide building blocks that occur in two pairs, is relatively simple, regular, and predictable.
Biomolecular sequence alignment/string search is the most important issue and challenging task in many areas of science and information processing. It involves identifying one-to-one correspondences between subunits of different sequences. An efficient algorithm or tool is involved with many important factors, these include the following: Scoring systems, Alignment statistics, Database redundancy and sequence repetitiveness.
Sequence "motifs" are derived from multiple alignments and can be used to examine individual sequences or an entire database for subtle patterns. With motifs, it is sometimes possible to detect distant relationships that may not be demonstrable based on comparisons of primary sequences alone.
A more comprehensive solution to the efficient string search is approached by building a small, representative set of motifs and using this as a screening database with automatic masking of matching query subsequences. This technology is still under development but recent studies indicate that a representative set of only 1,000 - 3,000 sequences may suffice and such a database can be searched in seconds.
Chuang, Wei-Jen, "A comparative study of sequence analysis tools in computational biology" (1998). Theses. 843.