[comp.theory] Approx. string matching

pete@BONNIE.ICS.UCI.EDU (Peter O'Leary) (01/11/90)

I am attempting to find an efficient approximate string matching
algorithm and so far have had a frustrating time.  I have an paper
by Landau and Vishkin from a 1986 ACM Proceedings that describes an
O(kn) algorithm, however, the paper refers to other works that I have not
been able to locate.  Does anyone know of a good survey paper or textbook
with a good treatment of this subject?

Peter O'Leary.

johnk@cs.arizona.edu (John Kececioglu) (01/12/90)

In article <9001092236.aa13533@PARIS.ICS.UCI.EDU>
Peter O'Leary <pete%BONNIE.ICS.UCI.EDU@VM1.NoDak.EDU> writes:

> I am attempting to find an efficient approximate string matching
> algorithm and so far have had a frustrating time.  I have an paper
> by Landau and Vishkin from a 1986 ACM Proceedings that describes an
> O(kn) algorithm, however, the paper refers to other works that I have not
> been able to locate.  Does anyone know of a good survey paper or textbook
> with a good treatment of this subject?

There are myriad approximate string matching problems.  The two survey papers
I am familiar with are

	A. Apostolico and C. Guerra, "The longest common subsequence problem
	revisited," Algorithmica 2 (1987) 315-336.

	Z. Galil and R. Giancarlo, "Data structures and algorithms for
	approximate string matching," J. of Complexity 4 (1988) 33-72.

The only book I know of is

	David Sankoff and Joseph Kruskal, editors, _Time Warps, String Edits,
	and Macromolecules:  The Theory and Practice of Sequence Comparison_,
	Addison-Wesley, Reading Massachusetts, 1983.

Eugene Myers has a new practical approximate string matching algorithm with
sublinear expected time, though it is unpublished.  If you are interested
in the more restricted problem of finding the edit distance between
two strings, I recommend the following not well-known paper

	Eugene Myers, "An O(ND) difference algorithm and its variations,"
	Algorithmica 1 (1986) 251-266.

The algorithm described is simple, and has O(D^2 + N) expected time.
(Here D is the minimum number of single-character insertions, deletions,
and substitutions to convert one string into the other.)  An impractical
variation has O(D^2 + N log N) worst-case time.  For implementation issues,
see

	Webb Miller and Eugene Myers, "A file comparison program," Software
	Practice and Experience 15:11 (November 1985) 1025-1040.

Hope this helps.

-- 
John Kececioglu
johnk@cs.arizona.edu or arizona!johnk.uucp
Department of Computer Science, University of Arizona, Tucson, Arizona 85721
"Through small apertures we glimpse abysses whose somber depths turn us faint."