ajay@cs.Buffalo.EDU (Ajay Shekhawat) (04/16/91)
Sometime ago I stumbled across this problem: Given a string, find the longest repeated substring of that string. Can this problem be solved in time better than O(n^2) ( where n = length of the string ) ? Any references would be appreciated. Ajay.. Ajay Shekhawat <Dept. of Comp. Sci., SUNY@Buffalo, Amherst, NY 14260> ajay@cs.Buffalo.EDU || ajay@sunybcs.BITNET || ajay@sunybcs.UUCP || 716.636.3180
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/16/91)
In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes: > >Sometime ago I stumbled across this problem: >Given a string, find the longest repeated substring of >that string. Can this problem be solved in time better >than O(n^2) ( where n = length of the string ) ? > >Any references would be appreciated. > >Ajay.. If you mean _consecutive_ repetitions it can be done in linear time. -kym
cs450a03@uc780.umd.edu (04/16/91)
Martin Farach > Kym Horsell >| Ajay Shekhawat >*** >*** Sometime ago I stumbled across this problem: Given a string, find >*** the longest repeated substring of that string. Can this problem >*** be solved in time better than O(n^2) ( where n = length of the >*** string ) ? >| If you mean _consecutive_ repetitions it can be done in linear time. >Actually, it can done in linear time whether or not it is consecutive >repetitions. Just build a suffix tree. I don't think that you can guarantee that building the suffix tree will take linear time. (Consider using a fun string, like 'abbabaabbaaba...') Raul Rockwell
mpf@triplea.cs.umd.edu (Martin Farach) (04/16/91)
In article <1991Apr15.200508.757@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes: > >Sometime ago I stumbled across this problem: >Given a string, find the longest repeated substring of >that string. Can this problem be solved in time better >than O(n^2) ( where n = length of the string ) ? > >Any references would be appreciated. > >Ajay.. If you mean _consecutive_ repetitions it can be done in linear time. -kym Actually, it can done in linear time whether or not it is consecutive repetitions. Just build a suffix tree. -- Martin Farach mpf@cs.umd.edu University of Maryland "Triviality and certainty are the Department of Computer Science kinderkrankheiten of knowledge." College Park, Maryland 20742 -Imre Lakatos
jloup@nocturne.chorus.fr (Jean-Loup Gailly) (04/16/91)
In article <70943@eerie.acsu.Buffalo.EDU>, ajay@cs.Buffalo.EDU (Ajay Shekhawat) writes: | Sometime ago I stumbled across this problem: | Given a string, find the longest repeated substring of | that string. Can this problem be solved in time better | than O(n^2) ( where n = length of the string ) ? This problem can be solved in linear time [O(n)] using a suffix tree. See for example McCreight,E.M. A space-economical suffix tree construction algorithm JACM 23, 2 (1976), 262-272 Jean-loup Gailly Chorus systemes, 6 av G. Eiffel, 78182 St-Quentin-en-Yvelines-Cedex, France email: jloup@chorus.fr Tel: +33 (1) 30 64 82 79 Fax: +33 (1) 30 57 00 66
mark@adler.philosophie.uni-stuttgart.de (Mark Johnson) (04/16/91)
Algorithm for finding longest common substring (I don't know if this is new or not). Input: An array A (of characters) of length n. Output: Integers p, q, and l such that A[p] .. A[p+l] = A[q] .. A[q+l] and there do not exist p', q' and l' > l such that A[p'] .. A[p'+l'] = A[q'] .. A[q'+l'] Its running time is O(l n lg n). The algorithm rests on the observation that the required substrings of A have the property that they are adjacent in a sorted sequence of all substrings of A. This algorithm uses an idea of Martin Kay's for space- and time-efficiently sorting all of the substrings of a string. The algorithm uses two arrays of integers P and L, both of length n. We use P as an array of indices into the array A, whose value is the suffix of A beginning at P[i]. 1. For 0 <= i < n set P[i] = i. P now contains indices to every suffix in A. 2. Sort the array P, where P[i] < P[j] iff the suffix of A beginning at P[i] precedes the suffix beginning at P[j] is standard lexicographic order. P now contains indices to every suffix in A in lexicographic order. 3. For each 0 <= i < n-1, let L[i] be the length of the common prefix of the suffixes of A beginning at P[i] and P[i+1]. 4. Find a maximum m such that for all 0 <= i < n-1, L[m] >= L[i]. Set p = P[m], q = P[m+1], l = L[m] and stop. We estimate the time complexity as follows. Step 1 takes O(n) time. Step 2 takes O(l n lg n) (since each comparision may need to compare at most l characters) Step 3 takes O(l n) time. Step 4 takes O(n) time. Comments? Mark Johnson -- Mark Johnson Institut fuer maschinelle Sprachverarbeitung - Computerlinguistik Universitaet Stuttgart Keplerstrasse 17 D-7000 Stuttgart 1 Germany. phone: 0711 - 121 3132. On leave until mid July 1991 from: Cognitive and Linguistic Sciences, Box 1978 Brown University Providence, RI 02912 USA email addresses: mark@adler.philosophie.uni-stuttgart.de mj@cs.brown.edu markj@ai.mit.edu johnson@csli.stanford.edu mark@ego.mit.edu
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/16/91)
In article <MPF.91Apr15210650@triplea.cs.umd.edu> mpf@triplea.cs.umd.edu (Martin Farach) writes: > In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes: > > > >Sometime ago I stumbled across this problem: > >Given a string, find the longest repeated substring of > >that string. Can this problem be solved in time better > >than O(n^2) ( where n = length of the string ) ? > > > >Any references would be appreciated. >Actually, it can done in linear time whether or not it is consecutive >repetitions. Just build a suffix tree. Please elucidate. -kym
mpf@triplea.cs.umd.edu (Martin Farach) (04/17/91)
In article <1991Apr16.154935.21123@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: In article <MPF.91Apr15210650@triplea.cs.umd.edu> mpf@triplea.cs.umd.edu (Martin Farach) writes: > In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes: > > > >Sometime ago I stumbled across this problem: > >Given a string, find the longest repeated substring of > >that string. Can this problem be solved in time better > >than O(n^2) ( where n = length of the string ) ? > > > >Any references would be appreciated. >Actually, it can done in linear time whether or not it is consecutive >repetitions. Just build a suffix tree. Please elucidate. -kym OK. In: @Article{W-73, author = "P. Weiner", title = "Linear Pattern Matching Algorithm", journal = "Proc. 14 IEEE Symposium on Switching and Automata Theory", year = 1973, pages = "1-11" } or, more legibly in: @InCollection{CS-85, author = "M. T. Chen and J. Seiferas", title = "Efficient and Elegant Subword Tree Construction", booktitle = "Combinatorial Algorithms on Words", publisher = "NATO ASI Series F: Computer and System Sciences", year = 1985, editor = "A. Apostolico and Z. Galil", chapter = 12, pages = "97-107" } there is an algorithm for constructing the suffix tree of a string of lenght n in O(n) time (actually, this assumes a finite alphabet - for an unbounded alphabet S, it takes time O(n log S)). In a suffix tree, each node has a string associated it. The string s(n) of node n is defined as follows. If v is the least common ancestor of n and m, then s(v) is the longest common prefix of s(n) and s(m). Furthermore, for any suffix of the orginal string, there is a leaf whose string is that suffix. Now consider the longest repeated substring, r, of a string S. Since the string is repeated (and maximal), there will be some node n in S's suffix tree such that s(n) = r. Furthermore, for any (internal) node v in S's suffix tree |s(v)| \leq |s(n)| = |r|. So you can build the suffix tree in linear time and then scan through the internal nodes to find which one has the longest associated string - also in linear time. For finding longest _consecutive_ repeated sequence (note that this is usually called a square), you need a different algorithm. It can also be done in linear time, further, in time O(n log n) you can find *all* the squares of a string: @ARTICLE{AP-83, AUTHOR = "A. Apostolico and F.P. Preparata", TITLE = "Optimal Off-line Detection of Repetitions in a String", JOURNAL = "Theoretical Computer Science", YEAR = 1983, VOLUME = 22, PAGES = "297-315", } -- Martin Farach mpf@cs.umd.edu University of Maryland "Triviality and certainty are the Department of Computer Science kinderkrankheiten of knowledge." College Park, Maryland 20742 -Imre Lakatos