[comp.theory] Longest repeated substring problem.

ajay@cs.Buffalo.EDU (Ajay Shekhawat) (04/16/91)

Sometime ago I stumbled across this problem: 
Given a string, find the longest repeated substring of
that string. Can this problem be solved in time better
than O(n^2) ( where  n = length of the string ) ?

Any references would be appreciated.

Ajay..


Ajay Shekhawat           <Dept. of Comp. Sci., SUNY@Buffalo, Amherst, NY 14260>
ajay@cs.Buffalo.EDU || ajay@sunybcs.BITNET || ajay@sunybcs.UUCP || 716.636.3180

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/16/91)

In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes:
>
>Sometime ago I stumbled across this problem: 
>Given a string, find the longest repeated substring of
>that string. Can this problem be solved in time better
>than O(n^2) ( where  n = length of the string ) ?
>
>Any references would be appreciated.
>
>Ajay..

If you mean _consecutive_ repetitions it can be done in linear time.

-kym

cs450a03@uc780.umd.edu (04/16/91)

Martin Farach  >
Kym Horsell    >|
Ajay Shekhawat >***

>*** Sometime ago I stumbled across this problem: Given a string, find
>*** the longest repeated substring of that string. Can this problem
>*** be solved in time better than O(n^2) ( where n = length of the
>*** string ) ?

>| If you mean _consecutive_ repetitions it can be done in linear time.

>Actually, it can done in linear time whether or not it is consecutive
>repetitions.  Just build a suffix tree.

I don't think that you can guarantee that building the suffix tree
will take linear time.  (Consider using a fun string, like
'abbabaabbaaba...')

Raul Rockwell

mpf@triplea.cs.umd.edu (Martin Farach) (04/16/91)

In article <1991Apr15.200508.757@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:

   In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes:
   >
   >Sometime ago I stumbled across this problem: 
   >Given a string, find the longest repeated substring of
   >that string. Can this problem be solved in time better
   >than O(n^2) ( where  n = length of the string ) ?
   >
   >Any references would be appreciated.
   >
   >Ajay..

   If you mean _consecutive_ repetitions it can be done in linear time.

   -kym

Actually, it can done in linear time whether or not it is consecutive
repetitions.  Just build a suffix tree.

--
Martin Farach			mpf@cs.umd.edu
University of Maryland		"Triviality and certainty are the
Department of Computer Science	 kinderkrankheiten of knowledge."
College Park, Maryland 20742			  -Imre Lakatos

jloup@nocturne.chorus.fr (Jean-Loup Gailly) (04/16/91)

In article <70943@eerie.acsu.Buffalo.EDU>, ajay@cs.Buffalo.EDU
(Ajay Shekhawat) writes:

| Sometime ago I stumbled across this problem: 
| Given a string, find the longest repeated substring of
| that string. Can this problem be solved in time better
| than O(n^2) ( where  n = length of the string ) ?

This problem can be solved in linear time [O(n)] using a suffix tree.
See for example

     McCreight,E.M.
     A space-economical suffix tree construction algorithm
     JACM 23, 2 (1976), 262-272

Jean-loup Gailly

Chorus systemes, 6 av G. Eiffel, 78182 St-Quentin-en-Yvelines-Cedex, France
email: jloup@chorus.fr    Tel: +33 (1) 30 64 82 79 Fax: +33 (1) 30 57 00 66

mark@adler.philosophie.uni-stuttgart.de (Mark Johnson) (04/16/91)

Algorithm for finding longest common substring (I don't know if
this is new or not).

Input: An array A (of characters) of length n.

Output: Integers p, q, and l such that
            A[p] .. A[p+l] = A[q] .. A[q+l]
  and there do not exist p', q' and l' > l such that
            A[p'] .. A[p'+l'] = A[q'] .. A[q'+l']

  Its running time is O(l n lg n).
  

The algorithm rests on the observation that the required
substrings of A have the property that they are adjacent
in a sorted sequence of all substrings of A.  This algorithm
uses an idea of Martin Kay's for space- and time-efficiently 
sorting all of the substrings of a string.

The algorithm uses two arrays of integers P and L, both of
length n.

We use P as an array of indices into the array A,
whose value is the suffix of A beginning at P[i].

1. For 0 <= i < n set P[i] = i.

     P now contains indices to every suffix in A.

2. Sort the array P, where P[i] < P[j] iff the suffix
   of A beginning at P[i] precedes the suffix beginning
   at P[j] is standard lexicographic order.

      P now contains indices to every suffix in A
      in lexicographic order.

3. For each 0 <= i < n-1, let L[i] be the length of
   the common prefix of the suffixes of A beginning
   at P[i] and P[i+1].

4. Find a maximum m such that for all 0 <= i < n-1,
   L[m] >= L[i].  Set p = P[m], q = P[m+1], l = L[m]
   and stop.

We estimate the time complexity as follows.

Step 1 takes O(n) time.
Step 2 takes O(l n lg n) (since each comparision
   may need to compare at most l characters)
Step 3 takes O(l n) time.
Step 4 takes O(n) time.

Comments?

Mark Johnson
--

Mark Johnson
Institut fuer maschinelle Sprachverarbeitung - Computerlinguistik
Universitaet Stuttgart
Keplerstrasse 17
D-7000 Stuttgart 1
Germany.

phone: 0711 - 121 3132.

On leave until mid July 1991 from:

Cognitive and Linguistic Sciences, Box 1978
Brown University
Providence, RI 02912
USA

email addresses: 

mark@adler.philosophie.uni-stuttgart.de
mj@cs.brown.edu
markj@ai.mit.edu
johnson@csli.stanford.edu
mark@ego.mit.edu

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/16/91)

In article <MPF.91Apr15210650@triplea.cs.umd.edu> mpf@triplea.cs.umd.edu (Martin Farach) writes:
>   In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes:
>   >
>   >Sometime ago I stumbled across this problem: 
>   >Given a string, find the longest repeated substring of
>   >that string. Can this problem be solved in time better
>   >than O(n^2) ( where  n = length of the string ) ?
>   >
>   >Any references would be appreciated.
>Actually, it can done in linear time whether or not it is consecutive
>repetitions.  Just build a suffix tree.

Please elucidate.

-kym

mpf@triplea.cs.umd.edu (Martin Farach) (04/17/91)

In article <1991Apr16.154935.21123@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:

   In article <MPF.91Apr15210650@triplea.cs.umd.edu> mpf@triplea.cs.umd.edu (Martin Farach) writes:
   >   In article <70943@eerie.acsu.Buffalo.EDU> ajay@cs.Buffalo.EDU ( Ajay Shekhawat ) writes:
   >   >
   >   >Sometime ago I stumbled across this problem: 
   >   >Given a string, find the longest repeated substring of
   >   >that string. Can this problem be solved in time better
   >   >than O(n^2) ( where  n = length of the string ) ?
   >   >
   >   >Any references would be appreciated.
   >Actually, it can done in linear time whether or not it is consecutive
   >repetitions.  Just build a suffix tree.

   Please elucidate.

   -kym

OK.  In:

@Article{W-73,
  author = 	"P. Weiner",
  title = 	"Linear Pattern Matching Algorithm",
  journal = 	"Proc. 14 IEEE Symposium on Switching and Automata Theory",
  year = 	1973,
  pages = 	"1-11"
}

or, more legibly in:

@InCollection{CS-85,
  author = 	"M. T. Chen and J. Seiferas",
  title = 	"Efficient and Elegant Subword Tree Construction",
  booktitle = 	"Combinatorial Algorithms on Words",
  publisher = 	"NATO ASI Series F: Computer and System Sciences",
  year = 	1985,
  editor = 	"A. Apostolico and Z. Galil",
  chapter = 	12,
  pages = 	"97-107"
}

there is an algorithm for constructing the suffix tree of a string of
lenght n in O(n) time (actually, this assumes a finite alphabet - for
an unbounded alphabet S, it takes time O(n log S)).  In a suffix tree,
each node has a string associated it.  The string s(n) of node n is
defined as follows.  If v is the least common ancestor of n and m,
then s(v) is the longest common prefix of s(n) and s(m).  Furthermore,
for any suffix of the orginal string, there is a leaf whose string is
that suffix.  

Now consider the longest repeated substring, r, of a string S.  Since
the string is repeated (and maximal), there will be some node n in S's
suffix tree such that s(n) = r.  Furthermore, for any (internal) node
v in S's suffix tree |s(v)| \leq |s(n)| = |r|.  So you can build the
suffix tree in linear time and then scan through the internal nodes to
find which one has the longest associated string - also in linear
time.

For finding longest _consecutive_ repeated sequence (note that this is
usually called a square), you need a different algorithm.  It can also
be done in linear time, further, in time O(n log n) you can find *all*
the squares of a string:

@ARTICLE{AP-83,
	AUTHOR = "A. Apostolico and F.P. Preparata",
	TITLE =  "Optimal Off-line Detection of Repetitions in a String",
	JOURNAL = "Theoretical Computer Science",
	YEAR = 1983,
	VOLUME = 22,
	PAGES = "297-315",
}

--
Martin Farach			mpf@cs.umd.edu
University of Maryland		"Triviality and certainty are the
Department of Computer Science	 kinderkrankheiten of knowledge."
College Park, Maryland 20742			  -Imre Lakatos