[comp.bugs.sys5] s/A*./B/g editor bug?

bink@aplcen.apl.jhu.edu (9704) (10/18/89)

I've noticed the ED s///g command works incorrectly when given the command
	s/b*a/c/g
on the input line
	aaaaaaaaaaaaaaaaaaaaa
-- the result is:
	cacacacacacacacacacac
and it should be:
	ccccccccccccccccccccc

This is incorrect according to ED(1):

     (.,.)s/RE/replacement/g
                         The substitute command searches each
                         addressed line for an occurrence of the
                         specified RE.  In each line in which a
                         match is found, ALL (NON-OVERLAPPED)
                         matched strings are replaced by the
                         replacement if the global replacement
                         indicator g appears after the command.

These occurrences ARE non-overlapping.  The form s/ab*/c/g works as expected.
Both VI and ED behave this way, on at least System V.2 and Ultrix-32 V3.1.
Is this a confirmed bug in the RE matcher or the substitute code, and has
it been resolved?  This sounds trivial, but it did cause real problems.

					-- Greg Ubben
					   bink@aplcen.apl.jhu.edu
					   ...!uunet!mimsy!aplcen!bink

gwyn@smoke.BRL.MIL (Doug Gwyn) (10/18/89)

In article <3522@aplcen.apl.jhu.edu> bink@aplcen.apl.jhu.edu (9704) writes:
-I've noticed the ED s///g command works incorrectly when given the command
-	s/b*a/c/g
-on the input line
-	aaaaaaaaaaaaaaaaaaaaa
-Is this a confirmed bug in the RE matcher or the substitute code, and has
-it been resolved?  This sounds trivial, but it did cause real problems.

I believe it's a real bug.  The corresponding "sam" x/b*a/c/c/ works as
you expected the "ed" command to work.

perry@ccssrv.UUCP (Perry Hutchison) (10/19/89)

In article <3522@aplcen.apl.jhu.edu> bink@aplcen.apl.jhu.edu (9704) writes:

+ I've noticed the ED s///g command works incorrectly when given the command
+ 	s/b*a/c/g
+ on the input line
+ 	aaaaaaaaaaaaaaaaaaaaa
+ -- the result is:
+ 	cacacacacacacacacacac
+ and it should be:
+ 	ccccccccccccccccccccc

< documentation citation, according to which this is a bug, deleted >

+ Both VI and ED behave this way, on at least System V.2 and Ultrix-32 V3.1.

Both also behave this way on SunOS 3.5, which is a 4.2 BSD derivative.
Anyone care to try it out on a 2bsd or a V7 system?

henry@utzoo.uucp (Henry Spencer) (10/19/89)

In article <741@ccssrv.UUCP> perry@ccssrv.UUCP (Perry Hutchison) writes:
>+ 	s/b*a/c/g
>+ on the input line
>+ 	aaaaaaaaaaaaaaaaaaaaa
>+ -- the result is:
>+ 	cacacacacacacacacacac
>+ Both VI and ED behave this way, on at least System V.2 and Ultrix-32 V3.1.
>
>Both also behave this way on SunOS 3.5, which is a 4.2 BSD derivative.
>Anyone care to try it out on a 2bsd or a V7 system?

Using our ed, which although it runs on a Sun is a V6-derived program,
I get the right answer.  Looks like somebody broke it -- Berkeley, for
a guess, but I could be wrong -- and everyone else blindly copied the
mistake.  (I unfortunately don't have a V7 ed source handy to check;
just conceivably it was broken there, although it would surprise me.)
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

guy@auspex.auspex.com (Guy Harris) (10/20/89)

>Both also behave this way on SunOS 3.5, which is a 4.2 BSD derivative.
>Anyone care to try it out on a 2bsd or a V7 system?

"vi", but not "ed", behaves that way on SunOS 4.0, whose "vi" is
4.3BSD-derived and whose "ed" is S5R3-derived.

prp@sei.cmu.edu (Patrick Place) (10/20/89)

To add fuel to the broken regular expressions fire, using a 3B1 - I
have no idea where this version of ed was derived from, but I would
guess from AT&T System V - ed is again broken in the way indicated
by others.

I would be interested if Henry Spencer, who has a V6 derived ed,
can determine whether he is using the standard regular expression
package or if in the derivation path the appropriate routines have
been fixed.

Pat Place

dmr@alice.UUCP (10/20/89)

The behavior of ed's
	s/b*a/c/g
on
	aaaaa
to yield
	cacacacacac
is doubtless a bug, and quite old; it occurs in
the Seventh Edition ed.  It relates to handling of null strings.

The problem ed is trying to overcome is shown by the command
	s/b*/c/g
on
	aaaaa
where the most desirable result is presumably
	cacacacacac
where each null string before, inside, and after the subject
matches /b*/ exactly once.  But "each null string" is a bit
slippery in practice.

Ed generally handles the 'g' option of substitution by iterating the
's' command, with an advancing starting position, until it reaches
the end of the subject string.  The new starting position is just
between the end of the replacement string from the last iteration and
the end of the unconsumed part of the subject string (if the r.e. match
succeeded), or one character further along in the subject (if the
match at this position failed).

When the algorithm suggested by this scheme is tried naively,
the second example, 's/b*/c/g', blows up; it puts a 'c'
at the beginning, then again finds the null string after
the inserted 'c' and before the same, initial 'a'; the r.e.
didn't consume any of the subject.

The remedy for this misbehavior that ed chose was
to force * expressions to fail to match empty strings
at the beginning of the subject string on the second
and subsequent iterations of the 's/.../.../g' processing.
(If you have the code, this is the meaning of the 'locs'
variable in advance().)  So in the 's/b*a/c/g' example,
a 'c' is put at the start; on the second iteration,
the subject still begins with the first 'a', and the 'b*' fails on this
initial 'a'.  Then it advances; this time the substitution
succeeds, because it is looking for 'b*a' not at the beginning.

What you've discovered is that this remedy is also naive.
It looks wiser to attack the problem head-on, by permitting
the r.e. null string match, but recasting the 's/.../.../g' loop
so that if the entire r.e. matches a null string immediately to the
right of a string replaced in the preceding iteration,
an advance through the subject string is forced.


	Dennis Ritchie
	dmr@research.att.com
	att!research!dmr

stevens@hsi.UUCP (Richard Stevens) (10/20/89)

V7's ed is broken too, so that must have been where it happened,
since Henry says V6 did it correctly.

Also, the 4.3BSD sed doesn't show this bug.

Under SVR3.2 on a 386, ed, vi, *and* sed all show the bug.

	Richard Stevens
	Health Systems International, New Haven, CT
	   stevens@hsi.com

peter@ficc.uu.net (Peter da Silva) (10/20/89)

Brian Beattie's Software-Tools-derived 'ed' works fine, so it looks like
Kernighan got it right.
-- 
Peter da Silva, *NIX support guy @ Ferranti International Controls Corporation.
Biz: peter@ficc.uu.net, +1 713 274 5180. Fun: peter@sugar.hackercorp.com. `-_-'
"You can tell when a USENET discussion is getting old when one of the      'U`
 participants drags out Hitler and the Nazis" -- Richard Sexton

henry@utzoo.uucp (Henry Spencer) (10/21/89)

In article <4586@fy.sei.cmu.edu> prp@sei.cmu.edu (Patrick Place) writes:
>I would be interested if Henry Spencer, who has a V6 derived ed,
>can determine whether he is using the standard regular expression
>package or if in the derivation path the appropriate routines have
>been fixed.

The derivation path can be described, most inadequately, as "long and
twisty".  The modification log is several hundred lines long.  I took a
quick look through it and found nothing specifically identified as
fixing regular-expression bugs, but there are enough slightly-mysterious
references to cleanups and such that it's hard to be sure.

It would be rather difficult to recover an original V6 source, and I
probably couldn't make it run on a Sun to check.  I may be able to give
a definitive answer about V7 next week.
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

chris@mimsy.umd.edu (Chris Torek) (10/21/89)

In article <10036@alice.UUCP> dmr@alice.UUCP writes:
>It looks wiser to attack the problem head-on, by permitting
>the r.e. null string match, but recasting the 's/.../.../g' loop
>so that if the entire r.e. matches a null string immediately to the
>right of a string replaced in the preceding iteration,
>an advance through the subject string is forced.

Interestingly enough, Emacs (Gosling, v 85 or later) gets this right.
It does not go to much effort: the solution is simply `delete the
matched text; insert the new text; if the matched text has length
zero, advance'.  (That is, I think the procedure described above is
excessively careful.  It should not matter whether the null match is
immediately rightward of a preceding replacement.)

(I did once have to fix the Gosmacs code that handled replacements of
the form

	(re-query-replace "^" "stuff")

which, if you answered `no' to the `replace this?' prompt, would
stay in the same place.  The solution is the same: move right if the
match is empty.)
-- 
`They were supposed to be green.'
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

tso@rocky2 (Daniels Tso (Wiesel)) (10/22/89)

In article <1989Oct20.182045.21541@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>It would be rather difficult to recover an original V6 source, and I
>probably couldn't make it run on a Sun to check.  I may be able to give
>a definitive answer about V7 next week.

	For what it's worth, ed is broken on PWB 1.2 (The precursor to
USG System 3.0 that was released between V6 and V7). Almost everything
on PWB 1.2 is V6-like, so I speculate that if V6 ed "works" (as Henry's
suggests) and PWB 1.2 and V7 is broken, that the change was made on the
research group's UNIX before the PWB people picked it up.
	I could fetch a V6 ed and diff it with the PWB ed. The number of
diffs should be very small...
				Cheers,
				Dan Ts'o
				Dept. Neurobiology	212-570-7671
				Rockefeller Univ.	...cmcl2!rna!dan
				1230 York Ave.		dan@rna.rockefeller.edu
				NY, NY 10021		tso@rockefeller.edu

arnold@mathcs.emory.edu (Arnold D. Robbins {EUCC}) (10/25/89)

In article <6606@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>Brian Beattie's Software-Tools-derived 'ed' works fine, so it looks like
>Kernighan got it right.

Indeed.  The "Georgia Tech 'se' Screen Editor", which is also derived from
the Software Tools 'ed', gets it right as well.  ('Se' is available in your
nearest comp.sources.unix archive.)
-- 
Arnold Robbins -- Emory U. Information Technology Div.  | Laundry increases
DOMAIN: arnold@emoryu1.cc.emory.edu			| exponentially in the
UUCP: gatech!emoryu1!arnold  PHONE: +1 404 727-7636	| number of children.
BITNET: arnold@emoryu1	     FAX:   +1 404 727-2599	|   -- Miriam Hartholz

webb@bass.tcspa.ibm.com (Bill Webb) (10/26/89)

>...
> It would be rather difficult to recover an original V6 source, and I
> probably couldn't make it run on a Sun to check.  I may be able to give
> a definitive answer about V7 next week.
> -- 
> A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
> megabyte of flaming.           | uunet!attcan!utzoo!henry
henry@zoo.toronto.edu

I've just tried it on a Version 6 ed (I knew that PDP 11 simulator would
eventually come in handy!) that had only one slight modification over the
standard v6 ed (it output a prompt character) and it has the bug. I could
probably recompile the original ed.c (dated May 13, 1975) and test that but
I expect it has the bug as well, but I'd have recreate the simulated v6
environment to do so which would take a few hours.

----------------------------------------------------------------
The above views are my own, not necessarily those of my employer.
Bill Webb (IBM AWD Palo Alto), (415) 855-4457).
UUCP: ...!uunet!ibmsupt!webb

tale@pawl.rpi.edu (David C Lawrence) (10/26/89)

To provide one more editor data point in this, GNU Emacs'
(replace-regexp "b*a" "c"), which is equivalent to the original
example, works correctly.  JTYMLTK.
-- 
 (setq mail '("tale@pawl.rpi.edu" "tale@itsgw.rpi.edu" "tale@rpitsmts.bitnet"))

eggert@twinsun.com (Paul Eggert) (10/27/89)

tale@pawl.rpi.edu (David C Lawrence) writes:

	GNU Emacs' (replace-regexp "b*a" "c"), which is equivalent to the
	original example, works correctly.

Ah, but (replace-regexp "a*" "b") does not work correctly in GNU Emacs 18.55;
it misses the empty string just before the end of the buffer.  I found this bug
(and posted a bug report) after reading dmr's explanation of the bug that was
avoided in 'ed'.