[comp.bugs.sys5] Bug in sed regexps ?

tml@santra.UUCP (Tor Lillqvist) (12/23/87)

I have noticed some strange behaviour in sed (similar problems
probably also exist in other users of regexp(3)).

I have the following sed script:

sed -e 's/^\([^.]*\)[^:]*:\([^	]*\)	\1/\2	\1/' \
    -e 's/^\([^.]*\)[^:]*:\([^	]*\)	/\2	\1:/' \
    -e 's/\\-/-/g' -e 's/\\\*-/-/g'\
    -e 's/^\.TH [^ ]* \([^ 	]*\).*	\([^-]*\)/\2(\1)	/'

and this input file:

ssignal.3c:.TH SSIGNAL 3C "" "" HP-UX	ssignal, gsignal \- software signals
stdio.3s:.TH STDIO 3S "" "" HP-UX	stdio \- standard buffered input/output stream file package
stdipc.3c:.TH STDIPC 3C "" "" HP-UX	ftok \- standard interprocess communication package
string.3c:.TH STRING 3C "" "" HP-UX 	strcat, strncat, strcmp, strncmp, strcpy, strncpy, strlen, strchr, strrchr, strpbrk, strspn, strcspn, strtok \- character string operations
strtod.3c:.TH STRTOD 3C "" "" HP-UX 	strtod, atof, nl_strtod, nl_atof \- convert string to double-precision number

I get the output:

ssignal, gsignal (3C)	- software signals
stdio (3S)	- standard buffered input/output stream file package
ftok (3C)	- standard interprocess communication package
strcat, strncat, strcmp, strncmp, strcpy, strncpy, strlen, strchr, strrchr, strpbrk, strspn, strcspn, strtok (3C)	- character string operations
strtod, atof, nl_strtod, nl_atof (3C)	- convert string to double-precision number

which isn't what I want.

However, if I change the sed script to:

sed -e 's/^\([^.]*\)\.[^:]*:\([^	]*\)	\1/\2	\1/' \
    -e 's/^\([^.]*\)\.[^:]*:\([^	]*\)	/\2	\1:/' \
    -e 's/\\-/-/g' -e 's/\\\*-/-/g'\
    -e 's/^\.TH [^ ]* \([^ 	]*\).*	\([^-]*\)/\2(\1)	/'

I get:

ssignal, gsignal (3C)	- software signals
stdio (3S)	- standard buffered input/output stream file package
stdipc:ftok (3C)	- standard interprocess communication package
string:strcat, strncat, strcmp, strncmp, strcpy, strncpy, strlen, strchr, strrchr, strpbrk, strspn, strcspn, strtok (3C)	- character string operations
strtod, atof, nl_strtod, nl_atof (3C)	- convert string to double-precision number

which is what I want. I.e. I add an \. after the \([^.]*\) .

(As you probably notice, I am trying to enhance the /usr/lib/mkwhatis
script so that the whatis database would include the title of the
manual page in case it isn't the same as the (first) entry.)

Is this a bug in sed or regexp(3), or what?  The same behaviour occurs
both in HP-UX on the 9000/840 and BSD4.3 on a VAX.
-- 
Tor Lillqvist, Technical Research Centre of Finland
tml@fingate.bitnet == tml@santra.uucp == mcvax!santra!tml

wk@hpirs.HP.COM (Wayne Krone) (12/29/87)

The behaviour noted is correct.  The apparent problem can be reduced to
the first line of the sed script:

    sed -e 's/^\([^.]*\)[^:]*:\([^	]*\)	\1/\2	\1/'

being processed against the third line of the input file:

    stdipc.3c:.TH STDIPC 3C "" "" HP-UX	ftok \- standard ...

which gives the result:

    .TH STDIPC 3C "" "" HP-UX	ftok \- standard ...

when what was wanted was no change to that line of input by that line
of the sed script.

The first line of the sed script was intended to operate on patterns such
as:

    <string1><.><junk><:><string2><tab><string1>

and so it was expected that line 3 of the input file would not be
processed because the obvious match for <string1> of "stdipc" did not
appear a second time in the input line after a <tab>.  However, based
upon the regular expression, the non-obvious match for <string1> is zero
characters (the NULL string) and <string1> as a zero length pattern does
match after the <tab>.  Stating the problem another way, while the
regular expression establishes that <string1> can not extend past the
first ".", it fails to prevent <junk> from matching characters before
the first ".".

The solution, as you have already noted, is to establish an explicit
boundary between the <string1> and <junk> expressions.

Wayne Krone
Hewlett-Packard