tml@santra.UUCP (Tor Lillqvist) (12/23/87)
I have noticed some strange behaviour in sed (similar problems probably also exist in other users of regexp(3)). I have the following sed script: sed -e 's/^\([^.]*\)[^:]*:\([^ ]*\) \1/\2 \1/' \ -e 's/^\([^.]*\)[^:]*:\([^ ]*\) /\2 \1:/' \ -e 's/\\-/-/g' -e 's/\\\*-/-/g'\ -e 's/^\.TH [^ ]* \([^ ]*\).* \([^-]*\)/\2(\1) /' and this input file: ssignal.3c:.TH SSIGNAL 3C "" "" HP-UX ssignal, gsignal \- software signals stdio.3s:.TH STDIO 3S "" "" HP-UX stdio \- standard buffered input/output stream file package stdipc.3c:.TH STDIPC 3C "" "" HP-UX ftok \- standard interprocess communication package string.3c:.TH STRING 3C "" "" HP-UX strcat, strncat, strcmp, strncmp, strcpy, strncpy, strlen, strchr, strrchr, strpbrk, strspn, strcspn, strtok \- character string operations strtod.3c:.TH STRTOD 3C "" "" HP-UX strtod, atof, nl_strtod, nl_atof \- convert string to double-precision number I get the output: ssignal, gsignal (3C) - software signals stdio (3S) - standard buffered input/output stream file package ftok (3C) - standard interprocess communication package strcat, strncat, strcmp, strncmp, strcpy, strncpy, strlen, strchr, strrchr, strpbrk, strspn, strcspn, strtok (3C) - character string operations strtod, atof, nl_strtod, nl_atof (3C) - convert string to double-precision number which isn't what I want. However, if I change the sed script to: sed -e 's/^\([^.]*\)\.[^:]*:\([^ ]*\) \1/\2 \1/' \ -e 's/^\([^.]*\)\.[^:]*:\([^ ]*\) /\2 \1:/' \ -e 's/\\-/-/g' -e 's/\\\*-/-/g'\ -e 's/^\.TH [^ ]* \([^ ]*\).* \([^-]*\)/\2(\1) /' I get: ssignal, gsignal (3C) - software signals stdio (3S) - standard buffered input/output stream file package stdipc:ftok (3C) - standard interprocess communication package string:strcat, strncat, strcmp, strncmp, strcpy, strncpy, strlen, strchr, strrchr, strpbrk, strspn, strcspn, strtok (3C) - character string operations strtod, atof, nl_strtod, nl_atof (3C) - convert string to double-precision number which is what I want. I.e. I add an \. after the \([^.]*\) . (As you probably notice, I am trying to enhance the /usr/lib/mkwhatis script so that the whatis database would include the title of the manual page in case it isn't the same as the (first) entry.) Is this a bug in sed or regexp(3), or what? The same behaviour occurs both in HP-UX on the 9000/840 and BSD4.3 on a VAX. -- Tor Lillqvist, Technical Research Centre of Finland tml@fingate.bitnet == tml@santra.uucp == mcvax!santra!tml
wk@hpirs.HP.COM (Wayne Krone) (12/29/87)
The behaviour noted is correct. The apparent problem can be reduced to the first line of the sed script: sed -e 's/^\([^.]*\)[^:]*:\([^ ]*\) \1/\2 \1/' being processed against the third line of the input file: stdipc.3c:.TH STDIPC 3C "" "" HP-UX ftok \- standard ... which gives the result: .TH STDIPC 3C "" "" HP-UX ftok \- standard ... when what was wanted was no change to that line of input by that line of the sed script. The first line of the sed script was intended to operate on patterns such as: <string1><.><junk><:><string2><tab><string1> and so it was expected that line 3 of the input file would not be processed because the obvious match for <string1> of "stdipc" did not appear a second time in the input line after a <tab>. However, based upon the regular expression, the non-obvious match for <string1> is zero characters (the NULL string) and <string1> as a zero length pattern does match after the <tab>. Stating the problem another way, while the regular expression establishes that <string1> can not extend past the first ".", it fails to prevent <junk> from matching characters before the first ".". The solution, as you have already noted, is to establish an explicit boundary between the <string1> and <junk> expressions. Wayne Krone Hewlett-Packard