[comp.lang.icon] deleting characters

cargo@TARDIS.CRAY.COM (David S. Cargo) (01/20/90)

After sending out my mail about deleting or stripping characters out of a
string I received a couple of responses from kindly Iconists.  I later was
showing off Icon to a friend when I notices a "delete" function in
strutil.icn.  Upon closer examination I found that delete also did what I
wanted.  The next step for me was to write a little benchmarking program so
I could see how the different methods compared speedwise.  It turns out
that the IPL delete routine was the slowest, although the fastest was only
15 percent faster.

Here is the test program:

procedure main()
    test_string :=
        "    str ? while tab(upto(chars)) do text ||:= tab(many(chars))"
    remove := &cset -- &lcase
    time1 := &time
    limit := 1000
    every 1 to limit do result1 := delete(test_string, remove)
    time2 := &time
    every 1 to limit do result2 := strip1(test_string, remove)
    time3 := &time
    every 1 to limit do result3 := strip2(test_string, remove)
    time4 := &time
    write(time1)
    write(time2-time1, " ", result1)
    write(time3-time2, " ", result2)
    write(time4-time3, " ", result3)
    return
end

#  from IPL strutil.icn
#  delete characters
#
procedure delete(s,c)
    local i
    while i := upto(c,s) do
        s[i:many(c,s,i)] := ""
    return s
end

#From: Richard Goerwitz <goer@sophist.uchicago.edu>
procedure strip1(s,c)
    s2 := ""
    s ? {
        while s2 ||:= tab(upto(c)) do
            tab(many(c))
        s2 ||:= tab(0)
        }
  return s2
end


#From: Chris Tenaglia - 257-8765 <tenaglia@mis.mcw.edu>
procedure strip2(str,chrs)
    local text,chars
#    /chrs := ' '
#   (I commmented this out because the others don't do such checks.)
    chars := &cset -- chrs
    text  := ""
    str ? while tab(upto(chars)) do text ||:= tab(many(chars))
    return text
end

And the output (first column is time, second column is the result string
which is always supposed to be the same).  Initial time is 0, other times
are in milliseconds.

0
21033 strwhiletabuptocharsdotexttabmanychars
20350 strwhiletabuptocharsdotexttabmanychars
17717 strwhiletabuptocharsdotexttabmanychars

All three are reasonable, but the differences in approach are educational.
                         o       o
                          \_____/
                          /-o-o-\     _______
DDDD      SSSS   CCCC    /   ^   \   /\\\\\\\\
D   D    S      C        \ \___/ /  /\   ___  \
D   D     SSS   C         \_   _/  /\   /\\\\  \
D   D        S  C           \  \__/\   /\ @_/  /
DDDDavid SSSS.   CCCCargo    \____\____\______/ CARGO@TARDIS.CRAY.COM

flee@SHIRE.CS.PSU.EDU (Felix Lee) (01/20/90)

> It turns out that the IPL delete routine was the slowest, although
> the fastest was only 15 percent faster.

On pathological cases, the delete routine can be much slower.  Try
removing spaces from repl("a   ", 1000).

The delete routine is quadratic wrt the length of the source string,
while the strip routines are quadratic wrt the result.  This is due
to the terrible amount of copying involved in manipulating Icon
strings: delete has to copy 3997 + 3994 + ... + 1000 characters,
while the other procedures need only copy 1 + 2 + ... + 1000.

You can get linear performance if you do it in C.
--
Felix Lee	flee@shire.cs.psu.edu	*!psuvax1!flee

gmt@CS.ARIZONA.EDU ("Gregg Townsend") (01/20/90)

    Felix Lee (flee@shire.cs.psu.edu) writes:

    On pathological cases, the delete routine can be much slower.

True.
    
    ...You can get linear performance if you do it in C.

You get it in Icon, too.  Both Goerwitz's and Tenaglia's procedures are linear.

Building strings by successive concatenation is sufficiently common that it was
worth optimizing the implementation.  If no other concurrent activity disrupts
things, only the new characters (those added at the end) are copied.

    Gregg Townsend / Computer Science Dept / Univ of Arizona / Tucson, AZ 85721
    +1 602 621 4325     gmt@cs.arizona.edu     110 57 16 W / 32 13 45 N / +758m

wgg@CS.WASHINGTON.EDU (William Griswold) (01/20/90)

> The delete routine is quadratic wrt the length of the source string,
> while the strip routines are quadratic wrt the result.

I may be wrong, but I believe you will find that in modern implementations 
of Icon that the strip routines are linear in time.  Icon is smart enough 
to know that a string is located at the end of the string memory region 
(in this case the value of the variable holding the accumulating result 
string), and can just add to the end of it to concatenate.  Any other 
*modification* of a string requires copying--substring creation does not
require copying, since it is implemented as a pointer and an index. 

> You can get linear performance if you do it in C.

Many common operations in Icon require *more* time to perform in C--using
available abstractions--such as computing the length of a string.  Also
note that string concatentation in C in the standard way (using strcat)
takes linear time.  It also requires knowing the destination string is
long enough to hold the longer result.  Thus making strip as ``fast'' as
Icon's requires a little effort. 

					Bill Griswold

flee@SHIRE.CS.PSU.EDU (Felix Lee) (01/20/90)

> If no other concurrent activity disrupts things, only the new characters
> (those added at the end) are copied.

Ah, I forgot about that optimization.
--
Felix

jac@paul.rutgers.edu (Jonathan A. Chandross) (01/22/90)

wgg@CS.WASHINGTON.EDU (William Griswold)
> Many common operations in Icon require *more* time to perform in C--using
> available abstractions--such as computing the length of a string.  Also
> note that string concatentation in C in the standard way (using strcat)
> takes linear time.  It also requires knowing the destination string is
> long enough to hold the longer result.  Thus making strip as ``fast'' as
> Icon's requires a little effort. 

I don't know if your statement is totally fair.  There is nothing to
prevent one from using BCPL style strings (i.e. also store a length
with the string) in a C program.

In fact, this is done.  The MESA language (XEROX) generates C code
which is then compiled normally.  Strings in MESA are stored with
a length, and are word aligned.  This allows strcpy, strcmp, et al
to work on word quantities,  producing much faster string routines. 
I see no reason (aside from inertia) for why this has not been done
to C.  (Well, one would have to write routines to convert from the
library function's notion of a character string to the new one with
a length.)

A while back I needed to derive the name from a file pointer.  Since
stdio does not support this I had to write a piece of code like:
	struct N_FILE {
		char	*name;
		FILE	*file;
		};
and the associated front-end routines for stdio.  This was not hard
to do, and did not take all that much time.

Of course, one could say that the pattern matching, associative
table features, etc. that make Icon so popular could also be added
to C using the argument I give above.  I won't (and can't) defend
such a statement.  My point is that condemning C for a shortcoming
in the library routines is not really fair.  Especially when that
problem could be fixed in a few days hacking.


Jonathan A. Chandross
Internet: jac@paul.rutgers.edu
UUCP: rutgers!paul.rutgers.edu!jac