[comp.lang.prolog] PROLOG Digest V5 #100

PROLOG-REQUEST@SUSHI.STANFORD.EDU (Chuck Restivo, The Moderator) (12/30/87)
PROLOG Digest           Thursday, 31 Dec 1987     Volume 5 : Issue 100

Today's Topics:
                       Announcement - New List,
                Implementation - String Concatenation
----------------------------------------------------------------------

Date: Wed, 16 Dec 87 14:45:25 CST
From: reddy@b.cs.uiuc.edu (Uday S. Reddy)
Subject: Mailing list on narrowing

A new mailing list dedicated to the discussion of narrowing in
functional and equational languages is being created.  Narrowing is an
operational mechanism using which expressions with free variables can
be "executed" producing solutions for the free variables.  This is one
of the approaches to combining functional programming and logic
programming paradigms into a unified framework.

The mailing list will be based at University of Illinois, at the
address narrow@a.cs.uiuc.edu.  Please send requests for addition to
the mailing list to narrow-request@a.cs.uiuc.edu or
reddy@a.cs.uiuc.edu.

-- Uday Reddy

------------------------------

Date: Tue, 29 Dec 87 00:06:55 PST
From: quintus!ok@Sun.COM (Richard A. O'Keefe)
Subject: On String Concatenation

    Although several Prolog dialects provide operations on strings, as
many more do not, and among those that do, the names and capabilities of
the operations vary considerably.  Most of them are just warmed-over
BASIC, which means that as part of a "logic" programming language they
are very badly designed indeed.  Come to that, they aren't all that
wonderful in BASIC, either.  And if people at ----- or --- or --- or ---
or ------ ------- ------------- think I'm talking about their product, I
*am*. Very badly designed indeed.  Most of the others are warmed-over
Lisp, which means that although they are not that badly designed, they
may fairly be described as excessively restrictive.

    I have just been looking at a medium size program (6,400 lines of
Prolog-with-a-blunt-knife) which uses string concatenation lavishly.
Just in case you are alarmed, this is NOT the Quintus Prolog compiler,
nor any other Quintus product.  In the Quintus Prolog version of this
program, the operation is coded thus:

concat_atom(Constants, Atom) :-
        concat_chars(Constants, Chars),
        atom_chars(Atom, Chars).        % you may need name/2

concat_chars([], []).
concat_chars([Constant|Constants], Chars) :-
        name(Constant, Name),
        append(Name, Chars1, Chars),
        concat_chars(Constants, Chars1).

    This is fine, as far as it goes.  Prolog atoms are an almost
perfect analogue of SNOBOL IV strings.  But there are three snags:
(a) in most Prolog systems known to me, the atom length bound is
    smaller than the string length bound.  A good design rule is
    that the Prolog atom length bound should be no less than the
    host operating system's file-name length bound, which is about
    1024 in a good UNIX.  Few Prolog systems meet this not-very-
    stringent condition.
(b) in many Prolog systems there is a smallish limit on the number
    of atoms you can have.  For example, in --- Prolog, you can
    only have 907.  When you recall that NU7 Prolog, running on a
    PDP-11 with 64kbytes of address space, would let you have 1023
    atoms, a limit of 907 on an IBM PC or a Macintosh is absurd.
    Come to that, any limit other than "storage exhausted" is absurd.
(c) Restriction (b) might be tolerable, but most Prolog systems do
    not garbage collect atoms.  (There is no reason why they can't,
    they just don't.)

    So if you rely too heavily on concat_atom/2, you run the risk of
(a) not being able to represent the result, (b) running out of atoms
-- even though there might be lots of storage left --, or (c) tying
up storage forever with atoms that were useful once.

    This sounds like a powerful argument for having a string data-
type in the language.  But is it?
(a) There is no reason why atoms should have to be short.  If space
    is tight, you might want to represent "normal" atoms by a
    pointer to a length byte followed by that many characters, but
    there is no reason why you couldn't reserve length 255 to mean
    "long name" and use some other representation for long atoms.
(b) There is no good reason for the atom count limit being small.
    I put together a symbol table design for the IBM PC where the
    limit was 2^16 minus a few; this didn't even store atom lengths
    explicitly, and was very compact.  If you can do it for the IBM
    PC, you can do it for decent machines.
(c) There is no reason why the atom table couldn't be garbage
    collected.  SNOBOL did it, after all.

    But still, it must be more efficient to allocate strings in the same
area as compound terms and have them vanish on backtracking, and never
have to look them up in the symbol table?  This is almost certainly
true, but the odd thing is that if efficiency is your concern you are
better off using lists of character codes than strings.  This is a very
surprising result, but it is true in the Lisp systems I have tried too.
[In case this isn't clear, I mean that I have MEASURED the cost of
strings -vs- lists of character codes, and found that in well-designed
Prolog and Lisp systems using lists is FASTER.]

    Even if efficiency is on the side of lists of character codes, they
do take up more space than strings.  So there is still some sort of an
argument for strings here, especially on PCs and Mac.  True, but there
is a better argument for not doing concatenation at all.

    I thought it would be interesting to see how the program mentioned
above used concatenation.  Eight different uses could be found:

(1) Making a file name.
    The typical example was
        concat_atom([Base,'.s'], AssemFileName),
        open(AssemFileName, write, AssemStream)

(2) Making a prompt.
    The typical example was
        concat_atom(['(',State,') ?- '], NewPrompt),
        prompt(OldPrompt, NewPrompt),
            <stuff>
        prompt(_, OldPrompt)

(3) Making a comment to write in a generated file.
    The typical example was
        concat_atom(['#   Date   : ',Day,-,Month,-,Year], DateLine),
        ...
        write_to_assem_file([FileLine,DateLine,VersionInfo])

(4) As a preliminary to making a list of character codes,
    also to be written in a generated file.
    The typical example was
        concat_atom(['source function: ',Function,/,Arity], Comment),
        name(Comment, CommentChars),
        ...
        write_to_assem_file([...,c(CommentChars),...])

(5) Changing the name of a compound term.
    This particular program has two families of constructor functions
    (among others):  where one has <foo> the other has c<foo>.
    The typical example was
        PlainTerm =.. [PlainFunction|Args],
        concat_atom([c,PlainFunction], DecoratedFunction),
        DecoratedTerm =.. [DecoratedFunction|Args]

(6) Constructing an error message.
    The typical example was
        target_machine(Machine),
        concat_atom(['A ',Machine,' cannot handle ',
            Operation,' on ',Size,' data'], ErrMsg),
        report_error(ErrMsg, ...)

(7) Turning a term into an atom.
    The typical example was
        flatten(X+Y, Flat) :-
                flatten(X, X1),
                flatten(Y, Y1),
                concat_atom([X1,+,Y1], Flat).

(8) Generating labels.
    The typical example was
        label_bounds(vms, '', '$').
        label_bounds(unix, 'L', '').

        make_label(Number, Label) :-
                target_os(OS),
                label_bounds(OS, Prefix, Suffix),
                concat_atom([Prefix,Number,Suffix], Label).

Comments.
(1) This shows up a real deficiency in most Prolog systems, but the
    solution adopted was no better.  Common Lisp has addressed this
    problem.  The problem is the problem of handling file names in
    a reasonably portable way.  We have something for internal use
    here that looks quite good, but then we decided to do an IBM
    mainframe port, and it didn't look quite so portable any more.
    What we'd really like to do in this case is
        file_extension(assembler, Extension),
        open('b.e'(Base,Extension), write, AssemStream)
    or something like that.

    The relevance of this to string operations is that we don't
    really want a string (or, for that matter, an atom) at all.
    The syntax of file names is something we want to keep OUT of
    our source code.  What we want here is operations on file
    names such as Common Lisp provides, not operations on strings.
    In fact string operations are unbelievably clumsy for such
    elementary things as "parent directory".

(2) The prompt/2 operation being specified to take atoms, concat_atom/2
    is needed here.  On the other hand, prompt/2 doesn't do quite what
    the programmer of this program thought (it sets the prompt used by
    *programmed* input, not the prompt used by the top level) so in
    fact the operation accomplished nothing at all.  So this can't be
    all that vital.  In any case, it only happens once, so making an
    atom is no big deal.

(3) String concatenation should not be used here (A).
(4) In this case, it would have been better to call concat_chars/2
    directly, rather than pack the result into an atom and straight
    away unpack it again.  Anyway, string concatenation should not
    be used here (A).

(5) The right way to do this is to make a table
        decorate(foo(X,Y),      cfoo(X,Y)).
        decorate(baz(X),        cbaz(X)).
        decorate(ugh(X,Y,Z),    cugh(X,Y,Z)).
        ...
                decorate(PlainTerm, DecoratedTerm)
        ...
    This is considerably faster and purer.

(6) String concatenation should not be used here (B).
(7) String concatenation should not be used here (A).
(8) String concatenation should not be used here (A).

(A) All of these particular uses are constructing something which
    is going to be written to a file.  Now, we DO care that the
    right sequence of characters should be written to the file,
    but we DON'T need to represent that sequence of characters
    EXPLICITLY in the program!

    For example, we can handle example (8) thus:
        make_label(Number, label(Number)).
        ...
        write_operand(label(Number)) :-
                target_os(OS),
                label_format(OS, Format),
                format(Format, [Number]).

    Example (7) and lots of others like it in the program can be
    handled by more clauses of the same sort:

        write_operand(A+B) :-
                write_operand(A),
                write(+),
                write_operand(B).

    What is happening here is that instead of constructing a string
    we are constructing a compound term, and finally we are saying
    how to print these compound terms.

    There are so many advantages to doing it this way!  For example,
    we can easily produce several different printing routines (not
    just different ones for different operating systems; we could
    have a debugging version which prints abbreviated forms to the
    terminal so that we can see what is happening).  Best of all,
    we keep the data structures as *Prolog* data structures for as
    long as possible, meaning that we can unpack them using
    unification rather than slow string operations, and we can leave
    parts of them uninstantiated (for all time, if we so choose).
    The long and short of it is that this approach is much more
    efficient, and much more flexible, than consing character strings
    all over the place.

(B) Once again, the program as such has no use for the character
    string.  The appropriate thing is to imitate Common Lisp (well,
    when we did it in DEC-10 Prolog years ago, we thought we were
    imitating C, but it comes to the same thing).  Instead of
    consing up a string, you pass your error reporting command a
    format and a list of arguments.  Thus we might do

                target_machine(Machine),
                report_error('A ~w cannot handle ~w on ~w data',
                        [Machine,Operation,Size], ...)

        report_error(Format, Arguments, ...) :-
                current_input(Stream),
                current_stream(FileName, read, Stream),
                line_count(Stream, Line),
                format(user_error, '~N! Error near line ~w of ~w~n! ',
                        [Line, FileName]),
                format(user_error, Format, Arguments),
                format(user_error, '~n', []),
                interact_with_user(...).

    format/2 and format/3 are specific to Quintus Prolog, just as
    writef/2 and fwritef/3 were specific to DEC-10 Prolog.  That
    doesn't matter.  It is trivially easy to knock together something
    of the sort which is powerful enough to handle your error messages.

    The same technique of passing around a "format" and a list of
    arguments for it may be applicable in other cases as well.

The point is:
    unless the program itself has a use for the string,
    don't make it.  Leave the thing as a compound term or something
    right up until you write it out.

It is sobering to reflect that the reason why C is such a good language
for writing text-processing applications is that it HASN'T got a string
data type.

Some other operations.
(i)  Pattern matching.
     Grammar rules are far and away the best way of doing this in Prolog.
     SNOBOL and ICON fans will recognise most of their favourite
     pattern matching operations as basic Prolog control structures
     (cut = FENCE) and as I pointed out earlier, grammar rules in a
     good Prolog are likely to be faster than string operations.

(ii) Accessing data fields in external data bases.
     The argument here is that the external data base may contain
     millions of character fields which we do not loaded into Prolog's
     symbol table.  This is a very strong argument indeed in favour of
     what have been called "uninterned atoms" (more or less what all
     the atoms in the LOGIX system are).  But
(ii.1)  If the external data base is a relational one, and if the
        data base designer had even the faintest understanding of
        what it means to design a relational data base, you will not
        want to do string operations on these character fields.
        (Domains in a relational data base are supposed to be ATOMIC
        as far as the application is concerned.)  Dates might be an
        exception, but they should be stored as a DATE data type or
        as coded numbers.
(ii.2)  If you are going to do any joins in Prolog, you definitely
        DO want the attributes in that domain to be held uniquely (at
        least for the duration of a query) in Prolog, so why not put
        them in the hash table?
(iii.3) Attributes which are only involved in selection, and are not
        included directly in the answer, should be kept on the other
        side of the interface.
     So the set of attributes where uninterned atoms are wanted is the
     set of attributes which are going to be printed but not used in
     joins on the Prolog side of the interface (and in particular, are
     not going to be stored in the internal data base).
     For this application, it looks as though an atom symbol table
     which lent itself to fast garbage collection would be little if
     at all worse than the use of uninterned atoms.

    I have used a lot of programming languages having a string data
type.  I am STILL looking for an application where it would be a good
idea to use them in a Prolog which had them.  If we can't get a
discussion going about Prolog textbooks, can we get a discussion going
about this?
The topic for debate is:
        assuming a Prolog system which imposes no arbitrary
        restrictions on atoms, so that the only difference between
        atoms and strings is that atoms are stored uniquely and
        recovered by garbage collection (so allocation is slow,
        but equality testing is rapid) whereas strings are not
        stored uniquely and are recovered by both backtracking
        and garbage collection (so allocation is fast but
        equality testing is slow)

        what examples can be found, OTHER than communication with
        another language (such as Lisp) which supports strings,
        where it is a good idea to USE strings?

        Practical experience concerning the relative utility of
        strings and uninterned atoms from people who are interfacing
        Prolog with a large external data base would be useful.

------------------------------

End of PROLOG Digest
********************