PROLOG-REQUEST@SUSHI.STANFORD.EDU (Chuck Restivo, The Moderator) (12/30/87)
PROLOG Digest Thursday, 31 Dec 1987 Volume 5 : Issue 100
Today's Topics:
Announcement - New List,
Implementation - String Concatenation
----------------------------------------------------------------------
Date: Wed, 16 Dec 87 14:45:25 CST
From: reddy@b.cs.uiuc.edu (Uday S. Reddy)
Subject: Mailing list on narrowing
A new mailing list dedicated to the discussion of narrowing in
functional and equational languages is being created. Narrowing is an
operational mechanism using which expressions with free variables can
be "executed" producing solutions for the free variables. This is one
of the approaches to combining functional programming and logic
programming paradigms into a unified framework.
The mailing list will be based at University of Illinois, at the
address narrow@a.cs.uiuc.edu. Please send requests for addition to
the mailing list to narrow-request@a.cs.uiuc.edu or
reddy@a.cs.uiuc.edu.
-- Uday Reddy
------------------------------
Date: Tue, 29 Dec 87 00:06:55 PST
From: quintus!ok@Sun.COM (Richard A. O'Keefe)
Subject: On String Concatenation
Although several Prolog dialects provide operations on strings, as
many more do not, and among those that do, the names and capabilities of
the operations vary considerably. Most of them are just warmed-over
BASIC, which means that as part of a "logic" programming language they
are very badly designed indeed. Come to that, they aren't all that
wonderful in BASIC, either. And if people at ----- or --- or --- or ---
or ------ ------- ------------- think I'm talking about their product, I
*am*. Very badly designed indeed. Most of the others are warmed-over
Lisp, which means that although they are not that badly designed, they
may fairly be described as excessively restrictive.
I have just been looking at a medium size program (6,400 lines of
Prolog-with-a-blunt-knife) which uses string concatenation lavishly.
Just in case you are alarmed, this is NOT the Quintus Prolog compiler,
nor any other Quintus product. In the Quintus Prolog version of this
program, the operation is coded thus:
concat_atom(Constants, Atom) :-
concat_chars(Constants, Chars),
atom_chars(Atom, Chars). % you may need name/2
concat_chars([], []).
concat_chars([Constant|Constants], Chars) :-
name(Constant, Name),
append(Name, Chars1, Chars),
concat_chars(Constants, Chars1).
This is fine, as far as it goes. Prolog atoms are an almost
perfect analogue of SNOBOL IV strings. But there are three snags:
(a) in most Prolog systems known to me, the atom length bound is
smaller than the string length bound. A good design rule is
that the Prolog atom length bound should be no less than the
host operating system's file-name length bound, which is about
1024 in a good UNIX. Few Prolog systems meet this not-very-
stringent condition.
(b) in many Prolog systems there is a smallish limit on the number
of atoms you can have. For example, in --- Prolog, you can
only have 907. When you recall that NU7 Prolog, running on a
PDP-11 with 64kbytes of address space, would let you have 1023
atoms, a limit of 907 on an IBM PC or a Macintosh is absurd.
Come to that, any limit other than "storage exhausted" is absurd.
(c) Restriction (b) might be tolerable, but most Prolog systems do
not garbage collect atoms. (There is no reason why they can't,
they just don't.)
So if you rely too heavily on concat_atom/2, you run the risk of
(a) not being able to represent the result, (b) running out of atoms
-- even though there might be lots of storage left --, or (c) tying
up storage forever with atoms that were useful once.
This sounds like a powerful argument for having a string data-
type in the language. But is it?
(a) There is no reason why atoms should have to be short. If space
is tight, you might want to represent "normal" atoms by a
pointer to a length byte followed by that many characters, but
there is no reason why you couldn't reserve length 255 to mean
"long name" and use some other representation for long atoms.
(b) There is no good reason for the atom count limit being small.
I put together a symbol table design for the IBM PC where the
limit was 2^16 minus a few; this didn't even store atom lengths
explicitly, and was very compact. If you can do it for the IBM
PC, you can do it for decent machines.
(c) There is no reason why the atom table couldn't be garbage
collected. SNOBOL did it, after all.
But still, it must be more efficient to allocate strings in the same
area as compound terms and have them vanish on backtracking, and never
have to look them up in the symbol table? This is almost certainly
true, but the odd thing is that if efficiency is your concern you are
better off using lists of character codes than strings. This is a very
surprising result, but it is true in the Lisp systems I have tried too.
[In case this isn't clear, I mean that I have MEASURED the cost of
strings -vs- lists of character codes, and found that in well-designed
Prolog and Lisp systems using lists is FASTER.]
Even if efficiency is on the side of lists of character codes, they
do take up more space than strings. So there is still some sort of an
argument for strings here, especially on PCs and Mac. True, but there
is a better argument for not doing concatenation at all.
I thought it would be interesting to see how the program mentioned
above used concatenation. Eight different uses could be found:
(1) Making a file name.
The typical example was
concat_atom([Base,'.s'], AssemFileName),
open(AssemFileName, write, AssemStream)
(2) Making a prompt.
The typical example was
concat_atom(['(',State,') ?- '], NewPrompt),
prompt(OldPrompt, NewPrompt),
<stuff>
prompt(_, OldPrompt)
(3) Making a comment to write in a generated file.
The typical example was
concat_atom(['# Date : ',Day,-,Month,-,Year], DateLine),
...
write_to_assem_file([FileLine,DateLine,VersionInfo])
(4) As a preliminary to making a list of character codes,
also to be written in a generated file.
The typical example was
concat_atom(['source function: ',Function,/,Arity], Comment),
name(Comment, CommentChars),
...
write_to_assem_file([...,c(CommentChars),...])
(5) Changing the name of a compound term.
This particular program has two families of constructor functions
(among others): where one has <foo> the other has c<foo>.
The typical example was
PlainTerm =.. [PlainFunction|Args],
concat_atom([c,PlainFunction], DecoratedFunction),
DecoratedTerm =.. [DecoratedFunction|Args]
(6) Constructing an error message.
The typical example was
target_machine(Machine),
concat_atom(['A ',Machine,' cannot handle ',
Operation,' on ',Size,' data'], ErrMsg),
report_error(ErrMsg, ...)
(7) Turning a term into an atom.
The typical example was
flatten(X+Y, Flat) :-
flatten(X, X1),
flatten(Y, Y1),
concat_atom([X1,+,Y1], Flat).
(8) Generating labels.
The typical example was
label_bounds(vms, '', '$').
label_bounds(unix, 'L', '').
make_label(Number, Label) :-
target_os(OS),
label_bounds(OS, Prefix, Suffix),
concat_atom([Prefix,Number,Suffix], Label).
Comments.
(1) This shows up a real deficiency in most Prolog systems, but the
solution adopted was no better. Common Lisp has addressed this
problem. The problem is the problem of handling file names in
a reasonably portable way. We have something for internal use
here that looks quite good, but then we decided to do an IBM
mainframe port, and it didn't look quite so portable any more.
What we'd really like to do in this case is
file_extension(assembler, Extension),
open('b.e'(Base,Extension), write, AssemStream)
or something like that.
The relevance of this to string operations is that we don't
really want a string (or, for that matter, an atom) at all.
The syntax of file names is something we want to keep OUT of
our source code. What we want here is operations on file
names such as Common Lisp provides, not operations on strings.
In fact string operations are unbelievably clumsy for such
elementary things as "parent directory".
(2) The prompt/2 operation being specified to take atoms, concat_atom/2
is needed here. On the other hand, prompt/2 doesn't do quite what
the programmer of this program thought (it sets the prompt used by
*programmed* input, not the prompt used by the top level) so in
fact the operation accomplished nothing at all. So this can't be
all that vital. In any case, it only happens once, so making an
atom is no big deal.
(3) String concatenation should not be used here (A).
(4) In this case, it would have been better to call concat_chars/2
directly, rather than pack the result into an atom and straight
away unpack it again. Anyway, string concatenation should not
be used here (A).
(5) The right way to do this is to make a table
decorate(foo(X,Y), cfoo(X,Y)).
decorate(baz(X), cbaz(X)).
decorate(ugh(X,Y,Z), cugh(X,Y,Z)).
...
decorate(PlainTerm, DecoratedTerm)
...
This is considerably faster and purer.
(6) String concatenation should not be used here (B).
(7) String concatenation should not be used here (A).
(8) String concatenation should not be used here (A).
(A) All of these particular uses are constructing something which
is going to be written to a file. Now, we DO care that the
right sequence of characters should be written to the file,
but we DON'T need to represent that sequence of characters
EXPLICITLY in the program!
For example, we can handle example (8) thus:
make_label(Number, label(Number)).
...
write_operand(label(Number)) :-
target_os(OS),
label_format(OS, Format),
format(Format, [Number]).
Example (7) and lots of others like it in the program can be
handled by more clauses of the same sort:
write_operand(A+B) :-
write_operand(A),
write(+),
write_operand(B).
What is happening here is that instead of constructing a string
we are constructing a compound term, and finally we are saying
how to print these compound terms.
There are so many advantages to doing it this way! For example,
we can easily produce several different printing routines (not
just different ones for different operating systems; we could
have a debugging version which prints abbreviated forms to the
terminal so that we can see what is happening). Best of all,
we keep the data structures as *Prolog* data structures for as
long as possible, meaning that we can unpack them using
unification rather than slow string operations, and we can leave
parts of them uninstantiated (for all time, if we so choose).
The long and short of it is that this approach is much more
efficient, and much more flexible, than consing character strings
all over the place.
(B) Once again, the program as such has no use for the character
string. The appropriate thing is to imitate Common Lisp (well,
when we did it in DEC-10 Prolog years ago, we thought we were
imitating C, but it comes to the same thing). Instead of
consing up a string, you pass your error reporting command a
format and a list of arguments. Thus we might do
target_machine(Machine),
report_error('A ~w cannot handle ~w on ~w data',
[Machine,Operation,Size], ...)
report_error(Format, Arguments, ...) :-
current_input(Stream),
current_stream(FileName, read, Stream),
line_count(Stream, Line),
format(user_error, '~N! Error near line ~w of ~w~n! ',
[Line, FileName]),
format(user_error, Format, Arguments),
format(user_error, '~n', []),
interact_with_user(...).
format/2 and format/3 are specific to Quintus Prolog, just as
writef/2 and fwritef/3 were specific to DEC-10 Prolog. That
doesn't matter. It is trivially easy to knock together something
of the sort which is powerful enough to handle your error messages.
The same technique of passing around a "format" and a list of
arguments for it may be applicable in other cases as well.
The point is:
unless the program itself has a use for the string,
don't make it. Leave the thing as a compound term or something
right up until you write it out.
It is sobering to reflect that the reason why C is such a good language
for writing text-processing applications is that it HASN'T got a string
data type.
Some other operations.
(i) Pattern matching.
Grammar rules are far and away the best way of doing this in Prolog.
SNOBOL and ICON fans will recognise most of their favourite
pattern matching operations as basic Prolog control structures
(cut = FENCE) and as I pointed out earlier, grammar rules in a
good Prolog are likely to be faster than string operations.
(ii) Accessing data fields in external data bases.
The argument here is that the external data base may contain
millions of character fields which we do not loaded into Prolog's
symbol table. This is a very strong argument indeed in favour of
what have been called "uninterned atoms" (more or less what all
the atoms in the LOGIX system are). But
(ii.1) If the external data base is a relational one, and if the
data base designer had even the faintest understanding of
what it means to design a relational data base, you will not
want to do string operations on these character fields.
(Domains in a relational data base are supposed to be ATOMIC
as far as the application is concerned.) Dates might be an
exception, but they should be stored as a DATE data type or
as coded numbers.
(ii.2) If you are going to do any joins in Prolog, you definitely
DO want the attributes in that domain to be held uniquely (at
least for the duration of a query) in Prolog, so why not put
them in the hash table?
(iii.3) Attributes which are only involved in selection, and are not
included directly in the answer, should be kept on the other
side of the interface.
So the set of attributes where uninterned atoms are wanted is the
set of attributes which are going to be printed but not used in
joins on the Prolog side of the interface (and in particular, are
not going to be stored in the internal data base).
For this application, it looks as though an atom symbol table
which lent itself to fast garbage collection would be little if
at all worse than the use of uninterned atoms.
I have used a lot of programming languages having a string data
type. I am STILL looking for an application where it would be a good
idea to use them in a Prolog which had them. If we can't get a
discussion going about Prolog textbooks, can we get a discussion going
about this?
The topic for debate is:
assuming a Prolog system which imposes no arbitrary
restrictions on atoms, so that the only difference between
atoms and strings is that atoms are stored uniquely and
recovered by garbage collection (so allocation is slow,
but equality testing is rapid) whereas strings are not
stored uniquely and are recovered by both backtracking
and garbage collection (so allocation is fast but
equality testing is slow)
what examples can be found, OTHER than communication with
another language (such as Lisp) which supports strings,
where it is a good idea to USE strings?
Practical experience concerning the relative utility of
strings and uninterned atoms from people who are interfacing
Prolog with a large external data base would be useful.
------------------------------
End of PROLOG Digest
********************