PROLOG-REQUEST@SUSHI.STANFORD.EDU (Chuck Restivo, The Moderator) (12/30/87)
PROLOG Digest Thursday, 31 Dec 1987 Volume 5 : Issue 100 Today's Topics: Announcement - New List, Implementation - String Concatenation ---------------------------------------------------------------------- Date: Wed, 16 Dec 87 14:45:25 CST From: reddy@b.cs.uiuc.edu (Uday S. Reddy) Subject: Mailing list on narrowing A new mailing list dedicated to the discussion of narrowing in functional and equational languages is being created. Narrowing is an operational mechanism using which expressions with free variables can be "executed" producing solutions for the free variables. This is one of the approaches to combining functional programming and logic programming paradigms into a unified framework. The mailing list will be based at University of Illinois, at the address narrow@a.cs.uiuc.edu. Please send requests for addition to the mailing list to narrow-request@a.cs.uiuc.edu or reddy@a.cs.uiuc.edu. -- Uday Reddy ------------------------------ Date: Tue, 29 Dec 87 00:06:55 PST From: quintus!ok@Sun.COM (Richard A. O'Keefe) Subject: On String Concatenation Although several Prolog dialects provide operations on strings, as many more do not, and among those that do, the names and capabilities of the operations vary considerably. Most of them are just warmed-over BASIC, which means that as part of a "logic" programming language they are very badly designed indeed. Come to that, they aren't all that wonderful in BASIC, either. And if people at ----- or --- or --- or --- or ------ ------- ------------- think I'm talking about their product, I *am*. Very badly designed indeed. Most of the others are warmed-over Lisp, which means that although they are not that badly designed, they may fairly be described as excessively restrictive. I have just been looking at a medium size program (6,400 lines of Prolog-with-a-blunt-knife) which uses string concatenation lavishly. Just in case you are alarmed, this is NOT the Quintus Prolog compiler, nor any other Quintus product. In the Quintus Prolog version of this program, the operation is coded thus: concat_atom(Constants, Atom) :- concat_chars(Constants, Chars), atom_chars(Atom, Chars). % you may need name/2 concat_chars([], []). concat_chars([Constant|Constants], Chars) :- name(Constant, Name), append(Name, Chars1, Chars), concat_chars(Constants, Chars1). This is fine, as far as it goes. Prolog atoms are an almost perfect analogue of SNOBOL IV strings. But there are three snags: (a) in most Prolog systems known to me, the atom length bound is smaller than the string length bound. A good design rule is that the Prolog atom length bound should be no less than the host operating system's file-name length bound, which is about 1024 in a good UNIX. Few Prolog systems meet this not-very- stringent condition. (b) in many Prolog systems there is a smallish limit on the number of atoms you can have. For example, in --- Prolog, you can only have 907. When you recall that NU7 Prolog, running on a PDP-11 with 64kbytes of address space, would let you have 1023 atoms, a limit of 907 on an IBM PC or a Macintosh is absurd. Come to that, any limit other than "storage exhausted" is absurd. (c) Restriction (b) might be tolerable, but most Prolog systems do not garbage collect atoms. (There is no reason why they can't, they just don't.) So if you rely too heavily on concat_atom/2, you run the risk of (a) not being able to represent the result, (b) running out of atoms -- even though there might be lots of storage left --, or (c) tying up storage forever with atoms that were useful once. This sounds like a powerful argument for having a string data- type in the language. But is it? (a) There is no reason why atoms should have to be short. If space is tight, you might want to represent "normal" atoms by a pointer to a length byte followed by that many characters, but there is no reason why you couldn't reserve length 255 to mean "long name" and use some other representation for long atoms. (b) There is no good reason for the atom count limit being small. I put together a symbol table design for the IBM PC where the limit was 2^16 minus a few; this didn't even store atom lengths explicitly, and was very compact. If you can do it for the IBM PC, you can do it for decent machines. (c) There is no reason why the atom table couldn't be garbage collected. SNOBOL did it, after all. But still, it must be more efficient to allocate strings in the same area as compound terms and have them vanish on backtracking, and never have to look them up in the symbol table? This is almost certainly true, but the odd thing is that if efficiency is your concern you are better off using lists of character codes than strings. This is a very surprising result, but it is true in the Lisp systems I have tried too. [In case this isn't clear, I mean that I have MEASURED the cost of strings -vs- lists of character codes, and found that in well-designed Prolog and Lisp systems using lists is FASTER.] Even if efficiency is on the side of lists of character codes, they do take up more space than strings. So there is still some sort of an argument for strings here, especially on PCs and Mac. True, but there is a better argument for not doing concatenation at all. I thought it would be interesting to see how the program mentioned above used concatenation. Eight different uses could be found: (1) Making a file name. The typical example was concat_atom([Base,'.s'], AssemFileName), open(AssemFileName, write, AssemStream) (2) Making a prompt. The typical example was concat_atom(['(',State,') ?- '], NewPrompt), prompt(OldPrompt, NewPrompt), <stuff> prompt(_, OldPrompt) (3) Making a comment to write in a generated file. The typical example was concat_atom(['# Date : ',Day,-,Month,-,Year], DateLine), ... write_to_assem_file([FileLine,DateLine,VersionInfo]) (4) As a preliminary to making a list of character codes, also to be written in a generated file. The typical example was concat_atom(['source function: ',Function,/,Arity], Comment), name(Comment, CommentChars), ... write_to_assem_file([...,c(CommentChars),...]) (5) Changing the name of a compound term. This particular program has two families of constructor functions (among others): where one has <foo> the other has c<foo>. The typical example was PlainTerm =.. [PlainFunction|Args], concat_atom([c,PlainFunction], DecoratedFunction), DecoratedTerm =.. [DecoratedFunction|Args] (6) Constructing an error message. The typical example was target_machine(Machine), concat_atom(['A ',Machine,' cannot handle ', Operation,' on ',Size,' data'], ErrMsg), report_error(ErrMsg, ...) (7) Turning a term into an atom. The typical example was flatten(X+Y, Flat) :- flatten(X, X1), flatten(Y, Y1), concat_atom([X1,+,Y1], Flat). (8) Generating labels. The typical example was label_bounds(vms, '', '$'). label_bounds(unix, 'L', ''). make_label(Number, Label) :- target_os(OS), label_bounds(OS, Prefix, Suffix), concat_atom([Prefix,Number,Suffix], Label). Comments. (1) This shows up a real deficiency in most Prolog systems, but the solution adopted was no better. Common Lisp has addressed this problem. The problem is the problem of handling file names in a reasonably portable way. We have something for internal use here that looks quite good, but then we decided to do an IBM mainframe port, and it didn't look quite so portable any more. What we'd really like to do in this case is file_extension(assembler, Extension), open('b.e'(Base,Extension), write, AssemStream) or something like that. The relevance of this to string operations is that we don't really want a string (or, for that matter, an atom) at all. The syntax of file names is something we want to keep OUT of our source code. What we want here is operations on file names such as Common Lisp provides, not operations on strings. In fact string operations are unbelievably clumsy for such elementary things as "parent directory". (2) The prompt/2 operation being specified to take atoms, concat_atom/2 is needed here. On the other hand, prompt/2 doesn't do quite what the programmer of this program thought (it sets the prompt used by *programmed* input, not the prompt used by the top level) so in fact the operation accomplished nothing at all. So this can't be all that vital. In any case, it only happens once, so making an atom is no big deal. (3) String concatenation should not be used here (A). (4) In this case, it would have been better to call concat_chars/2 directly, rather than pack the result into an atom and straight away unpack it again. Anyway, string concatenation should not be used here (A). (5) The right way to do this is to make a table decorate(foo(X,Y), cfoo(X,Y)). decorate(baz(X), cbaz(X)). decorate(ugh(X,Y,Z), cugh(X,Y,Z)). ... decorate(PlainTerm, DecoratedTerm) ... This is considerably faster and purer. (6) String concatenation should not be used here (B). (7) String concatenation should not be used here (A). (8) String concatenation should not be used here (A). (A) All of these particular uses are constructing something which is going to be written to a file. Now, we DO care that the right sequence of characters should be written to the file, but we DON'T need to represent that sequence of characters EXPLICITLY in the program! For example, we can handle example (8) thus: make_label(Number, label(Number)). ... write_operand(label(Number)) :- target_os(OS), label_format(OS, Format), format(Format, [Number]). Example (7) and lots of others like it in the program can be handled by more clauses of the same sort: write_operand(A+B) :- write_operand(A), write(+), write_operand(B). What is happening here is that instead of constructing a string we are constructing a compound term, and finally we are saying how to print these compound terms. There are so many advantages to doing it this way! For example, we can easily produce several different printing routines (not just different ones for different operating systems; we could have a debugging version which prints abbreviated forms to the terminal so that we can see what is happening). Best of all, we keep the data structures as *Prolog* data structures for as long as possible, meaning that we can unpack them using unification rather than slow string operations, and we can leave parts of them uninstantiated (for all time, if we so choose). The long and short of it is that this approach is much more efficient, and much more flexible, than consing character strings all over the place. (B) Once again, the program as such has no use for the character string. The appropriate thing is to imitate Common Lisp (well, when we did it in DEC-10 Prolog years ago, we thought we were imitating C, but it comes to the same thing). Instead of consing up a string, you pass your error reporting command a format and a list of arguments. Thus we might do target_machine(Machine), report_error('A ~w cannot handle ~w on ~w data', [Machine,Operation,Size], ...) report_error(Format, Arguments, ...) :- current_input(Stream), current_stream(FileName, read, Stream), line_count(Stream, Line), format(user_error, '~N! Error near line ~w of ~w~n! ', [Line, FileName]), format(user_error, Format, Arguments), format(user_error, '~n', []), interact_with_user(...). format/2 and format/3 are specific to Quintus Prolog, just as writef/2 and fwritef/3 were specific to DEC-10 Prolog. That doesn't matter. It is trivially easy to knock together something of the sort which is powerful enough to handle your error messages. The same technique of passing around a "format" and a list of arguments for it may be applicable in other cases as well. The point is: unless the program itself has a use for the string, don't make it. Leave the thing as a compound term or something right up until you write it out. It is sobering to reflect that the reason why C is such a good language for writing text-processing applications is that it HASN'T got a string data type. Some other operations. (i) Pattern matching. Grammar rules are far and away the best way of doing this in Prolog. SNOBOL and ICON fans will recognise most of their favourite pattern matching operations as basic Prolog control structures (cut = FENCE) and as I pointed out earlier, grammar rules in a good Prolog are likely to be faster than string operations. (ii) Accessing data fields in external data bases. The argument here is that the external data base may contain millions of character fields which we do not loaded into Prolog's symbol table. This is a very strong argument indeed in favour of what have been called "uninterned atoms" (more or less what all the atoms in the LOGIX system are). But (ii.1) If the external data base is a relational one, and if the data base designer had even the faintest understanding of what it means to design a relational data base, you will not want to do string operations on these character fields. (Domains in a relational data base are supposed to be ATOMIC as far as the application is concerned.) Dates might be an exception, but they should be stored as a DATE data type or as coded numbers. (ii.2) If you are going to do any joins in Prolog, you definitely DO want the attributes in that domain to be held uniquely (at least for the duration of a query) in Prolog, so why not put them in the hash table? (iii.3) Attributes which are only involved in selection, and are not included directly in the answer, should be kept on the other side of the interface. So the set of attributes where uninterned atoms are wanted is the set of attributes which are going to be printed but not used in joins on the Prolog side of the interface (and in particular, are not going to be stored in the internal data base). For this application, it looks as though an atom symbol table which lent itself to fast garbage collection would be little if at all worse than the use of uninterned atoms. I have used a lot of programming languages having a string data type. I am STILL looking for an application where it would be a good idea to use them in a Prolog which had them. If we can't get a discussion going about Prolog textbooks, can we get a discussion going about this? The topic for debate is: assuming a Prolog system which imposes no arbitrary restrictions on atoms, so that the only difference between atoms and strings is that atoms are stored uniquely and recovered by garbage collection (so allocation is slow, but equality testing is rapid) whereas strings are not stored uniquely and are recovered by both backtracking and garbage collection (so allocation is fast but equality testing is slow) what examples can be found, OTHER than communication with another language (such as Lisp) which supports strings, where it is a good idea to USE strings? Practical experience concerning the relative utility of strings and uninterned atoms from people who are interfacing Prolog with a large external data base would be useful. ------------------------------ End of PROLOG Digest ********************