[comp.lang.forth] Amiga FORTH Newsletter V1, #2

thomson@utah-cs.UUCP (Richard A Thomson) (03/08/88)

Amiga FORTH Newsletter
Volume 1
Number 2
March 8th, 1988

[ I'm posting just the contents of AFN V1, #2 along with the first document
  since the total length is 157K.  If you wish to obtain the whole thing,
  you can contact me (Rich Thomson), or Marcus Gabriel.  Comments to AFN V1
  numbers 1 and 2 will appear in AFN V1, #3 -- RT ]

Contents:  Introduction

           Forth-83.Txt (Offered but not included, 200114 bytes)

           FStrings Package
                Files:  FStrings.Txt   (56796 bytes)
                        FStrings.Scr   (26432 bytes)
                        FSTest.Scr     (28543 bytes)
                        MFStrings.Scr  (27381 bytes)

           GStrings Proposal
                 File:  GStrings.Pro   (10981 bytes)

           WARNING CONCERNING THE MULTI-FORTH ASSEMBLER AND MFSTRINGS.SCR

           THANK YOU Richard Thomson

Sources:  marcus@newton.physics.purdue.edu (Marcus D. Gabriel)
          East Coast Forth Board (ECFB)
            (703-442-8695)(SYSOP: Jerry Shifrin, nice PCBoard)
___________________________________________________________________________

Subject: FStrings Package by George T. Hawkins
From: marcus@newton.physics.purdue.edu (Marcus D. Gabriel)


To One and All:

  Allow me to begin by explaining the one offered file which is not
included here.  The file Forth-83.Txt is the 89 page "FORTH-83 STANDARD,"
a publication of the FORTH STANDARDS TEAM, and it is reproducible
in whole or in part.  Their address is

    FORTH STANDARDS TEAM
    P.O. BOX 4545
    MOUNTAIN VIEW, CA 94040
    USA

If you would like this document, please e-mail a note to this effect to
marcus@newton.physics.purdue.edu (Marcus D. Gabriel), and I will respond either
by sending it to you e-mail, or if there is sufficient response and when
Richard Thomson has time to create a completed mailing list, I will
propagate it to the group.

  Last January I downloaded from the ECFB George T. Hawkins' FSTRINGS
package, adapted it to Multi-Forth in January, and put some more work into
it in late February.  The file MFStrings.Scr is the result, more on this
file later.  The file FStrings.Txt is George Hawkins' original narrative
on his FSTRINGS package, and this file is highly recommended reading.
The File FStrings.Scr is George Hawkins' original source file (FORTH-83
Standard) for the implementation of the abstract string operators of
FSTRINGS, although I have converted it from a block file to a stream file.
This file is included for three reasons:

      i)    George Hawkins' code is enjoyable and, I believe, instructive
          to read.
     ii)    For those with JFORTH or MVP Forth, this will give additional
          aid in adapting FSTRINGS.
    iii)    You may wish to adapt this package differently than myself,
          and thus you have the original source.

The file FSTest.Scr is George Hawkins' Test/Validation file for
FStrings.Scr or, in this case, for MFStrings.Scr also.  It has been
minimally modified in order to compile under Multi-Forth, and you should
have no difficulty in seeing where I have made my changes, there documented
or obvious.  This file will help if you are adapting or re-adapting
FSTRINGS.  The file MFStrings.Scr is my adaptation of FSTRINGS to
Multi-Forth.

  I have coded all of the words that George Hawkins recommened
and many others besides these.  I have defined two words for "dynamically"
allocating a temporary string storage buffer, eliminated some internal
words through the use of local variables, factored some words of generic
interest beyond this package, etc.  If you do not like the way I did
something, please feel free to change it or bring it up for discussion.
I must admit that upon review, in some instances, but not all, I may have
gotten carried away with the use of local variables.  In these cases, I
believe I was trying to make myself understand each and every word without
skimming over them by simply reworking them, even to excess.
C'est la vie :-).

  If you have questions or problems, you can e-mail me a note if you do
not think it is of generic interest to the Amiga FORTH Newsletter,
otherwise please feel free to participate.

  The file GStrings.Pro (I gather the pun was intended) was posted to the
ECFB by Rj Brown, and it is a proposed incremental improvement over
FSTRINGS.  The abstract string operators would stay the same, it is the
implementation of these that would change.  In essence, at the ECFB,
they are discussing lists and list processing as applied to strings.
See THE ART OF COMPUTER PROGRAMMING, FUNDAMENTAL ALGORIHMS, by Donald
Knuth.  I am sure that Richard Thomson or others of the Amiga FORTH
Newsletter could supply additional or better references.  I include this
file as an example of where we might take this discussion of strings,
as a group, if we wished, although it is not necessarily specific to the
Amiga.  On the other hand, one does use strings with the Amiga.

WARNING AND SOLUTION
--------------------

  The Multi-Forth Assembler has an interesting "feature."  On page
43 of Chapter 21, see

    CMPM  Aa Ab size CMPM,  test:<(Ab)-(Aa)>

whereas in fact the test <(Aa)-(Ab)> is performed, opposite of CMP, ,
CMPA, , or the Motorola convention.  However note that on page 35 of
Chapter 21, -TEXT is coded correctly, that is, given that CMPM, functions
oppositely from the documentation.  Note that I am refering to an old
manuel, but after calling CSI, I gather that the new manuel has the same
discrepency.  At least they, and you, have been notified.

  Find CMPM, in the the Multi-Forth Assembler and change the
original CMPM,

: CMPM,  ( Aa\Ab\sz -- MEMORY MEMORY COMPARE)
      B108 ,OP  3 AND !SIZE 6 SCALE
      SWAP ?AREG  OR  SWAP ?AREG
      SWAP 9 SCALE  OR  <OROP -OS ;

to

: CMPM,  ( Aa\Ab\sz -- MEMORY MEMORY COMPARE)
( Modified by MDG Feb 1988 )
  >R  2SWAP  R>
      B108 ,OP  3 AND !SIZE 6 SCALE
      SWAP ?AREG  OR  SWAP ?AREG
      SWAP 9 SCALE  OR  <OROP -OS ;

Remember that an Effective Address (ea) places TWO items on the Top Of
the Stack (TOS), hence the 2SWAP .

  I made this change so that CMPM, functioned as CMP, , CMPA, , and
respected the Motorala convention as did CMP, and CMPA, .  YOU MUST
DO THIS IN ORDER FOR COMPARE , ?SAME , AND _$$COMPARE OF MFSTRINGS.SCR
TO WORK AS ADVERTISED, OTHERWISE YOU MUST RECODE THEM.

END WARNING AND SOLUTION
------------------------


Finally, THANK YOU Richard Thomson for such a nice idea!



Enjoy
Marcus D. Gabriel
marcus@newton.physics.purdue.edu


==[ File:  FStrings.Txt  (56796 bytes) ]==CUT=HERE=========================
                               FORTH STRINGS 
 
 
 
                            TABLE OF CONTENTS: 
 
 
        1.      OVERVIEW 
        2.      SYNTAX 
        3.      PARAMETERS/DATA STACK CONVENTIONS 
        4.      INTERNAL REPRESENTATION/STRUCTURE OF STRINGS 
        5.      DATA ABSTRACTION 
        6.      MAJOR STRING FUNCTIONS 
        6.0.        A NOTE ON THE SYMBOLOGY USED HEREIN 
        6.1.        STRING LITERALS 
        6.2.        STRING DEFINITION 
        6.3.        STRING CONSTANTS 
        6.4.        STRING VARIABLES 
        6.5.        STRING REFERENCE 
        6.6.        BASIC STRING MANIPULATION 
        6.7.        STRING/CHARACTER FETCHES/STORES 
        6.8.        STRING INSERTIONS 
        6.9.        STRING DELETIONS 
        6.10.       STRING REPLACEMENTS 
        6.11.       STRING ROTATIONS 
        6.12.       STRING COMPARISONS 
        6.13.       STRING PATTERN MATCHING 
        6.14.       STRING SET OPERATIONS 
        6.15.       STRING TRANSLATION 
        6.16.       STRING ENCODING/DECODING 
        7.      AREAS NOT COVERED 
        7.1.        INPUT/OUTPUT 
        7.2.        STRING STACK 
        7.3.        PARSING STRINGS 
        8.      THE FORTH SOURCE FILE 
        9.      THE VALIDATION FILE 
        10.     FORTH-83 DEVIATIONS 
        11.     INTERNALS 
        11.1.       INTERNAL STRING STRUCTURE 
        11.2.       VISIBILITY 
        11.3.       ERROR CHECKING 
        11.4.       MEMORY OPERATORS 
        11.5.       THE ISSUE OF STANDARDS 
        11.6.       NULL STRINGS 
        11.7.       CONVERSION PROBLEMS 
        12.     GLOSSARY 
 
 
 
1.  OVERVIEW 
 
 
    This file discusses the concept of string manipulation primitives 
within the Forth language. 
 
    The formulation attempts to be as general as possible in both problem 
approach and solution, that is, in general, no a priori assumptions are 
made about what is "best" nor about any imposed constraints, and in 
general, I strive for the most complete, minimal solution.  
 
 
2.  SYNTAX 
 
 
    Following the general Forth philosophy of making word names short yet 
meaningful, the convention of using the dollar sign "$" to represent 
string operations is used throughout.  All string words begin with an 
initial "$" with two exceptions.  Additionally, where possible, other 
Forth naming conventions are "blended in".  
 
    For example, the Forth word chosen to copy a string is merely "$$!" 
(pronounced "s-s-store").  Which means to copy/store (!) one string ($) 
into another string ($).  
 
    In a like fashion, the word for string concatenation is "$$+" and the 
word for typing a string is "$.".  This seems, to me, more meaningful (in 
the context of Forth) than, say, the words "STRCAT" or "STRPRINT". 
 
    As a caveat, I would like to add that I am NOT an advocate of BASIC 
and the use of the "$" prefix I do not regard as giving the string 
operators presented here a "BASIC-like" flavor.  In the Forth source code 
provided (and in the this documentation as well) a "pronunciation guide" 
is given as a suggestion for pronouncing the string manipulation 
primitives provided herein.  It can be found (in double quotes) to the far 
right of the block on the line which begins the colon definition. 
 
    Two successive dollar signs "$$" as a word prefix indicate that two 
string references are needed (e.g., the concatenation word $$+ previously 
given).  In general there is no restriction on the two string references so 
long as they are validly defined strings.  The two references may, in fact, 
be the same.  For example, if the string variable "STV" contains the value 
"Hello there!", then the code: "STV STV $$+" would give STV the value 
"Hello there!Hello there!".  (Of course the "insert string in string" word 
($$INS) and the "replace string in string" word ($$REP) - to be introduced 
later - should not contain the same reference for both strings.)  Whenever 
a string primitive begins with "$$" the first string is always understood 
to be the "source" string and the second string is always the "target" 
string.  So, using the concatenation example again, the code: 
"STV1 STV2 $$+" would add/concatenate STV1 to the end of STV2. 
 
    A prefix of "$C" indicates that a string and a character reference are 
needed/returned.  For example, the string operator "$C!" (pronounced 
"s-c-store") takes a character, a string, and an index (which starts at 0) 
into the string and stores the character in the string at the designated 
index.  Using the previous variable STV, then the code: "65 STV 1 $C!" 
would give STV the value "HAllo there!Hello there!". 
 
    A single "$" indicates that only a single string reference is needed. 
For example, the string operator "$NULL" (pronounced "string-null") sets a 
given string to null. 
 
    Hopefully the pattern and logic behind the syntax chosen for Forth 
string operators will become more apparent as additional examples are 
given. 
 
 
3.  PARAMETERS/DATA STACK CONVENTIONS 
 
 
    The method selected to pass parameters to/from the string operators 
provided herein follows a regular, consistent form.  Operands (e.g., 
strings and characters) always precede operators (e.g., indices, lengths, 
and counts).  The general form of parameter passing is: 
 
        {operands} {operators} 
 
    where {operands} = 
 
        {addr,length|char|string} {string} 
 
    and {operators} = 
 
        {index} {length} {number} {status} {flag} 
 
    The enclosing brackets (i.e., "{" and "}") indicate the optional 
occurrence of a parameter and the parameters themselves are: 
 
        addr        =   a machine address, 
 
        char        =   an 8 bit ASCII character, 
 
        string      =   an abstract string reference which may be further 
                        qualified if needed (e.g., "string1"), 
 
        index       =   an index into a string (0..n), 
 
        length      =   the length of the (sub)string in question, 
 
        number      =   a number or character count, 
 
        status      =   a set of values indicating possible outcomes, and 
 
        flag        =   a TRUE/FALSE flag (a returned value, designated 
                        as "t | f"). 
 
    Finally, the general Forth usage of "consuming" a data stack item is 
followed. 
 
 
4.  INTERNAL REPRESENTATION/STRUCTURE OF STRINGS 
 
 
    As a start, I feel one should clearly separate the actual internal 
representation of Forth strings from their external appearance. 
 
    That is, the string data type (just as any other data type) should be 
abstract.  The Forth programmer should be given a rich enough set of 
string primitives so that he/she need not have to be concerned with the 
internal representation. 
 
    The implementation of the string manipulation capability itself must, 
however, be closely concerned with the internal representation/structure. 
 
    In general, two major classes of representation for strings have 
tended to evolve: counted strings and (null) terminated strings. 
 
    Terminated strings (also called ASCIIZ strings) are simply strings 
which are terminated by a special symbol (generally the ASCII zero, or nul 
character).  This is the representation chosen, for example, by the C 
language.  
 
    Counted strings are strings which reserve some initial portion of the 
data structure itself to contain the following physical string length. 
 
    There are, of course, advantages to both string types. 
 
    The chief advantage of the terminated string type is that the string 
may be arbitrarily long. 
 
    The chief advantage of the counted string is that it is fast (i.e., 
many manipulations on strings require knowing the string length and this 
can be determined immediately in the case of counted strings). 
 
    A disadvantage with the terminated string is that the special 
terminating symbol cannot be used within the body of the string and a 
disadvantage with the counted string is that there is always some fixed 
(system imposed) limitation on the maximum string size. 
 
    All in all, I feel the counted string representation is the superior 
choice for Forth.  In Forth, speed is always a consideration and the 
problem of having a fixed string size limitation is resolved to a large 
extent by ensuring that the implementation treat strings as an abstract 
data type. 
 
    As a consequence, the string implementation given here uses counted 
strings.  I did not however select the current Forth usage (e.g., as 
reflected in the word COUNT) of representing a counted string with a 
preceding count byte.  My feeling was that in our (current) era of 32-bit 
micros, a single byte limiting the string length to 255 characters is 
simply too restrictive.  I have, rather, used an initial 16-bit word to 
represent the current string length.  Note also that this internal 
representation can be easily changed without, in general, requiring 
application program modifications since the string is implemented as an 
abstract data type. 
 
    Let me make the emphatic point that the string operators defined here 
are *totally* independent of implementation considerations.  That is, there 
is nothing in the definition of the string primitives provided which, in 
any way, requires any assumptions about the underlying string structure. 
This point may seem minor, but it is an extremely important one. 
 
 
5.  DATA ABSTRACTION 
 
 
    Having just discussed some of the considerations involved in the 
choice of the internal representation/structure of strings, I would now 
ask that the reader (from this point forward) ignore these considerations 
entirely!  The reason is that such considerations should only be a factor 
for implementation (i.e., a vendor concern or a concern if you implement a 
string package yourself).  The string type, from the programmer's 
perspective, should be an abstract data type. 
 
    The power and advantage of data abstraction, in general, is well 
known, so I shall not go into further detail here.  The key issue is, 
rather, how one reflects the abstraction at the programming level. 
 
    A natural choice would seem that of implementing strings as an 
address/pointer type.  That is, any reference to a string merely leaves an 
address upon the data stack.  This address is meaningless (per se) to the 
programmer, but is the "hook" used by all of the underlying string 
primitives to perform string operations. 
 
    Of course, this implies, that the set of string primitives must be 
sufficiently powerful not to require the direct programming access of the 
abstract data type. 
 
    Not only does the use of a pointer type keep the underlying 
implementation hidden, but it (as well) allows string references to be 
treated as single data elements (since the FORTH-83 standard defines the 
address type to be the same as an unsigned number which is represented as 
a single stack element).  Thus, if two string references are on TOS they 
may be swapped via "SWAP" rather than having to introduce a new word such 
as "$SWAP".  I should add that I do not agree with any number of the 
particulars of the FORTH-83 definition in this regard, but (at least for 
this subcase) it serves the purpose well. 
 
    Although the definition and implementation of string types as abstract 
data structures provides immeasurable benefit in the ability to write 
clean, clear code using strings - it provides another major benefit as 
well.  Portability.  That is, if strings (or any other data type) are 
abstract, then they are prohibited (at the programming/application level) 
from assuming anything about the underlying system.  The underlying 
details must, of course, be handled by the Forth system - but, they more 
properly belong at this "hidden" level anyway.  Since the code/application 
can assume nothing about the underlying implementation, then - by 
definition - the code is portable. 
 
    The only "wrinkle" in the above logic is that, unfortunately, some 
sort of "interface" may be needed between strings and some other vendor 
supplied package/subsystem (e.g., I/O) which is non-portable.  To this end, 
an "import" and "export" capability has been added which "translates" 
between (to/from) the internal/abstract string representation and an 
external/memory sequential one (which seems to be a common format across 
vendors for the data content of strings).  Note that (ideally) this 
import/export capability should only be temporary.  That is, once other 
extensions to the Forth language have been defined which achieve true 
portability/abstraction, then the need for importing and exporting will go 
away. 
 
 
6.  MAJOR STRING FUNCTIONS 
 
 
    The major string functions proposed are discussed next.  In general 
they fall into the categories of: 
 
        -  String literals 
        -  String constants 
        -  String variables 
        -  String reference 
        -  Basic string manipulation 
        -  String/character fetches/stores 
        -  String insertions 
        -  String deletions 
        -  String replacements 
        -  String rotations 
        -  String comparisons 
        -  String pattern matching 
        -  String set operators 
        -  String translation 
        -  String encoding/decoding 
 
    Admittedly the taxonomy is arbitrary but it should, hopefully, 
represent a good presentation framework and it should, at least, be 
exhaustive.  All reasonable string functions I can think of are either 
represented within the primitives given or capable of being built from 
these primitives. 
 
 
6.0.  A NOTE ON THE SYMBOLOGY USED HEREIN 
 
 
    In discussing the major string functions to follow, excerpts from the 
code in "FSTRINGS.SCR" will be used.  Section 3 (i.e., PARAMETERS/DATA 
STACK CONVENTIONS) explains the symbology used here.  
 
 
6.1.  STRING LITERALS 
 
 
    As a way of introducing the string data type into a program, a string 
literal capability must be provided. 
 
    Since a string may be any arbitrary sequence of characters, some way 
of delimiting the string must be found which does not interfere with the 
capability to define an arbitrary sequence of characters. 
 
    The string literal (interpretative) is introduced via: 
 
        $LIT "This is a string" 
 
    It's action is to read the input stream until a non-blank character 
(i.e., a "delimiter") is encountered and to return the read string (until 
a subsequent delimiter is encountered) as an abstract string reference 
upon the stack.  The delimiters are not included as part of the returned 
string.  However, embedded delimiters may be included in the string by 
simply using two consecutive occurrences of the delimiter. 
 
    The definition of $LIT (as it occurs within the code) is: 
 
\ : $LIT ( -- string )                                \ "s-lit" 
\ Two string literal primitives are provided - "$LIT" for 
\ interpretation and "[$LIT]" for colon definitions - otherwise 
\ their actions are the same. 
\ The string in the input stream following $LIT (or [$LIT]) is 
\ delimited by the first following non-blank character.  The 
\ string is defined by all characters immediately following the 
\ delimiting character until a subsequent delimiter is 
\ encountered in the input stream.  The delimiter, itself, may 
\ also be included in the string by using two successive 
\ delimiters to represent a single delimiter.  For example: 
\    $LIT "The double quote character is """ 
 
    Within a colon definition "[$LIT]" at run time is equivalent to $LIT 
interpretatively.  Thus the word "X", defined 
as: 
 
        : X [$LIT] ' This is a string' ; 
 
leaves the string " This is a string" as a reference upon TOS when 
executed, and is equivalent to the interpretative sequence: 
 
        $LIT | This is a string| 
 
    The use of "$LIT" leaves the abstract string data type on TOS.  This 
data type is used in the remainder of the string primitives provided 
herein. 
 
    The above use of [$LIT] within a compiled word cannot be used when run 
time interactive input is needed.  For this reason, two primitives are 
provided which read typed keyboard characters directly into a string until 
a <RETURN> character (i.e., ASCII decimal value 13) is encountered.  The 
two primitives are:  
 
: $KEY ( string -- )                                \ "s-key" 
\ Reads keyboard characters into "string" until <RETURN>. 
 
    and 
 
: $$KEY ( string -- )                               \ "s-s-key" 
\ Same as $KEY except input echoed back to console. 
 
    The string data type may be compiled directly into the current 
dictionary via the use of the "$," word once it is placed upon the stack 
via the $LIT word.  Alternatively, the ",$" word may be used to read a 
string from the Forth input stream and place it into the current 
dictionary.  Thus the two segments of code: 
 
        $LIT "This is a string" $, 
 
            and 
 
        ,$ *This is a string* 
 
    are equivalent. 
 
    The source code description of the above discussed words follows: 
 
: $, ( string -- )                            \ "s-comma" 
\ Compiles a string into the dictionary. 
 
: ,$ ( -- )                                   \ "comma-s" 
\ Compiles the following word string into the dictionary. 
 
    A point to keep in mind with $LIT is that only one string literal may 
be defined concurrently.  That is, $LIT returns a "standard" string 
reference address, so if, for example, "$LIT 'hello' $LIT 'there'" is 
typed in, then the two string references left on TOS both "point to" the 
string "there".  In general, this type of situation should not occur. 
 
 
6.2.  STRING DEFINITION 
 
 
    Once a string literal capability is defined, then string data types 
may be defined.  The following subsections discusses the two possible 
cases of string constants and string variables. 
 
 
6.3.  STRING CONSTANTS 
 
 
    No provision is made for implementing string constants.  The primary 
reason is that since the string is implemented as an abstract data type, 
then a string constant is no different in reference from a string 
variable.  Thus the only meaning of a string constant is that assignments 
to string constants would be caught at compile (or run) time and the 
corresponding compilation (or run) aborted.  This would have the negative 
effect of introducing additional complexity in the package with a possible 
run time speed penalty as well.  Since the author felt this direction to 
be counter to his perceived "philosophy" for Forth, the capability was not 
implemented.  In a like fashion, no error checking has been provided 
either.  The user can, of course, modify this package as he/she desires to 
introduce a constant and/or error checking capability if so desired.  
 
    In some Forth systems I've seen, it appears that the only difference 
between a string constant and a string variable is that the former allows 
initialization (i.e., preassigning a value) and the latter does not.  In 
such systems a reference to a string constant or a string variable *both* 
leave identical stack data types.  Thus in these systems there is nothing 
(save the programmer's own good judgment) to prevent the run time 
alteration of string constants.  (Note that this is entirely different 
from the use of standard Forth scalar constants and variables where the 
former leave a value and the latter leave an address on the stack.) 
 
    Such usage of the terms "constant" and "variable" seem, to the author 
at least, to be artificial, contrived, and misleading.  That is, a string 
constant, in such systems, is not a constant but, rather, an initialized 
variable - and this is merely a minor syntactical distinction. 
 
    The implementation given here allows allocating a string variable in 
either an initialized or a NULL state with equal facility and avoids the 
use of string constants altogether, since (as previously mentioned) this 
would invariably extract a performance penalty. 
 
    For example, if a string constant were provided, then the appropriate 
definition might be something like: 
 
        $LIT "This is a string constant" $CON STR-CON 
 
    However this same capability is provided with an initialized variable 
via: 
 
        CREATE STR-CON ,$ "This is a string constant" 
 
    I fail to see much difference between the two nor any advantage to the 
use of a "constant". 
 
 
6.4.  STRING VARIABLES 
 
 
    String variables are "CREATE"d just as are regular Forth variables. 
This is due to the implementation of the abstract string data type as a 
pointer/address. 
 
    For example, to create the string variable SV1 with the initial value: 
" Hello world!" - one could code: 
 
 
        $LIT "Hello world!" CREATE SV1 $, 
 
    or 
 
        CREATE SV1 ,$ "Hello world!" 
 
    or 
 
        CREATE SV1 12 $ALLOT $LIT "Hello world!" SV1 $$! 
 
 
    As shown, if a string variable is needed, but its initial contents are 
not known, then the variable (plus the dictionary space required) can be 
defined via the CREATE and $ALLOT words. 
 
    For example, to reserve a string variable of 97 characters, one could 
code: 
 
        CREATE SV1 97 $ALLOT 
 
    or 
 
        97 CREATE SV1 $ALLOT 
 
    The use of $ALLOT not only reserves the necessary dictionary space but 
initializes the string to NULL (i.e., zero length) as well.  This is 
important since a null string is different from an unintialized string (and 
the string package provided here incorporates null strings).  All strings 
are initialized with this package. 
 
    The source code definition of $ALLOT is: 
 
: $ALLOT ( number -- )                              \ "s-allot" 
\ Reserves "number" characters in the dictionary for a string 
\ and sets the string to NULL. 
 
    Although initialization is provided, error checking is not.  This 
means that it is the responsibility of the Forth programmer to ensure that 
the dynamic (i.e., run time) size of strings never exceed their initially 
(implicitly or explicitly) allocated sizes.  (Of course, error checking 
can be added if so desired, at a performance cost.) 
 
 
6.5.  STRING REFERENCE 
 
 
    Once strings are defined as constants or immediately available as 
literals, then they may be referenced.  "Referencing", in this sense, 
means an immediate operation with a string without altering its value. 
 
    Four basic reference operators are provided: 1) determining the length 
of a string, 2) printing a string, 3) "importing" a string from another 
package (e.g., I/O), and 4) "exporting" a string to another package. 
 
    String length is determined via: 
 
 
: $LEN ( string -- length )                        \ "s-length" 
\ Returns the length of string. 
 
 
    Printing a string is accomplished via: 
 
 
: $. ( string -- )                                 \ "s-dot" 
\ Prints a string. 
 
 
    Their actions should be self-explanatory. 
 
    Since almost all Forth packages of which I am aware treat the data 
portion of a string via contiguous memory locations and since it may be 
necessary to "send/export" a string to some other vendor provided Forth 
package (e.g., I/O) and/or to "receive/import" a string from some other 
vendor provided package - an import/export ability is provided. 
 
    Note that exporting and importing strings are necessary (but hopefully 
temporary) evils.  That is, once the "string data type" has been 
established within the framework of the Forth standard and once all 
packages recognize (and act appropriately upon) this data type - then the 
export/import artifact will no longer be needed. 
 
    The definitions of import and export are: 
 
 
: $IMPORT ( addr length string -- )                \ "s-import" 
\ Imports a string from addr, length. 
 
    and 
 
: $EXPORT ( string addr -- )                       \ "s-export" 
\ Exports a string to addr. 
 
 
    Note that is is the user's responsibility when using $EXPORT to ensure 
that enough contiguous memory is available/reserved at "addr" to hold the 
data portion of the string. 
 
 
6.6.  BASIC STRING MANIPULATION 
 
 
    The use of the definition "basic string manipulation" (as was the 
previous case of string reference) is somewhat arbitrary.  What I am trying 
to capture here are those essential actions necessary for *any* string 
package. 
 
    Two basic string manipulation operators are provided; they are: 
 
 
: $NULL ( string -- )                             \ "s-null" 
\ Forces a string to NULL. 
 
    and 
 
: $$+ ( string1 string2 -- )                      \ "s-s-plus" 
\ Adds (concatenates) string1 onto string2. 
 
    Again, the actions should be self-explanatory. 
 
 
6.7.  STRING/CHARACTER FETCHES/STORES 
 
 
    Following the Forth fetch/store convention, a number of operators are 
provided which perform fetches/stores across strings/characters.  They are: 
 
 
: $C! ( char string index -- )                    \ "s-c-store" 
\ Stores "char" in "string" at position "index". 
 
: $C@ ( string index -- char )                    \ "s-c-fetch" 
\ Fetches "char" from position "index" in "string". 
 
: $$! ( string1 string2 -- )                      \ "s-s-store" 
\ Stores string1 in string2.  Original contents of string2 are 
\ lost. 
 
    and 
 
: $$@ ( string1 string2 index length -- )         \ "s-s-fetch" 
\ "string2" is built/fetched using the "string1" substring 
\ starting at position "index" for "length" characters. 
 
 
    The $C@ and $LEN primitives provide, essentially, the capability to 
"examine" any string in terms of size/content, and the $C! primitive 
provides the primary method of altering a string. 
 
    The $$@ operator is, in effect, the "substring" operator used in other 
languages.  When using $$@, the original contents of string2 are lost. 
 
 
6.8.  STRING INSERTIONS 
 
 
    Often it is necessary to dynamically "add" information into a string; 
the string insertion operators provide this capability, and they are: 
 
 
: $CINS ( char string index -- )                    \ "s-c-ins" 
\ Inserts "char" in "string" with "char" at position "index". 
\ Remaining characters, if any, are moved right. 
 
    and 
 
: $$INS ( string1 string2 index -- )                \ "s-s-ins" 
\ Inserts "string1" into "string2" starting at position "index" 
\ of "string2".  Remaining characters, if any, are moved right. 
 
 
    When a character, or string, is inserted into another string, the 
original character(s) - if any - beginning at the point of insertion are 
right shifted past the inserted character(s). 
 
    Note that: 
 
        "string1 string2 DUP $LEN $$INS" 
 
    is equivalent to: 
 
        "string1 string2 $$+" 
 
    The latter usage is to be preferred since it is simpler to understand 
and faster in execution. 
 
 
6.9.  STRING DELETIONS 
 
 
    The converse of string insertion (just presented) is string deletion. 
One general purpose and several special purpose string deletion primitives 
are provided. 
 
    The general purpose string deletion primitive is: 
 
: $DEL ( string index number -- )                     \ "s-del" 
\ Deletes "number" characters from "string" starting at 
\ position "index". 
 
    Although any of the special purpose string deletion primitives 
presented next can be built using $DEL, $C@ and $LEN, they are provided 
based on both the perceived frequency of use and in order to standardize 
naming conventions.  They are: 
 
 
: $|TRIM  ( string number -- )                 \ "s-left-trim" 
\ Deletes "number" characters from the left/start of "string". 
 
 
: $TRIM|  ( string number -- )                 \ "s-trim-right" 
\ Deletes "number" characters from the right/end of "string". 
 
 
: $|SPACES ( string -- )                     \ "s-left-spaces" 
\ Trims leading spaces from string. 
 
    and 
 
: $SPACES| ( string -- )                     \ "s-spaces-right" 
\ Trims trailing spaces from string. 
 
 
 
6.10.  STRING REPLACEMENTS 
 
 
    Having now considered string insertions and deletions, string 
replacements are now addressed.  A single, general purpose string 
replacement primitive is provided; it is: 
 
 
: $$REP ( string1 string2 index -- )                \ "s-s-rep" 
\ Replaces current substring in "string2" with "string1" 
\ starting at position "index" of "string2". 
 
 
6.11.  STRING ROTATIONS 
 
 
    The ability to circularly rotate strings left/right is also provided 
via four primitives.  The single shift string primitives are: 
 
 
: $<ROT ( string -- )                          \ "s-left-rote" 
\ Rotates a string left one character. 
 
    and 
 
: $>ROT ( string -- )                          \ "s-right-rote" 
\ Rotates a string right one character. 
 
 
    The multiple shift string primitives are: 
 
 
: $<<ROT ( string number -- )              \ "s-many-left-rote" 
\ Rotates "string" left "number" characters. 
 
    and 
 
: $>>ROT ( string number -- )             \ "s-many-right-rote" 
\ Rotates "string" right "number" characters. 
 
    Note that neither $<<ROT and $>>ROT are necessary since $<<ROT is 
merely: 
 
        : $<<ROT 0 DO $<ROT LOOP ; 
 
    And, of course, $>>ROT can be similarly defined. 
 
    The rationale behind providing $<<ROT and $>>ROT was simply that of 
execution time efficiency. 
 
 
6.12.  STRING COMPARISONS 
 
 
    The determination of string ordinality (i.e., the relative 
lexicographic order of two strings) can be satisfied with a single, all 
purpose, string comparator (i.e., "$$COMPARE") which compares two strings 
and returns -1, 0, or +1 depending on whether the first string is less 
than, equal to, or greater than the second string, respectively.  
 
    This implementation treats the null string as having the lowest 
possible ordinality (and there is valid mathematical reason for doing 
so).  Further, the ordinality of a string is based upon the ASCII 
collating sequence. 
 
    Although the "$$COMPARE" primitive is adequate to resolve all 
questions concerning string ordinality, from a coding/programming 
perspective, it is often more expedient to ask a specific question (such 
as: Is string1 less than string2?).  For this reason, and also from the 
standpoint of standardizing string comparator terminology, a number of 
additional, special purpose string comparators are introduced (e.g., $$=, 
$$<, $$<=, etc.). As might be expected, these comparators are all simple 
words invoking the $$COMPARE comparator. 
 
 
    The single general purpose string comparator is: 
 
 
: $$COMPARE ( string1 string2 -- status )       \ "s-s-compare" 
\ Returns -1, 0, or +1 depending on whether string1 is 
\ lexicographically less than, equal to, or greater than 
\ string2, respectively. 
 
 
    And the six special purpose string comparators are: 
 
 
: $$= ( string1 string2 -- t | f )   \ "s-s-equal" 
\ Returns -1 if string1 =  string2, else returns 0. 
 
 
: $$< ( string1 string2 -- t | f )   \ "s-s-less-than" 
\ Returns -1 if string1 <  string2, else returns 0. 
 
 
: $$<= ( string1 string2 -- t | f )  \ "s-s-less-than-or-equal" 
\ Returns -1 if string1 <= string2, else returns 0. 
 
 
: $$> ( string1 string2 -- t | f )         \ "s-s-greater-than" 
\ Returns -1 if string1 >  string2, else returns 0. 
 
 
: $$>= ( string1 string2 -- t | f) \ "s-s-greater-than-or-equal" 
\ Returns -1 if string1 >= string2, else returns 0. 
 
    and 
 
: $$<> ( string1 string2 -- t | f )        \ "s-s-not-equal" 
\ Returns -1 if string1 <> string2, else returns 0. 
 
 
 
6.13.  STRING PATTERN MATCHING 
 
 
    The ability to locate arbitrary characters and substrings within a 
string is provided.  The two primitives for locating an arbitrary character 
within a string are: 
 
 
: $CFIND ( char string -- index | -1 )        \ "s-c-find" 
\ Searches for leftmost occurrence of "char" in "string". 
\ Returns "index" if found, else returns -1. 
 
    and 
 
: $CFIND< ( char string -- index | -1 )       \ "s-c-find-back" 
\ Searches for rightmost occurrence of "char" in "string". 
\ Returns "index" if found, else returns -1. 
 
 
    The primitive which searches for a substring (to include the searched 
string itself) within a string is: 
 
 
: $$FIND ( string1 string2 index length -- index | -1 ) 
                                                   \ "s-s-find" 
\ Searches for the first occurrence of the string1 substring 
\ starting at "index" for "length" characters in string2. 
\ Returns index if found, else returns -1. 
 
 
    Note that "index" and "length" identify the substring within "string2" 
to be used for matching purposes. 
 
 
6.14.  STRING SET OPERATIONS 
 
 
    Two set theoretic string primitives are provided.  In both cases a 
character is considered a set element and a string is considered a set. 
 
    The two set theoretic string primitives are: 
 
    For single element set membership: 
 
 
: $CMEM ( char string -- t | f )                    \ "s-c-mem" 
\ Returns -1 if "char" is in "string", else returns 0. 
 
 
  Note that the $CMEM primitive is sufficient (along with other primitives 
introduced here such as $NULL, $C@, $$+, etc.) to build any necessary 
higher order set theoretic string operators. 
 
    For multiple element set membership: 
 
 
: $$VER ( string1 string2 -- index | -1 )           \ "s-s-ver" 
\ Verifies that string2 contains only those characters in 
\ string1 by returning a -1.  Otherwise the index of the first 
\ character in string2 not contained in string1 is returned. 
 
 
    The $$VER (verify) operator is borrowed from PL/I and it is more 
powerful, and practical, that might at first be expected. 
 
    As an example if we define: 
 
        CREATE NUMBERS ,$ "0123456789" 
 
    Then any string can be tested to see if it contains only numbers with 
the word: 
 
 
    \ Returns -1 if "string" contains only the ASCII characters 0..9, 
    \ else returns 0. 
    : NUMBERS? ( string -- flag ) 
      NUMBERS SWAP $$VER 0< ; 
 
 
6.15.  STRING TRANSLATION 
      
 
    Two straightforward string translation primitives are provided based, 
primarily, on frequency of use.  They translate lower-case characters in a 
string to upper-case, and vice-versa.  They are: 
 
 
: $>UPPER ( string -- )                          \ "s-to-upper" 
\ Converts any lower-case characters in "string" to upper-case. 
 
    and 
 
: $>LOWER ( string -- )                          \ "s-to-lower" 
\ Converts any upper-case characters in "string" to lower-case. 
 
 
6.16.  STRING ENCODING/DECODING 
 
 
    String encoding (i.e., translating a number to its ASCII string 
representation) and string decoding (i.e., translating an ASCII string 
number representation to a number) are handled by three primitives which 
are modeled after the currently defined FORTH-83 word set which performs 
double number conversion (i.e., "<#", "#", "#>", etc.). 
 
    String encoding, or decoding, is initiated via the use of the 
"$CONVERT" word.  Its definition is: 
 
 
: $CONVERT ( string index -- )                    \ "s-convert" 
\ Defines a string and index within the string to be used for 
\ subsequent string-to-number or number-to-string conversions. 
 
 
    An initial call to $CONVERT, in effect, defines a specific string and 
a specific place/index within that string to be used for subsequent 
encoding or decoding. 
 
    String encoding is accomplished via successive calls to the word 
"$<N".  Its definition is: 
 
 
: $<N ( number -- )                                \ "s-from-n" 
\ Converts "number" to the appropriate ASCII representation 
\ character and inserts it into the string and index defined 
\ by the last call to $CONVERT.  Note that successive calls to 
\ $<N "builds" the number string in reverse sequence. 
\ For example, the code: 
\ 
\     CREATE SNUMBER 10 $ALLOT 
\     SNUMBER 0 $CONVERT 1 $<N 2 $<N 3 $<N 
\ 
\ would build the image "321" in string SNUMBER. 
 
 
    Note that no "editing" is done on the number passed to $<N.  It is 
assumed that the user has already written some sort of "single digit from 
number" extraction routine (e.g., the standard "divide number by current 
base" passing remainder and saving quotient) and will "feed" this number 
sequence to $<N.  Any alphabetic characters added to a string by $<N are 
always upper-case. 
 
    Finally, string decoding is accomplished via successive calls to the 
word "$>N".  Its definition is: 
 
 
\ : $>N ( -- number | -1 )                           \ "s-to-n" 
\ Converts the character at the string/index position defined 
\ by the last call to $CONVERT to a number if possible.  An 
\ error (i.e., -1) is returned if the character is non-numeric 
\ (i.e., does not lie between 0..BASE-1).  The index position 
\ established in the call to $CONVERT is always incremented 
\ after a call to $>N. 
 
 
    The fact that $>N always increments the index position is important 
since this allows $>N to be used to "search" a string for the first valid 
numeric character (and then proceed with the decode operations). 
 
    The primitives given here are quite simple (more so than the "<#", 
"#>", and associated word set) and it is assumed that the user will tailor 
the "higher levels" of string encoding/decoding to his/her taste. 
 
 
7.  AREAS NOT COVERED 
 
 
    Since this paper tries to concentrate exclusively on Forth strings, a 
number of related topics were purposely excluded.  In all cases the primary 
reason for exclusion was that the topic area would necessarily introduce 
other major considerations which would lay well beyond the area of strings 
and thereby cause a considerable loss of focus.  
 
 
7.1.  INPUT/OUTPUT 
 
 
    Input/output, per se, has not been addressed in this file.  The 
reasoning is straightforward.  Forth, itself, provides minimal input/output 
facilities.  The Forth machine (upon which the Forth language itself is 
based) is a virtual machine with an on-line console (for interactive 
input/output) and a block file system for everything else.  This obviously 
falls far short of an input/output system in the traditional sense. 
 
    Correspondingly, the string functions given will work with the console 
and the block file (i.e., the Forth input stream).  No attempt has been 
made to extend this concept.  The reasoning is simply that input/output 
considerations for Forth are a separate issue, of which, string structures 
are a subcomponent (e.g., any binary read capability could easily input 
Forth strings - along with any other internal Forth data type for a given 
system). 
 
    In truth, as long as string primitives are provided to read string 
literals from a standard Forth block file (as they are in this 
implementation) - then this, in itself, handles the issue of portability. 
So, in essence, there is no need to augment string primitives beyond this 
point (at least from the perspective of portability). 
 
    Note: this is not to say that many Forth systems do not have excellent 
input/output facilities (well beyond the FORTH-83 standard) - it is rather 
to say that this is an independent subject and more properly treated as 
input/output instead of strings. 
 
 
7.2.  STRING STACK 
 
 
    The idea and use of a string stack will also be considered outside the 
bounds of this discussion.  This, in no way, indicates that I feel the 
idea of a string stack is bad (I think it is an excellent idea).  It is 
rather because it introduces additional issues and the line has to be 
drawn somewhere.  For example, if a string stack were introduced then 
string operations could proceed independently of data stack operations 
with no confusion between the two.  This is excellent except that it 
completely changes the semantics/effect of the same code when string and 
data stack operations are intermixed. 
 
    For example, if "SR1" is a string reference, then the code: 
 
        "3 SR1 7 + $." 
 
    Could be expected to produce entirely different results depending on 
whether or not a string stack were implemented. 
 
    Of course this same logic applies if a floating stack (etc.) is 
introduced.  The point, however, is that this is really more of a stack 
(and semantics) issue than one of a string manipulation issue.  I have long 
argued for multiple stacks in Forth (and even *God forbid* for a type 
stack), but the arguments have fallen upon deaf ears.  I'm getting very 
tired of arguing at this point. 
 
 
7.3.  PARSING STRINGS 
 
 
    Most C language implementations (for example) provide a healthy 
serving of string parsing functions.  This is nice except that often 
underlying assumptions must be made (e.g., what characters constitute a 
"whitespace") which may not apply in all/most situations - and, of course, 
if you don't need it - why have it?  I can think of no parsing functions 
which cannot be built upon the primitives provided here if needed - so 
I've left them out. 
 
    Additionally, since the string package provided herein is primarily 
targeted toward writing application code (rather than, for example, 
parsing Forth code) - no provisions have been made for parsing functions 
as they relate to the Forth language itself. 
 
 
8.  THE FORTH SOURCE FILE 
 
 
    The file "FSTRINGS.SCR" contains all of the Forth string primitives 
discussed in this document. 
 
 
9.  THE VALIDATION FILE 
 
    The file "FSTEST.SCR" contains a set of validation tests to ensure 
that the string primitives provided in FSTRINGS.SCR function properly. 
Although not exhaustive, the validation tests should give reasonable 
confidence that this package will work on your particular Forth system. 
(The validation package will also attempt to isolate malfunctions down to 
a specific primitive if possible.) 
 
 
10.  FORTH-83 DEVIATIONS 
 
 
    As far as I am able to determine FSTRINGS.SCR is all kosher FORTH-83 
Standard.  I use words from the double number extension word set (i.e., 
2DUP, 2DROP, 2SWAP, etc.). Also I use the BRANCH word from the system 
extension word set in [$LIT]. 
 
    I would suspect any reasonable FORTH-83 implementation would provide 
these words (along with the "\" word which is also used). 
 
    I have tested the system (although not extensively) under both 
MasterFORTH and Laxen & Perry's F83 - no problems. 
 
    I would be glad to assist anyone who encounters difficulty in getting 
this system to run - but, remember, the whole idea is just to provide 
working FORTH-83 code to demonstrate the functions themselves so that they 
can then be rewritten consistent with the underlying Forth system and 
extended to your taste. 
 
 
11.  INTERNALS 
 
 
    Some of the key points which may be of assistance in understanding the 
code/logic in file FSTRINGS.SCR follows: 
 
 
11.1.  INTERNAL STRING STRUCTURE 
 
 
    Although nothing in the use of the primitives provided here makes any 
assumptions about the underlying internal string structure used, this 
information is required for the actual implementation itself. 
 
    The internal representation selected for strings in this 
implementation was that of counted strings.  The initial string *word* 
contains the count (which may be 0 or null).  Thus the limit on the 
working size of a string word is 0..32,767 bytes due to the use of signed 
number arithmetic. 
 
    The reserve scratch area "_BUFFER" has been allocated at 1024+2 bytes. 
 
    The only words which reference _BUFFER are $LIT, $<<ROT, and $>>ROT. 
Other than this everything is done "core-to-core" via CMOVEs. 
 
    _BUFFER's initial allocation (and use) should be thought out carefully 
when implementing this system since it represents a (potentially) large 
permanent allocation of space.  For example, it may be more expedient to 
temporarily "grab" unused dictionary space for this purpose and avoid any 
permanent loss of space.  How one does this is dependent upon the 
particular Forth system being used. 
 
    Another point to keep in mind is that only one _BUFFER is provided. 
Thus one cannot have two string literals simultaneously defined.  (Also, 
for example, a call to $<<ROT would destroy the contents of a string 
literal if it is also active.)  Normally, these considerations would not 
occur. 
 
 
11.2.  VISIBILITY 
 
 
    All routines meant to be "hidden" begin with an initial underscore "_" 
character.  Ideally these words should be headerless if your Forth system 
supports this capability. 
 
 
11.3.  ERROR CHECKING 
 
 
    Simple.  There isn't any!!  (Well, $<<ROT and $>>ROT will translate 
the "number" to the modulus of the length of the string given, but that's 
hardly editing!) 
 
 
11.4.  MEMORY OPERATORS 
 
 
    The string primitives provided herein do most of their real "work" 
through the underlying "memory" primitives (i.e, CMOVE, CMOVE>, COMPARE, 
CFIND, CFIND<, and -optionally- SAME?). 
 
    For this reason alone it would be advisable to code these memory 
operators in assembler.  (Still another reason is that these memory 
operators should also be useful independent of this string package.) 
 
    The "interface" operators (i.e., _CF+, _LEN, and _$>AL) provide the 
connection/interface between the high-level string operators and the 
low-level memory operators.  They also make it easier to modify this 
package for stack widths of other than 16 bits and/or allow using other 
internal representations than the one selected here.  Since the use of the 
interface operators is pervasive as well, it is recommended that they be 
coded in assembler also (in addition they are very short words). 
 
    So, in general, the basic modus operandi of these string operators is 
to call upon the interface operators; to perform the appropriate "stack 
juggling"; and to call upon the appropriate memory operators. 
 
    In the file FSTRINGS.SCR, block 1 is the load block for a full 
FORTH-83 implementation.  Block 2 is the load block for a MasterFORTH 
system (i.e., the words: COMPARE, CFIND, CFIND<, SAME?, _CF+, _LEN, and 
_$>AL) are written in assembler for a MasterFORTH system (80XX chip). 
 
    Even if you don't have a MasterFORTH system (in fact does *anyone* 
besides me!?), I would strongly recommend you examine the assembler 
implementation of these words and rewrite them (in assembler) for your own 
particular Forth system (they really make a rather dramatic difference in 
performance). 
 
    The words are documented and the underlying algorithmic approach 
should be valid regardless of the system you're running.  As an aid the 
understanding the MasterFORTH implementation for the 80XX series chip (so 
that you may convert these words): 
 
    -   The MasterFORTH assembler is based upon the Laxen/Perry F83 public 
domain model with the following major differences: 
 
        -   The mode #) is simply ); 
 
        -   The mode S#) has disappeared; 
 
        -   Local labels are available for unconditional jumps; 
 
        -   REPZ and REPNZ are REPE and REPNE instead; 
 
        -   JZ, JNZ, JC, and JNC are added; 
 
        -   The string instructions use BYTE and WORD rather than AL and 
            AX. 
 
    -   The MasterFORTH implementation uses the following machine 
resources: 
 
        -   The four segment register must be maintained at the same value 
            (i.e., CS = DS = SS = ES); 
 
        -   The following three registers are used internally by 
            MasterFORTH and must be restored prior to returning to Forth: 
 
                -   SI -- the Forth instruction pointer (IP) 
 
                -   SP -- the Forth parameter stack pointer 
 
                -   BP -- the Forth return stack pointer. 
 
 
    The definitions of the memory operators are: 
 
 
: COMPARE ( a1 a2 n -- status ) 
\ Compares a1-a2, (a1+1)-(a2+1), ... (a1+n-1)-(a2+n-1) 
\ as required returning -1, 0, or +1 depending on lexicographic 
\ order.  If n=0, then 0 (i.e., =) is returned. 
 
 
: SAME? ( a1 a2 n -- t | f ) 
\ Compares a1-a2, (a1+1)-(a2+1), ... (a1+n-1)-(a2+n-1) 
\ as required returning -1 if all "n" bytes match, else returns 
\ 0.  Note that this operator is not strictly required since one 
\ may define SAME? as 
\     : SAME? COMPARE 0= ; 
\ It is included primarily for reasons of speed. 
 
 
\ Memory character/byte FIND                        GTH 08/22/87 
: CFIND ( c a1 n -- a2 | 0 )                    \ "c-find" 
\ Searches a1, a1+1, ..., a1+n-1 for c returning first 
\ address of match or 0 if no match.  NOTE: n=0 returns 0. 
 
 
: CFIND< ( c a1 n -- a2 | 0 )                   \ "c-find-back" 
\ Same as CFIND except search is high-to-low memory. 
\ NOTE: Memory search begins at address: a1+n-1. 
 
    The words "CFIND" and "CFIND<" were selected to match the current 
FORTH-83 words "CMOVE" and "CMOVE>". 
 
    Also note that no attempt was made to code a general-purpose assembler 
word which searches memory attempting to locate a substring.  The primary 
reason for this is that this operation, although frequently used, is far 
more complex (when done efficiently) than is generally realized.  The 
reader is referred to the excellent article "Searching for Strings with 
Boyer-Moore" by Richard Wiggins and Paul Walberg in the November, 1986 
(Volume 3, Number 11) edition of Computer Language (pp. 28-42) just to see 
how harry things can get! 
 
 
11.5.  THE ISSUE OF STANDARDS 
 
 
    The string operators provided herein will hopefully be of benefit to 
some of you in string processing applications.  The true test, however, of 
the utility of any programming language extension is its proven utility 
over time.  There is thus no good measure at present of what string 
operators, as provided in FSTRINGS, are needed, are defined well, or are 
missing. 
 
    It is more reasonable to expect that FSTRINGS will serve as a catalyst 
(or perhaps seed) to encourage members of the Forth community to 
investigate the fundamental nature of and need for string operations in 
Forth. 
 
    In addition to the issue of functionality addressed above, there are 
as well the matters of stack/data conventions and naming conventions.  All 
of these issues need to be throughly investigated and an informal standard 
developed and used for an extended period before any sort of string 
standards proposal is warranted (at least in this author's opinion). 
 
    This has been a rather long-winded way of stating that I am in no way 
proposing FSTRINGS as a standards extension. 
 
    The development of FSTRINGS did, however, provide reasonably good 
assurance that some more fundamental, underlying operators (upon which any 
contiguous memory string package could easily be built) could be good 
candidates as standards extensions.  These are just the "memory operators" 
earlier mentioned.  Thus I would like to propose that the words: COMPARE, 
CFIND, CFIND<, and -optionally- SAME? be considered as standards 
extensions to augment the CMOVE and CMOVE> words. 
 
 
11.6.  NULL STRINGS 
 
 
    This package allows the use of null strings.  You may wish to ignore 
them altogether, but I would caution against this.  The null string 
processing introduces very little overhead for the capability gained. 
 
    Null string handling is greatly simplified by the fact that all of the 
underlying memory operators are defined to accept null (or zero) counts. 
For example, in the null case: 
 
 
    CMOVE and CMOVE>            do not move anything; 
 
    COMPARE                     returns 0 (i.e., "equal"); 
 
    CFIND and CFIND<            return 0 (i.e., "not found"); 
 
        and -optionally- 
 
    SAME?                       returns -1 (i.e., "equal"). 
 
 
    Thus the majority of the string primitives herein can be written 
without concern for the special case of null (or zero length) strings. 
 
    For those primitives which must explicitly test for the null string 
case, this action is always performed initially.  If the null string case 
applies, then the return stack is set appropriately and the primitive is 
immediately EXITed.  In all cases, this amounts to a single line of 
(commented) code at the beginning of the word definition. 
 
 
11.7.  CONVERSION PROBLEMS 
 
 
    As mentioned, this system has been tested with both MasterFORTH and 
F83.  Jerry Shifrin was kind enough to allow me/us to test this package 
under LMI's Forth as well. 
 
    The primary difficulty encountered was the underlying implementation 
of the "BRANCH" word. 
 
    Both MasterFORTH and F83 use absolute addresses for BRANCH.  LMI seems 
to use a relative offset instead. 
 
    BRANCH is only used in the definition of [$LIT] as follows: 
 
 
: [$LIT] ( -- string )                        \ "bracket-s-lit" 
\ Compiled string literal 
  COMPILE BRANCH HERE 0 ,  \ forward branch around string 
  HERE ,$ HERE ROT ! [COMPILE] LITERAL ; IMMEDIATE 
 
 
    It may be necessary to modify [$LIT] prior to the "!" in order to 
adjust the "address/offset" which follows BRANCH in the compiled word as 
necessary.  It may also be necessary to perform "aligns" as well. 
 
    Alternatively, if your Forth system supports the ">MARK" and 
">RESOLVE" words from the system extension word set, then it may be more 
expedient to simply recode [$LIT] as: 
 
 
: [$LIT] ( -- string )                        \ "bracket-s-lit" 
\ Compiled string literal 
  COMPILE BRANCH >MARK 
  HERE ,$ SWAP >RESOLVE [COMPILE] LITERAL ; IMMEDIATE 
 
 
    Finally, it may be necessary to modify the "_CF+" and "_$>AL" words to 
use some value other than "2+" if your Forth system's word size is other 
than 16 bits. 
 
 
 
12.  GLOSSARY 
 
 
    A short glossary of all string primitives follows: 
 
 
: COMPARE ( a1 a2 n -- status ) 
 
: SAME? ( a1 a2 n -- t | f ) 
 
: CFIND ( c a1 n -- a2 | 0 )                 \ "c-find" 
 
: CFIND< ( c a1 n -- a2 | 0 )                \ "c-find-back" 
 
: $LIT ( -- string )                         \ "s-lit" 
 
: $, ( string -- )                           \ "s-comma" 
 
: ,$ ( -- )                                  \ "comma-s" 
 
: [$LIT] ( -- string )                       \ "bracket-s-lit" 
 
: $KEY ( string -- )                         \ "s-key" 
 
: $$KEY ( string -- )                        \ "s-s-key" 
 
: $ALLOT ( number -- )                       \ "s-allot" 
 
: $IMPORT ( addr length string -- )          \ "s-import" 
 
: $EXPORT ( string addr -- )                 \ "s-export" 
 
: $LEN ( string -- length )                  \ "s-length" 
 
: $. ( string -- )                           \ "s-dot" 
 
: $NULL ( string -- )                        \ "s-null" 
 
: $$+ ( string1 string2 -- )                 \ "s-s-plus" 
 
: $C! ( char string index -- )               \ "s-c-store" 
 
: $C@ ( string index -- char )               \ "s-c-fetch" 
 
: $$! ( string1 string2 -- )                 \ "s-s-store" 
 
: $$@ ( string1 string2 index length --)     \ "s-s-fetch" 
 
: $CINS ( char string index -- )             \ "s-c-ins" 
 
: $$INS ( string1 string2 index -- )         \ "s-s-ins" 
 
: $|TRIM  ( string number -- )               \ "s-left-trim" 
 
: $TRIM|  ( string number -- )               \ "s-trim-right" 
 
: $|SPACES ( string -- )                     \ "s-left-spaces" 
 
: $SPACES| ( string -- )                     \ "s-spaces-right" 
 
: $DEL ( string index number -- )            \ "s-del" 
 
: $$REP ( string1 string2 index -- )         \ "s-s-rep" 
 
: $<ROT ( string -- )                        \ "s-left-rote" 
 
: $>ROT ( string -- )                        \ "s-right-rote" 
 
: $<<ROT ( string number -- )                \ "s-many-left-rote" 
 
: $>>ROT ( string number -- )                \ "s-many-right-rote" 
 
: $$COMPARE ( string1 string2 -- status )    \ "s-s-compare" 
 
: $$= ( string1 string2 -- t | f )           \ "s-s-equal" 
 
: $$< ( string1 string2 -- t | f )           \ "s-s-less-than" 
 
: $$<= ( string1 string2 -- t | f )          \ "s-s-less-than-or-equal" 
 
: $$> ( string1 string2 -- t | f )           \ "s-s-greater-than" 
 
: $$>= ( string1 string2 -- t | f)           \ "s-s-greater-than-or-equal" 
 
: $$<> ( string1 string2 -- t | f )          \ "s-s-not-equal" 
 
: $CFIND ( char string -- index | -1 )       \ "s-c-find" 
 
: $CFIND< ( char string -- index | -1 )      \ "s-c-find-back" 
 
: $$FIND ( string1 string2 index length -- index | -1 ) 
                                             \ "s-s-find" 
 
: $CMEM ( char string -- t | f )             \ "s-c-mem" 
 
: $$VER ( string1 string2 -- index | -1 )    \ "s-s-ver" 
 
: $>UPPER ( string -- )                      \ "s-to-upper" 
 
: $>LOWER ( string -- )                      \ "s-to-lower" 
 
: $CONVERT ( string index -- )               \ "s-convert" 
 
: $<N ( number -- )                          \ "s-from-n" 
 
: $>N ( -- number | -1 )                     \ "s-to-n"