[comp.lang.forth] I need a word!

mackay@iisat.UUCP (Daniel MacKay) (08/05/88)

My current programming project is a program called "travesty"; it was in a Byte
many moons ago.  It (in a nutshell) produces text with the same
3-or-4 letter sequence distributions, as a piece of sample text you feed it. 
I have tried a couple of implementations, both different from
Byte's, but in the end I think I'll resort to their solution (in forth, tho,
not pathcal).
 
Their solution involved loading the entire text into memory, then scanning the
entire buffer every time an output character has to be generated; what you do
is look at the last three (or whatever) letters you output, then look for the
probability of one more letter following it in the original text.  After
you've collected the probabilities for all the letters that follow it (i.e.
count the # of times each one followed), then you roll a weighted die and 
pick a letter.  Then you go on to the next letter.  
 
The program generates hilarious and interesting text from various sources.

Now here's my problem:  I've been out of forth for a couple of years,
and I need a word that can take a buffer address and a string address,
I guess, and return the address of the letter following first occurrance of
the string in the buffer! 
 
Now this has to be a really efficient word, (cause it gets called on a 
30-K or so buffer for every letter output) and it seems like the kind
of problem that has a very, very simple solution- I'm looking for 
insight- I can use a big hammer but I know there's an elegant solution
(that's what FORTH's all about, right?).
 
Whatcha say, guys? any suggestions?  Anyone done this already?
----
+---------+				IIS Public Usenet
|    _    |     From the		Halifax, Nova Scotia
|   (_)===|     Disk of ...		Canada
|         |      Daniel		mackay@iisat.UUCP
+---------+			...{utai,uunet,watmath}!dalcs!iisat!mackay

jax@well.UUCP (Jack J. Woehr) (08/08/88)

In article <99@iisat.UUCP> mackay@iisat.UUCP (Daniel MacKay) writes:
>
>
>Now here's my problem:  I've been out of forth for a couple of years,
>and I need a word that can take a buffer address and a string address,
>I guess, and return the address of the letter following first occurrance of
>the string in the buffer! 


Try this:

	<xxxx> constant buffersize
	
	create foobuffer buffersize allot

: find-first-string-match ( addr-of-string count-of-string ---- addr|false)
	buffersize over - 0 swap 0	\ install dummy fail flag
	do
		drop				\ dummy fail flag	
		2dup foobuff i + swap compare	\ 0 means perfect match
		0= if foobuff i + leave then	\ exit with addr, not fail
		0				\ restore fail flag
	loop
	-rot 2drop ;	\ lose srcaddr & ct 

: find-first-char-after-match ( $addr ct --- addr|false)
	dup >r		\ save count
	find-first-string-match	\ find where the string matches, if it does
	r> over			\ bring back count
	if + else drop then ;	\ if non-zero, addr was found, add count

This is not entirely optimal, but if you examine these, and the source
in your Forth system for the words called, you will probably be able
to code something *real* fast. Good luck! Send us a copy ... :-)

************
jax@well
jax@chariot	" The only thing better than Forth is beer"
JAX on GEnie

A-PIRARD@BLIULG11.BITNET (Andre PIRARD) (08/10/88)

>and I need a word that can take a buffer address and a string address,
>I guess, and return the address of the letter following first occurrance of
>the string in the buffer!

Well, your problem is one of finding the occurence of a string in
another one.  At the Southern Belgium Fig Chapter,  we have  been
debating  for  a  long  time how  Comforth  should  perform  this
function in terms of stack behaviour.  We finally settled for the
definition and code below.  You will easily be able to turn it to
anything you wish.
We found the definition of -MATCH suggested by the standard often
involving  intricate computation on the parameters  returned.  We
like thinking  of strings in terms of address  and  count.  MATCH
returns  what  we think is most useful  for  immediate  use:  the
substring  preceding  the  match and the one  following  it.  The
following  example  shows  how  easily it  is  used  to  type  on
different  lines  the  parts of a character string  separated  by
commas:

: SPLIT \ scon --
   BEGIN  DUP
   WHILE  " ," MATCH TYPE CR  1 /STRING
   REPEAT 2DROP ;

MATCH     scon1 scon2 -- scon3 scon1'
     Search  the first occurrence in scon1 of the  characters  of
     scon2.  Split  the string at the match point,  giving scon1'
     and scon3,  such that scon1=scon1'+scon3 and either,  when a
     match exists, scon3=scon2+rest or, when no match or scon2 is
     the null string,  scon3=null.  When scon2 is not null, scon3
     size may be used as a flag. MATCH is most useful at removing
     separators  from a string and repeatedly handling the  first
     part  then  continuing with the rest until a null string  is
     encountered.   /STRING may be used to behead scon3 by  scon2
     size  to remove the matching part off that rest.  scon2 size
     is limited to 255 characters on some CPU's.

Note:  our descriptions use the symbol <sconx> to refer to a pair
of stack entries containing the address and count of a  character
string whose contents is not changed by the definition.

CODE MATCH \ a1 l1  a2 l2 -- a1' l1'  a1 l3 \ find scon2 at a1' in scon1
LABEL NOMATCH
   DI , CX ADD   CX , CX XOR \ l1'=0 when no match
LABEL OKMATCH
   DX , DI MOV   SI POP  DI POP  AX POP
   DX PUSH   CX PUSH     \ a1',l1', matching substring
   AX PUSH   DX , AX SUB \ a1,l3=a1'-a1, skipped one
   NEXT
ENTRY:
   BX POP  CX POP  AX POP   AX PUSH  DI PUSH  SI PUSH
   DI , AX MOV   AL , 0 [BX] MOV \ DI=a1 CX=l1 BX=a2 DX=l2 AL=first char of 2
   DX , DX OR       NOMATCH JE   \ null string 2 special case
   DX DEC   BX INC
   BEGIN
      REPNZ  SCASB  NOMATCH JNE \ search next first char equality => a1',l1'
      CX , DX CMP   NOMATCH JB  \ not enough to compare rest, l1'<l2
      DI PUSH  CX PUSH          \ save where to continue
       SI , BX MOV  CX , DX MOV
       CX , CX OR   REPZ  CMPSB \ compare rest (assume equality if null)
      CX POP  DI POP
   E UNTIL
   DI DEC  CX INC  OKMATCH JMP  END-CODE \ exit if equal

Note: our Comforth 8086 environments keep the top of stack in register DX.
This highly enhances performance as in CODE  2*  DX , DX ADD.
Converting  to Forths not using this feature is just a matter  of
adding DX POP ahead and DX PUSH behind.

Andr .