[comp.lang.forth] Optimization

GEnie@willett.UUCP (ForthNet articles from GEnie) (12/21/89)

 Date: 12-20-89 (05:23)              Number: 1681 (Echo)
   To: MIKE SPERL                    Refer#: NONE
 From: MARK SMILEY                     Read: NO
 Subj: OPTIMIZATION                  Status: PUBLIC MESSAGE

 Mike,
    Here's another timing question for you from GRAFKRNL.SEQ.
 \ YOURS:
 code v320.dot  ( x y color -- )
     pop ax
     mov di, ax                   \ color in di
     ...
 \ PROPOSED:
 code v320.dot  ( x y color -- )
     POP DI
     ...
 \ Here are my times, with similar code to that of my previous note.
 \ Here, T2 is the POP DI method.
 100 SS !
 TIMER T1
 Elapsed time   =  00:00:28.39
 TIMER T2
 Elapsed time   =  00:00:27.63
 1000 SS !
 TIMER T1
 Elapsed time   =  00:04:43.80
 TIMER T2
 Elapsed time   =  00:04:36.22
 \ Once again, it seems that the smaller code is (slightly) faster.  Any
 \ comments?
                               Mark
 P.S. I haven't been able to bring myself to try your 22 minute timer
 test, I'm not sure I can part with my computer for 44+ minutes!!  (My
 addiction to my computer must be worse than I thought.)  Besides, all
 the tests agree that the smaller code is faster, so I see no real need
 to run the test.

 ---
  ~ EZ-Reader 1.13 ~ 
-----
This message came from GEnie via willett through a semi-automated program.
Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'

ForthNet@willett.UUCP (ForthNet articles from GEnie) (01/11/90)

 Date: 01-09-90 (09:52)              Number: 1702 (Echo)
   To: MARK SMILEY                   Refer#: 1701
 From: PETE KOZIAR                     Read: NO
 Subj: INSTRUCTION TIMINGS           Status: PUBLIC MESSAGE

 I don't know if this has been said already, but you must be careful 
 about instruction timings on the 80x86/88 family.  The instruction 
 timings given assume that the pre-fetch queue is full. 

 Let me explain.  The 80x86 family has a queue of instructions waiting 
 to be executed.  It fills this queue during instructions that require a 
 lot of cycles to "calculate" (like multiply/divide, etc.).  If you do 
 this, then fetching the instructions from memory are "free," since they 
 are done in spare time. 

 Unfortunately, I believe that almost all of the 80x86 families must 
 purge this queue when a branch, jump, or call occurs (the 80386 may 
 not; I'm not sure and don't feel like digging in the manual). 

 Now, let's think about FORTH: lots of nice, small subroutines, i.e., 
 with lots of jumps and calls.  Bottom line: we tend to run with the 
 prefetch queue empty a large proportion of the time, so we need to add 
 in the number of fetches for each instruction. 

 A corollary to this is that you may wind up with faster code if you use 
 fewer slower instructions than many fast ones! 

 On the 8088 (i.e., XT-class machines), each fetch of each BYTE adds 4 
 whole cycles to the instruction time.  That ain't hay!  The '286 and 
 '386 do better, taking only 2 cycles per fetch, and that being 16 or 32 
 bits at a time, respectively, which may even represent multiple 
 instructions. 

 Remember, that's with no wait states; wait states add a cycle each. 

 Even worse, if you have cache, you need to worry if the instructions 
 executed in a tight loop are all in cache.  In a cached system, the 
 smaller the loop the better. 

 This is why Motorola hedges on their instruction timings for the 68020; 
 you almost need a computer program to figure out instruction timings, 
 and benchmarks are easier anyway. 
 ---
  * Via Qwikmail 2.01  The Baltimore Sun 
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'

hgw@rht32.pcs.com (h.-g. willers) (02/09/90)

Please note, that in most modern RISC-processors (such as
Intels i860 or MIPSco's R2000 / R3000) there is no
classical call instruction.

Subroutine linkage is done with a machine register which
holds the return address. So, for routines calling no
further routine, there is no overhead for storing/retrieving
information to/from the machine stack.

Another issue, that has to be taken care of, is the
'branch delay slot' that follows a JAL (jump-and-link
instruction). The branch delay slot is an instruction that
is always executed after the JAL.

So, we have to fill the branch delay slot with something
useful (besides a NOP). That could be the first instruction
of the called routine  (calling effectively routine+4 on a
32-bit machine), if it is not also a branch or jump instruction.

H.-G.

--
H.-G. Willers       PCS-Mail: hgw       internal phone ( -271 )
DOMAIN:  hgw@rht32.pcs.de   (EUR) or  hgw@rht32.pcs.com    (US)
BANG:    ..unido!pcsbst!hgw (EUR) or  ..pyramid!pcsbst!hgw (US)

ForthNet@willett.UUCP (ForthNet articles from GEnie) (02/25/90)

 Date: 02-24-90 (01:48)              Number: 1755 (Echo)
   To: PETE KOZIAR                   Refer#: 1729
 From: MIKE SPERL                      Read: NO
 Subj: OPTIMIZATION                  Status: PUBLIC MESSAGE

 Pete,

 >why not just write in assembly language altogether?
 >A High-level language should be poortable at least to SOME extent, 
 >and writing the whole shebang in assembly-language with a small 
 >FORTH mainline makes it about as portable as an organ transplant. 

 In  the inline code example I referred to, the assembler does the 
 same as the hi level, i.e., the hi level calls an assembler word 
 that does the same thing, except the hi level does a lot of stack 
 thrashing that is unnecessary, and uses next - VERY iinefficient!

 Fortth is a lot handier than assembler alone, and, in NEXT, more
 efficient as well.  But doing everything in hi levell would make 
 forth less efficient than almost any other language, and so far as 
 transportability, moving words even between different versions of 
 F83 or F-PC is a chore.  Do commercial forths guarantee 
 transportability between versions of tthe same brand?  How easy is 
 itt to move code between brands?  This seems like a poor reason to 
 avoid  assembler words, which, I suspect, are more poortablle between 
 versions running on the same cpu than the hi level forth is! 

 >I appreciate that graphics is a unique application because of speed, 
 >but even you'll have to worry about portability.  What happenns if the 
 >graphics co-processors like the IBM 8514 or the TI chip become the 
 >norm, and all the main processor has to do is send them graphics 
 >primitives?  Or, even worse, if you want to convert your code to run on

 >the graphics coprocessors?  Writing it in assembly--language will cook 
 >your oown goose. 

 We've been thru that alreadyy.  The cga code is quite different than 
 the ega/vga code, and if another system takes over, that will neeed
 specific code tooo.  But the algorithms will be similar, and we've 
 tried to comment the code so they're understandable.  Hi level code 
 won't be anyy more transportable, because it's based oon low level 
 code somewhere in the prrogram that will require rewriting.  When 
 you port any language to a new cpu, somebody has to do a lot of 
 programmming, maybe not you. :--)  We are reallly writing a graphics 
 kkernel analogous tto thhe forth kernel.


 >By the other way, a GIF viewer "word" in FORTH would be a great thing 
 >for writing adventure games and/or displaying pictures as error 
 >indicators.  It would be neat to be able, for instance, to guide a 
 >technician throughh servicing a piece of equipment by displaying 
 >pictures showing what's wroong.  The posssibilities are endless. 

 Good ideas, all of them.  I certainly hope someone will use our
 code to make a useful hi level program.

 >Keep up the good work. 

 Thanks.

     - Mike -
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'

ForthNet@willett.UUCP (ForthNet articles from GEnie) (02/28/90)

 Date: 02-26-90 (10:45)              Number: 1759 (Echo)
   To: MIKE SPERL                    Refer#: 1755
 From: PETE KOZIAR                     Read: NO
 Subj: STUFF                         Status: PUBLIC MESSAGE

 In thinking back on our conversation, I hope I didn't come across as 
 being too critical.  My comments were supposed to be more on the level 
 of "have you thought about this..." than "YOU'RE DOING IT WRONG." 

 I guess I am sensitive to this, since I have witnessed in my career two 
 major systems that had to be scrappped because they were in 
 assembly-language, and one of them was supposedly in FORTH even though 
 all the words were CODE words.  We just have to be a little careful 
 that we don't go for that extra clock cycle at the expense of 
 maintainability. 

 After saying that, there is a way to shave a few cycles off of UNNEST 
 using LES IP, 0 [BP]  ADD BP, # 2 instead of: 

   XCHG  BP, SP 
   POP   ES             \ These may me in the reverse order -- 
   POP   IP             \   I don't have a listing handy. 
   XCHG  BP, SP 
 ---
  * Via Qwikmail 2.01  The Baltimore Sun 
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'

ForthNet@willett.UUCP (ForthNet articles from GEnie) (03/05/90)

 Date: 03-03-90 (20:01)              Number: 1772 (Echo)
   To: PETE KOZIAR                   Refer#: 1759
 From: MIKE SPERL                      Read: NO
 Subj: STUFF                         Status: PUBLIC MESSAGE

 Pete,

 >>there is aa way to shave a few cycles off of UNNEST using
 > LLES IP, 0 [BP]  ADD BP, # 2

     Good ssuggestion!  No real change on the PC, but on a 286 or
 386 a real improvement in performance - 90% faster.  Not bad for 
 a word that is part of forth's main loop and executes thousands of 
 times inn a program.  I'm going to make the change in kernel1.seq, 
 and I think Tom should include it in his next release. 

 Note:  we "popped" 4 bytes, not 2
 CODE UNNEST     ( --- )   \ Same as EXIT
 CODE EXIT       ( -- )  \ Terminate a high-level definition
     LES IP, 0 [BP]      \ 20 clocks (28 on 8088, 7 on 386)
     ADD BP, # 4         \  4 clocks             (2 on 386)
      NEXT                 ===        ===         ===
     ENDD-CODE              24         32          9

 instead of: 

 ( Tom''s timings are for an 8086)
 CODE EXIT       ( -- )  \ Terminate a high-level definition
                 XCHG RP, SP     \ 4 cycles             (3 on 386)
                  POP IP          \ 8 cycles (12 on 8088, 7 on 386)
                 POP ES          \ 8 cycles (12 on 8088, 7 on 386)
                 XCHG RPP, SP     \ 4 cycles             (3 on 386)
                 NEXT             ===        ===         ===
                  END-CODE         24         32         17

     This is a good example of the effect on programming of 
 improving technology.  The 386 does a lot of things in parallel
 that lessor chips do sequentially.  This is aside from the presence 
 of a larger on-chip instruction cache of 15 bytes rather than 3-6.

 >In thinking back on our conversation, I hope I didn't come across 
 >as being too critical.  My comments were supposed to be more on 
 >the level of "have you thought about this..." than "YOU'RE DOING 
 >IT WRONG." 

     No problem.  I likke a good diiscuussion; your ideas are among 
 the best in the conference,, and your experience is valuable to
 all of us.  Please continue to be as critical as you feel is
 justified.

     I feel more comfortable in assembler than in forth, and I'm sure 
 I get carried away at times.  I love forth as a splendid vehicle 
 forr assembler coding and testing, and as you have probably noticed, 
 I try to stick to the low level problems, like screen writing and 
 graphics, and leave the high level programming to others with more 
 experience.  Of course, there is an overlap, annd I may tread on 
 higgh level turf sometimmes (especially if I think I can eliminate
 inefficiencies in the hi level code).  Your observations are 
 appreciateed. 

 >major systems that had to be scrappped because they were in 
 >assembly-language, and one of them was supposedly in FORTH even 
 >though all the words were CODE words.  We just have to be a little 
 >careful 

     What would have made a difference: better commenting?  Did the
 CPU change?  There's a story here that you should tell us.

   I find that much of the high levvel portion of F-PC is very 
 difficullt to deccipher, untiil I trace things to thee actual machine 
 code, sometimes.  I'm glad I don't have to maintain it. 

 >The 80386 modes that allow access to all of memory will cause 
 >problems when used with DOS, DESQview or Windows.  Just be aware 
 >that it can be done,

     "Forth can be an operating system" has been said over and over.
 This is a good time to see this put to a real test!  I don't see
 much point to a 32 bit forth limited to <640K when most of the work 
 can be donne by  hardware built into the 386 chip.  We need an  
 asssembler that can access the 386 protected modee features, like
 multitasking and debugging.

 >but the problems  go beyond the assembly language part of it. 

     Please expand on this point.  I know that DOS doesn't know 
 aboutt protected mode.  Do the others?  How much memory can programs 
 access iin the others?

      BTW, F-PC has a FOR - NEXT construct, which I've never 
 examined, but Ian's question is not far-fetched.  You might take
 a look at it and tell us hoow it works and what you think of it. ;-)

     - Mike -
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'

ForthNet@willett.UUCP (ForthNet articles from GEnie) (03/09/90)

 Date: 03-08-90 (01:53)              Number: 1777 (Echo)
   To: GENE CZARCINSKI               Refer#: 1564
 From: MIKE SPERL                      Read: NO
 Subj: OBK3501                       Status: PUBLIC MESSAGE

 Gene,

     The following illustrates the effect of new technology on 
 programming.  Changing the kernel would improve 2/386 perrformance 
 by 100%, while having minimal (stiill beneficiall) effect on 8088 
 performannce. 

 \ CODE 2*  ( n -- 2*n )         8088 286 386
 \                POP AX         \ 12, ?,, 7
 \                 SHL AX, # 1    \  2,,  , 3!
 \                1PUSH          \ 15,  , 7
 \                END-CODE       \       17 for 386

 CODE 2*    ( n -- 2*n ) \ suggested (and tested)) new coding
     MOV BX, SP                \  2,  , 2
     SHL 0 [BX], # 1 WORD      \ 26,  ,  7
     NEXT    END-CODE           \        9 for 386 - mps

     F-PC for the 386 will need a lot of codde changes if we're to 
 get all thhe speed of the new chip..  One change that I'd favor is 
 storing double numbers in memory and on the stack in the lo - hi 
 manner adopted  by Intel (consistent  with the way 16 bit words are 
 stored now).  I think you uncovered  a design flaw in forth! 

     With this change, pop ax and pop eax would both work without 
 bit twiddling.  This shouldn't affect most programs, those that use 
 doubles as units, e.g.., with 2@ and 2!, but might cause problems if 
 programs play "gaames" with the doubles.  The stack needn't be 
 changed; it would bbe a 16 bit and 32 bit stack simultaneously, as 
 it is now!  The program(er) would have to keep track of what's on 
 the sstack, but that's the way it is now, too.

     BTW, I downloaded obk3501 and it's impressiive.  Fine Job.
 Just a suggestion: fill in some code for PERFORM (does anyone
 use it?). ;-)

      - Mike -

-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'

ForthNet@willett.UUCP (ForthNet articles from GEnie) (06/27/90)

 Date: 06-25-90 (06:32)              Number: 1805 (Echo)
   To: MARK SMILEY                   Refer#: NONE
 From: PETE KOZIAR                     Read: NO
 Subj: MORE EFFICIENT 0<             Status: PUBLIC MESSAGE

 Another little optimization for F83 and/or F-PC.  Using the magic CWD 
 instruction, we can gain quite a few cycles, making the definition for 
 0< being: 

 CODE 0<  POP AX  CWD  PUSH DX  NEXT  END-CODE 

 We can extend this to other words in a similar fashion: 

 CODE 0>  POP AX  NEG AX  CWD  PUSH DX  NEXT  END-CODE 
 CODE 0<> POP AX  CWD   MOV BX, DX 
          NEG AX  CWD   AND BX, DX  PUSH BX  NEXT  END-CODE 

 This saves the Jxx instruction, as well as the immediate move 
 instructions.  I think it's used often enough to justify "kernel 
 twiddling." 
 ---
  * Via Qwikmail 2.01  The Baltimore Sun 

 Date: 06-25-90 (06:54)              Number: 1806 (Echo)
   To: MARK SMILEY                   Refer#: NONE
 From: PETE KOZIAR                     Read: NO
 Subj: MISTAKE IN LAST NOTE          Status: PUBLIC MESSAGE

 Oops.  That's what I get for entering a message the first thing in the 
 morning!  The AND BX, DX in the definition for 0<> always results in 
 zero, since the number cannot be both positive and negative at the same 
 time.  The correct instruction would be OR BX, DX which returns true if 
 the number was either > 0 or < 0, and false if it was the only number 
 that is neither positive nor negative, i.e., 0.
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: uunet!willett!dwp or willett!dwp@hobbes.cert.sei.cmu.edu

ForthNet@willett.UUCP (ForthNet articles from GEnie) (07/06/90)

 Date: 07-04-90 (03:42)              Number: 1810 (Echo)
   To: PETE KOZIAR                   Refer#: 1805
 From: GENE LEFAVE                     Read: NO
 Subj: MORE EFFICIENT 0<             Status: PUBLIC MESSAGE

 PK>Another little optimization for F83 and/or F-PC.  Using the magic CWD 

 I tried this improvement in 32 bit polyFORTH on a 386.  0< and 0> are 
 slightly faster but 0<> is slightly slower.

 Pretty clever! 

 ---
  ~ EZ-Reader 1.13 ~ 
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: uunet!willett!dwp or willett!dwp@hobbes.cert.sei.cmu.edu

ForthNet@willett.UUCP (ForthNet articles from GEnie) (07/13/90)

 Date: 07-11-90 (07:45)              Number: 1812 (Echo)
   To: GENE LEFAVE                   Refer#: 1810
 From: PETE KOZIAR                     Read: NO
 Subj: CWD MAGIC                     Status: PUBLIC MESSAGE

 It depends a lot on the penalty for a JMP instruction.  On some of the 
 simpler members of the family that have an instruction queue rather 
 than a cache, it's horrendous, since you throw away the queue on a JMP 
 and need to start re-filling it.  In addition, the JMP instruction 
 itself is a lengthy one on the smaller micros. 

 Glad you liked it! 
 ---
  * Via Qwikmail 2.01  The Baltimore Sun 
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: uunet!willett!dwp or willett!dwp@hobbes.cert.sei.cmu.edu