GEnie@willett.UUCP (ForthNet articles from GEnie) (12/21/89)
Date: 12-20-89 (05:23) Number: 1681 (Echo) To: MIKE SPERL Refer#: NONE From: MARK SMILEY Read: NO Subj: OPTIMIZATION Status: PUBLIC MESSAGE Mike, Here's another timing question for you from GRAFKRNL.SEQ. \ YOURS: code v320.dot ( x y color -- ) pop ax mov di, ax \ color in di ... \ PROPOSED: code v320.dot ( x y color -- ) POP DI ... \ Here are my times, with similar code to that of my previous note. \ Here, T2 is the POP DI method. 100 SS ! TIMER T1 Elapsed time = 00:00:28.39 TIMER T2 Elapsed time = 00:00:27.63 1000 SS ! TIMER T1 Elapsed time = 00:04:43.80 TIMER T2 Elapsed time = 00:04:36.22 \ Once again, it seems that the smaller code is (slightly) faster. Any \ comments? Mark P.S. I haven't been able to bring myself to try your 22 minute timer test, I'm not sure I can part with my computer for 44+ minutes!! (My addiction to my computer must be worse than I thought.) Besides, all the tests agree that the smaller code is faster, so I see no real need to run the test. --- ~ EZ-Reader 1.13 ~ ----- This message came from GEnie via willett through a semi-automated program. Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'
ForthNet@willett.UUCP (ForthNet articles from GEnie) (01/11/90)
Date: 01-09-90 (09:52) Number: 1702 (Echo) To: MARK SMILEY Refer#: 1701 From: PETE KOZIAR Read: NO Subj: INSTRUCTION TIMINGS Status: PUBLIC MESSAGE I don't know if this has been said already, but you must be careful about instruction timings on the 80x86/88 family. The instruction timings given assume that the pre-fetch queue is full. Let me explain. The 80x86 family has a queue of instructions waiting to be executed. It fills this queue during instructions that require a lot of cycles to "calculate" (like multiply/divide, etc.). If you do this, then fetching the instructions from memory are "free," since they are done in spare time. Unfortunately, I believe that almost all of the 80x86 families must purge this queue when a branch, jump, or call occurs (the 80386 may not; I'm not sure and don't feel like digging in the manual). Now, let's think about FORTH: lots of nice, small subroutines, i.e., with lots of jumps and calls. Bottom line: we tend to run with the prefetch queue empty a large proportion of the time, so we need to add in the number of fetches for each instruction. A corollary to this is that you may wind up with faster code if you use fewer slower instructions than many fast ones! On the 8088 (i.e., XT-class machines), each fetch of each BYTE adds 4 whole cycles to the instruction time. That ain't hay! The '286 and '386 do better, taking only 2 cycles per fetch, and that being 16 or 32 bits at a time, respectively, which may even represent multiple instructions. Remember, that's with no wait states; wait states add a cycle each. Even worse, if you have cache, you need to worry if the instructions executed in a tight loop are all in cache. In a cached system, the smaller the loop the better. This is why Motorola hedges on their instruction timings for the 68020; you almost need a computer program to figure out instruction timings, and benchmarks are easier anyway. --- * Via Qwikmail 2.01 The Baltimore Sun ----- This message came from GEnie via willett through a semi-automated process. Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'
hgw@rht32.pcs.com (h.-g. willers) (02/09/90)
Please note, that in most modern RISC-processors (such as Intels i860 or MIPSco's R2000 / R3000) there is no classical call instruction. Subroutine linkage is done with a machine register which holds the return address. So, for routines calling no further routine, there is no overhead for storing/retrieving information to/from the machine stack. Another issue, that has to be taken care of, is the 'branch delay slot' that follows a JAL (jump-and-link instruction). The branch delay slot is an instruction that is always executed after the JAL. So, we have to fill the branch delay slot with something useful (besides a NOP). That could be the first instruction of the called routine (calling effectively routine+4 on a 32-bit machine), if it is not also a branch or jump instruction. H.-G. -- H.-G. Willers PCS-Mail: hgw internal phone ( -271 ) DOMAIN: hgw@rht32.pcs.de (EUR) or hgw@rht32.pcs.com (US) BANG: ..unido!pcsbst!hgw (EUR) or ..pyramid!pcsbst!hgw (US)
ForthNet@willett.UUCP (ForthNet articles from GEnie) (02/25/90)
Date: 02-24-90 (01:48) Number: 1755 (Echo) To: PETE KOZIAR Refer#: 1729 From: MIKE SPERL Read: NO Subj: OPTIMIZATION Status: PUBLIC MESSAGE Pete, >why not just write in assembly language altogether? >A High-level language should be poortable at least to SOME extent, >and writing the whole shebang in assembly-language with a small >FORTH mainline makes it about as portable as an organ transplant. In the inline code example I referred to, the assembler does the same as the hi level, i.e., the hi level calls an assembler word that does the same thing, except the hi level does a lot of stack thrashing that is unnecessary, and uses next - VERY iinefficient! Fortth is a lot handier than assembler alone, and, in NEXT, more efficient as well. But doing everything in hi levell would make forth less efficient than almost any other language, and so far as transportability, moving words even between different versions of F83 or F-PC is a chore. Do commercial forths guarantee transportability between versions of tthe same brand? How easy is itt to move code between brands? This seems like a poor reason to avoid assembler words, which, I suspect, are more poortablle between versions running on the same cpu than the hi level forth is! >I appreciate that graphics is a unique application because of speed, >but even you'll have to worry about portability. What happenns if the >graphics co-processors like the IBM 8514 or the TI chip become the >norm, and all the main processor has to do is send them graphics >primitives? Or, even worse, if you want to convert your code to run on >the graphics coprocessors? Writing it in assembly--language will cook >your oown goose. We've been thru that alreadyy. The cga code is quite different than the ega/vga code, and if another system takes over, that will neeed specific code tooo. But the algorithms will be similar, and we've tried to comment the code so they're understandable. Hi level code won't be anyy more transportable, because it's based oon low level code somewhere in the prrogram that will require rewriting. When you port any language to a new cpu, somebody has to do a lot of programmming, maybe not you. :--) We are reallly writing a graphics kkernel analogous tto thhe forth kernel. >By the other way, a GIF viewer "word" in FORTH would be a great thing >for writing adventure games and/or displaying pictures as error >indicators. It would be neat to be able, for instance, to guide a >technician throughh servicing a piece of equipment by displaying >pictures showing what's wroong. The posssibilities are endless. Good ideas, all of them. I certainly hope someone will use our code to make a useful hi level program. >Keep up the good work. Thanks. - Mike - ----- This message came from GEnie via willett through a semi-automated process. Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'
ForthNet@willett.UUCP (ForthNet articles from GEnie) (02/28/90)
Date: 02-26-90 (10:45) Number: 1759 (Echo) To: MIKE SPERL Refer#: 1755 From: PETE KOZIAR Read: NO Subj: STUFF Status: PUBLIC MESSAGE In thinking back on our conversation, I hope I didn't come across as being too critical. My comments were supposed to be more on the level of "have you thought about this..." than "YOU'RE DOING IT WRONG." I guess I am sensitive to this, since I have witnessed in my career two major systems that had to be scrappped because they were in assembly-language, and one of them was supposedly in FORTH even though all the words were CODE words. We just have to be a little careful that we don't go for that extra clock cycle at the expense of maintainability. After saying that, there is a way to shave a few cycles off of UNNEST using LES IP, 0 [BP] ADD BP, # 2 instead of: XCHG BP, SP POP ES \ These may me in the reverse order -- POP IP \ I don't have a listing handy. XCHG BP, SP --- * Via Qwikmail 2.01 The Baltimore Sun ----- This message came from GEnie via willett through a semi-automated process. Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'
ForthNet@willett.UUCP (ForthNet articles from GEnie) (03/05/90)
Date: 03-03-90 (20:01) Number: 1772 (Echo) To: PETE KOZIAR Refer#: 1759 From: MIKE SPERL Read: NO Subj: STUFF Status: PUBLIC MESSAGE Pete, >>there is aa way to shave a few cycles off of UNNEST using > LLES IP, 0 [BP] ADD BP, # 2 Good ssuggestion! No real change on the PC, but on a 286 or 386 a real improvement in performance - 90% faster. Not bad for a word that is part of forth's main loop and executes thousands of times inn a program. I'm going to make the change in kernel1.seq, and I think Tom should include it in his next release. Note: we "popped" 4 bytes, not 2 CODE UNNEST ( --- ) \ Same as EXIT CODE EXIT ( -- ) \ Terminate a high-level definition LES IP, 0 [BP] \ 20 clocks (28 on 8088, 7 on 386) ADD BP, # 4 \ 4 clocks (2 on 386) NEXT === === === ENDD-CODE 24 32 9 instead of: ( Tom''s timings are for an 8086) CODE EXIT ( -- ) \ Terminate a high-level definition XCHG RP, SP \ 4 cycles (3 on 386) POP IP \ 8 cycles (12 on 8088, 7 on 386) POP ES \ 8 cycles (12 on 8088, 7 on 386) XCHG RPP, SP \ 4 cycles (3 on 386) NEXT === === === END-CODE 24 32 17 This is a good example of the effect on programming of improving technology. The 386 does a lot of things in parallel that lessor chips do sequentially. This is aside from the presence of a larger on-chip instruction cache of 15 bytes rather than 3-6. >In thinking back on our conversation, I hope I didn't come across >as being too critical. My comments were supposed to be more on >the level of "have you thought about this..." than "YOU'RE DOING >IT WRONG." No problem. I likke a good diiscuussion; your ideas are among the best in the conference,, and your experience is valuable to all of us. Please continue to be as critical as you feel is justified. I feel more comfortable in assembler than in forth, and I'm sure I get carried away at times. I love forth as a splendid vehicle forr assembler coding and testing, and as you have probably noticed, I try to stick to the low level problems, like screen writing and graphics, and leave the high level programming to others with more experience. Of course, there is an overlap, annd I may tread on higgh level turf sometimmes (especially if I think I can eliminate inefficiencies in the hi level code). Your observations are appreciateed. >major systems that had to be scrappped because they were in >assembly-language, and one of them was supposedly in FORTH even >though all the words were CODE words. We just have to be a little >careful What would have made a difference: better commenting? Did the CPU change? There's a story here that you should tell us. I find that much of the high levvel portion of F-PC is very difficullt to deccipher, untiil I trace things to thee actual machine code, sometimes. I'm glad I don't have to maintain it. >The 80386 modes that allow access to all of memory will cause >problems when used with DOS, DESQview or Windows. Just be aware >that it can be done, "Forth can be an operating system" has been said over and over. This is a good time to see this put to a real test! I don't see much point to a 32 bit forth limited to <640K when most of the work can be donne by hardware built into the 386 chip. We need an asssembler that can access the 386 protected modee features, like multitasking and debugging. >but the problems go beyond the assembly language part of it. Please expand on this point. I know that DOS doesn't know aboutt protected mode. Do the others? How much memory can programs access iin the others? BTW, F-PC has a FOR - NEXT construct, which I've never examined, but Ian's question is not far-fetched. You might take a look at it and tell us hoow it works and what you think of it. ;-) - Mike - ----- This message came from GEnie via willett through a semi-automated process. Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'
ForthNet@willett.UUCP (ForthNet articles from GEnie) (03/09/90)
Date: 03-08-90 (01:53) Number: 1777 (Echo) To: GENE CZARCINSKI Refer#: 1564 From: MIKE SPERL Read: NO Subj: OBK3501 Status: PUBLIC MESSAGE Gene, The following illustrates the effect of new technology on programming. Changing the kernel would improve 2/386 perrformance by 100%, while having minimal (stiill beneficiall) effect on 8088 performannce. \ CODE 2* ( n -- 2*n ) 8088 286 386 \ POP AX \ 12, ?,, 7 \ SHL AX, # 1 \ 2,, , 3! \ 1PUSH \ 15, , 7 \ END-CODE \ 17 for 386 CODE 2* ( n -- 2*n ) \ suggested (and tested)) new coding MOV BX, SP \ 2, , 2 SHL 0 [BX], # 1 WORD \ 26, , 7 NEXT END-CODE \ 9 for 386 - mps F-PC for the 386 will need a lot of codde changes if we're to get all thhe speed of the new chip.. One change that I'd favor is storing double numbers in memory and on the stack in the lo - hi manner adopted by Intel (consistent with the way 16 bit words are stored now). I think you uncovered a design flaw in forth! With this change, pop ax and pop eax would both work without bit twiddling. This shouldn't affect most programs, those that use doubles as units, e.g.., with 2@ and 2!, but might cause problems if programs play "gaames" with the doubles. The stack needn't be changed; it would bbe a 16 bit and 32 bit stack simultaneously, as it is now! The program(er) would have to keep track of what's on the sstack, but that's the way it is now, too. BTW, I downloaded obk3501 and it's impressiive. Fine Job. Just a suggestion: fill in some code for PERFORM (does anyone use it?). ;-) - Mike - ----- This message came from GEnie via willett through a semi-automated process. Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'
ForthNet@willett.UUCP (ForthNet articles from GEnie) (06/27/90)
Date: 06-25-90 (06:32) Number: 1805 (Echo) To: MARK SMILEY Refer#: NONE From: PETE KOZIAR Read: NO Subj: MORE EFFICIENT 0< Status: PUBLIC MESSAGE Another little optimization for F83 and/or F-PC. Using the magic CWD instruction, we can gain quite a few cycles, making the definition for 0< being: CODE 0< POP AX CWD PUSH DX NEXT END-CODE We can extend this to other words in a similar fashion: CODE 0> POP AX NEG AX CWD PUSH DX NEXT END-CODE CODE 0<> POP AX CWD MOV BX, DX NEG AX CWD AND BX, DX PUSH BX NEXT END-CODE This saves the Jxx instruction, as well as the immediate move instructions. I think it's used often enough to justify "kernel twiddling." --- * Via Qwikmail 2.01 The Baltimore Sun Date: 06-25-90 (06:54) Number: 1806 (Echo) To: MARK SMILEY Refer#: NONE From: PETE KOZIAR Read: NO Subj: MISTAKE IN LAST NOTE Status: PUBLIC MESSAGE Oops. That's what I get for entering a message the first thing in the morning! The AND BX, DX in the definition for 0<> always results in zero, since the number cannot be both positive and negative at the same time. The correct instruction would be OR BX, DX which returns true if the number was either > 0 or < 0, and false if it was the only number that is neither positive nor negative, i.e., 0. ----- This message came from GEnie via willett through a semi-automated process. Report problems to: uunet!willett!dwp or willett!dwp@hobbes.cert.sei.cmu.edu
ForthNet@willett.UUCP (ForthNet articles from GEnie) (07/06/90)
Date: 07-04-90 (03:42) Number: 1810 (Echo) To: PETE KOZIAR Refer#: 1805 From: GENE LEFAVE Read: NO Subj: MORE EFFICIENT 0< Status: PUBLIC MESSAGE PK>Another little optimization for F83 and/or F-PC. Using the magic CWD I tried this improvement in 32 bit polyFORTH on a 386. 0< and 0> are slightly faster but 0<> is slightly slower. Pretty clever! --- ~ EZ-Reader 1.13 ~ ----- This message came from GEnie via willett through a semi-automated process. Report problems to: uunet!willett!dwp or willett!dwp@hobbes.cert.sei.cmu.edu
ForthNet@willett.UUCP (ForthNet articles from GEnie) (07/13/90)
Date: 07-11-90 (07:45) Number: 1812 (Echo) To: GENE LEFAVE Refer#: 1810 From: PETE KOZIAR Read: NO Subj: CWD MAGIC Status: PUBLIC MESSAGE It depends a lot on the penalty for a JMP instruction. On some of the simpler members of the family that have an instruction queue rather than a cache, it's horrendous, since you throw away the queue on a JMP and need to start re-filling it. In addition, the JMP instruction itself is a lengthy one on the smaller micros. Glad you liked it! --- * Via Qwikmail 2.01 The Baltimore Sun ----- This message came from GEnie via willett through a semi-automated process. Report problems to: uunet!willett!dwp or willett!dwp@hobbes.cert.sei.cmu.edu