yuval@taux01.UUCP (Gideon Yuval) (02/18/88)
Is good support for self-modifying code a real issue? all CPUs support a "modify code -- invalidate all caches -- execute" cycle, which is enough to run (say) loaders & debuggers; i THINK this is enough for all real applications, but want to be sure. Are any exceptions to this (AI or other) known? -- Gideon Yuval, +972-52-522255 (work), -2-690992 (home), yuval@taux01.nsc.com
kers@otter.hple.hp.com (Christopher Dollin) (02/18/88)
"yuval@taux01.UUCP (Gideon Yuval)" says: > Is good support for self-modifying code a real issue? all CPUs support a > "modify code -- invalidate all caches -- execute" cycle, which is enough to > run (say) loaders & debuggers; i THINK this is enough for all real > applications, but want to be sure. That depends. In the Poplog environment, for example, user code (Pop11, Prolog, ML, Common Lisp) is compiled to native code by an incremental compiler. You could call this "loading", but I suspect that many people will think of a loader as being something that loads rather larger peices of code than individual procedures. What's more, the Pop11 "partial application" feature, whereby a procedure is constructed from a base procedure by supplying some of its arguments, also involves the creation of new procedure records in the store. While the compiler is not often invoked repeatedly in user code, partial application (and to a lesser extend procedure composition) is. This means that code cache flushing might happen rather often. Strictly, none of this is "self-modifying" code, since the procedures are constructed as data-objects before being used. But the data cache has to be flushed first. Of course, some peiple might say that a mixed-language incremental development system with integrated editor wasn't a "real" application..................... Regards, Kers | "Why Lisp if you can talk Poperly?"
webber@athos.rutgers.edu (Bob Webber) (02/19/88)
In article <486@taux01.UUCP>, yuval@taux01.UUCP (Gideon Yuval) writes: > Is good support for self-modifying code a real issue? all CPUs support a > "modify code -- invalidate all caches -- execute" cycle, which is enough to > run (say) loaders & debuggers; i THINK this is enough for all real > applications, but want to be sure. > > Are any exceptions to this (AI or other) known? At most unix systems, the classic example of ``self-modifying'' code is dynamic loading where you want things like the lisp interpreter to be able to use compiled code while it is interpreting. Similarly, some people have used dynamic-loading by hand to simulate some of the benefits of shared libraries (i.e., minimizing the size of executables that are not being executed). Incremental compilation is another place where one could imagine self-modifying code being used to good effect. Some people generate code that for floating point operations that modifies itself depending on the availability of an fpa at execution time. A number of people interested in object-oriented programming will present algorithms. When asked how they expect them to run as fast as hand coded C, often they start talking about dynamic code generation for common cases. One even hears people contemplate about programs that profile their own execution and optimize ``hotspots''. ------- BOB (webber@athos.rutgers.edu ; rutgers!athos.rutgers.edu!webber)
ok@quintus.UUCP (Richard A. O'Keefe) (02/19/88)
In article <486@taux01.UUCP>, yuval@taux01.UUCP (Gideon Yuval) writes: > Is good support for self-modifying code a real issue? all CPUs support a > "modify code -- invalidate all caches -- execute" cycle, which is enough to > run (say) loaders & debuggers; i THINK this is enough for all real > applications, but want to be sure. > > Are any exceptions to this (AI or other) known? Historically, "self-modification" has referred to modifying instructions which will then be executed again. For example, walking down an array on an IBM 650 without index registers required changing address fields in instructions. What languages like Pop-2, Lisp, Prolog, SmallTalk, &c &c require is the ability to (a) dynamically add *new* code; if the operating system provided a "load this .obj file whereever you please and protect it as executable code, but tell me where you put everything" facility, that would be fine. (b) reclaim old code which will never be used again; if the operating system provided an "unload this chunk that you loaded for me before, and if you protect it so that attempts to exceute it will trap, so much the better" facility, that would be fine. Note that separate I&D doesn't mean you can't have dynamically loaded code; it just means that an ordinary user program can't do it, and many operating systems vendors never think to provide it. Some implementations of SmallTalk and Lisp do use a self-modifying scheme. For example, one technique for handling Object msg in SmallTalk is to have instructions like load Object call usual_handler_for_msg ^^^^^^^^^^^^^^^^^^^^^ where the handler checks to see if it is the right one, and if it isn't, does a full method lookup to find the right handler, and then pokes the address of the right handler back in the call instruction. A similar technique has been used in some Lisp systems ("snapping links"). Indirection can be used instead, at a price.
aglew@ccvaxa.UUCP (02/20/88)
Branch target caching in SOAR, and some 68000 SMALLTALK implementations, is self modifying. Smalltalk methods are multiway branches based on possibly complicated conditions. Instead of evaluating all the conditions, these implementations branch to the most recently used method, and do a simple validity check, defaulting to the more general case. The general case patches a branch back into the code. Of course, there are other ways to do it... Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have nameserver aglew@gswd-vms.gould.com - if you don't aglew@gswd-vms.arpa - if you use DoD hosttable aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
freeman@spar.SPAR.SLB.COM (Jay Freeman) (02/20/88)
In article <666@cresswell.quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >What languages like Pop-2, Lisp, Prolog, SmallTalk, &c &c require is the >ability to >(a) dynamically add *new* code; if the operating system provided a > "load this .obj file whereever you please and protect it as > executable code, but tell me where you put everything" facility, Not quite: Occasionally one wants to be able to compile directly to memory without bothering to write an object file; the loader would not be called in such a case. The objective is to be able to write some bits into memory and then call or jump to what you just wrote.
bzs@bu-cs.BU.EDU (Barry Shein) (02/20/88)
Actually I'm not sure exactly where the confusion between self-modifying and dynamically loading code is coming in. I've seen at least one lisp which used self-modifying code to some advantage, typically to poke in debugger and other evalhooks when desired (into the main eval loop) or simply remove them when not wanted (for speed's sake.) It may only be a few tests saved, but consider how many times a lisp program travels through the eval loop, like every iterative loop for interpreted code. -Barry Shein, Boston University
dhesi@bsu-cs.UUCP (Rahul Dhesi) (02/21/88)
[re: how to overcome the code versus data distinction] For those implementors who wish to allow data to be safely treated as code, I suggest a system call that will accept a range of memory addresses and a request to change it from code to data or from data to code. E.g.: #include <modech.h> if (modech(startadd, endadd, TO_CODE) == -1) error ("could not change data into code for execution"); This preserves the advantages if any of the code/data distinction without being a pain to the implementor. No doubt there would be some alignment restrictions on the values of startadd and endadd. -- Rahul Dhesi UUCP: <backbones>!{iuvax,pur-ee,uunet}!bsu-cs!dhesi
aglew@ccvaxa.UUCP (02/22/88)
>[re: how to overcome the code versus data distinction] > >To allow data to be safely treated as code... > > #include <modech.h> > if (modech(startadd, endadd, TO_CODE) == -1) > error ("could not change data into code for execution"); > >Rahul Dhesi UUCP: <backbones>!{iuvax,pur-ee,uunet}!bsu-cs!dhesi That's along the right track, but won't necessarily work on strict Harvard architectures. How about int f(); f = codealloc(size); codecpyin(f,data,size); codecpyout(f,data,size); I promised Karl Heuer that I'd come up with a standard proposal - one day, maybe.
ok@quintus.UUCP (Richard A. O'Keefe) (02/22/88)
In article <2169@bsu-cs.UUCP>, dhesi@bsu-cs.UUCP (Rahul Dhesi) writes: > [re: how to overcome the code versus data distinction] > For those implementors who wish to allow data to be safely treated as > code, I suggest a system call that will accept a range of memory > addresses and a request to change it from code to data or from data to > code. E.g.: > > #include <modech.h> > if (modech(startadd, endadd, TO_CODE) == -1) > error ("could not change data into code for execution"); > It won't work in a separate I&D machine. To start with, that range of data addresses may *already* be in use in the code space, or it may not be valid. (Imagine a machine where code space is negative and data space is positive.) Another thing that can go wrong is that, as on the B6700, code and data may be different *kinds* of address. If you can use the same addresses for code and data as Rahul Dhesi suggests, your problem is just an unco-operative operating systems. That's why I suggested "load this .obj file where you like, and tell me where you put it", because that *does* make sense in genuinely separate I&D machines. Of course it doesn't have to be a .obj file; the thing you hand to the kernel might be a chunk of data memory to copy, or for that matter a Postscript program to set individual bits (:-). The important thing is that it has to be a copy, because "separate I&D" means *separate*. An interesting point of dynamic *extension* of programs is that (if the code space allocator grain size is a multiple of the instruction cache entry size) you don't need to flush the instruction cache, because valid entries are still valid.
rogerh@arizona.edu (Roger Hayes) (02/25/88)
Someone (Rob Pike?) used self-modifying code to do bitblits. The general case bitblit code did a case analysis to construct the optimal program for that particular blit, then jumped to the constructed program. Worked quite well, as I recall. Roger Hayes rogerh@arizona.edu
ohbuchi@unc.cs.unc.edu (Ryutarou Ohbuchi) (02/25/88)
rogerh@arizona.edu (Roger Hayes) write >Someone (Rob Pike?) used self-modifying code to do bitblits. The >general case bitblit code did a case analysis to construct the optimal >program for that particular blit, then jumped to the constructed >program. Worked quite well, as I recall. It was for Blit bitmap graphics terminal, originally driven by 68000, and later by WE32000. It is more like a very special compiler than a self-modifying code. The BitBlt code does case analysis, and generate small 'code', and pass control to it (well, it is a self modifying code, in some sense...). The small code is without various case analysis (BitBlt is not so simple, if special cases are considered), and faster than the code with all the 'if' statements. I beleave the same technique is used for Microsoft Window's BitBlt code. They generate code in (I thoght) stack segment, and branch to it. It was fairly nasty code because of 64K segment on 80X86s, if frame buffer size exceeds 64K. ============================================================================== Any opinion expressed here is my own. ------------------------------------------------------------------------------ Ryutarou Ohbuchi "Life's rock." "Climb now, work later." and, now, "Life's snow." "Ski now, work later." ohbuchi@cs.unc.edu <on csnet> Department of Computer Science, University of North Carolina at Chapel Hill ==============================================================================
mash@mips.COM (John Mashey) (02/26/88)
In article <1415@thorin.cs.unc.edu> ohbuchi@unc.UUCP (Ryutarou Ohbuchi) writes: >rogerh@arizona.edu (Roger Hayes) write >>Someone (Rob Pike?) used self-modifying code to do bitblits.... John Reiser was the author of the code that saw widespread use in the Blits. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
tainter@ihlpg.ATT.COM (Tainter) (02/27/88)
In article <1415@thorin.cs.unc.edu>, ohbuchi@unc.cs.unc.edu (Ryutarou Ohbuchi) writes: > rogerh@arizona.edu (Roger Hayes) write > >Someone (Rob Pike?) used self-modifying code to do bitblits. The > >general case bitblit code did a case analysis to construct the optimal > >program for that particular blit, then jumped to the constructed > >program. Worked quite well, as I recall. I did preciesly this. It gave me a nice compact and fast routine. The only alternatives I could come up with were: --separate code for each operation jumped to through a jump table This was optimally fast but hugh. --calling subroutines through jump tables This was good on space but relatively slow since it amounted to retesting the opcode inside the loop --a sort of hybrid of the above This still came out quite large. My self modifying code used a template into which it copied instructions for the various operations. This gave a near optimal tight loop for each of the operations in a nice compact form. > Ryutarou Ohbuchi "Life's rock." "Climb now, work later." and, now, --j.a.tainter
leonid@TAURUS.BITNET (03/03/88)
May I add to this discussion that on most UNIX system with Virtual Memory (e.g. BSD 4.2/3) there is alot of self-modifying code. Except for the debugger case, the idea is implemented by copying some code onto the stack, modifying it there and executing. Some examples I can think of are signal handling, Floating-Point hardware dependent code on SUN. Since dynamic loading is not yet impelemented on BSD systems I dont really know how they're gonna do it. In a word, VM UNIX systems tend to keep the text segment readonly for the sace of text sharing, swap space saving etc, with the few exceptions where that stack segment is used to keep temporary copies of executable code that must be modified. Debugging is an exception. Leonid
pardo@june.cs.washington.edu (David Keppel) (07/11/88)
[ What about that self-modifying code ] nather@astro.as.utexas.edu (Ed Nather) writes: > "horrid" [ hard to write, understand, maintain ] > >But it's really needed -- if you have to watch an animated display >of data on a CRT all night long that blinks and flickers, you learn >very quickly that techniques to get display refresh faster than the >human flicker-frequency can save lots of pounding headaches. If >they are "forbidden" techniques that require a descent into >assembly coding, so be it. Headaches hurt. Some reasons NOT to write S-M (Self Modifying or Sado-Masochistic) code: + It's machine dependent (we already knew that). + Read-only (e.g., code) pages may be shared. + Some machines have sperate I and D spaces. It's not inconceivable that different machines in the same family will have different memory models. + Many machines use seperate instruction and data caches. Instruction caches can be made much simpler (= smaller = faster) by not requiring them to pay attention to the data bus. Writing to a memory location is not guaranteed to update the I cache, thus you may or may not execute the modified instruction. Next week's model may have a large enough cache that this week's code breaks. Good enough? ;-D on ( Compiler hints won't save you here ) Pardo
nather@ut-sally.UUCP (Ed Nather) (07/11/88)
In article <5254@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes: > [ What about that self-modifying code ] > nather@astro.as.utexas.edu (Ed Nather) writes: > > "horrid" > Tsk tsk -- out of context, like the movie ad that quotes a reviewer who said: "This movie is a great waste of time -- it's insipid, boring, and stupid. I've never seen anything like it." The ad quoted: "...great...I've never seen anything like it." I said the practice "...was formally declared horrid ...", NOT that *I* thought it was. I don't think that. > Some reasons NOT to write S-M (Self Modifying or Sado-Masochistic) code: > + It's machine dependent (we already knew that). Assembly code does tend to be machine dependent, all right -- but C was designed to give the programmer the same facility, but with better control and better discipline. It works quite well for those things it covers. It is, in some reasonable sense, portable. A good compiler could generate self-modifying code to suit, I'm sure. What we lack is the formal discipline for using it wisely and well. > + Read-only (e.g., code) pages may be shared. Only in a time-sharing system. Most real-time control programs can't tolerate such an environment for lots of reasons, mostly due to critical timing requirements. Anyway, if generated code is treated as "executable data" this whole point becomes irrelevant. > + Some machines have sperate I and D spaces. It's not inconceivable > that different machines in the same family will have different > memory models. Not a problem. Generated executable code changes, just like any other variable values, and can be treated the same way. > + Many machines use seperate instruction and data caches. Instruction > caches can be made much simpler (= smaller = faster) by not > requiring them to pay attention to the data bus. Writing to a > memory location is not guaranteed to update the I cache, thus you > may or may not execute the modified instruction. Next week's model > may have a large enough cache that this week's code breaks. > Again, not a problem. You don't take the generated code out of the I cache, but out of the D cache, since it can (by definition) change. > Good enough? Sorry, no. I've heard LOTS of arguments against programs that generate their own code, and all of them -- so far as I am aware -- assume the "proper" answer in generating arguments to show the answer is proper. Be honest: suppose you *had* to do it this way, but needed a formal discipline to keep it clean and understandable, and *fast.* Couldn't you find a reasonable way? "It's unclean because ... well, because it's so ... so ... UNCLEAN!" -- Ed Nather Astronomy Dept, U of Texas @ Austin {backbones}!{noao,ut-sally}!utastro!nather nather@astro.as.utexas.edu
lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (07/12/88)
In article <12357@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes: >In article <5254@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes: >> [ What about that self-modifying code ] .... >> Some reasons NOT to write S-M (Self Modifying or Sado-Masochistic) code: .... >> + Read-only (e.g., code) pages may be shared. > >Only in a time-sharing system. Most real-time control programs can't tolerate >such an environment for lots of reasons, mostly due to critical timing >requirements. Anyway, if generated code is treated as "executable data" this >whole point becomes irrelevant. There are time-sharing systems that can handle critical tasks (e.g. real-time applications with strict timing requirements) well. These systems have the possibility to give tasks priorities. A critical task can be given a high priority to enable it to meet its time limits. Such a system may very well have code that is shared. Another related issue is that self-modifying code cannot be executed directly from ROM. Executing programs in ROM is an important memory-saving technique in small, diskless, special-purpose systems. >> Good enough? > >Sorry, no. I've heard LOTS of arguments against programs that generate their >own code, and all of them -- so far as I am aware -- assume the "proper" >answer in generating arguments to show the answer is proper. Be honest: >suppose you *had* to do it this way, but needed a formal discipline to keep >it clean and understandable, and *fast.* Couldn't you find a reasonable way? > >"It's unclean because ... well, because it's so ... so ... UNCLEAN!" I guess there are situations with extreme performance requirements where self-modifying code is justified. It would be interesting to see what a semantics for self-modifying programs would look like. Regarding the "uncleanliness" I think there are two main sources for this. The first is the difficulty for a human mind to trace what's going on in a self-modifying program. The second is an advantage read-only data (e.g. non-self-modifying code) has to writable data: read-only data can be accessed asynchronously, in any order, even concurrently, without any constraints. Therefore read-only is "clean" in a concurrent environment. An example apart from code is data bases, where read-only data don't have to be locked during transactions. Bjorn Lisper
pardo@june.cs.washington.edu (David Keppel) (07/12/88)
nather@ut-sally.UUCP (Ed Nather) writes: >pardo@june.cs.washington.edu (David Keppel) writes: >>[ Out of context quote ] >[ Funny retort. ] Sorry. I do like your rejoinder, though. Could be about self-modifying code, even :-) > Sorry, no. I've heard LOTS of arguments against programs that > generate their own code, and all of them -- so far as I am aware > -- assume the "proper" answer in generating arguments to show the > answer is proper. Be honest: suppose you *had* to do it this way, > but needed a formal discipline to keep it clean and understandable, > and *fast.* Couldn't you find a reasonable way? Let me rephrase my position. Their is nothing architecturally weird about programs that generate their own code. Doing so does not cause any problems on any machines that I am aware of, although few OPERATING SYSTEMS support this. There is a *major* problem with modifying existing code. Consider a machine with seperate I-cache and D-cache. If you execute an instruction at location 0x500, then one at 0x504 that branches back to the one at 0x500, and the one at 0x500 modifies itself, either: + The cache can be built so that it pays attention to the changing data (code) in the I-cache. This requires a "snoopy" cache, which is more complicated, less dense, and slower. If you are constrained by price or die space, you may be forced to use a slower, smaller (worse hit-ratio) cache. + You can execute out of the D-cache. This is effectively the same as having a snoopy I-cache. If you have BOTH an I-cache and a D-cache, you're left with coherency problems that are solved by building a snoopy I-cache. + You can define this operation to be illegal, build a smaller, faster, non-snoopy cache, and pay the price when you need to run code that could be expressed "naturally" as self-modifying code. Assume that you have an algorithm that is, say, 1000 instructions and that each instruction takes 4 cycles to execute (must be a RISC :-). One particular part of it, 10 instructions, is written as self-modifying code. It can also be written as 100 instructions of non-self-modifying (static) code. Further assume that the effect of having a non-snoopy cache is to make the computer run 10% faster (e.g., the cache is faster AND has a better hit rate). Then the running time of the program on the machine with the snoopy I-cache is: T = 990 * 4.0 + 10 * 4.0 = 4000 cycles On the machine without a snoopy I-cache (10% faster), T = 990 * 3.6 + 100 * 3.6 = 3924 cycles which is marginally FASTER for NOT using self-modifying code, even though it takes an additional 90 isntructions. ;-D on ( Cycles to burn reading flames ) Pardo
mash@mips.COM (John Mashey) (07/12/88)
In article <12357@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes: .... >> + Many machines use seperate instruction and data caches. Instruction >> caches can be made much simpler (= smaller = faster) by not >> requiring them to pay attention to the data bus. Writing to a >> memory location is not guaranteed to update the I cache, thus you >> may or may not execute the modified instruction. Next week's model >> may have a large enough cache that this week's code breaks. >> > >Again, not a problem. You don't take the generated code out of the I cache, >but out of the D cache, since it can (by definition) change. No, machines that do this don't work that way, i.e., you only execute code from the I-cache or main memory, not from the D-cache (or the cache currently acting as the D-cache, in cases where you can swap them). There is nothing wrong with generating code on the fly. What is wrong is assuming a model of machines that looks just like a CPU + DRAM, or with caches that act only like coherent, joint I&D caches. This is just one instance of a general problem: how do you handle caches that are visible to the executable code in any way? What's needed is a good set of cache-control primitives that: a) Cover all of the cases b) Become widely used enough for people to count on. c) Are easy to make nops for enviornments where they don't matter. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
chasm@killer.UUCP (Charles Marslett) (07/12/88)
In article <33441@yale-celray.yale.UUCP>, lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes: > Another related issue is that self-modifying code cannot be executed > directly from ROM. Executing programs in ROM is an important memory-saving > technique in small, diskless, special-purpose systems. Actually, I always thought that ROM was much more important as a way of eliminating the need to load the code and as a way of guaranteeing non- self-modifying-ness. Usually ROM code is larger (thus memory-consuming) than RAM-loaded code. This is because most RAM based systems assume some form of selfmodification of code, ranging from relocation tables that avoid the need for self-relative addressing everywhere to file initialization code that gets overwritten by its own buffer. On the other hand, some RAM based systems require 4 Mb to run an editor, so you may be right in those cases :->} !! Seriously, newer architectures have reduced the difference (to 0 perhaps), but the emphasis on RISC these days may resurrect self-modifying code -- a RISC-er technique is not known to mankind! (:-D}===<<). > Bjorn Lisper Charles Marslett
hjm@cernvax.UUCP (hjm) (07/12/88)
If the program is considered to be data for an interpreter which I will for convenience call a CPU, then does self-modifying code seem any different? As long as the data in this table remains consistent, then there should be no difficulty in dealing with it mathematically. All of the problems of database consistency have been tackled (but not all solved) some time ago. Code is data according to a PATCH program, and the only tricky bits there are to ensure that any changes fit inside the space that they replace and to patch up the jumps where necessary. A bit like a structure linked by pointers, such as a binary tree, except that code is in general a directed cyclic graph (the directedness is due to the program counter). Now, if the routines that ensure the consistency of the code change themselves, then the world will most likely fall down around one's head, but then things get to be so recursive that my stack oveflows ... Hubert Matthews
nather@ut-sally.UUCP (07/12/88)
In article <33441@yale-celray.yale.UUCP>, lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes: > > I guess there are situations with extreme performance requirements where > self-modifying code is justified. It would be interesting to see what a > semantics for self-modifying programs would look like. I agree. That was the whole point of my original posting -- to see if someone with a talent for languages could ignore the current prejudice against this practice, and see what might be done. If there's really no way to do it in a reasonable manner, fine -- but just *saying* it's awful doesn't *make* it awful. > The first [source of uncleanliness] is > the difficulty for a human mind to trace what's going on in a self-modifying > program. I think this arises because of the two different intellectual "levels" involved: what the program is doing which generates the instructions (by analogy with the instruction generator in a compiler) and what those instructions are going to do when they are executed. These *must* be kept completely separate or massive confusion results. But if the written code is approached with full awareness of this requirement, the confusion vanishes -- in my experience, anyway. > The second is an advantage read-only data (e.g. non-self-modifying > code) has to writable data: read-only data can be accessed asynchronously, > in any order, even concurrently, without any constraints. Therefore > read-only is "clean" in a concurrent environment. I believe self-modifying code was condemned long before concurrent environments were even possible. This may be a disadvantage, but *no* techinque exists without trade-offs. I think we can live with this one. > An example apart from code > is data bases, where read-only data don't have to be locked during > transactions. OK, but I'm not sure this is relevant here. I do, however, appreciate your keeping "data" plural. It happens so rarely nowdays ... -- Ed Nather Astronomy Dept, U of Texas @ Austin {backbones}!{noao,ut-sally}!utastro!nather nather@astro.as.utexas.edu
baum@Apple.COM (Allen J. Baum) (07/13/88)
[] >In article <5262@june.cs.washington.edu> pardo@uw-june.UUCP (David Keppel) writes: > >There is a *major* problem with modifying existing code. Consider a >machine with seperate I-cache and D-cache. If you execute an >instruction at location 0x500, then one at 0x504 that branches back >to the one at 0x500, and the one at 0x500 modifies itself, either: > > + The cache can be built so that it pays attention to the changing > data (code) in the I-cache.... > + You can execute out of the D-cache.... > + You can define this operation to be illegal.... or you can recognize the situation and issue a write-back-Dcache-line instruction, and a flush-Icache-line instruction, and everything works. Note that a compile-and-go situation is exactly this; you are building a program in the Dcache, and then you want to execute it. While it may be unlikely that any remnants are left over after linking, et. al., this may be the only way to guarantee it. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (07/13/88)
In article <4776@killer.UUCP> chasm@killer.UUCP (Charles Marslett) writes: >In article <33441@yale-celray.yale.UUCP>, I write: >> Another related issue is that self-modifying code cannot be executed >> directly from ROM. Executing programs in ROM is an important memory-saving >> technique in small, diskless, special-purpose systems. > >Actually, I always thought that ROM was much more important as a way of >eliminating the need to load the code and as a way of guaranteeing non- >self-modifying-ness. Usually ROM code is larger (thus memory-consuming) >than RAM-loaded code. This is because most RAM based systems assume some >form of selfmodification of code, ranging from relocation tables that avoid >the need for self-relative addressing everywhere to file initialization code >that gets overwritten by its own buffer.... As I wrote was actually thinking primarily of diskless systems. In such a system the program is stored in ROM. If the program is self-modifying, its code has to be loaded into RAM before it can execute. If not it can be executed directly in ROM and only the data areas need to be RAM. Thus memory is saved (we don't need the extra RAM to hold the code). Bjorn Lisper
gillies@p.cs.uiuc.edu (07/13/88)
I think many rules have an exception (e.g. "gotos considered harmful"). There is a difference between self-modifying code and non-reentrant code. Today, people are finding new uses for self-modifying code that were unforseen 20 years ago: 1. Simple programs that need OPTIMAL speed, and would be outrageously expensive on ANY machine. The major example is BITBLT, a subroutine with about 10-15 parameters that does a massive amount of linear-time work. If the BITBLT subroutines generates the machine code expressly for the particular arguments, you can save a factor of 3 or more in speed. This is very important for interactive graphics. 2. Peter Deutsch and PARCPLACE systems used a new technique to speed up Smalltalk on conventional computers. When a subroutine is executed, the types are checked and if possible, the subroutine is compiled for speed & executed. Then this compiled code is cached for possible use later, when the arguments have the same types. This also gave a very large constant speedup to the smalltalks marketed by ParcPlace systems. I think papers on both these techniques appeared in the first OOPSLA conference. In light of these new techniques, I believe it's time to reexamine our opinions about self-modifying code. Don Gillies, Dept. of Computer Science, University of Illinois 1304 W. Springfield, Urbana, Ill 61801 ARPA: gillies@cs.uiuc.edu UUCP: {uunet,ihnp4,harvard}!uiucdcs!gillies
hjm@cernvax.UUCP (hjm) (07/13/88)
I have been mulling over the idea of self-modifying code (SMC) for a while and
I've come to the conclusion that there is no good definition of SMC.
For example, if treating code as data is the definition, then does passing a
procedure as a parameter in PASCAL, or a pointer to a function in C count?
Probably not.
OK, what about a jump table. Consider an array of pointers to functions in C.
Does changing the pointers count as SMC? Again, I don't think so.
So, changing a pointer by assigning to it is not SMC, but putting a new jump
instruction in (e.g. jmp #somewhere_else) in place of an existing instruction
is SMC. Does one level of indirection really make that much difference?
Of course, if you want to be really horrid in C, you can try something like
this:
char codearray[] = { 0x03, /* one byte instruction for something */
0xc9 /* one byte return instruction */
}
and then 'call' this function using a cast to turn the pointer codearray into
a pointer to a function. (Who needs #asm anyway!) Then you can modify the
code as much as you want. This _is_ SMC without a doubt, because you can
overwrite code. So, I propose a very weak definition for SMC as code that
writes over other code.
As a final note, why is it 'clean' to alter a jump table and 'unclean' to alter
an inline constant (e.g. jmp @offset(r0) uses a value in memory as the address
but mov (pc)+,#1234 which loads an immediate does so too)? Why the subtle
difference? Any thoughts on the subject?
Hubert Matthews
(I don't consider LISP or PROLOG programs that create code on the fly to be
SMC. Does anyone disagree?)
aglew@urbsdc.Urbana.Gould.COM (07/13/88)
>Today, people are finding new uses for self-modifying code that >were unforseen 20 years ago: > >1. ... BITBLT ... > >2. Peter Deutsch and PARCPLACE systems used a new technique to speed >up Smalltalk on conventional computers... > >In light of these new techniques, I believe it's time to reexamine >our opinions about self-modifying code. > >Don Gillies, Dept. of Computer Science, University of Illinois >1304 W. Springfield, Urbana, Ill 61801 >ARPA: gillies@cs.uiuc.edu UUCP: {uunet,ihnp4,harvard}!uiucdcs!gillies Two useful techniques. Both of them can be cast in the form of "compile an entire procedure/function" (although I have seen BITBLTs that compile code fragments and branch to them, I think those are better understood as BITBLTs with streamlined function calling sequences). One of my favourite languages was POP-2, where the compiler was a subroutine that could be used to compile code, returning what was essentially a pointer to a function. May I suggest that, perhaps, this is the moral: from the point of view of programming principles, self-modifying code where the granule is the procedure is not bad. Smaller granules are dangerous/hard to use. (I've been running a pretty good streak of stupid mistakes in my posts recently. Tell me all about them) Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have MX records aglew@xenurus.gould.com - if you don't ...!ihnp4!uiucuxc!ccvaxa!aglew - paths may still be the only way My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
hankd@pur-ee.UUCP (Hank Dietz) (07/14/88)
In article <12360@ut-sally.UUCP>, nather@ut-sally.UUCP (Ed Nather) writes: > In article <33441@yale-celray.yale.UUCP>, lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes: > > > > I guess there are situations with extreme performance requirements where > > self-modifying code is justified. It would be interesting to see what a > > semantics for self-modifying programs would look like. > > I agree. That was the whole point of my original posting -- to see if someone > with a talent for languages could ignore the current prejudice against this > practice, and see what might be done. If there's really no way to do it in a > reasonable manner, fine -- but just *saying* it's awful doesn't *make* it awful. The key problem with self-modifying code is that code created at runtime may REPLACE code that appeared at "compile" time, hence, in general, no STATIC understanding of the code is possible. This can make it hard for humans to understand the code, but more importantly it implies that compilers cannot perform optimizations, e.g., you cannot perform register allocation / value propagation / common subexpression elimination / in-line expansion because the code you remove might be code which is modified at runtime. The fix need not be the elimination of the generate-and-execute programming concept, rather, the language merely needs to specify constraints on the code which is to be generated at runtime. For example, key constraints would be that the runtime generated code could not redefine any existing function, could only be invoked through a statically-specified interface, and could only access data which either was statically-declared as accessible through that interface or was created as new data local to the new code. In other words, we can permit code to add to itself, but not to redefine what existed at compile time: self-extending code. An outline of fixes to be applied to Lisp (to create "Refined Lisp") appears in my PhD thesis: H. Dietz, "The Refined-Language Approach to Compiling for Parallel Supercomputers," June 1987. The folks working on Scheme had similar ideas.... -hankd
radford@calgary.UUCP (Radford Neal) (07/14/88)
There are interesting cases where on-the-fly generation of code seems to be essential to get good asymptotic space and/or time complexity. Consider the following code fragment (version A): for (i = 0; i<L; i++) { if (c1) p1(); if (c2) p2(); ... if (cN) pN(); } The Boolean variables c1, c2, ... cN are assumed to be loop invariants. Assuming the pi all take the same constant time and code space, the time complexity of this code is O(N), and its space complexity is also O(N). (I'm ignoring the factor of L for number of loop iterations.) Now consider the following equivalent code, specialized for N=2, but clearly generalizable to any N (version B): if (c1) { if (c2) for (i = 0; i<L; i++) { p1(); p2(); } else for (i = 0; i<L; i++) { p1(); } } else { if (c2) for (i = 0; i<L; i++) { p2(); } else /* Nothing */ ; } Code like this will have a time complexity of O(M), where M is the number of ci that are actually true, and a space complexity of O(2^N). Let's say that M, the number of true ci, is typically logN. This gives the following comparison for versions A and B: version space time A O(N) O(N) B O(2^N) O(logN) Thus you get to choose between time that is exponentially worse than it might be, or space that is exponentially worse than it might be. The following lets you get both (version C): start generating code; if (c1) generate instruction to call p1; if (c2) generate instruction to call p2; ... if (cN) generate instruction to call pN; for (i = 0; i<L; i++) execute generated instructions; This gives you the low time complexity of version B, with the low space complexity of version A. It's machine dependent, however. It _is_ possible to get the same form of time complexity without machine dependent operations, as follows (version D): void (*pa)[N+1]; j = 0; if (c1) pa[j++] = p1; if (c2) pa[j++] = p2; ... if (cN) pa[j++] = pN; pa[j] = 0; for (i = 0; i<L; i++) { for (j = 0; pa[j]; j++) (*pa[j])(); } This essentially works by generating code in a very specialized language, and then interpreting that code. This gives the same form of time complexity as version B, but with a larger constant factor - perhaps considerably larger if the pi aren't really procedures, but just short code sequences. This is the general sort of reason for the bit-blit algorithms that generate code on the fly, and I've encountered the same sort of situation in other contexts (e.g. a "life" program). One could imagine a compiler performing the transformation from version A to the machine-dependent version C, but I've never heard of such. Even if it did, it is a bit disturbing that a transformation of such significance for space/time complexity isn't expressed at the source level. This would be solved by having the compiler transform version D to the faster version C, but that's probably even harder for the compiler-writer to accomplish than the A to C transformation. Radford Neal
peter@ficc.UUCP (Peter da Silva) (07/14/88)
I have a question: Why are an Icache plus a Dcache better than just a big shared cache as big as both? -- Peter da Silva `-_-' Ferranti International Controls Corporation. "Have you hugged U your wolf today?" (uunet,tness1)!sugar!ficc!peter.
peter@ficc.UUCP (Peter da Silva) (07/14/88)
In article <76700032@p.cs.uiuc.edu>, gillies@p.cs.uiuc.edu writes: > 1. Simple programs that need OPTIMAL speed, and would be outrageously > expensive on ANY machine. The major example is BITBLT, a subroutine > with about 10-15 parameters that does a massive amount of linear-time > work. If the BITBLT subroutines generates the machine code expressly > for the particular arguments, you can save a factor of 3 or more in > speed. This is very important for interactive graphics. You can save an even greater factor by building the BitBlt into the hardware. This is what the Commodore Amiga does, using a chip called the Biltter. I'm pretty surprised that a chipset this good that's cheap enough to be put in a consumer machine hasn't been picked up (or emulated) by the big boys. It makes a significant difference to the speed of the machine. And you could cut that ugly self modifying code out of BitBlt. Instead you just: Wait for the blitter to be free (lock it). Set up its parameters (it can take 3 source bitmaps and one destination bitmaps, with any minterm) Blit (at DMA speeds, up to one word per clock). You can even return to the caller before the blit's finished, if the locking is in hardware or if the blitter can generate an interrupt to clear the semaphore. -- Peter da Silva `-_-' Ferranti International Controls Corporation. "Have you hugged U your wolf today?" (uunet,tness1)!sugar!ficc!peter.
gregr@edge.UUCP (Greg Rose) (07/15/88)
The discussion of self modifying code, as it has progressed in this newsgroup, will never reach an end. This is simply because the opponents and proponents are not talking about the same things at all. "Self modifying code" such as was used on early computers, for example early CDC 6600 machines (not flaming them, just the example I know) was a NECESSARY technique for performing subroutine linkage. There was no instruction to perform a jump to an arbitrary address (i.e. the return address of the subroutine) so the subroutine entry code *had* to overwrite a jump instruction's operand (at a known, per subroutine, address) before calling. This was usually the instruction immediately at or before the start of the subroutine. In psuedo code: /* caller */ /* callee */ ld r0,#ret_addr subr: jmp anywhere /*TBA*/ st r0,subr+1 ... jmp subr+2 ... ret_addr: jmp subr /* return */ ... (Assume that it is word addressed and the jmp instruction takes one word, as does its operand.) THIS is self modifying code at its worst, gaaakk phut, urk, where is the bathroom. Ever wondered why the early Fortran standards didn't even *allow* recursion? Even earlier machines used similar techniques for array indexing and other irrelevant stuff. "Run time generated code" is a whole different kettle of fish, and seems to be what all of the proponents in the discussion have been referring to. This is where issues such as separate I/D spaces/caches, operating system support, and so on, come in. BITBLT is, as noted, the classic example where run time generated code can do the job much faster than any parameterised loops. But load-and-go compilers are also a sensible thing in some environments. The only difference between generating code in your own address space and jumping to it, or generating it in a file and 'exec'ing it, is that the operating system knows how to do the latter correctly. Nobody really objects to optimising compilers, now do they? So what is the problem with optimising code generators for some other class of problem. I agree with John Mashey that the most needed thing at the moment is appropriate operating system or hardware support to address the caching and accessability issues. The second most needed thing is to stop calling "run time generated code" "self modifying code" so people will think before barfing. Greg Rose, Softway Pty Ltd, Australia ...uunet!munnari!greg@softway.oz "Assume your favourite disclaimer since I don't think I need one"
jackson@esosun.UUCP (Jerry Jackson) (07/15/88)
In article <752@cernvax.UUCP> hjm@cernvax.UUCP (hjm) writes: Path: esosun!seismo!uunet!mcvax!cernvax!hjm From: hjm@cernvax.UUCP (hjm) Newsgroups: comp.lang.c,comp.arch Summary: What is self-modifying code anyway? Keywords: self-modifying code Date: 13 Jul 88 09:44:20 GMT References: <3353@cognos.UUCP> <619@goofy.megatest.UUCP> <429@uwovax.uwo.ca> <5254@june.cs.washington.edu> <12357@ut-sally.UUCP> <5262@june.cs.washington.edu> Reply-To: hjm@cernvax.UUCP () Organization: CERN European Laboratory for Particle Physics, CH-1211 Geneva, Switzerland Lines: 36 Xref: esosun comp.lang.c:11534 comp.arch:5582 >> As a final note, why is it 'clean' to alter a jump table and 'unclean' to alter >> an inline constant (e.g. jmp @offset(r0) uses a value in memory as the address >> but mov (pc)+,#1234 which loads an immediate does so too)? Why the subtle >> difference? Any thoughts on the subject? Hubert Matthews (I don't consider LISP or PROLOG programs that create code on the fly to be SMC. Does anyone disagree?) The main difference that I see is that the code in one case is not reentrant. In systems like unix where it is possible for more than one process to share a code segment at runtime, the jump table is local to each process while the code is not. Imagine if vi had true self-modifying code in it... (Many users typically share a code segment for programs like vi and csh) The example in which you create an array and then call it as a function does not really fit this definition of self-modifying, since the code that is changed is created in data space at runtime and hence cannot interfere with other programs using the same code segment. *Real* SMC would be something like: main() { char func(); long *longptr; longptr = (long *)func; *(longptr + 20) = 0xffffffff; /* shudder!!!!! */ } +-----------------------------------------------------------------------------+ | Jerry Jackson UUCP: seismo!esosun!jackson | | Geophysics Division, MS/22 ARPA: esosun!jackson@seismo.css.gov | | SAIC SOUND: (619)458-4924 | | 10210 Campus Point Drive | | San Diego, CA 92121 | +-----------------------------------------------------------------------------+
cjohnson@unisoft.UUCP (Christopher Lyle Johnson) (07/15/88)
As has been already pointed out, one of the problems with self modifying code stems from separate instruction and data caches. Even on Harvard architectures, if it is possible to mark an area of code with a cache inhibit bit SM code doesn't cause too many problems. In kernel mode or on systems where user processes can access the mmu, SM code may make sense in certain situations. I have no major aesthetic complaint if the performance win is large enough. Of course, if the code isn't cached, perhaps the win is less significant. If we want to toggle the don't cache bit (i.e. set while modifying the code, clear after the code is written), it may be difficult to claim a performance win for non-supervisor processes. If it takes two system calls per SM operation, the overhead is pretty high. (If it is faster to flush the caches, that may be an option, but there may be very little control over the granularity. How would you implement a 'flush_address(base, bounds)' system call on your favorite mmu/cache system?) However..., if a user process can't get to the mmu the operating system has a bit of a problem. I guess the trick is to write protect program text and execute protect the data. Of course, this causes a execute protect fault on the modified code page(s). The kernel could then flush the caches and write protect the page. Again, this is a lot of overhead. (Yes, I know some mmus don't have an execute protect mode. Work arounds may exist, eg. the function code mode on the 030. Ugh.) All in all, a person had better have a good reason for writing self modifying code. In some cases the performance win from the code may be lost in overhead. There are many other facets to this problem, too. Cheers, cj* ========================================================= This message does not reflect the views of UniSoft Corp but this sentence does.
baum@Apple.COM (Allen J. Baum) (07/15/88)
[] >In article <28200177@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: > >>Today, people are finding new uses for self-modifying code that >>were unforseen 20 years ago: >> >>1. ... BITBLT ... >> >>2. Peter Deutsch and PARCPLACE systems used a new technique to speed >>up Smalltalk on conventional computers... >> This was supposed to be a reply to Doug Pardo, but the Brain-Damaged-Mailer bounced the reply, so I'm posting... What about the case mentioned by someone for speeding up SmallTalk? In this case, you're compiling a small routine, instead of the whole program, but have the same information. HP used something like this for a fast APL compiler long ago in a galaxy far away. They compiled code for adding to objects according to the object types at compilation time, and the code included a check for just those types. If the check failed, it bounced back to the compiler, which generated code for a more general case; so, for example, the first time it might compile code for arrays of size i=constant, and if the size changed, then it would compile slightly more general code for arrays of size n=variable, then if that also failed, maybe for arrays of rank m=var, size n=var, etc. It ran much faster than interpreters running on faster machines so it can be a big win. Again, in all cases, you know which addresses to flush before you branch to them. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
barmar@think.COM (Barry Margolin) (07/15/88)
In article <749@cernvax.UUCP> hjm@cernvax.UUCP () writes: >If the program is considered to be data for an interpreter which I will for >convenience call a CPU, then does self-modifying code seem any different? As >long as the data in this table remains consistent, then there should be no >difficulty in dealing with it mathematically. All of the problems of database >consistency have been tackled (but not all solved) some time ago. Code is data >according to a PATCH program, and the only tricky bits there are to ensure that >any changes fit inside the space that they replace and to patch up the jumps >where necessary. A bit like a structure linked by pointers, such as a binary >tree, except that code is in general a directed cyclic graph (the directedness >is due to the program counter). This view is valid, but simplifies the issues. If you are going to think of the program as a database, you must actually consider it as a distributed database. This is because the data is replicated in several places: the physical memory, the swapping device, the CPU caches, and internal processor registers. The problems of maintaining consistency in replicated databases is much less understood than concurrency control in non-distributed databases. Additionally, any form of consistency maintenance must incur some performance penalty, and hardware designers are under great pressure to make machines go as fast as possible. Multiple caches that do not maintain consistency are one answer to this. Since self-modifying code has generally been considered bad style for quite some time, they felt comfortable implementing a hardware optimization that assumes code will not modify nearby locations. Barry Margolin Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar
bclaus@wright.EDU (Brian Clausing) (07/15/88)
in article <5262@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) says: > Xref: wright comp.lang.c:8563 comp.arch:4081 [...discussion about self-modifying code. It fouls up caches. It is intellectually unmanageable with current ISPs....] As to the semantics of self-modifying code, isn't it identical to a Turing machine's or am I missing something? Of course, reduction machines, like Turner's combinatoric implementation of SASL, are inherently self modifying. MB Clausing
baum@Apple.COM (Allen J. Baum) (07/15/88)
[] >In article <1087@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: >I have a question: > > Why are an Icache plus a Dcache better than just > a big shared cache as big as both? In terms of hit/miss ratios, a unified cache is clearly better. However, in terms of effective access time, a split I/D cache is better. This is because both can be accessed simultaneously, instead of one having to wait for the other to finish. If both are getting accessed simultaneously (which should happen 40% of the time, if Loads/Stores account for %40 of instructions), then this more than offsets the increase in miss ratio. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
tim@amdcad.AMD.COM (Tim Olson) (07/16/88)
In article <1087@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: | I have a question: | | Why are an Icache plus a Dcache better than just | a big shared cache as big as both? When the processor has separate instruction and data buses, so concurrent accesses can occur to both caches. -- -- Tim Olson Advanced Micro Devices (tim@delirun.amd.com)
lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (07/16/88)
In article <1744@vaxb.calgary.UUCP> radford@calgary.UUCP (Radford Neal) writes: > >There are interesting cases where on-the-fly generation of code seems >to be essential to get good asymptotic space and/or time complexity. > >Consider the following code fragment (version A): > > for (i = 0; i<L; i++) > { if (c1) p1(); > if (c2) p2(); > ... > if (cN) pN(); > } > >The Boolean variables c1, c2, ... cN are assumed to be loop invariants. (...version B deleted...) >The following lets you get both (version C): > > start generating code; > if (c1) generate instruction to call p1; > if (c2) generate instruction to call p2; > ... > if (cN) generate instruction to call pN; > for (i = 0; i<L; i++) execute generated instructions; Interesting. This reminds a lot of the kind of compile-time manipulation that optimizing compilers do. The loop for (i = 0; i<L; i++) execute generated instructions; is an optimized version of A *when the ci are known*. If they were known at compile-time an optimizing compiler could have generated the same code. The only difference is that the self-modifying program does this *at runtime*, when we are guaranteed to know the values of the ci's. Maybe we should call this technique "run-time compilation"? Bjorn Lisper
pardo@june.cs.washington.edu (David Keppel) (07/16/88)
In article <1087@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: >[ Why are two caches (I, D) better than one (I+D)? For a given real-estate, a non-snoopy I-cache will hold more data bits and be faster than a snoopy I-cache. + For a given real-estate, the I-cache hit rate will be better. This makes the average instruction fetch time lower. This may be a win if instruction fetch rate is at all a bottleneck. + For a given cache size, the real-estate can be smaller. This means: + The cache may be cheaper. + The cache may be small-enough to put closer to the instruction register, making it effectively faster. + The logic is simpler and more regular. + Faster design times. + Fewer bugs. + Easier to incorporate into VLSI designs (==faster). + Less logic to drive => faster, less power. + The cache doesn't need to be connected to as many things. + More freedom to place the cache => faster, cheaper. + Less external logic to drive => smaller, faster, cheaper. + Aliasing and data distribution are less of (none) a problem. This lets you build (with much less pain) + Heirarchical cache organizations (faster). + Virtually-addressed caches (faster). I-caches are particularly useful on machines that make heavy use of memory. This includes: + Many high-performance processors. + Shared-memory multiprocessors. The primary problem (that I'm aware of) with non-snoopy I-caches is that you must manually flush the I-cache every time that an instruction is changed. (Strictly speaking, every time you change an I-space address that has been touched since the last cold start). Does that answer your question ?-) ;-D on ( Faster ) Pardo
petolino%joe@Sun.COM (Joe Petolino) (07/16/88)
>> Why are an Icache plus a Dcache better than just >> a big shared cache as big as both? > >In terms of hit/miss ratios, a unified cache is clearly better. I beg to differ. We ran a few simulations, comparing split and unified caches (total cache size in the 32K-512K range), and when the caches were direct-mapped (i.e. 1-way set-associative) the unified cache performed worse. Our guess is that when code and data are forced to co-exist in the same space, you get a high probability of collision and thrashing. This effect went away when we increased the degree of set associativity. One way to think about it is that having two (direct-mapped) caches gives you some of the benefits of set-associativity. Take this with a grain of NaCl: we only tried a few test programs, and each was small enough to fit into the cache. -Joe
tps@chem.ucsd.edu (Tom Stockfisch) (07/16/88)
In article <33652@yale-celray.yale.UUCP> lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes: >In article <1744@vaxb.calgary.UUCP> radford@calgary.UUCP (Radford Neal) writes: >>There are interesting cases where on-the-fly generation of code seems >>to be essential to get good asymptotic space and/or time complexity. >> >> for (i = 0; i<L; i++) >> { if (c1) p1(); >> if (c2) p2(); >> ... >> if (cN) pN(); >> } >>[SMC solution follows] >> start generating code; >> if (c1) generate instruction to call p1; >> if (c2) generate instruction to call p2; >> ... >> if (cN) generate instruction to call pN; >> for (i = 0; i<L; i++) execute generated instructions; Wouldn't the following be "nearly" as efficient, and be portable to boot? At least it has the same asymptotic complexity. enum { END, P1, P2, ..., Pn } code[MAXCODE], *p; p = code; if (c1) *p++ = P1; if (c2) *p++ = P2; ... if (cn) *p++ = Pn; *p = END; for ( pixel = 0; pixel++ < NPIXELS ) for ( p = code; *p != END; ) switch (*p++) { case P1: [p1 instructions] break; case P2: [p2 instructions] break; ... case Pn: [pn instructions] break; default: assert(0); /*NOTREACHED*/ } -- || Tom Stockfisch, UCSD Chemistry tps@chem.ucsd.edu
sean@lfcs.ed.ac.uk (Sean Matthews) (07/16/88)
With this talk about self modifying code, no one has mentioned the theoretical work that has been done on the subject. There is a quite well known paper (and quite controversial) on the subject of programs that can modify themselves and their interpreters by Brian Smith Reflection and Semantics in a procedural language MIT-LCS-TR-272 Mass. Inst. Tech. January 1982 Reflection and semantics in Lisp 11th ACM symposium on principles of programming languages Then there is the reply to these papers by Mitchel Wand and Daniel Freidman The mystery of the tower revealed: a non-reflective description of the reflective tower CACM 1986 (this is as far as I can go since I have a copy of a copy and there is no publishing information on it) An extended version of this paper was published in Meta-level architectures and reflection, P. Maes and D. Nardi (editors) Elsevier Science Publishers B.V. (North-Holland) 1988 There is a lot of theory about self modifying systems and self referential systems but by the time you start looking into it you are in philosophical logic, not comp.architecture Se\'an Matthews arpa: sean%uk.ac.ed.aipna@nss.cs.ucl.ac.uk
mash@mips.COM (John Mashey) (07/16/88)
In article <1087@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: > Why are an Icache plus a Dcache better than just > a big shared cache as big as both? 1) (Major) with split caches, you can access a word of instruction and a word of data in the same cycle. With a joint cache, you get one or the other. As a truly gross estimate, expect to lose 10-40% of your performance, given conflicts between I-refs and loads/stores. 2) Less important, but still useful. As you make a direct-mapped cache bigger, each successive 2X improves the hit-rate, but it improves it less than the last 2X. At some point, it is hard to make it bigger and keep the same speed SRAMs, and then the only way (keeping the same organization) to make the hit rate faster is to make it set-associative. Given the same total cache size, hit rates for many programs (not all, there are exceptions): joint, direct-mapped <= split, direct-mapped <= joint, 2-way set associative Note: those are Hit Rates, not performance. There are a bunch of reasons to keep caches (at least the ones nearest the CPU) direct-mapped when you can. The following is a fine paper that analyzes the various tradeoffs in cache design [rather than just hit rates], is: S. Pryzbylski, M. Horowitz, J. Hennessy, "Performance Tradeoffs in Cache Design", Proc. 15th Annual Symposium on Computer Architecture, May 30 - June 2, 1988, Honolulu, Hawaii. (Computer Architecture News 16, 2(May 1988), 290-298. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
bill@proxftl.UUCP (T. William Wells) (07/16/88)
In article <60175@sun.uucp>, petolino%joe@Sun.COM (Joe Petolino) writes: > >> Why are an Icache plus a Dcache better than just > >> a big shared cache as big as both? > > > >In terms of hit/miss ratios, a unified cache is clearly better. > > I beg to differ... > ...Take this with a grain of NaCl: we only > tried a few test programs, and each was small enough to fit into the cache. Another relevant point about separate caches is this: instructions and data have different access patterns. A cache designed for the one is not necessarily going to be right for the other; a cache designed for both may not do as well as the separate ones. So this possibility has to be balanced against the general benefit of having a larger pool to work with.
sl@van-bc.UUCP (pri=-10 Stuart Lynne) (07/17/88)
In article <752@cernvax.UUCP> hjm@cernvax.UUCP () writes: >I have been mulling over the idea of self-modifying code (SMC) for a while and >I've come to the conclusion that there is no good definition of SMC. >For example, if treating code as data is the definition, then does passing a >procedure as a parameter in PASCAL, or a pointer to a function in C count? >Probably not. >OK, what about a jump table. Consider an array of pointers to functions in C. >Does changing the pointers count as SMC? Again, I don't think so. >So, changing a pointer by assigning to it is not SMC, but putting a new jump >instruction in (e.g. jmp #somewhere_else) in place of an existing instruction >is SMC. Does one level of indirection really make that much difference? A simple definition might be any program that cannot be compiled and run in shared text mode (or equivalent for non Unix application environments). Modifying a jump table in your data space does not affect how the program will run for other users. Modifying a jump instruction in the shared text *will* affect how the program will run for other users. I have used SMC in places that where by design not required to be shared and where a high degree of efficency was required. For example in a p-System type interpreter you need to have a test in the opcode fetch and dispatch loop for a pseudo interrupt. This takes a lot of time. By simply patching in a jmp to_int as the first instruction of the loop we eliminate the need for an explicit test. To make the above example work for multi-user systems, we used a jump indirect through a var which contained either the ifetch address or interrupt handler address. A little slower but still faster than explicit test in the ifetch loop. Similiar type of problems can arise when doing high performance device drivers. -- Stuart.Lynne@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl Vancouver,BC,604-937-7532
leo@philmds.UUCP (Leo de Wit) (07/17/88)
In article <752@cernvax.UUCP> hjm@cernvax.UUCP () writes: >I have been mulling over the idea of self-modifying code (SMC) for a while and >I've come to the conclusion that there is no good definition of SMC. > >For example, if treating code as data is the definition, then does passing a >procedure as a parameter in PASCAL, or a pointer to a function in C count? >Probably not. True. A pointer is data. >OK, what about a jump table. Consider an array of pointers to functions in C. >Does changing the pointers count as SMC? Again, I don't think so. Again true. Note that it may be possible that the system (O.S., processor) checks whether the new pointer value represents a valid address, or a valid entry point. If this is (always) desirable is an entirely different question. E.g. some architectures will protest if you try to jump to an odd address. PASCAL will leave you probably less room to cheat than C. >So, changing a pointer by assigning to it is not SMC, but putting a new jump >instruction in (e.g. jmp #somewhere_else) in place of an existing instruction >is SMC. Does one level of indirection really make that much difference? Yes, I think it does make a difference. Maybe not always, but there are cases that you can't get away with this: think of re-entrant code, or shared text segments. Now I'm thinking of recursion, what about putting the code on the stack 8-) ? No worry about re-entrancy, and the C program model becomes more complete (we have already global, static and automatic data, and global and static functions, and now there's automatic functions...). >Of course, if you want to be really horrid in C, you can try something like >this: > >char codearray[] = { 0x03, /* one byte instruction for something */ > 0xc9 /* one byte return instruction */ > } This must be (or was a) Z80 freak! At least I recognize the 0xc9 as: RET. On a Z80 you could do other horrible things too, since instruction sizes vary from 1-4 bytes; by carefully picking your instructions you could use the same code twice (using a one-off entry point), with entirely different result. Good (?) old days, when memory was more expensive than programming effort.... >and then 'call' this function using a cast to turn the pointer codearray into >a pointer to a function. (Who needs #asm anyway!) Then you can modify the >code as much as you want. This _is_ SMC without a doubt, because you can >overwrite code. So, I propose a very weak definition for SMC as code that >writes over other code. Note that not all implementations will allow initialized arrays to be altered. I remember a problem we had last year on VMS while passing a initialized char array to mktemp() (or whatever it's name is); the program stackdumped because mktemp tried to write into the readonly array (yes, VMS put it in readonly!). >As a final note, why is it 'clean' to alter a jump table and 'unclean' to alter >an inline constant (e.g. jmp @offset(r0) uses a value in memory as the address >but mov (pc)+,#1234 which loads an immediate does so too)? Why the subtle >difference? Any thoughts on the subject? Try putting your code in ROM and see what happens. Just one example. Besides I think jump tables are generally not altered, pointers are. The jump tables could well be in the text segment, for instance. A jump table is not an array of function pointers. > Hubert Matthews > >(I don't consider LISP or PROLOG programs that create code on the fly to be >SMC. Does anyone disagree?) No, unless the code is buggy; such code on a fly could well be Self Multiplying 8-). Leo.
smryan@garth.UUCP (Steven Ryan) (07/17/88)
>As to the semantics of self-modifying code, isn't it identical to >a Turing machine's or am I missing something? No. The state mechanism of a Turing machine is fixed and finite. To show equivalence, it is necessary to show that self modifying code does not increase the number of states. Usually, people just appeal to Church-Turing and assume all the states are there, somewhere. For something like a load and go compiler, since every possible program is not hidden somewhere in the compiler, it would appear to escape CT. However the compiled program can be regarded as a transformation of the input tape into another region of the tape and its apparent execution as an interpretation of the tape by the compiler/loader/cpu state machine. Then again, if you believe really hard and clap your hands..... sm ryan ---------------------------------- Hugh: Why does he carry a sword? Fred: He has delusions of grandeur. him: Uh, Fred... (SWOOP) Fred: YEEK him: ...before you make a value judgement about a person's delusions... (poink) Fred: AWK! My head feathers! him: ...you'd better be absolutely sure they ARE delusions. (JAB) Fred: OUCH! -- Odd Bodkins
david@daisy.UUCP (David Schachter) (07/17/88)
In article <1276@edge.UUCP> gregr@edge.UUCP (Greg Rose) writes: >The discussion of self modifying code, as it has progressed in this newsgroup, >will never reach an end. This is simply because the opponents and proponents >are not talking about the same things at all. Actually, the discussion will never end because it changes as it goes along. Oh, and regarding the chap who asked for a definition of SMC: SMC is code that, when you look at it, you go "Blecch-- this is self-modifying code."
seanf@sco.COM (Sean Fagan) (07/17/88)
In article <1276@edge.UUCP> gregr@edge.UUCP (Greg Rose) writes: [brief summary of the SM code discussion] >"Self modifying code" such as was used on early computers, for example >early CDC 6600 machines (not flaming them, just the example I know) >was a NECESSARY technique for performing subroutine linkage. There was no >instruction to perform a jump to an arbitrary address (i.e. the return >address of the subroutine) so the subroutine entry code *had* to overwrite >a jump instruction's operand (at a known, per subroutine, address) before >calling. This was usually the instruction immediately at or before the start >of the subroutine. In psuedo code: > [sample pseudo code] NO NO NO NO!!!!!! The CDC 6600 (and 7600, and the Crays, as well) had a Jump to Subroutine instruction: RJ <addr> (Return Jump) (on the 6600); what this would do is write a <JP {calling address+1},NOP,NOP> into <addr> (the machine could pack up to 4 instructions in a word [60 bit words, 15 or 30 bit instructions], so the assembler had to look for labels for RJ instructions and pad out the rest of the word with NOPs 8-)). (The machine did this because it lacked one simple little feature found on many of today's machines: a stack.) Therefore, when you did a RJ, it would perform the indicated write, and start execution at <addr>+1, and, to return, you would JP (or do an unconditional branch, which wouldn't clear out the instruction cache) to the entry point, and boom! you're back where you want to be. Because of other features of the machine (such as banks of memory [i.e., the memory was accessed as one of 8(?) banks of RAM, with each succeeding address in a different bank. Accessing any individual address could be, oh, 8 times slower than the machine cycle, but accessing the next address wouldn't be. Therefore, when you did an RJ, it could write to <addr>, and then get <addr+1> at the next cycle], and the fact that you could write to a location and execute another instruction [instruction overlap]), this was *much* faster than having to maintain a stack would be, and it also didn't tie up any of the registers (it only had 24 of them, one of which was hardwired to 0, one of which was usually 1, and 7 of which, if accessed, would either screw up memory or another register). They were very fast machines, as I've said here before. If only they didn't have NOS... >Greg Rose, Softway Pty Ltd, Australia >...uunet!munnari!greg@softway.oz Sean. --------------- Sean Eric Fagan seanf@sco (408) 458-1422, ext. 3561
elg@killer.DALLAS.TX.US (Eric Green) (07/17/88)
in article <4776@killer.UUCP>, chasm@killer.UUCP (Charles Marslett) says: > In article <33441@yale-celray.yale.UUCP>, lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes: >> Another related issue is that self-modifying code cannot be executed >> directly from ROM. Executing programs in ROM is an important memory-saving >> technique in small, diskless, special-purpose systems. > > Actually, I always thought that ROM was much more important as a way of > eliminating the need to load the code and as a way of guaranteeing non- > self-modifying-ness. Usually ROM code is larger (thus memory-consuming) > than RAM-loaded code. Uh, I don't know what systems you're talking about, but for the small microcontrollers that I've used -- the ROM is assembled to be placed in the memory map at an absolute location. That is, it takes up the same amount of space whether it's in ROM or RAM (in fact, one of my favorite hardware designs is a little SRAM card that I use to spoof ROM for prototyping purposes). As for re-using memory buffers etc., those are always in RAM, so those are always re-usable -- no matter where your executable is loaded. 32K bytes of ROM costs about 1/4th of what 32K bytes of static RAM costs (not to mention the eliminating-need-to-load). Note that these are dedicated control systems using 6502's, Z-80's, and 6809's, which have a total address space of 64K -- so 32K of ROM is a heckuva lot. > Seriously, newer architectures have reduced the difference (to 0 perhaps), > but the emphasis on RISC these days may resurrect self-modifying code -- > a RISC-er technique is not known to mankind! (:-D}===<<). Actually, RISC may be the stake-in-the-heart of self-modifying code. RISC technology overcomes the increased memory bandwidth requirement by using enhanced cache technology, 256-byte-at-a-time prefetch from page-mode DRAM's, heavy pipelining, and every other trick in the book. Most of which are quite contrary to the thought of self-modifying code. Another RISC trick is to use the saved silicon for extra registers, and have the compilers keep as many valus as possible in the CPU registers. You can always do a one-word memory-indirect load off of a register much faster than you can do a "memory-write(the address into the instruction) memory-read(the address part of the modified instruction) memory-read" triplet (which is at least three two-word instructions, as vs. a single one-word instruction, on a machine where one cycle = one word processed). But, I HAVE done self-modifying code on a 6502-based microcontroller (out of sheer necessity). I still make an occasional enhancement to that software, over 3 years later, as new applications arise, and have had no problems with readability. After all, in assembly on a microcontroller, you have to save the address somewhere, and the fact that the "somewhere" happens to be inside an assembly language instruction makes little difference. The argument against self-modifying code lies elsewhere besides readability. -- Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Snail Mail P.O. Box 92191 Lafayette, LA 70509 MISFORTUNE, n. The kind of fortune that never misses.
cik@l.cc.purdue.edu (Herman Rubin) (07/17/88)
In article <361@scolex>, seanf@sco.COM (Sean Fagan) writes: > In article <1276@edge.UUCP> gregr@edge.UUCP (Greg Rose) writes: > [brief summary of the SM code discussion] ........ > NO NO NO NO!!!!!! The CDC 6600 (and 7600, and the Crays, as well) had a Jump > to Subroutine instruction: RJ <addr> (Return Jump) (on the 6600); what this > would do is write a <JP {calling address+1},NOP,NOP> into <addr> ........ The RJ would set up the return address, but how about the address in the call? It is not that unusual to have subroutines or functions as arguments of called subroutines, or computed by the program in some other way. I have used a program which allowed the user to decide which way something should be done in a small subprogram, with the main program having many calls to a dummy procedure in that module which changed the call to the desired program. This way the dozens of places in the main program would compile the call to the dummy, and the first execution would change the call (and the return at that time). -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
g-rh@cca.CCA.COM (Richard Harter) (07/17/88)
In article <989@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes: >>As to the semantics of self-modifying code, isn't it identical to >>a Turing machine's or am I missing something? >No. The state mechanism of a Turing machine is fixed and finite. >To show equivalence, it is necessary to show that self modifying code >does not increase the number of states. Usually, people just appeal >to Church-Turing and assume all the states are there, somewhere. This needs some qualification. Strictly speaking no computers are Turing machines because they do not have an infinite memory or equivalent. However almost all computers would be universal Turing machines if they somehow did. However it depends on what level you are speaking of. If you are talking about the hardware, that's a *qualified* Turing machine. You can talk about a language being a virtual Turing machine; that makes sense, albeit you want to be more careful with your language than I am being here. The language remains fixed, and most languages are universal Turing machines. In this case the SMC is data. If you want to talk about a specific program as a virtual Turing machine (probably not universal) then, yes, it can't modify itself. -- In the fields of Hell where the grass grows high Are the graves of dreams allowed to die. Richard Harter, SMDS Inc.
smryan@garth.UUCP (Steven Ryan) (07/18/88)
>to Subroutine instruction: RJ <addr> (Return Jump) (on the 6600); what this >would do is write a <JP {calling address+1},NOP,NOP> into <addr> (the Technically, + EQ caller_address+1 - VFD 30/0 >bit instructions], so the assembler had to look for labels for RJ Compass forces upper on all labels. >instructions and pad out the rest of the word with NOPs 8-)). And SB0 B0+46000. >did this because it lacked one simple little feature found on many of >today's machines: a stack.) Actually hardware support of a stacks. I implemented reentrant and recursive stack frames for doing multitasking in Compass. >the indicated write, and start execution at <addr>+1, and, to return, you >would JP (or do an unconditional branch, which wouldn't clear out the >instruction cache) to the entry point, and boom! you're back where you want And sneaky code puts exit code above the entry point so that it falls into jump back to the caller: RTN0 exit processing RTN PS entry point. junk ZR ...,RTN0 more junk >[i.e., the memory was accessed as one of 8(?) banks of RAM, with each 8, Aye. >succeeding address in a different bank. Accessing any individual address could >be, oh, 8 times slower than the machine cycle, but accessing the next Actually, I think a major cycle was 10 minor cycles. >They were very fast machines, as I've said here before. If only they didn't >have NOS... Meaning NOS 1 I hope. NOS 2 has a lot nifty features I wish Unix had.
aglew@urbsdc.Urbana.Gould.COM (07/18/88)
>I have a question: > > Why are an Icache plus a Dcache better than just > a big shared cache as big as both? > >Peter da Silva `-_-' Ferranti International Controls Corporation. The best data I have seen says that a big shared cache as big as both would be better - all other things being equal. The other things include stuff like cycle time, simultaneous access, line sizes. For example, with split I/D caches you can have simultaneous access to both the I and D cache in the same cycle; with a shared cache this is harder to do. The I cache doesn't need to have expensive bus watching circuitry for invalidations. The I and D caches can be located physically closer to where they will be used. The smaller the cache, the faster. An I cache can be optimized for instruction access, which is much more sequential than data access (this is the same type of argument that can lead to a separate cache, or no cache, for vectors). Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have MX records aglew@xenurus.gould.com - if you don't ...!ihnp4!uiucuxc!ccvaxa!aglew - paths may still be the only way My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (07/18/88)
In article <1394@daisy.UUCP> david@daisy.UUCP (David Schachter) writes: >In article <1276@edge.UUCP> gregr@edge.UUCP (Greg Rose) writes: >>The discussion of self modifying code, as it has progressed in this >>newsgroup, will never reach an end. This is simply because the opponents >>and proponents are not talking about the same things at all. > >Actually, the discussion will never end because it changes as it goes along. In other words: a self-modifying discussion! :-) Bjorn Lisper
conybear@moncsbruce.oz (Roland Conybeare) (07/18/88)
From article <752@cernvax.UUCP>, by hjm@cernvax.UUCP (hjm): > As a final note, why is it 'clean' to alter a jump table and 'unclean' to > alter an inline constant (e.g. jmp @offset(r0) uses a value in memory as the > address but mov (pc)+,#1234 which loads an immediate does so too)? Why > the subtle difference? Any thoughts on the subject? > > Hubert Matthews I can see several reasons. * the big, big reason for referring to code via pointers, and getting the effect of self-modifying code via such pointers, is that you make your changes independent of the size of the code. Real SMC will only work when the new code is no larger than the old code. I think this is a very restrictive assumption. * when you alter a jump table (in C, at least) you are doing so within the language, and can expect the compiler to understand you. A language which allows you to modify instructions directly would of necessity depend strongly on the machine architecture to run these instructions. Otherwise, why don't we all use Universal Assembly Language? Roland Conybeare conybear@moncsbruce.oz an instruction, like mov (pc)+,#1234 you are assuming that the change you make
daveb@geac.UUCP (David Collier-Brown) (07/18/88)
From article <33441@yale-celray.yale.UUCP>, by lisper-bjorn@CS.YALE.EDU (Bjorn Lisper): > In article <12357@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes: >>In article <5254@june.cs.washington.edu>, pardo@june.cs.washington.edu > (David Keppel) writes: [a long, and generally well-reasoned debate elided] >>Sorry, no. I've heard LOTS of arguments against programs that generate their >>own code, Gentles, could we ***PLEASE*** reserve the term "self-modifying code" for code which actually modifies itself on the fly (eg, for generating indexes by instruction modification instead of using index registers)? Much of what is being discussed here is part of the well-known and respectable "generate and execute" paradigm. The only difference between this and the normal code-generation paradigm is that an application generates code at run-time, not a compiler at a previous time. Mixing the two is making this discussion "noisy". --dave (sorry about the pedantry, but its important) c-b -- David Collier-Brown. {mnetor yunexus utgpu}!geac!daveb Geac Computers Ltd., | Computer science loses its 350 Steelcase Road, | memory, if not its mind, Markham, Ontario. | every six months.
peter@ficc.UUCP (Peter da Silva) (07/18/88)
In article <1110@ficc.UUCP>, karl@ficc.UUCP (karl lehenbauer#) writes:
[ about some neat examples of SMC ]
What about the PolyForth scheduler. Each entry in the process table contained
a jump instruction. The next instruction was the save space for the current
PC in a context switch. So a context switch involved jumping to the next
processes process table entry. When a process waited, it put the address of
the next process table entry in that location. This meant that the process
was going to effectively busy-wait, but it's a *very* efficient busy-wait.
For the sort of tight real-time control that PolyForth was designed for, this
was apparently the most efficient way of doing things.
--
Peter da Silva `-_-' Ferranti International Controls Corporation.
"Have you hugged U your wolf today?" (uunet,tness1)!sugar!ficc!peter.
henry@utzoo.uucp (Henry Spencer) (07/19/88)
In article <1090@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: >You can save an even greater factor by building the BitBlt into the >hardware.... No. You can save a large factor over *poor* software implementations of BitBlt. Not over good ones. The dynamic-compilation implementations on the Blit et al run the memory at nearly full speed for the usual cases of BitBlt; it is fundamentally impossible to do better. Commodore et al find hardware implementations attractive because they don't know how to write good software. -- Anyone who buys Wisconsin cheese is| Henry Spencer at U of Toronto Zoology a traitor to mankind. --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu
baum@Apple.COM (Allen J. Baum) (07/19/88)
[] For those of you interested in on-the-fly generated code, here is a reference to an interesting article: Turning interpreters into Compilers Frank Pagan -C.S.U. Fullerton June 88 Software- Practice and Experience -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
rcd@ico.ISC.COM (Dick Dunn) (07/19/88)
First, a clarification: Self-modifying code was *not* necessary on the 6600 CPU. The call instruction generated a return instruction. However, it *was* "necessary" to use self-modifying code in the peripheral (I/O) processors, because there were certain operations on channels which required an immediate operand (the channel number)--the only reasonable way to select a channel at execution time was to modify the instruction. (A jump table of instructions with constant channel numbers would have worked, but would have been far too slow.) Back to the subroutine call in the CPU, which worked by planting a return instruction at the target, then jumping one word beyond. Herman Rubin asked: "And how would you call a function which is an argument?"... > The RJ would set up the return address, but how about the address in the call? > It is not that unusual to have subroutines or functions as arguments of called > subroutines, or computed by the program in some other way. Of course--even in FORTRAN this is common, and the 6600 was a FORTRAN machine if ever there was one. The answer is simple: You emulate the RJ instruction. You form the return instruction (actually this can be done at compile time to save execution) and plop it into memory just as the RJ would have done; then you jump to the target routine. The jump address is formed from a register plus a constant, so it can be a variable or computed address with no problem. This approach roughly doubles the time to call a routine; the return is unaffected. [It is one of life's enduring mysteries why the "RJ" (subroutine call) instruction allowed only a constant operand, while the "JP" (unconditional branch) allowed register plus constant! There was room in the RJ instruction format for a register.] It's interesting to look at what it costs to do subroutine calls with return addresses on a stack (instead of plunked into memory) so that you can have recursive calls. It's only about 1.5* the time for the native call/return to constant address, and it's about a break-even for call/ return to variable address. (It's slightly faster on the 6400, slower on the 6600.) -- Dick Dunn UUCP: {ncar,nbires}!ico!rcd (303)449-2870 ...Are you making this up as you go along?
hwfe@ur-tut (Harlan Feinstein) (07/19/88)
Reasons for writing self-modifying code have been scarce, and have basically centered around speed reasons. Speed is not the only reasons why someone would want to write s-m code. I've seen a program for the IBM PC called PC-LOCK that prohibits one from accessing the hard disk without providing a password. The install and de-install programs are self-modifying to make tracing them all but impossible. One time last summer I changed my password right before the weekend, and wouldn't you know it, come Monday I had no clue as to what the password was. I tried all kinds of disassembly and debug tracing to no avail. Hence, another reasons for self-modifying code. --harlan
hjm@cernvax.UUCP (hjm) (07/19/88)
Programmers have become quite accustomed to the destructive assignment of data ( x := y + 2 ), so why should the destructive changing of code be any different? If the practice were more common, then it wouldn't seem quite so bad, surely. After all, a Turing machine with a finite length tape is not *that* difficult ... Hubert Matthews
david@daisy.UUCP (David Schachter) (07/20/88)
In article <1988Jul18.231158.19500@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >The dynamic-compilation implementations on >the Blit et al run the memory at nearly full speed for the usual cases >of BitBlt; it is fundamentally impossible to do better. Commodore et al >find hardware implementations attractive because they don't know how to >write good software. I disagree. It is not necessarily true that good software blit code can run as fast as hardware supported blit operations. Sometimes, maybe, but not always. Henry's claim "it is fundamentally impossible to do better" doesn't fly. What about a CPU with a 16 bit bus competing with a blit chip with 64 bit access to the frame buffer? What about a CPU that must fetch instructions on the same bus used for data? Henry's statement is over-broad. -- David Schachter #include "disclaimer.legal" #include "funny.quote"
rcd@ico.ISC.COM (Dick Dunn) (07/20/88)
One other quick note about self-modifying code on the CDC 6000/7000 series: It was generally a very bad idea to do it because of the instruction lookahead. The 7600 looked farther ahead than the 6600, and the 6400 only had a one-word instruction lookahead. Thus, depending on the processor, it might or might not have already fetched an instruction before you modified it--leading to the superficially baffling phenomenon that the processor would appear not to have executed the instructions which would show up in a memory dump. But it's an ill wind, etc... The fact that self-modifying code behaved differently (but predictably so) on different processor models was used to create a code sequence which could identify the processor at run time and let code adapt itself accordingly. -- Dick Dunn UUCP: {ncar,nbires}!ico!rcd (303)449-2870 ...Are you making this up as you go along?
darin@laic.UUCP (Darin Johnson) (07/20/88)
In article <1988Jul18.231158.19500@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes: > In article <1090@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: > >You can save an even greater factor by building the BitBlt into the > >hardware.... > > No. You can save a large factor over *poor* software implementations of > BitBlt. Not over good ones. The dynamic-compilation implementations on > the Blit et al run the memory at nearly full speed for the usual cases > of BitBlt; it is fundamentally impossible to do better. However, a hardware BitBlt using DMA can be operating while the CPU is servicing other processes, with no slowdown in CPU speed. So even if you can squeeze your BitBlt routine down to the bare minimum and it takes x CPU cycles to do the blit, that is x cycles that are wasted. Granted, if you are using a single-tasking machine, you won't get much speedup using hardware blits. Since the Amiga is multi-tasking, its blitter chip does provide speedup. Also, the Amiga blitter can do logical operations (on three inputs) just as fast as a normal memory copy - It'd be pretty hard to hack some 680x0 code to XOR three inputs and write to three outputs as fast as it can move memory. Also, the overhead is very low compared to the overhead involved in dynamic compilation. If dynamic-compilation is that much faster, then you would have expected someone to have written a program that does uses this technique on the Amiga, rather than using the blitter. So far, I haven't seen any. If one shows up that is comparable in speed to the hardware, I would bet that the overhead and memory involved are high enough that the programs usefulness is diminished. -- Darin Johnson (...pyramid.arpa!leadsv!laic!darin) (...ucbvax!sun!sunncal!leadsv!laic!darin) "All aboard the DOOMED express!"
colwell@mfci.UUCP (Robert Colwell) (07/21/88)
In article <759@cernvax.UUCP> hjm@cernvax.UUCP () writes: > >Programmers have become quite accustomed to the destructive assignment of >data ( x := y + 2 ), so why should the destructive changing of code be any >different? If the practice were more common, then it wouldn't seem quite so >bad, surely. After all, a Turing machine with a finite length tape is not >*that* difficult ... Hubert, I was going to let this whole SMC discussion slide on by, but your post finally convinced me otherwise. First, you meant "y" := y + 2, not "x", right? Second, it seems like only yesterday when we (the royal we) CPU architects were so concerned with trying to narrow the semantic gap between what a programmer was trying to express and what the machine/language was coercing her into. Languages like Ada and machine architectures like capability machines were intended to address this perceived need. The basic idea is that (oversimplifying drastically) in a small programming environment (you at your standalone workstation) anything goes in terms of protection and language type checking, etc. The interesting domain is that of programming-in-the-large, with large teams of programmers producing hundreds of thousands of lines of code (SDI/Shuttle). There, extracting the last nano-mip of performance is of far less interest than in avoiding bugs and producing maintainable code, and I still believe that supporting that environment is a far more important challenge to architects than worrying about SMC could ever be. And I don't think (heavy sarcasm here) that UNIX/C is the ultimate answer to that environment. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
anw@nott-cs.UUCP (07/21/88)
<Reminiscence mode on> The Manchester University ATLAS, on which I cut my teeth (violins), included in its standard (ABL) Assembler a most extraordinary example of SMC. Each 48-bit instruction consisted of a 10-bit opcode, two 7-bit register numbers and a 24-bit address. If the MSB of the opcode was 1, the operation was an extracode, essentially one of 512 hard-wired subroutines [trig fns, complex arith, double arith, tape motions, I/O, etc, and how 512 such subroutines were squeezed into 4096 words, shared with a compiler, supervisor, test routines etc is another gory story]. If it was 0, then the next bit was ignored, and the following two bits determined the type of operation (00 illegal, 01 integer, 10 test, 11 floating). ABL included 128 user-settable flags. Where were these stored? Where else but in the ignored bits of 128 consecutive instructions within ABL itself which happened not to be extracodes. I remember being amazed by the discovery. I'm still not sure whether one should be appalled, or whether one should marvel at the ingenuity. All those people who were recently distrustful of an optimising assembler would have been really dubious about a self-modifying assembler! I once, and only once, wrote SMC myself. Also on ATLAS. It was a self-initialising random number generator. First time through, it initialised itself, either from the clock or from the user-supplied seed, then pulled the real code in on top of itself. Must have saved 3us per call, may well have amounted to 0.1% of the total run-time in heavy simulations. I was proud of it at the time. ATLAS would be a fine candidate for Bob Webber's collection of historic computers for which there ought to be simulations. I still have a cupboard full of machine-code listings of most of the important software. <Reminiscence mode off> -- Andy Walker, Maths Dept., Nott'm Univ., UK. anw@maths.nott.ac.uk
guy@gorodish.Sun.COM (Guy Harris) (07/21/88)
> Second, it seems like only yesterday when we (the royal we) CPU > architects were so concerned with trying to narrow the semantic gap > between what a programmer was trying to express and what the > machine/language was coercing her into. Languages like Ada and > machine architectures like capability machines were intended to > address this perceived need. A naive (and not rhetorical) question: what evidence is there to indicate the degree to which "narrowing the semantic gap" with capability machines and the like would improve the productivity of programmers or the reliability of programs, and to which other techniques (e.g., making a very fast conventional machine, possibly but not necessarily using RISC technology, and using that speed to improve the programming environment with better software) achieve the same goal?
peter@athena.mit.edu (Peter J Desnoyers) (07/21/88)
In article <60782@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: > >A naive (and not rhetorical) question: what evidence is there to indicate the >degree to which "narrowing the semantic gap" with capability machines and the >like would improve the productivity of programmers or the reliability of >programs, and to which other techniques [fast machines, good software] >achieve the same goal? Some protection and tracing features are MUCH slower in software. These features are also useful in writing wonderful debuggers. You need a bus snooper or ICE with an 8088 to achieve debugger features that can be done in software on the 68020 or 80386. And per-task memory protection is a big win in any environment, but impossible in software. Peter Desnoyers peter@athena.mit.edu
guy@gorodish.Sun.COM (Guy Harris) (07/22/88)
> >A naive (and not rhetorical) question: what evidence is there to indicate > >the degree to which "narrowing the semantic gap" with capability machines > >and the like would improve the productivity of programmers or the > >reliability of programs, and to which other techniques [fast machines, good > >software] achieve the same goal? > > Some protection and tracing features are MUCH slower in software. > These features are also useful in writing wonderful debuggers. You > need a bus snooper or ICE with an 8088 to achieve debugger features > that can be done in software on the 68020 or 80386. And per-task > memory protection is a big win in any environment, but impossible in > software. OK, maybe I didn't make myself clear. I'm not referring to features you find on many random processors or systems out there, such as memory management units or "trap on transfer of control/trap on reference to particular location" breakpoints. I'm referring to the sort of architectural features you find in machines that, say, require references to objects to go through a "descriptor" for the object, to do bounds checking and the like. One instantiation of the question would be "are you better off doing that or having a fast machine plus a compiler that can e.g. generate bounds checking code but avoid doing it in some cases where it's 'known' that you don't need it?" For instance, don't bother doing it in double a[SIZE], b[SIZE], c[SIZE]; for (i = 0; i < SIZE; i++) a[i] = b[i] + c[i]; (or the FORTRAN, or Ada, or COBOL, or... equivalent), since you know "i" will always be within the bounds of all the arrays. Not "do you think you'd be better off" or "do you have an analysis that makes it 'intuitively obvious' that you'd be better off", but "has anybody compared actual implementations of the two approaches, and concluded that, with everything taken into account, you're better off with one or the other?" I.e., taking into account the fact that microcode and hardware is like software in that it is possible for it to be buggy (either designed incorrectly or implemented incorrectly), and taking into account the time it takes to design that system, and....
henry@utzoo.uucp (Henry Spencer) (07/22/88)
In article <1404@daisy.UUCP> david@daisy.UUCP (David Schachter) writes: >In article <1988Jul18.231158.19500@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >>The dynamic-compilation implementations on >>the Blit et al run the memory at nearly full speed for the usual cases >>of BitBlt; it is fundamentally impossible to do better... > >I disagree. It is not necessarily true that good software blit code can >run as fast as hardware supported blit operations... On the Blit hardware? I doubt it. In any case, if the implementation uses the full memory bandwidth (a more precise form of my earlier statement), then it *is* fundamentally impossible to do better. The ways around this generally involve structuring the hardware in such a way that the hardware has access to memory bandwidth that cannot be used by the software. Other things being equal (which they often aren't, of course), one may question whether this is really such a smart idea. >... What about a CPU with a 16 bit bus competing with a blit >chip with 64 bit access to the frame buffer? One wonders whether it might be a better investment to put in a wider CPU, which should speed up *everything*, not just BitBlt. >What about a CPU that must >fetch instructions on the same bus used for data? What about it? Most modern CPUs have ways around this on at least a small scale. The Blit CPU is a lousy old 68000, not even a 68010; by using fun instructions like move-multiple, one can build code (for the usual BitBlt cases) that needs instruction fetches relatively rarely. I agree that it's possible to break the hardware in such a way that a software BitBlt is inherently slow. I prefer unbroken hardware myself. -- Anyone who buys Wisconsin cheese is| Henry Spencer at U of Toronto Zoology a traitor to mankind. --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu
rcd@ico.ISC.COM (Dick Dunn) (07/22/88)
> > >You can save an even greater factor by building the BitBlt into the > > >hardware.... > > No. You can save a large factor over *poor* software implementations of > > BitBlt... > However, a hardware BitBlt using DMA can be operating while the CPU is > servicing other processes, with no slowdown in CPU speed... First, that's only true if either the bitblt is being done to *and* from memory not accessed by the CPU, or the memory bandwidth is high enough to handle both the CPU and the bitblt DMA cycles. I'd guess that's rare; we seem to be afflicted with memory-bound CPUs in a lot of systems. But more importantly, in many cases you can't do anything else because either (a) there's only one process running or (b) you can't afford the context-switch time. Unless you get a bitblt operation that takes longer than *two* context switches (out and back), it costs more. For the case where you're bitblting characters (probably the most common case), you will lose big. > If dynamic-compilation is that much faster, then you would have expected > someone to have written a program that does uses this technique on the Amiga, > rather than using the blitter. So far, I haven't seen any... Ummm...but what does this prove? Not much, I think, until someone actually tries it and finds out whether it's slower. Given that the Amiga already has a useful bitblt (of whatever form), there's no great incentive to write another. -- Dick Dunn UUCP: {ncar,nbires}!ico!rcd (303)449-2870 ...Are you making this up as you go along?
colwell@mfci.UUCP (Robert Colwell) (07/22/88)
In article <60782@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: >> Second, it seems like only yesterday when we (the royal we) CPU >> architects were so concerned with trying to narrow the semantic gap >> between what a programmer was trying to express and what the >> machine/language was coercing her into. Languages like Ada and >> machine architectures like capability machines were intended to >> address this perceived need. > >A naive (and not rhetorical) question: what evidence is there to indicate the >degree to which "narrowing the semantic gap" with capability machines and the >like would improve the productivity of programmers or the reliability of >programs, and to which other techniques (e.g., making a very fast conventional >machine, possibly but not necessarily using RISC technology, and using that >speed to improve the programming environment with better software) achieve the >same goal? As far as I know, there is no evidence that you would necessarily find compelling, but then I could say that same thing about almost everything else in this field, too. There are, on the other hand, some good reasons to believe that we can do better than imperative languages running on essentially unprotected architectures. I know I can't do this topic justice in this forum, but here's my quick take on the subject. Think about the different computer languages you have used, and what was good or bad about each. Backus argued (in his famous paper on functional languages) that one of the reasons that Lisp is a good language is that you don't have to mentally execute a program (as you do with imperative languages) in order to convince yourself that it will do what is wanted. The language allows you to more closely express what is desired, so you test correctness by inspection. And yes, there are cases where it's more obvious/easier-to-express something in C than Lisp. But both examples serve to illustrate the point -- you, as a programmer, have some virtual entity you want to realize (data structure or algorithm), and the closer you can get to realizing exactly what you have in mind, the more likely the code is to be correct, maintainable, and understandable (and the smaller the semantic gap between what you want and what you can express). That's partly an allegory. The capability people argue that the same thing extends into all areas of computer systems. Recall the classic arguments about turning off runtime bounds checking to reclaim that lost performance -- why should a programmer, in the best of all possible worlds, have to worry about things like that? It doesn't help any of the major productivity metrics. Say what you want about Ada on the 432 (and I've said plenty already), as a programming environment (forget the speed problems temporarily) it was really nice. Your program had access to only the data structures you specified, and only in the ways you specified, and if your code screwed up, the machine could tell you where and in what way. To me, the hardest bugs to find are those where you fix one bug (in someone else's code, of course) and inadvertently break it in some obscure way such that something completely unrelated is getting trashed. For this to even be possible (in my opinion) means that the programming environment is lacking something crucial, notwithstanding that about 95% of all programmers on this planet see just such an environment. The environment is failing to express what the programmer wanted, and it's a combined failure of the machine architecture, the language, and probably the OS too. The semantic gap argument says that the more the desired and achieved environments differ, the larger the quantity of bugs, and the worse their obscurity will be. I know you're tempted at this point to say that even if one grants this as a shortcoming of current architectures/environments, there have been no successes so far at addressing it. That's another topic of conversation, I think; all I'm trying to do here is offer some of the justification for why a lot of research has gone into other ways to compute (that don't have sheer instrs-executed-per-unit-time as their primary figure of merit). All the strong-type-checking stuff built into some languages has the same motivation as above. For me, the bottom line is that, as usual, there aren't any easy answers (because if there were somebody would've patented them by now), but we shouldn't lose track of the corresponding questions just on that basis. The problem is getting worse, after all -- more, not fewer, lives currently depend on the correctness of software than ever before, and that trend will continue (fly-by-computer airplanes, dynamically-unstable planes, military systems, ...). I assume that the reason people like you and me are concentrating on performance is that that's what sells. I don't think that needs defending, but I also see few reasons to believe that sheer performance alone will provide the cures to the problems I tried to outline above. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
henry@utzoo.uucp (Henry Spencer) (07/22/88)
In article <302@laic.UUCP> darin@laic.UUCP (Darin Johnson) writes: >... a hardware BitBlt using DMA can be operating while the CPU is >servicing other processes, with no slowdown in CPU speed... Uh, using what for memory bandwidth? A good implementation of BitBlt, whether in software *or* hardware, will run the memory flat out, leaving nothing for "servicing other processes". (Caches help a lot on instruction fetches, but data has this annoying tendency to require real memory fetches.) If your CPU is sitting there waiting for memory, it might as well be doing the BitBlt itself. >...the Amiga blitter can do >logical operations (on three inputs) just as fast as a normal memory >copy - It'd be pretty hard to hack some 680x0 code to XOR three inputs >and write to three outputs as fast as it can move memory. True, but who cares? The overwhelmingly common cases of BitBlt are doing an operation like writing a character, where the software overhead of deciding what to do totally dominates doing it, and bulk movement of bits, where a straight copy operation can run memory-bound on a well-built box with any good implementation. >Also, the >overhead is very low compared to the overhead involved in dynamic compilation. See previous paragraph. The Blit code does not do dynamic compilation for the small cases where overhead is dominant anyway (when you're drawing a character, you are only moving a few words of data, and figuring out which words to move is much harder than actually doing it, regardless of whether the actual doing is hardware or software), only for the big ones that can amortize the overhead well. BTW, the overhead is smaller than you think. >If dynamic-compilation is that much faster, then you would have expected >someone to have written a program that does uses this technique on the Amiga, Yes. Hence my harsh words about the competence of the people involved. Note that doing a really good software BitBlt is A LOT OF WORK -- it's not the sort of thing an amateur knocks off in an evening. The techniques are not that well known (especially in the micro community, which is notorious for its ignorance of the state of the art -- the people at Commodore most likely had no idea it was even possible to do a fast software BitBlt). Note that some of the earlier Suns had quite a bit of hardware BitBlt assist, and the newer ones don't. Sun learned. Commodore will, someday. -- Anyone who buys Wisconsin cheese is| Henry Spencer at U of Toronto Zoology a traitor to mankind. --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu
colwell@mfci.UUCP (Robert Colwell) (07/22/88)
In article <60859@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: >> >A naive (and not rhetorical) question: what evidence is there to indicate >> >the degree to which "narrowing the semantic gap" with capability machines >> >and the like would improve the productivity of programmers or the >> >reliability of programs, and to which other techniques [fast machines, good >> >software] achieve the same goal? > ... >Not "do you think you'd be better off" or "do you have an analysis that makes >it 'intuitively obvious' that you'd be better off", but "has anybody compared >actual implementations of the two approaches, and concluded that, with >everything taken into account, you're better off with one or the other?" I.e., >taking into account the fact that microcode and hardware is like software in >that it is possible for it to be buggy (either designed incorrectly or >implemented incorrectly), and taking into account the time it takes to design >that system, and.... You are asking for way too much, Guy. Computer system designers can't even agree on performance metrics, and performance is probably the easiest thing to measure about a computer system (as compared to, say, reliability or programmer productivity). Even the units in which programmer productivity is usually measured are suspect -- lines-of-debugged-code per person per unit time???? My intuition is that the practical answer is obvious -- if you have a machine that is 2x the performance of mine, the prospect for my environment being more "productive" than yours (without even being able to quantize the difference) isn't going to help me sell very many machines. Since this is a user-perception problem, technical answers won't necessarily help, either. But my intuition is also that, for the same reasons that we instinctively reach for the best tools for the job at hand (C, assembly, Lisp, Ada, etc.), we would do considerably better as programmers if the "thing" (including architecture, language, & OS) executing our code matched our expectations as closely as possible, and with few surprises. And further, if it made sure we knew the nature of whatever surprises occurred, rather than (a C example) having an array-pointer-out-of-bounds happen to trash a scalar value that was coincidentally declared nearby. I don't see higher performance as a very good means of preventing errors like that. I think the reason you won't find many academic or formal studies attempting to quantify the benefits of object-orientation (or other higher-productivity programming environments) is that such studies would be fraught with difficulty and expense, and would be unlikely to yield completely unambiguous results. Hell, there are still people who think assembly language programming is better; how would you prove to them they're wrong? (If they smoke, too, you'll know you're in big trouble if all you've got is a rational argument on your side.) Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
elg@killer.DALLAS.TX.US (Eric Green) (07/23/88)
in article <1988Jul18.231158.19500@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says: > In article <1090@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: >>You can save an even greater factor by building the BitBlt into the >>hardware.... > No. You can save a large factor over *poor* software implementations of > BitBlt. Not over good ones. The dynamic-compilation implementations on > the Blit et al run the memory at nearly full speed for the usual cases > of BitBlt; it is fundamentally impossible to do better. Commodore et al > find hardware implementations attractive because they don't know how to > write good software. I won't comment on Commodore's software guys (I'm typing this on an Amiga, so, depending on when I last found a bug in the OS, I'm either in complete agreement or violent disagreement), but there's one big advantage of a bitblt chip over even a tight assembly-language memory-move loop (using DBcc, no less): instruction fetch. On newer processors such as the 68020 and 68030, instruction fetch during a loop is no problem, since such a small tight loop is almost certain to be in the cache. But on older processors such as the 68000 or 68010, you would burn up 3/4ths of your bus bandwidth just fetching instructions! (no cache). Also note that even fetching instructions out of cache, it still takes at least the 68020 some time to execute them (presumably, the 68030's pipelining will keep instruction fetch & execution overlapped). The hardware blitter, of course, will never have instruction fetches... Given the technology of 1983-1984, a hardware blit chip made a lot of sense, then. Now.... well, the 68030 costs a bundle, so why pay more for a blit chip that can't go any faster anyhow? -- Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Snail Mail P.O. Box 92191 Lafayette, LA 70509 PC Pursuit, n., a service specializing in mangling your postings, by eating characters so you can't see what you're typing.
elg@killer.UUCP (07/23/88)
in article <7439@ico.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) says: > >> > >You can save an even greater factor by building the BitBlt into the >> > >hardware.... >> > No. You can save a large factor over *poor* software implementations of >> > BitBlt... >> However, a hardware BitBlt using DMA can be operating while the CPU is >> servicing other processes, with no slowdown in CPU speed... > First, that's only true if either the bitblt is being done to *and* from > memory not accessed by the CPU, or the memory bandwidth is high enough to > handle both the CPU and the bitblt DMA cycles. That's exactly what the Amiga does. There is 512K of RAM dedicated to screen operations, while the OS and (with xtended RAM) user programs run out RAM or ROM which is on a seperate bus inaccessible to the blitter. Thus, while the blitter runs, user programs are free to run without bus contention. In addition, the blitter takes advantage of the fact that the 68000 only uses half the available memory bandwidth, and even when the 68000 is accessing "chip" RAM (blitter-accessible RAM), the blitter can still run at half-speed until the 68000 goes somewhere else. All in all, a very useful and elegant design, given its age ('83-'84 or therebouts), and the low cost of implementation ($650 for a complete machine, with disk drive, whereas 68020's and 68030's would cost almost the same as the complete A-500 after production costs and two levels of markups). > context-switch time. Unless you get a bitblt operation that takes longer > than *two* context switches (out and back), it costs more. For the case > where you're bitblting characters (probably the most common case), you will > lose big. That's one of the major flaws that they found in the Amiga's handling of text i/o... Charlie Heath of Microsmiths replaced the bitblt with plain old memory moves, and is almost twice as fast. However, I still would not want to use a plain old 68000 machine which had to use a software loop to scroll a 32K bitmap.... I've seen Macs and ST's, which have much simpler hardware and software which should be faster than the Amiga's message-passing OS (which requires a context switch to send the text to the console.device process , then another context switch to get back to the user process)... and the speed of window management and scrolling simply cannot compare. That hardware makes more difference than you'd think, given all the other factors involved. -- Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Snail Mail P.O. Box 92191 Lafayette, LA 70509 PC Pursuit: A diabolical conspiracy to cause spelling flames upon the net, by rendering the typer inable to read what he has written.
henry@utzoo.uucp (Henry Spencer) (07/24/88)
In article <473@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >... The capability people argue that the same >thing extends into all areas of computer systems. Recall the classic >arguments about turning off runtime bounds checking to reclaim that >lost performance -- why should a programmer, in the best of all >possible worlds, have to worry about things like that? ... My counterargument is that it is almost as much of an imposition on the programmer to have such checks done at runtime as it is to have them not done at all. Given the impossibility of complete testing, there is no way to guarantee such tests will catch a bug during development. This means that the programmer has to go through the same complicated exercise of trying to assure himself that the problem will never happen. What is really wanted is a way to check these properties at COMPILE TIME... in which case the runtime checks become largely superfluous anyway. Can it be done? Well, in one sense the answer is clearly yes, because a proof of program correctness has to include it, and we know that automating such proofs is possible (although seldom practical at the moment). The question is whether it can be done with practical, affordable machinery without crippling the language. My own long-held conjecture is that the answer is "yes", but I don't have proof of that yet. -- Anyone who buys Wisconsin cheese is| Henry Spencer at U of Toronto Zoology a traitor to mankind. --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu
henry@utzoo.uucp (Henry Spencer) (07/24/88)
In article <572@tuck.nott-cs.UUCP> anw@maths.nott.ac.uk (Dr A. N. Walker) writes: >...self-initialising random number generator. First time through, it >initialised itself, either from the clock or from the user-supplied seed, >then pulled the real code in on top of itself... Let us not forget, also, that self-modifying code wasn't uncommon in bootstraps, back in the days when ROMs were expensive or unavailable. The disk bootstrap for a late-model PDP-8 was two instructions, so brief that nobody ever bothered putting it in ROM, and it was about as self-modifying as you could get: power-up-reset happened to leave the disk controller in just the right state for reading block 0 into memory starting at 0, so the bootstrap -- toggled in at a specific place in low core -- was "kick the disk controller; branch to self". It got overwritten as block 0 came in, and of course block 0 was carefully crafted to catch control as the "branch to self" was overwritten! -- Anyone who buys Wisconsin cheese is| Henry Spencer at U of Toronto Zoology a traitor to mankind. --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu
colwell@mfci.UUCP (Robert Colwell) (07/24/88)
In article <1988Jul23.221743.22169@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <473@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >>... The capability people argue that the same >>thing extends into all areas of computer systems. Recall the classic >>arguments about turning off runtime bounds checking to reclaim that >>lost performance -- why should a programmer, in the best of all >>possible worlds, have to worry about things like that? ... > >My counterargument is that it is almost as much of an imposition on the >programmer to have such checks done at runtime as it is to have them not >done at all. Given the impossibility of complete testing, there is no >way to guarantee such tests will catch a bug during development. This >means that the programmer has to go through the same complicated exercise >of trying to assure himself that the problem will never happen. What is >really wanted is a way to check these properties at COMPILE TIME... in >which case the runtime checks become largely superfluous anyway. We probably agree that the compiler should check as much as it possibly can; the earlier an error is detected, the less the ambiguity about what is wrong and the cheaper the cure. But there is an awful lot the compiler can't know about -- programs that use input data, generate pointers, or do just about anything else that's interesting, are going to do many operations that are difficult for the compiler to validate at compile time. I'm not sure I catch your drift on the imposition of runtime checks. To me, the best human debuggers have the ability to figure out the fastest which assumptions are being made that are wrong, thus getting to the heart of some problem the quickest. If I have a program bug, I want the machine to express as directly as possible the intent of my program so that I can narrow the bug-space down to my code alone. If the machine allows a change to one variable (an array, say) to affect some other unrelated variable, then it is not conforming to the intent of my program. In fact, it is not conforming to anything useful at all, since nobody would argue that it is useful programming practice to ever do such a thing on purpose (I hope?). Given that, the fact that the machine can do such a thing, let alone do it as a default, is what I'd label a shortcoming in the basic architecture. One we've all long ago learned to live with, to be sure, but it's not something to be proud of, at very least. >Can it be done? Well, in one sense the answer is clearly yes, because a >proof of program correctness has to include it, and we know that automating >such proofs is possible (although seldom practical at the moment). The >question is whether it can be done with practical, affordable machinery >without crippling the language. My own long-held conjecture is that the >answer is "yes", but I don't have proof of that yet. I sure hope you're right. In fact, if progress towards this goal becomes evident, I'd propose we start discussing ways of turning architecture towards better ways of supporting this instead of throughput or programming environments. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
elg@killer.DALLAS.TX.US (Eric Green) (07/25/88)
in article <1988Jul22.164129.5495@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says: > In article <302@laic.UUCP> darin@laic.UUCP (Darin Johnson) writes: >>... a hardware BitBlt using DMA can be operating while the CPU is >>servicing other processes, with no slowdown in CPU speed... > > Uh, using what for memory bandwidth? A good implementation of BitBlt, > whether in software *or* hardware, will run the memory flat out, leaving > nothing for "servicing other processes". But the 68000 (uncached) version would use 3/4ths of that bandwidth to fetch INSTRUCTIONS. Also note that the bandwidth of modern 100ns memories is much greather than a 8mhz 68000 can take advantage of... even when the 68000 is on the bus 100% of the time that it can be, you still have half that memory bandwidth left for blitter operations etc. (a fact that the designers of the ST use to do their video refresh). Given the cost constraints, and the state of technology in 1984, it made a lot of sense to have a hardware blitter. The fact that the Blit terminal did fast blits without a blitter probably has more to do with the fact that a) the machine was designed before modern high-speed DRAMs, when 250ns RAMs were fast, and b) the machine had a much higher resolution screen than the Amiga, meaning that much more bus bandwidth was taken up with video refresh (thereby improving the desirability of caching the loop, since you couldn't do data accesses any faster anyhow). > If your CPU is sitting there waiting for memory, it might as well be doing > the BitBlt itself. At least with the 68000 and modern DRAMs, the memory spends most of its time waiting for the CPU! > deciding what to do totally dominates doing it, and bulk movement of bits, > where a straight copy operation can run memory-bound on a well-built box > with any good implementation. Memory-bound? Yes, if you're using a 68020 or 68030 with a tight loop that'll fit into their internal cache. But the 68000 would spend half of that bandwith simply fetching instructions, not moving bits.... > the sort of thing an amateur knocks off in an evening. The techniques are > not that well known (especially in the micro community, which is notorious > for its ignorance of the state of the art -- the people at Commodore most > likely had no idea it was even possible to do a fast software BitBlt). True, the majority of the micro world thinks multitasking is a neat idea, message passing is what secretaries do, etc. But first time I ever heard the designers of the Amiga accused as such (the Amiga, BTW, was NOT designed by Commodore). Considering that they implemented such nifty state-of-the-art ideas as a message-passing multi-tasking OSetc. (only to ruin it all by commissioning a cruddy filesystem), and considering that they all had Suns on their desktops when they were designing it, it sounds like you're being harsh for no good reason, considering the design constraints (512K RAM, under-$1500 price, etc. > Note that some of the earlier Suns had quite a bit of hardware BitBlt > assist, and the newer ones don't. Sun learned. Commodore will, someday. Earlier Suns were based upon 68000/68010 techynology, where hardware bitblt's make a lot of sense. Later Sun's, based upon 68020s, don't need the hardware assist (see the beginning of this message). Unfortunately, Commodore probably will retain their blitter even when they move into 68020/68030 machines, for backward-compatability reasons (there's a helluva lot of programs out there, mostly games, directly accessing that blitter).BTW, games programmers are known for their dedication to speed... and most of the games programmers that I know have made extensive use of the blitter whenever it made sense (when moving large amounts of data, when setup time is no longer the dominating factor). -- Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Snail Mail P.O. Box 92191 Lafayette, LA 70509 PC Pursuit: Or, How to Eat the Typist's Echo in Three Words or Less.
ge@hobbit.sci.kun.nl (Ge' Weijers) (07/25/88)
From article <1087@ficc.UUCP>, by peter@ficc.UUCP (Peter da Silva): > I have a question: > > Why are an Icache plus a Dcache better than just > a big shared cache as big as both? The answer is bandwidth. The CPU does not have to stop filling the instruction pipeline when it accesses/writes data. -- Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge
peter@ficc.UUCP (Peter da Silva) (07/25/88)
In article <7439@ico.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes: Me: > > > >You can save an even greater factor by building the BitBlt into the > > > >hardware.... > > > No. You can save a large factor over *poor* software implementations of > > > BitBlt... Me: > > However, a hardware BitBlt using DMA can be operating while the CPU is > > servicing other processes, with no slowdown in CPU speed... > First, that's only true if either the bitblt is being done to *and* from > memory not accessed by the CPU, or the memory bandwidth is high enough to > handle both the CPU and the bitblt DMA cycles. I'd guess that's rare; we > seem to be afflicted with memory-bound CPUs in a lot of systems. Well, for the Amiga the memory bandwidth is high enough, *and* the blitter only affects display memory. Memory requests for the CPU to memory outside the display memory area (the first 512K as it turns out) is on another bus. > But more importantly, in many cases you can't do anything else because > either (a) there's only one process running or (b) you can't afford the > context-switch time. The Amiga context switch time is on the order of a procedure call... the Amiga Exec is a realtime operating system... all processes are "lightweight" in UNIX jargon. > Unless you get a bitblt operation that takes longer > than *two* context switches (out and back), it costs more. For the case > where you're bitblting characters (probably the most common case), you will > lose big. When scrolling a window you just blit the whole thing instead of doing it a character at a time, and you win big (80 by 24 by 8 by 8 bits == 15,360 bytes). As for text, see below. > [dynamic compilation] > Ummm...but what does this prove? Not much, I think, until someone actually > tries it and finds out whether it's slower. Given that the Amiga already > has a useful bitblt (of whatever form), there's no great incentive to write > another. Well, there are already programs out that SetFunction the Text() routine to speed up text drawing of fixed 8-by-8 characters (the common case, as you say) and there is apparently lots of dynamically compile BitBlt code out there waiting to be nabbed. -- Peter da Silva `-_-' Ferranti International Controls Corporation. "Have you hugged U your wolf today?" (uunet,tness1)!sugar!ficc!peter.
guy@gorodish.Sun.COM (Guy Harris) (07/26/88)
> We probably agree that the compiler should check as much as it > possibly can; the earlier an error is detected, the less the > ambiguity about what is wrong and the cheaper the cure. But there is > an awful lot the compiler can't know about -- programs that use input > data, generate pointers, or do just about anything else that's > interesting, are going to do many operations that are difficult for > the compiler to validate at compile time. In the case of programs that use input data, said programs should validate the input data before using it (unless they have good reason to know the data will *never* be incorrect); I would prefer to see input, line 12: value "12038" is out of range (should be between 17 and 137) than to see ILLEGAL ARRAY REFERENCE: "frobozz.q", line 31 Subscript was 12038; should be between 17 and 137 as the former, at least, tells me that the ultimate problem is in the data, not the code. Could a compiler know to remove run-time subscript checks in: i = read(); if (i < 17 || i > 137) fail("input, line %d: value "%d" is out of range (should be between 17 and 137"); printf("input value for %d is %d\n", i, array[i]); If so, it should probably issue warnings when it *isn't* able to remove run-time subscript checks, to inform programmers that they should probably put those checks in themselves. > >Can it be done? Well, in one sense the answer is clearly yes, because a > >proof of program correctness has to include it, and we know that automating > >such proofs is possible (although seldom practical at the moment). The > >question is whether it can be done with practical, affordable machinery > >without crippling the language. My own long-held conjecture is that the > >answer is "yes", but I don't have proof of that yet. > > I sure hope you're right. In fact, if progress towards this goal > becomes evident, I'd propose we start discussing ways of turning > architecture towards better ways of supporting this instead of > throughput or programming environments. Yes, but does this need "architectural" support, at least in the sense of "instruction set architecture"? If a compiler for a "conventional" machine can do that level of validation, subscript-range checking features in the instruction set would be seldom, if ever, used. If "architecture" includes the compiler, then I agree that "architectural" support for this would be nice.
guy@gorodish.Sun.COM (Guy Harris) (07/26/88)
> You are asking for way too much, Guy. Computer system designers > can't even agree on performance metrics, and performance is probably > the easiest thing to measure about a computer system (as compared to, > say, reliability or programmer productivity). Well, I don't necessarily want numbers complete with reliable error bars; I'd be interested just to hear anecdotal stories such as "it took this team 10 times as long to do this project under XYZZY's Ada on a PC as it did for some other team to do it on FFGGF's Ada on the FFGGF Ada-Interpreter-In-Microcode(TM) processor", as long as the two teams were of comparable quality. > But my intuition is also that, for the same reasons that we > instinctively reach for the best tools for the job at hand (C, > assembly, Lisp, Ada, etc.), we would do considerably better as > programmers if the "thing" (including architecture, language, & OS) > executing our code matched our expectations as closely as possible, > and with few surprises. Yes, but is there any indication of whether we'd do better by changing the architecture (meaning the instruction-set architecture) or by changing the compiler, holding other things pretty much equal (e.g., the language)? > And further, if it made sure we knew the nature of whatever surprises > occurred, rather than (a C example) having an array-pointer-out-of-bounds > happen to trash a scalar value that was coincidentally declared nearby. > I don't see higher performance as a very good means of preventing errors > like that. If the performance is enough greater, you could spend some of the performance gain by inserting instructions to do run-time array bounds checks. If the compiler is good enough to optimize away most of those checks, you may spend less than you might expect to.
smryan@garth.UUCP (Steven Ryan) (07/26/88)
>If the machine allows a change to one variable (an array, say) to >affect some other unrelated variable, then it is not conforming to >the intent of my program. In fact, it is not conforming to anything >useful at all, since nobody would argue that it is useful programming >practice to ever do such a thing on purpose (I hope?). This particular example is done all the time in handling recursive data structures. >>Can it be done? Well, in one sense the answer is clearly yes, because a >>proof of program correctness has to include it, and we know that automating >>such proofs is possible (although seldom practical at the moment). Depends on whether all possible programs (including those of monkey programmers) or just `practical' programs are considerred. Formal proofs of all possible programs are impossible, flat out, no appeal. Formal proofs of practical programs depend on how powerful practical has to be to remain useful.
david@sun.uucp (David DiGiacomo) (07/26/88)
In article <1988Jul22.164129.5495@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >Note that some of the earlier Suns had quite a bit of hardware BitBlt >assist, and the newer ones don't. Sun learned. Commodore will, someday. This is just my opinion, but I think what Sun learned is that once you stuck all that other gunk on the CPU board there was no room left for anything beyond a minimal frame buffer. The fastest Sun frame buffers currently available are the "CG3" and "CG5" VME color boards, which have extensive "hardware BitBlt assist". Of course, it's all datapath (the notorious rasterop chips) and no sequencing. The CPU does all the address generation. P.S. An Amiga-like autonomous "blitter" would be pretty useless in a virtual memory workstation unless you gave it its own MMU. -- David DiGiacomo, Sun Microsystems, Mt. View, CA sun!david david@sun.com
peter@ficc.UUCP (Peter da Silva) (07/26/88)
I suppose that if *all* your emmory is on the same bus as your display drivers, then this statement is true: In article ... henry@utzoo.uucp (Henry Spencer) writes: > I agree that it's possible to break the hardware in such a way that a > software BitBlt is inherently slow. I prefer unbroken hardware myself. On the other hand, at a given speed a 68030 can only go so fast. If you can let it do other things while another processor is doing the display, then why not? Then the statement becomes: "I agree that it's possible to break the hardware in such a way that a hardware BitBlt is inherently slow. I prefer unbroken hardware myself." I have always been uncomfortable with the idea of having your main CPU throwing all those display bits around anyway... hell, given the price of the 68030 it might pay Sun to stick an extra one in there to unload *all* of the display management from the main CPU. -- Peter da Silva `-_-' Ferranti International Controls Corporation. "Have you hugged U your wolf today? when it was pointed out to them by
levy@ttrdc.UUCP (Daniel R. Levy) (07/26/88)
In article <1988Jul23.221743.22169@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes: > -- > Anyone who buys Wisconsin cheese is| Henry Spencer at U of Toronto Zoology > a traitor to mankind. --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu First it was NASA, now it's Wisconsin cheese. What gripe do you (and Pournelle for that matter) have against Wisconsin cheese? (The imagination boggles.) -- |------------Dan Levy------------| THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY | AT&T Data Systems Group | AND ARE NOT TO BE IMPUTED TO AT&T. | Skokie, Illinois | |-----Path: att!ttbcad!levy-----|
jesup@cbmvax.UUCP (Randell Jesup) (07/26/88)
In article <4912@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >in article <1988Jul22.164129.5495@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says: >> Uh, using what for memory bandwidth? A good implementation of BitBlt, >> whether in software *or* hardware, will run the memory flat out, leaving >> nothing for "servicing other processes". > >But the 68000 (uncached) version would use 3/4ths of that bandwidth >to fetch INSTRUCTIONS. Also note that the bandwidth of modern 100ns >memories is much greather than a 8mhz 68000 can take advantage of... >even >when the 68000 is on the bus 100% of the time that it can be, you >still have half that memory bandwidth left for blitter operations etc. >(a fact that the designers of the ST use to do their video refresh). Also, remember that there are more than one memory busses out there, and even if cycles are tight in video ("chip") memory, the processor can run flat out from rom or expansion ram. >> If your CPU is sitting there waiting for memory, it might as well be doing >> the BitBlt itself. Don't forget, the 68000 doesn't have the BFEXTU instruction, so the inner loop of a BitBlit isn't as tight as on a 68020. >> the sort of thing an amateur knocks off in an evening. The techniques are >> not that well known (especially in the micro community, which is notorious >> for its ignorance of the state of the art -- the people at Commodore most >> likely had no idea it was even possible to do a fast software BitBlt). Excuse me, I have a paper in front of me published in EUUG last fall by a researcher at AT&T on optimized software BitBlits in C on an 68020. If you want to insult people please do it to the XYZZY-clone makers and the people who write software for them. >> Note that some of the earlier Suns had quite a bit of hardware BitBlt >> assist, and the newer ones don't. Sun learned. Commodore will, someday. > >Earlier Suns were based upon 68000/68010 techynology, where hardware >bitblt's make a lot of sense. Later Sun's, based upon 68020s, don't >need the hardware assist (see the beginning of this message). Hardware BitBlit is still useful in a lightwieght process architecture where you have divided memory spaces (and thus much higher effective band- width). >Unfortunately, Commodore probably will retain their blitter even when >they move into 68020/68030 machines, for backward-compatability >reasons (there's a helluva lot of programs out there, mostly games, >directly accessing that blitter).BTW, games programmers are known for >their dedication to speed... and most of the games programmers that I >know have made extensive use of the blitter whenever it made sense >(when moving large amounts of data, when setup time is no longer the >dominating factor). Actually, a number of the assembler hacker-types who write games for the amiga ignore the blitter, because if you bypass the system software it's a pain to set up, or because in their precise setup they can do things faster because of various restrictions. The blitter is still great for general purpose stuff. Other nice things done by our blitter are things like line draws and area fills. Fast software implementations of BitBlit tend to be 2 sources, 1 destination blits (therefor 16 operations). If you expand them to 3 sources, 1 destination (256 operations), they tend to take up a LOT of code/ROM space, or they slow down a lot, or both. 3 sources are VERY useful for animation, since it lets you do "cookie cutter" blits, with an object that is masked onto the destination, without destroying the areas in the rectangle but outside the mask. -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
henry@utzoo.uucp (Henry Spencer) (07/26/88)
In article <4894@killer.DALLAS.TX.US> elg@killer.UUCP writes: >... In addition, the blitter takes advantage of >the fact that the 68000 only uses half the available memory bandwidth... I.e., that the memory system is too fast for the processor to use fully. Can you say "imbalanced system"? >However, I still would not want to use a plain old 68000 machine which >had to use a software loop to scroll a 32K bitmap.... I've seen Macs >and ST's... the speed of window management and scrolling simply cannot >compare... Have you tried a Blit (aka 5620, aka 630)? That's a demonstration of what *can* be done with a plain old 68000 using software BitBlt. I said Commodore didn't know how to write a fast software implementation; I didn't mean to imply that Atari or Apple did better! One *can* do faster BitBlt than the Blit does, given enough added hardware, but I really doubt the cost-effectiveness. (Remembering, again, that the alternative is not to leave out the custom goodies, but to use that investment of money and hardware effort to make the CPU run faster.) -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
henry@utzoo.uucp (Henry Spencer) (07/26/88)
In article <4912@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >> ... A good implementation of BitBlt, >> whether in software *or* hardware, will run the memory flat out, leaving >> nothing for "servicing other processes". > >But the 68000 (uncached) version would use 3/4ths of that bandwidth >to fetch INSTRUCTIONS... Sigh. Please think before answering. An artfully-build MOVEM loop does *not* spend 3/4 of its time fetching instructions. >... the bandwidth of modern 100ns >memories is much greather than a 8mhz 68000 can take advantage of... Suggesting that one should either buy a faster processor or spend less money on the memory. There are sillier things than putting 100ns RAMs on an 8MHz 68000, yes, but I'd have to think for a moment to come up with specific examples. >Given the cost constraints, and the state of technology in 1984, it >made a lot of sense to have a hardware blitter. The fact that the Blit >terminal did fast blits without a blitter probably has more to do with >the fact that a) the machine was designed before modern high-speed >DRAMs, when 250ns RAMs were fast... The 630, designed rather more recently, does things the same way. So do the monochrome Suns. > and b) the machine had a much higher >resolution screen than the Amiga, meaning that much more bus bandwidth >was taken up with video refresh (thereby improving the desirability of >caching the loop, since you couldn't do data accesses any faster >anyhow). I don't follow this. There is no cache in the Blit. >> Note that some of the earlier Suns had quite a bit of hardware BitBlt >> assist, and the newer ones don't. Sun learned... > >Earlier Suns were based upon 68000/68010 techynology, where hardware >bitblt's make a lot of sense. Later Sun's, based upon 68020s, don't >need the hardware assist (see the beginning of this message). No, they dumped the hardware assist *before* they switched to the 020, not after, unless I'm much mistaken. Most of the Sun-2s (68010) had no hardware assist. More generally, I will repeat -- more explicitly -- something I pointed out before: the fair comparison is not to the same system without BitBlt hardware, but to a system where the same effort and funding (custom chips are *not* cheap to design) have been invested in making the main CPU faster instead. -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
colwell@mfci.UUCP (Robert Colwell) (07/26/88)
In article <61251@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: >In the case of programs that use input data, said programs should validate the >input data before using it (unless they have good reason to know the data will >*never* be incorrect); I would prefer to see > > input, line 12: value "12038" is out of range (should be between > 17 and 137) > >than to see > > ILLEGAL ARRAY REFERENCE: "frobozz.q", line 31 > Subscript was 12038; should be between 17 and 137 > >Yes, but does this need "architectural" support, at least in the sense of >"instruction set architecture"? If a compiler for a "conventional" machine can >do that level of validation, subscript-range checking features in the >instruction set would be seldom, if ever, used. > >If "architecture" includes the compiler, then I agree that "architectural" >support for this would be nice. But the whole point of capability machines (to name a single example) is that one cannot cover the space of all interesting exception-possibilities using only a compiler, no matter how smart. For one thing, the program could be coercing data types back and forth such that checking an operation on a type can only be done at the time the operation is applied. But a more fundamental matter is how one manages the development of a large software project with dozens of programmers contributing to a single large end product. The modules are all separately compiled, so there is no question of the compiler helping out much. Given access to all the symbol tables, you could imagine the linker doing some reasonable checks of consistency (number and types of args, for instance), but even that fails when pointers are being passed (pointers to functions, even). You can catch a lot of the common cases with good programming style, as you note above. But you can't catch them all, and the question that capability machines pose is "how close can we come to an airtight programming environment, and how much would it cost"? (simplistic paraphrase, I know; maybe it'll draw some capability people out of their woodwork abstraction!) And how about functional languages? Again, the compiler can only do so much, and the space it covers is not intuitively obvious to the programmer. If it doesn't entirely cover the bug-space, and you aren't sure what it *does* cover, then the coverage is much less useful (and may not be useful at all, as Henry Spencer was alluding to in an earlier post.) Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
colwell@mfci.UUCP (Robert Colwell) (07/26/88)
In article <1152@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: >I suppose that if *all* your emmory is on the same bus as your display >drivers, then this statement is true: > >In article ... henry@utzoo.uucp (Henry Spencer) writes: >> I agree that it's possible to break the hardware in such a way that a >> software BitBlt is inherently slow. I prefer unbroken hardware myself. > >On the other hand, at a given speed a 68030 can only go so fast. If you >can let it do other things while another processor is doing the display, >then why not? Then the statement becomes: > > "I agree that it's possible to break the hardware in such a way that a > hardware BitBlt is inherently slow. I prefer unbroken hardware myself." > >I have always been uncomfortable with the idea of having your main CPU >throwing all those display bits around anyway... hell, given the price >of the 68030 it might pay Sun to stick an extra one in there to unload >*all* of the display management from the main CPU. My blit knowledge is now dated, but when I designed a color version of the Perq workstation (you remember them, sure you do) I had to keep the same basic architecture that the Perq already had -- CPU and display shared main memory. To get the bandwidth to what I needed I had to make the display fetch 256 bits at a time from main memory. Now never mind whether it's a good idea to make CPU and display share a common memory. If you *have* 256 bits/fetch, and your rasterop can handle that, I think it made good sense to provide rasterop hardware. I couldn't see making the CPU data paths that wide just for pushing display bits. I think it comes down to what Henry already said -- if your blit is already saturating memory bandwidth, you don't need hardware (hard to believe that the CPU is going to get anything useful done if no memory bandwidth is available, so the co-processor rationale is a weak one when memory is shared). But at least in my case, memory bandwidth could not possibly be saturated by just the CPU. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
daveh@cbmvax.UUCP (Dave Haynie) (07/27/88)
in article <1988Jul26.024039.28579@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says: > The 630, designed rather more recently, does things the same way. So > do the monochrome Suns. All the monochrome Suns-2s around here have a pretty piss-poor graphic preformance as compared to what the Amiga does with it's blitter and some well written code. Not to imply that all of the Amiga's video routines are as fast as they could be, they certainly aren't. And the Sun-2s are even running faster than the Amiga. Certainly you may be able to get better display performance out of a Sun-3, but with a fast 68020, you'd be pretty hurtin' if you couldn't. And still, at least with out UNIX software (I did put the "well-written" qualifier in), our text scrolling is several time that of the Sun-3's I've seen. > No, they dumped the hardware assist *before* they switched to the 020, > not after, unless I'm much mistaken. Most of the Sun-2s (68010) had > no hardware assist. And as I pointed out, it shows. > MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology > smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu -- Dave Haynie "The 32 Bit Guy" Commodore-Amiga "The Crew That Never Rests" {ihnp4|uunet|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy "I can't relax, 'cause I'm a Boinger!"
guy@gorodish.Sun.COM (Guy Harris) (07/27/88)
> >Yes, but does this need "architectural" support, at least in the sense of > >"instruction set architecture"? If a compiler for a "conventional" machine > >can do that level of validation, subscript-range checking features in the > >instruction set would be seldom, if ever, used. > > > >If "architecture" includes the compiler, then I agree that "architectural" > >support for this would be nice. > > But the whole point of capability machines (to name a single example) > is that one cannot cover the space of all interesting > exception-possibilities using only a compiler, no matter how smart. > For one thing, the program could be coercing data types back and > forth such that checking an operation on a type can only be done at > the time the operation is applied. You missed the point. As I said in an earlier posting, you can spend some increased speed by putting in code to do the appropriate checks on a "conventional" machine. If the compiler can eliminate most of those checks, so much the better; you spend less of the increased speed. I don't see that, for example, doing these checks in microcode, as opposed to regular code, is *ipso facto* the only way to do things. In some ways, that's what I've been asking: what reason is there to believe that capability machines and the like are the *only* way, or even just the *best* way, to achieve the kind of programming environment I suspect most, if not all, of us would like to see? > But a more fundamental matter is how one manages the development of a > large software project with dozens of programmers contributing to a > single large end product. The modules are all separately compiled, > so there is no question of the compiler helping out much. > > Given access to all the symbol tables, you could imagine the linker > doing some reasonable checks of consistency (number and types of > args, for instance), but even that fails when pointers are being > passed (pointers to functions, even). I'm not so sure of that. I don't see how (assuming a "reasonable" language and "reasonable" linkers) doing the check at link time is intrinsically any different from doing it at compile time, if (as per the assumptions listed) the linker has all the information about types that was available to the compiler. I also don't see that pointers are that bad a problem, assuming pointers are typed (again, assuming a "reasonable" language); in ANSI C, for instance, "pointer to 'int'-valued function taking an 'int' argument" is a different type from "pointer to 'int'-valued function taking a 'float' argument". > You can catch a lot of the common cases with good programming style, > as you note above. But you can't catch them all, and the question > that capability machines pose is "how close can we come to an > airtight programming environment, and how much would it cost"? > (simplistic paraphrase, I know; maybe it'll draw some capability > people out of their woodwork abstraction!) I agree that you may not be able to catch all examples of incorrect code. However, if you can avoid doing any checking in those cases where you *can* catch them at compile time, you may be better off. The further questions that capability machines pose are: 1) How much "safer" are they than "conventional" machines plus compilers that generate run-time checks? and 2) If this extra safety is expensive, is it worth the cost? (Remember, the sort of embedded systems to which you've referred in the past do have real-time constraints, so there is presumably some minimum performance required; as the jobs they do get more sophisticated, the minimum performance required increases.)
smryan@garth.UUCP (Steven Ryan) (07/27/88)
In article <1075@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes: >>If the machine allows a change to one variable (an array, say) to >>affect some other unrelated variable, then it is not conforming to >>the intent of my program. In fact, it is not conforming to anything >>useful at all, since nobody would argue that it is useful programming >>practice to ever do such a thing on purpose (I hope?). > >This particular example is done all the time in handling recursive data >structures. What do I mean by that? [I do not know how to reply by private mail--student driver.] A simple example is reversing a singly linked list and, just to make it interesting, inserting the list ordinal: p := nil; q := first_element; n := 1; while q isnt nil do r := q.link; -- r and q.link are variables pointing at the location. q.link := p; -- p and q.link are now pointing at the same location. p := q; -- now p and q. p.ordinal := n; -- as a sideeffect, also modifies q.ordinal. q := r; -- now r and q. n +:= 1 od
elg@killer.DALLAS.TX.US (Eric Green) (07/27/88)
In message <1988Jul26.024039.28579@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says: >In article <4912@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >>> ... A good implementation of BitBlt, >>> whether in software *or* hardware, will run the memory flat out, leaving >>> nothing for "servicing other processes". >>... the bandwidth of modern 100ns >>memories is much greather than a 8mhz 68000 can take advantage of... > >Suggesting that one should either buy a faster processor or spend less >money on the memory. There are sillier things than putting 100ns RAMs >on an 8MHz 68000, yes, but I'd have to think for a moment to come up >with specific examples. The slowest 256kbit DRAM chips that I have ever seen are 150ns. I believe that the Amiga uses the 120ns parts. The difference in price is on the order of cents. >More generally, I will repeat -- more explicitly -- something I pointed >out before: the fair comparison is not to the same system without BitBlt >hardware, but to a system where the same effort and funding (custom chips >are *not* cheap to design) have been invested in making the main CPU faster >instead. The original design goal, as I understand it, was to produce a low-cost consumer machine with high-performance graphics. Which basically means using an off-the-shelf microprocessor with adequate development tools (the Amiga OS was cross-compiled on Sun workstations using the Greenhills compiler). The machine was based around 256kbit DRAM's, which means that the memory bandwidth is going to be greater than can be used up by an 8mhz 68000. To go faster, they would have needed a faster 68000. Which in 1985 would have probably added $200 to the cost of the machine (after going through two levels of markups). That would have made it miss the $1500 price criteria (as vs. the Blit, which, I understand, had 1 megabyte of memory and a price tag of $10,000). -- Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Snail Mail P.O. Box 92191 Lafayette, LA 70509 MISFORTUNE, n. The kind of fortune that never misses.
ge@hobbit.sci.kun.nl (Ge' Weijers) (07/27/88)
From article <1848@van-bc.UUCP>, by sl@van-bc.UUCP (pri=-10 Stuart Lynne): [About self-modifying code] > A simple definition might be any program that cannot be compiled and run > in shared text mode (or equivalent for non Unix application environments). What about programs that write code into the data segment and then call it: int myproc[1000]; ..... generating s/m code (*((int (*)())myproc))(); /* and calling it */ We need a better definition I'm afraid -- Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge
ge@hobbit.sci.kun.nl (Ge' Weijers) (07/27/88)
From article <1988Jul26.022555.28494@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
) In article <4894@killer.DALLAS.TX.US> elg@killer.UUCP writes:
)>However, I still would not want to use a plain old 68000 machine which
)>had to use a software loop to scroll a 32K bitmap.... I've seen Macs
)>and ST's... the speed of window management and scrolling simply cannot
)>compare...
)
) Have you tried a Blit (aka 5620, aka 630)? That's a demonstration of
) what *can* be done with a plain old 68000 using software BitBlt. I said
) Commodore didn't know how to write a fast software implementation; I
) didn't mean to imply that Atari or Apple did better! One *can* do faster
) BitBlt than the Blit does, given enough added hardware, but I really doubt
) the cost-effectiveness.
If you ever get a chance to see the programmers editor Tempus for the
Atari ST: it scrolls *FAST*, a factor 10 better than most programs.
In fact it is so fast the speed almost useless.
The authors stated in an interview that they'd rather not have to
do the programming again. The advantage of the Amiga is that most
reasonably well written programs have fast scrolling without having to
reinvent the wheel.
--
Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands
UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge
henk@ace.nl (Henk Hesselink) (07/27/88)
In article <1152@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: > I have always been uncomfortable with the idea of having your main CPU > throwing all those display bits around anyway... hell, given the price > of the 68030 it might pay Sun to stick an extra one in there to unload > *all* of the display management from the main CPU. The latest Sony "News" workstation has two '030s for exactly that reason, and proves that it does pay. A very nice machine indeed. Henk Hesselink
darin@nova.laic.uucp (Darin Johnson) (07/28/88)
In article <1988Jul26.022555.28494@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes: > Have you tried a Blit (aka 5620, aka 630)? That's a demonstration of > what *can* be done with a plain old 68000 using software BitBlt. > One *can* do faster > BitBlt than the Blit does, given enough added hardware, but I really doubt > the cost-effectiveness. As I recall, the Blit (or 5620, ATT told me not to call it a blit), had very nice speed. However, it cost quite a lot of money ($8000?) so I wouldn't really call it that cost effective (most of the price was probably video and cpu though). Darin Johnson (...pyramid.arpa!leadsv!laic!darin) (...ucbvax!sun!sunncal!leadsv!laic!darin) "All aboard the DOOMED express!"
baum@Apple.COM (Allen J. Baum) (07/28/88)
[] >In article <61459@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: >I don't see that, for example, doing these checks in microcode, as opposed to >regular code, is *ipso facto* the only way to do things.... Presumably, hardware (done in parallel with fetches) is the way to go >... the questions that capability machines pose are: > > 1) How much "safer" are they than "conventional" machines plus > compilers that generate run-time checks? > >and > > 2) If this extra safety is expensive, is it worth the cost? In some sense, the cost is very little, probably not much more than your standard segmentation scheme. The hardware cost is usually a tag bit, which is fast, pretty cheap on chip, but possibly expensive in your memory system. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
colwell@mfci.UUCP (Robert Colwell) (07/28/88)
In article <61459@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: >> >Yes, but does this need "architectural" support, at least in the sense of >> >"instruction set architecture"? If a compiler for a "conventional" machine >> >can do that level of validation, subscript-range checking features in the >> >instruction set would be seldom, if ever, used. >> > >> >If "architecture" includes the compiler, then I agree that "architectural" >> >support for this would be nice. >> >> But the whole point of capability machines (to name a single example) >> is that one cannot cover the space of all interesting >> exception-possibilities using only a compiler, no matter how smart. >> For one thing, the program could be coercing data types back and >> forth such that checking an operation on a type can only be done at >> the time the operation is applied. > >You missed the point. As I said in an earlier posting, you can spend some >increased speed by putting in code to do the appropriate checks on a >"conventional" machine. If the compiler can eliminate most of those checks, >so much the better; you spend less of the increased speed. Actually, I didn't miss the point, I ignored it because I wanted to deal with something I consider more important. Yes you can put in more checks. And if you aren't going to let me get anything better than that for a programming environment, then I'll take it as an improvement over the C/Unixes that currently exist. But I have a lot of doubts about this approach. To me it's like observing that an earth dam has 300 holes, and proposing to fix them one-by-one, each hole needing some different fix from the one before. The whole point of capabilities (at least as the 432 implemented them) was to provide a seamless programming environment, with exactly the same abstraction (that of an "object") presented to the programmer, no matter in which direction she looked (left or right through the code, or down to the hardware). You can't get this by throwing a large number of bandaids at the problem, even if you say they don't cost much in performance. >I don't see that, for example, doing these checks in microcode, as opposed to >regular code, is *ipso facto* the only way to do things. In some ways, that's >what I've been asking: what reason is there to believe that capability >machines and the like are the *only* way, or even just the *best* way, to >achieve the kind of programming environment I suspect most, if not all, of us >would like to see? Actually, I didn't say it was the only way. In fact, all I've been trying to argue is that there are reasons to believe that there are things worth considering from that kind of research and that type of programming environment. Sometimes I feel like all we do is debate the race cars, when it's the production cars that represent all the money. [discussion about specific checks elided; this message is too long] >I agree that you may not be able to catch all examples of incorrect code. >However, if you can avoid doing any checking in those cases where you *can* >catch them at compile time, you may be better off. The further questions that >capability machines pose are: You won't catch me disagreeing with this, even for capability machines. In fact, that was a major point of our article in the 13th Computer Arch Symp (I think it was 13 -- the one in Tokyo). > 1) How much "safer" are they than "conventional" machines plus > compilers that generate run-time checks? > >and > > 2) If this extra safety is expensive, is it worth the cost? Absolutely. I tried to quantify the cost of 432-style object orientation in my thesis. In a nutshell, I concluded that that style of machine be from 1 to 4 times slower than a conventional unprotected architecture made of equivalent implementation technology. (1 times slower means same speed). That's an order of magnitude faster than the 432 really was, but that doesn't count (see next issue of ACM TOCS to see why). You could do their transparent multiprocessing and buy that factor of 4 back, but if you got above 6 or 7 you'd saturate the bus (memory-to-memory architecture). Things get hard to sort out after that. Is that performance hit worth it? Who knows? I'd say it's probably worth it a lot more often than most people currently think. An awful lot of code got written and debugged on machines that were a lot slower than the could-have-been 432. You have to allow for a natural tendency (which I think largely fuels this whole industry) to always want more and to refuse to ever go backwards in anything, speed, memory, disk, sw complexity... >(Remember, the sort of embedded systems to which you've referred in the past do >have real-time constraints, so there is presumably some minimum performance >required; as the jobs they do get more sophisticated, the minimum performance >required increases.) Yeah, well, the Space Shuttle doesn't exactly have blazing hardware. But it's running some of the most sophisticated software I know of. And telephone switchers aren't mini-crays by any means. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
guy@gorodish.Sun.COM (Guy Harris) (07/28/88)
> But I have a lot of doubts about this approach. To me it's like > observing that an earth dam has 300 holes, and proposing to fix them > one-by-one, each hole needing some different fix from the one before. > The whole point of capabilities (at least as the 432 implemented > them) was to provide a seamless programming environment, with exactly > the same abstraction (that of an "object") presented to the > programmer, no matter in which direction she looked (left or right > through the code, or down to the hardware). What if they look at the microcode, or directly at the hardware (if the underlying machine isn't implemented using microcode)? In order for the environment to be truly airtight, the microcode (or hardware) has to be airtight. It may be that producing a sufficiently airtight compiler that implements the object-oriented model by producing the "right" code, including checks, is either impossible or more costly than producing a microcode or hardware implementation atop a compiler that doesn't have to be so airtight because the microcode or hardware is. If people's experience points in this direction, fine. However, how is this different from, say, implementing the interpreter in some "conventional" machine language, or in some language compiled into "conventional" machine language? If it's not different, could the interpreter then try compiling the code to be interpreted into that machine language, preserving the run-time checks? Could this still be sufficiently airtight? If so, would it be faster than, say, a microcode implementation of the same interpreter with no compilation? How about a microcode implementation that compiles either into some faster-to-interpret code, or directly into microcode? > Actually, I didn't say it was the only way. In fact, all I've been > trying to argue is that there are reasons to believe that there are > things worth considering from that kind of research and that type of > programming environment. I'll certainly agree with that; however... > Sometimes I feel like all we do is debate the race cars, when it's the > production cars that represent all the money. I won't agree with that, because 1) I don't think that's *all* people have been doing. While there's been relatively little discussion of it in *this* forum, I get the impression - perhaps erroneously - that people are thinking about how to implement the desired kind of programming environment, even if more in the "atop a 'conventional' machine language" form than in the "machine language that does lots of it directly" form; e.g., the SOAR people, people doing LISP implementations on "conventional" machines, etc.. 2) I don't think the analogy is appropriate. For one thing, most computers are - for better or worse - "conventional", so a better analogy would be "sometimes it feels like all we do is debate the production cars, when we should be concentrating on 'safety' cars", e.g. cars with significantly new features to increase safety. The second analogy raises some interesting questions: 1) How much are people willing to spend to have safer cars? Are they willing to pay X amount of extra money for airbags? (How much of a slowdown are people willing to accept for some amount of additional run-time checking?) 2) How much safer does a particular feature make you? (If 99% of the errors a particular run-time check detects can be caught at compile-time, how much safer are you if you catch the extra 1%?) 3) How many problems does a feature have that reduce its desirability - or even its contribution to safety? For example, if there's a risk of airbags detonating spontaneously, what is the risk of this happening and how bad is it if it does? (What are the chances that the implementation of the environment, and all its checking, has a bug in it, and how badly are you screwed if it is?) > Absolutely. I tried to quantify the cost of 432-style object > orientation in my thesis. In a nutshell, I concluded that that style > of machine be from 1 to 4 times slower than a conventional > unprotected architecture made of equivalent implementation > technology. Just out of curiosity: How would this figure change with newer implementation technologies for "conventional" machines (e.g. more powerful optimizers, different architectural style (for which you may read "the stuff that RISC implementors talk about a lot"), etc.? How would it change with newer implementations for protected architectures (including some of the same newer technologies mentioned for "conventional" machines, and also techniques such as compiling "protected architecture" code into "conventional" code, or doing "quick" checks in hardware and trapping to software checking code if they fail)? (For all I know, some of these may not be "newer" techniques; if so, how did that change the relative performance?) How easy is it to apply some of those techniques to instruction-set architectures such as the 432's, as opposed to applying them to, say, RISC architectures and using other techniques to implement a "protected" system atop such a "conventional" architecture? > You could do their transparent multiprocessing and buy that factor of > 4 back, but if you got above 6 or 7 you'd saturate the bus > (memory-to-memory architecture). Things get hard to sort out after > that. Yes, but could you buy an equivalent amount back with multiprocessing on "conventional" architectures with the protection implemented elsewhere than in the instruction set, even if the multiprocessing might be less transparent? Could it be a greater amount if the "conventional" architecture were not so memory-reference-intensive? Could you change the "protected" system to make *it* less memory-intensive and improve *its* multiprocessor performance? (At this point, one would no longer expect eyebrows to raise at the appearance of a new register-oriented machine.... :-)) > Is that performance hit worth it? Who knows? I'd say it's probably > worth it a lot more often than most people currently think. An awful > lot of code got written and debugged on machines that were a lot slower > than the could-have-been 432. You have to allow for a natural > tendency (which I think largely fuels this whole industry) to always > want more and to refuse to ever go backwards in anything, speed, > memory, disk, sw complexity... Yes, but there may be an economic justification for that tendency; the issues may not all be "technical" in the narrow sense. It may be hard, as you pointed out, to show improvements in programmer productivity, code reliability, etc. in a better programming environment, but if the advocates of such environments want people to spend money on those sorts of improvements rather than straight performance improvements, they should expect those who would pay the bills to be more swayed by that kind of evidence. (Another problem is that it may not even be cost-effective to adopt a technology that's clearly "better", technically; I suspect a new motor fuel would have to be quite a bit better before people were willing to write off the existing investment in the production and distribution of existing fuels. There are a lot of "conventional" machines out there.) > Yeah, well, the Space Shuttle doesn't exactly have blazing hardware. > But it's running some of the most sophisticated software I > know of. And telephone switchers aren't mini-crays by any means. Yeah, well, I don't know that either of those processors would necessarily have the requisite horsepower for some of the embedded applications somebody might want to put computers into in the future. That's why I said "as the jobs they do get more sophisticated, the minimum performance required increases;" a 4x performance hit relative to "conventional" computers may rule out a "protected" processor for some application until you can either come up with a way of solving the problem faster with existing hardware or make the hardware faster. I'm certainly willing to believe that there are applications now that could use a "protected" processor, but I'm not sure I'm willing to believe that there aren't applications that, at present, couldn't use a "protected" processor if it were 4x slower than a "conventional" one. In this case, a useful question to ask might be "can I get the same level of protection with an adequate level of performance if I don't put all the protection in the instruction set?" (BTW, I infer from the paper "Engineering a RISC Compiler System", by some people at MIPS, that their compiler can optimize register usage across procedure boundaries even when the calling and called procedures were separately compiled - they keep Ucode around for multiple compilation units, so that information for both calling and called procedure is available to the optimizer - so the claim made in "Fast Object-Oriented Procedure Calls: Lessons from the Intel 432" that "Clearly, the same approach (interprocedural register allocation) is inapplicable to separately compiled modules." is false. It may be true in some cases - for example, if the modules are bound together at run time - but it is hardly clear that it is always true. I believe David Wall has worked on link-time register allocation as well.)
pf@diab.se (Per Fogelstr|m) (07/28/88)
In article <480@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >I think it comes down to what Henry already said -- if your blit is >already saturating memory bandwidth, you don't need hardware (hard to >believe that the CPU is going to get anything useful done if no >memory bandwidth is available, so the co-processor rationale is a >weak one when memory is shared). But at least in my case, memory >bandwidth could not possibly be saturated by just the CPU. > I disagree ! If a specialized "hardware" can do a BitBlit at least twice as fast (and propably even faster) than the CPU you will always have 50% or more CPU time left than if the cpu was doing the job itself. Note that a good designed Blit can saturate the memory bandwidth with no problem. The design i'm working with right now Blits 80-90 kpixels (depth unlimited) per second with a 16 bit bus and 100ns dynamic rams. This means it can scroll a 1024 x 1024 screen in 10ms, less than one display time, and it can move windows on the screen in real time as the Intel GPC does with hardware windows. Do that with the CPU. Even with external hardware the CPU will not be able to generate the addresses fast enough. (You have to generate a new address each 50-100ns).
colwell@mfci.UUCP (Robert Colwell) (07/28/88)
In article <61781@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: [Guy, how come you always delete the previous correspondent's name? Are you trying to save me embarrassment?] [lots of interesting questions and points to ponder removed...I think this forum has run out of gas on a topic this complex, so in true Usenet style I'm going to pick nits instead. I'll try to get back to your 432 questions when I get a chance.] >I'm certainly willing to believe that there are applications now that could use >a "protected" processor, but I'm not sure I'm willing to believe that there >aren't applications that, at present, couldn't use a "protected" processor if >it were 4x slower than a "conventional" one. In this case, a useful question >to ask might be "can I get the same level of protection with an adequate level >of performance if I don't put all the protection in the instruction set?" If you think of the 432 as merely a "protected processor" you've not fully grasped what they created there. It was an *environment*, of which the central processor was just a component. You, as a programmer, did not see the 432 programming environment as a processor running with a certain amount of memory, a set of registers, some available OS calls, a certain type of virtual memory, etc. You the programmer were given a new universe in which all the pieces obeyed the same rules, and the rules were simple and few and not derived from hardward or language artifacts (by and large, anyway). I do not believe you can get this environment piecemeal. (But you can get closer to it than we currently are, and we probably agree on both the feasibility and desireability of this.) >(BTW, I infer from the paper "Engineering a RISC Compiler System", by some >people at MIPS, that their compiler can optimize register usage across >procedure boundaries even when the calling and called procedures were >separately compiled - they keep Ucode around for multiple compilation units, so >that information for both calling and called procedure is available to the >optimizer - so the claim made in "Fast Object-Oriented Procedure Calls: >Lessons from the Intel 432" that "Clearly, the same approach (interprocedural >register allocation) is inapplicable to separately compiled modules." is >false. It may be true in some cases - for example, if the modules are bound >together at run time - but it is hardly clear that it is always true. I >believe David Wall has worked on link-time register allocation as well.) We were talking about procedure calls made within an object-based programming environment, and I don't think the claim is false. In such an environment you do not want to tie your parameter-passing abstraction, with its expected runtime type-checking, to whatever register convention happens to be the current fashion. We didn't say you couldn't do it -- we meant you don't want to. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
henry@utzoo.uucp (Henry Spencer) (07/29/88)
In article <1152@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes: >On the other hand, at a given speed a 68030 can only go so fast. If you >can let it do other things while another processor is doing the display, >then why not? ... hell, given the price >of the 68030 it might pay Sun to stick an extra one in there to unload >*all* of the display management from the main CPU. And when it's not doing display management, of course, it can run user programs. Congratulations! You have just re-invented the multiprocessor system. The Wheel of Reincarnation strikes again. The more smarts you put in your display processor, the more it resembles a somewhat-crippled main processor. You get a much more useful system if you break down and admit that you're building a multiprocessor machine, and make all the CPUs general-purpose. -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
henry@utzoo.uucp (Henry Spencer) (07/29/88)
In article <4324@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: > Don't forget, the 68000 doesn't have the BFEXTU instruction, so >the inner loop of a BitBlit isn't as tight as on a 68020. The inner loops of normal large-area BitBlts do not usually do bitfield extractions. Look at how the operation is actually used. >>> ... The techniques are >>> not that well known (especially in the micro community, which is notorious >>> for its ignorance of the state of the art -- the people at Commodore most >>> likely had no idea it was even possible to do a fast software BitBlt). > > Excuse me, I have a paper in front of me published in EUUG last fall >by a researcher at AT&T on optimized software BitBlits in C on an 68020. If >you want to insult people please do it to the XYZZY-clone makers and the people >who write software for them. I don't consider a 68020 a micro, myself. The 68020 machine I am writing this on is much bigger and faster than the minicomputer I was using a month ago. When I say "micro", I mean something that competes with PCs. As for insulting the wrong people, if you're reading papers like that, I would consider you a (laudable) rare exception to the (deplorable) rule. > Fast software implementations of BitBlit tend to be 2 sources, 1 >destination blits (therefor 16 operations). If you expand them to 3 sources, >1 destination (256 operations), they tend to take up a LOT of code/ROM >space, or they slow down a lot, or both... This is where dynamic compilation, the original subject of this conversation (remember back then? :-)) wins big: you don't need to duplicate all that code N times, just know how to generate it. -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
henry@utzoo.uucp (Henry Spencer) (07/29/88)
In article <4929@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >>More generally, I will repeat -- more explicitly -- something I pointed >>out before: the fair comparison is not to the same system without BitBlt >>hardware, but to a system where the same effort and funding (custom chips >>are *not* cheap to design) have been invested in making the main CPU faster >>instead. > >... To go faster, they would have >needed a faster 68000. Which in 1985 would have probably added $200 to >the cost of the machine (after going through two levels of markups). Which, I would guess, is the same order of magnitude as the cost added by the custom chips, after the same markups. >Blit, which, I understand, had 1 megabyte of memory and a price tag of >$10,000). Wrong twice. I don't think even the 5620 cost that much, and the biggest one had half a meg. -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) (07/29/88)
In article <4333@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes: > in article <1988Jul26.024039.28579@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says: > > The 630, designed rather more recently, does things the same way. So > > do the monochrome Suns. > > All the monochrome Suns-2s around here have a pretty piss-poor graphic > preformance as compared to what the Amiga does with it's blitter and > some well written code. Not to imply that all of the Amiga's video > routines are as fast as they could be, they certainly aren't. And the > Sun-2s are even running faster than the Amiga. Certainly you may be > able to get better display performance out of a Sun-3, but with a fast > 68020, you'd be pretty hurtin' if you couldn't. And still, at least > with out UNIX software (I did put the "well-written" qualifier in), our > text scrolling is several time that of the Sun-3's I've seen. > Yes but there is a valid reason. The Amiga is only (at least the workbench) updating a 320x200x2bit plane display. The sun screen is much, much larger, (maybe 1000 x 800?). It is a whole lot easier to move around 16k or so than up to say 100k on the sun. This just like comparing the Amiga to the Mac II. The Amiga isn't that speedy if you are using HAM 640x400, which is 192k of memory. Then the Mac II also slows down in its 640x400x8 mode which is 256k of memory. But I would having a 256 color palette is much nicer than HAM in my opinion. Things like blitters are nice, but I would much rather have a better CPU. Now how about a 68000 like CPU with say 128 data reg. and 128 address reg.. That is somethin I really could get into using. Things like blitters are a dime a dozen, I want some real power! Wayne Knapp P.S. The Amiga was pretty hot three years ago, but I just think that in general it is no longer that wonderful. Time for a major upgrade?
karl@sugar.uu.net (Karl Lehenbauer) (07/29/88)
(Those blit terminals were monochrome, weren't they? Amiga screens have up to six bitplanes.) > I.e., that the memory system is too fast for the processor to use fully. > Can you say "imbalanced system"? Henry's remarks are flames in the sense that they're kind of insulting when they don't really have to be, I think. Anytime someone uses Mr. Rogers' "Can you say 'xxx'?", they're being condescending. I don't think the Amiga is imbalanced either. The extra clock cycles are used for screen, audio and disk DMA. Is that really imbalanced? Would slower memory have been better? Whether or not blitters are worthwhile, the Amiga has capabilities in terms of realtime video generation (examples: various Videoscape 3D demos, NewTek demo, PD animations, etc) that can't be touched by anything within about 10X of its price, and can't be touched by a lot of big, expensive workstations because their operating systems aren't realtime. Also, look at the cost of software. Software for that $5K Targa board for your Mac II can run over $7K, fine if you're institutional, but kind of narrow to think that's the Minimum Acceptable System. Should one expect a $550 machine to have a 68020? (No, a 32-bit RISC, someday) My point is that whatever the theory of whether it should have a blitter, in practice there is a lot of cool software that does amazing things, cheap - a gestalt win. I've done some programming on the Amiga and I'm very, very impressed with the realtime exec, the long names for all the calls, the shared libraries and so forth. My other realtime OS work has been with RSX-11M, and M+, RMX, C Exec, Forth and various homegrown-style ones and so far the Amiga is a superset of the others...and it's hardly the work of people who don't know how to write software. -- -- -- Karl Lehenbauer, karl@sugar.uu.net aka uunet!sugar!karl, +1 713 274 5184
karl@sugar.uu.net (Karl Lehenbauer) (07/29/88)
> I.e., that the memory system is too fast for the processor to use fully. > Can you say "imbalanced system"? Henry's remarks are flames in the sense that they're kind of insulting when they don't really have to be, I think. Anytime someone uses Mr. Rogers' "Can you say 'xxx'?", they're being condescending. I don't think the Amiga is imbalanced either. The extra clock cycles are used for screen, audio and disk DMA. Is that really imbalanced? Would slower memory have been better? I think Henry's right that "best of all is a really fast CPU." I don't know what the Amiga people themselves could have done to make the 68000 itself faster, though. Maybe it should be "best of all is a really fast CPU and a bunch of other stuff?" I don't want my Unix system to run really slow while drawing characters as pixels, either. I want a blit card or something like that. I guess I want both. Whether or not blitters are worthwhile, the Amiga has capabilities in terms of realtime video generation (examples: various Videoscape 3D demos, NewTek demo, PD animations, etc) that can't be touched by anything within about 10X of its price, and can't be touched by a lot of big, expensive workstations because their operating systems aren't realtime. Also, look at the cost of software. Software for that $5K Targa board for your Mac II can run over $7K, fine if you're institutional, but kind of narrow to think that's the Minimum Acceptable System. Should one expect a $550 machine to have a 68020? (No, a 32-bit RISC, someday) My point is that whatever the theory of whether it should have a blitter, in practice there is a lot of cool software that does amazing things, cheap - a gestalt win. I've done some programming on the Amiga and its system software is very sophisticated. It is hardly the work of people who don't know how to program, as fun as it may be to stir up trouble by saying that it is. -- -- -- Karl Lehenbauer, karl@sugar.uu.net aka uunet!sugar!karl, +1 713 274 5184
aglew@urbsdc.Urbana.Gould.COM (07/29/88)
..> Talk about capabilities? Why has no-one mentioned one of the biggest problems with capability based systems? The security people who evaluate and classify secure _systems_ (C2, B1, etc., Gould knows about them) don't like "OS in hardware" or "OS in microcode" because then they have to understand the microcode/microarchitecture, and there's more of it to understand than in a conventional system. Plus the fact that capability inheritance rules have never been satisfactorily modeled and shown to be secure (I'm sure that I'll receive flames about that, eh? :-) *************************************************************** Attempting to break out of the C vs ASM and HW vs SW BLT morass *************************************************************** I notice Bob Colwell from Multiflow is a regular reader. Just read the IEEE Transactions on Computers article on Multiflow's TRACE machines. Questions: Exactly how does the store register file on the floating point subsystem operate? You mention that there are only 8 32 bit busses across the board edge. Exactly what are they? Are there more busses in the system, but only a subset to any board? Are address and data busses split? Etc. Everybody I know who has a reason to know the details of the TRACE says that it is loaded down with crossbars. How and where? Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have MX records aglew@xenurus.gould.com - if you don't ...!ihnp4!uiucuxc!ccvaxa!aglew - paths may still be the only way My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
henry@utzoo.uucp (Henry Spencer) (07/30/88)
In article <308@laic.UUCP> darin@nova.laic.uucp (Darin Johnson) writes: >As I recall, the Blit (or 5620, ATT told me not to call it a blit), had >very nice speed. However, it cost quite a lot of money ($8000?) so I wouldn't >really call it that cost effective ... The selling price was something like $5k, as I recall, and it *cost* a lot less than that (as witness the rather lower price of the 630, its replacement, despite greater functionality with similar technology). -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
darin@nova.laic.uucp (Darin Johnson) (07/30/88)
In article <3057@tekig5.PEN.TEK.COM>, wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes: > This just like comparing the Amiga to the Mac II. The Amiga isn't that > speedy if you are using HAM 640x400, which is 192k of memory. Then > the Mac II also slows down in its 640x400x8 mode which is 256k of memory. > But I would having a 256 color palette is much nicer than HAM in my opinion. > > P.S. The Amiga was pretty hot three years ago, but I just think that in > general it is no longer that wonderful. Time for a major upgrade? however, it is still the biggest bang for the buck. It is still the only computer with NTSC video, without having to pay megabucks. I could buy a 68030 board (and extra 32 bit ram when the prices go down), plug it into my Amiga, and still have spent less money than a Mac II with color - and the Mac still won't have multitasking without spending more money. The extra colors and higher resolution on the MacII are nice, but the Mac II just isn't a personal computer anymore... (sort of like a workstation, but without the performance) Darin Johnson (...pyramid.arpa!leadsv!laic!darin) (...ucbvax!sun!sunncal!leadsv!laic!darin) "All aboard the DOOMED express!"
henry@utzoo.uucp (Henry Spencer) (07/30/88)
In article <477@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >... there is >an awful lot the compiler can't know about -- programs that use input >data, generate pointers, or do just about anything else that's >interesting, are going to do many operations that are difficult for >the compiler to validate at compile time. My personal thesis is that the difficulty of doing this in a well-written program -- I am not interested in poorly-written ones -- is consistently overestimated. Remember, the programmer is supposed to be sure that these errors aren't going to happen, because even if they are caught immediately they are generally a disaster for production code. Frequently, the sort of checks that a programmer must put in to be sure about such things are the sort of thing that a compiler ought to be able to spot fairly easily, e.g. validation of input data. >I'm not sure I catch your drift on the imposition of runtime checks. >To me, the best human debuggers have the ability to figure out the >fastest which assumptions are being made that are wrong, thus getting >to the heart of some problem the quickest. If I have a program bug, >I want the machine to express as directly as possible the intent of >my program so that I can narrow the bug-space down to my code alone. We may have a slight misunderstanding here, which I admit I didn't notice in my original posting. I'm not claiming that run-time checks aren't useful during debugging. However, I claim that run-time checks are not a particularly good approach to catching problems during production use, because "subscript error in line 32, core dumped" can be every bit as bad as trash output in a production environment. What one really wants is a *guarantee* that (in the absence of hardware malfunction), it will not happen. This cannot be achieved by testing, regardless of how much help the hardware provides. >I sure hope you're right. In fact, if progress towards this goal >becomes evident, I'd propose we start discussing ways of turning >architecture towards better ways of supporting this instead of >throughput or programming environments. This is actually a programming-environment issue, though, and architecture is nearly irrelevant. A much more important issue is the programming languages being used, and how tractable they are. C, for example, is not a very favorable case -- too many problems with pointers. -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
oconnor@nuke.steinmetz (Dennis M. O'Connor) (07/30/88)
An article by ge@hobbit.sci.kun.nl (Ge' Weijers) says: ] From article <1087@ficc.UUCP>, by peter@ficc.UUCP (Peter da Silva): ] > Why are an Icache plus a Dcache better than just ] > a big shared cache as big as both? ] ] The answer is bandwidth. The CPU does not have to stop filling ] the instruction pipeline when it accesses/writes data. ] -- ] Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands This is indeed part of the answer. The other part is that instruction-fetch behavior is much simpler and more predictable than operand fetch, and that I-caches don't need to be written-through, allowing an isolated I-cache to be simpler and more effective than a combined operand/instruction cache. Specilized I-caches, like a Branch-Target cache combined with pre-fetch, can produce effective hit rates of 99%+ with only a few thousand bits of storage. Operand caches can't do anywhere near this well. -- Dennis O'Connor oconnor%sungod@steinmetz.UUCP ARPA: OCONNORDM@ge-crd.arpa "Never confuse USENET with something that matters, like PIZZA."
henry@utzoo.uucp (Henry Spencer) (07/30/88)
In article <479@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >But the whole point of capability machines (to name a single example) >is that one cannot cover the space of all interesting >exception-possibilities using only a compiler, no matter how smart. >For one thing, the program could be coercing data types back and >forth such that checking an operation on a type can only be done at >the time the operation is applied. Yes, it could be -- but how does the *programmer* know it's right? Remember, programs are not supposed to dump core in production operation. Remember also Dijkstra's comments on how human minds are actually fairly simple and find it difficult to reason about complex situations. There are three possibilities here: 1. Code is simple enough for human programmer to figure out easily, in which case the compiler should be able to do likewise, perhaps with a bit of help. 2. Code is wrong, human programmer is (e.g.) not validating his inputs properly. Compiler should reject it. 3. The code really is right but very subtle. Programmer has to have a way to tell the compiler "this is subtle but right". This should not be the default, as it effectively is now. My conjecture, not proven at the moment, is that only modest intelligence in the compiler and occasional tidying-up of code for clarity would be needed to make case 3 fairly rare, given favorable cirumstances (it is not clear that C constitutes a sufficiently favorable circumstance). >But a more fundamental matter is how one manages the development of a >large software project with dozens of programmers contributing to a >single large end product. The modules are all separately compiled, >so there is no question of the compiler helping out much. Please consider modern compiler technology, not 20-year-old versions. Ways to achieve inter-module checking are a major concern in modern languages, as witness (e.g.) C++ and Modula 2. It can be done, and ought to be done! >Given access to all the symbol tables, you could imagine the linker >doing some reasonable checks of consistency (number and types of >args, for instance), but even that fails when pointers are being >passed (pointers to functions, even). If the assumptions being made about those pointers are simple ones, they should be in the code, for compilers (and humans!) to see. If they are complex ones, then in a multi-programmer project they are probably being violated already. -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
jesup@cbmvax.UUCP (Randell Jesup) (07/30/88)
In article <3057@tekig5.PEN.TEK.COM> wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes: >Yes but there is a valid reason. The Amiga is only (at least the workbench) >updating a 320x200x2bit plane display. The sun screen is much, much larger, >(maybe 1000 x 800?). It is a whole lot easier to move around 16k or so >than up to say 100k on the sun. Wrong: 640x200x2bit or 640x400x2bit (for workbench, more bitplanes are allowed for "custom" screen.) Also, When I'm using suntools, I tend to use 80-column wide windows, 30 or so lines high, 1 bitplane deep. Overall, about the same amount of display memory used for the window I'm scrolling in each case. >This just like comparing the Amiga to the Mac II. The Amiga isn't that >speedy if you are using HAM 640x400, which is 192k of memory. Then >the Mac II also slows down in its 640x400x8 mode which is 256k of memory. >But I would having a 256 color palette is much nicer than HAM in my opinion. HAM only operates in lores (320x{200,400}) in the current chips. The biggest amount of memory for a screen/window is 128K (640x400x4bits). The Mac-II is not a reasonable comparison to the current Amigas (look at price, when they were designed, processor, etc.) The interesting thing is that in windowing and screen update/scrolling speed the 68000 8Mhz 16-bit $600 Amiga can do so well compared to the 68020 16Mhz 32-bit Mac-II that costs several thousand, which also uses VRAMs (a win due to bus contention), which didn't exist when the current amiga was designed. >Things like blitters are nice, but I would much rather have a better CPU. >Now how about a 68000 like CPU with say 128 data reg. and 128 address reg.. >That is somethin I really could get into using. Things like blitters are >a dime a dozen, I want some real power! Ok, where can we buy this CPU for less than $20? In fact, where can one buy this CPU at all? Warning: task switch time on that will be atrocious, and instructions will have to double in size or so to hold those register addresses. Leading edge CPU design is an expensive thing compared to building a blitter. The last major CPU MOS (Commodore's silicon arm) designed was the 6502/6510. They concentrate on other things now, like the custom chips for the C64 and now the Amiga. >P.S. The Amiga was pretty hot three years ago, but I just think that in >general it is no longer that wonderful. Time for a major upgrade? The new non-interlaced chips are more or less public knowlege, though most of the details aren't. I can say that we certainly aren't standing still; Gould announced a few months ago in Germany that we are working on the Amiga 3000, based on the 68030. -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
aglew@urbsdc.Urbana.Gould.COM (07/30/88)
> Can you say "imbalanced system"?
Getting away from the Blitter wars, what exactly do people mean
by a "balanced system", and why is it so important.
This is a somewhat rhetorical question. And yes, I know about
Amdahl's MIPS/MB/s rule of thumb, and a few others.
I ask this because I have seen several situations where building
unbalanced systems has been an attractive corporate strategy.
Eg. a small company that can't do everything at once, and wants
to maintain a high rate of delivery of new projects:
Start off with a really fast memory system, and good, but not
necessarily the fastest CPUs. Deliver this as your first system.
Keep the memory system, but work all out on the CPU.
Deliver this as your second system - with the advantages that the
memory system is tried and proven, so all that you have to devote
debugging and langauge effort to is the CPU system.
Now the memory is tired, so spend time and effort working on
it - but keep the proven CPUs...
And so on. This is an overlapping product development strategy
- eventually, the whole system gets redone, but at any time you
don't have the expense and risk of giving the customer a completely
new system.
Reactions?
Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801
aglew@gould.com - preferred, if you have MX records
aglew@xenurus.gould.com - if you don't
...!ihnp4!uiucuxc!ccvaxa!aglew - paths may still be the only way
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.
colwell@mfci.UUCP (Robert Colwell) (07/31/88)
>Just read the IEEE Transactions on Computers article on >Multiflow's TRACE machines. > > Questions: Exactly how does the store register file >on the floating point subsystem operate? The data that you'd like to have end up in memory, you put into the SRF first. When you kick off a memory store, it's an integer opcode -- the integer section has the DTLB. This was done so that the floating section can keep cranking out flops while the integer section can keep track of loop indices, memory addresses, and other scalar types of things. The SRF is a legitimate target of a flop or an integer op. > You mention that there are only 8 32 bit busses across >the board edge. Exactly what are they? Are address and data busses split? By "address bus" I assume you mean physical address bus to memory; those are "dedicated" in the sense that they only carry addresses, never data. There are also a set of store buses that carry data to the memory controllers on stores. There are eight other buses, four FL buses and four IL buses. The "L" means Load, but these buses are also used to carry certain cluster-to-cluster transfers. Buses are all tagged as to destination, and self-drain in the event of stalls, traps, etc. >Are there more busses >in the system, but only a subset to any board? Yes, the integer boards get the four IL buses, one of the physical address buses each, and two IF buses (that don't count because they're carried over very short cables to their floating-point board-pair). The floating boards touch all four FL buses and all four Store buses -- that's where the eight comes from. By the way, if it wasn't clear -- in figure 4 of the paper, the "I3,I2,I1,I0" boxes are the Integer boards, the Fn boxes are the Floating boards, and the Mn boxes are the memory controllers. > Everybody I know who has a reason to know the details of >the TRACE says that it is loaded down with crossbars. How and >where? There are a lot of crossbars, but "loaded down?" If you have a lot of general-purpose buses, you need a fair amount of muxing hardware somewhere in order to take advantage of it. Once you've paid the price for having all those buses (bus drivers everywhere you look), providing the crossbars to maximize the win only seems right. Besides, in this architecture, you can put those crossbars into the gate arrays. The only place we couldn't make them fit was on the store buses, so there's a bunch of PALs to handle it there (10 or so per F board, I guess?). The main places that come to mind are the inputs to the Register Files (both integer and floating -- they're built from the same gate array type), and the store file outputs. Also, don't skim over the point we made in the article too fast about putting a full crossbar between address generators and memory. If you don't do this, the compiler has to exactly where something is in memory (or is going to be) in order to schedule the bus to transfer the datum. With a crossbar, the compiler only has to know where something is relative to everything else, an easier task. Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
mike@arizona.edu (Mike Coffin) (07/31/88)
From article <1988Jul29.202400.28068@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer): > ... There are three possibilities here: > > 1. Code is simple enough for human programmer to figure out easily, in > which case the compiler should be able to do likewise, perhaps > with a bit of help. > > 2. Code is wrong, human programmer is (e.g.) not validating his inputs > properly. Compiler should reject it. > > 3. The code really is right but very subtle. Programmer has to have a > way to tell the compiler "this is subtle but right". This should > not be the default, as it effectively is now. 4. Code is simple enough for a human programmer to figure out easily given "deep" knowledge about what the program is doing, in which case a compiler has big troubles figuring out almost anything. I have worked on some mathematical software --- computational group theory, if it matters --- where the code itself looks broken unless you know the right theorem. The theorem establishes an invariant that makes the code very easy to understand. I just thought of a simple example: a piece of code relies on the fact that an array always contains a permutation of the integers 1:N. This invariant is originally established by generating a random permutation via a very clever algorithm (Floyd's), and then maintained by always permuting the integers --- sometimes swapping them, other times rotating several. A procedure that relies on the fact that the array is a permutation is going to be extremely opaque to the compiler, yet might be transparent to a programmer. (Especially if the array is called "permutation".) -- Mike Coffin mike@arizona.edu Univ. of Ariz. Dept. of Comp. Sci. {allegra,cmcl2,ihnp4}!arizona!mike Tucson, AZ 85721 (602)621-4252
elg@killer.DALLAS.TX.US (Eric Green) (07/31/88)
In message <1988Jul28.173620.7325@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says: >In article <4929@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >>... To go faster, they would have >>needed a faster 68000. Which in 1985 would have probably added $200 to >>the cost of the machine (after going through two levels of markups). > >Which, I would guess, is the same order of magnitude as the cost added >by the custom chips, after the same markups. An Amiga 500 costs $650 retail. That's with power supply, disk drive, 512K of RAM (which probably accounts for 1/3rd of the cost, by itself), etc. Somehow, I don't see $200 of that as being blitter, especially since there's four custom chips in there and the blitter is only part of one of them. Commodore has a long record of building inexpensive custom chips, dating back to the lowly Commodore 64 (which had a custom video chip and a custom sound chip, along with a gate-array for general "glue" purposes). Perhaps Sun cannot build a blitter for less than $500, but that's Sun's problem. You cannot build an argument against blitters on the basis of cost (also see the $30 NSC chip). You will have to fall back to the bus bandwidth argument. On a 68000, which cannot saturate modern memories, period, having a blitter use the extra bandwidth is smart. On a 68030 with an icache, the CPU can already saturate the bus, so having a blitter is more of a tradeoff. Unless you have a seperate bus for the blitter and video memory, a blitter buys you nothing with a 68030 (or a cached 68020, for that matter). -- Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Snail Mail P.O. Box 92191 Lafayette, LA 70509 MISFORTUNE, n. The kind of fortune that never misses.
karl@sugar.uu.net (Karl Lehenbauer) (08/01/88)
In article <1988Jul28.170834.6949@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes some more flamelets, and... > ...You get a much more useful system if you break down and > admit that you're building a multiprocessor machine, and make all the > CPUs general-purpose. I don't know at what point this becomes weirding out. Isn't it OK to have a dedicated processor on the multiport serial I/O boards? Isn't it OK to have microprocessors in the disk controllers? ...in a graphics terminal? So why not on a graphics board? Why can't we put a processor anywhere that's useful and have general multiprocessing as well? Why should my graphics CPU, with dedicated memory and, eventually, parallel processing, hardware transforms and such have to be complicated by the need to run user programs? Your statement could be read to criticize every engineer designing a part of a system who used a uP for something for not implementing with random logic instead and adding another general purpose processor. That's absurd, right? OK then, so why is the display among all things sacrosanct from the use of custom or dedicated hardware? -- -- -- Karl Lehenbauer, karl@sugar.uu.net aka uunet!sugar!karl, +1 713 274 5184
peter@ficc.UUCP (Peter da Silva) (08/02/88)
In article ... henry@utzoo.uucp (Henry Spencer) writes: > The more smarts you > put in your display processor, the more it resembles a somewhat-crippled > main processor. You get a much more useful system if you break down and > admit that you're building a multiprocessor machine, and make all the > CPUs general-purpose. So, the question then becomes a matter of price. When one is building a cheap (under $1000) computer, the cost of a second 68000 (that still won't be as fast as a blitter, since it doesn't have those nice I-caches in your 68030) becomes a significant factor. Here's a way to decide: you write the fastest LIFE program you can on the 68030, using the "dumb" algorithm (calculate all cells every cycle). Allow for clock rate differences. The Blitter can be used to pump LIFE along at a bit over 20 generations per second. The 68000 can't hope to catch up. How's the 68030? -- Peter da Silva, Ferranti International Controls Corporation, sugar!ficc!peter. "You made a TIME MACHINE out of a VOLKSWAGE2
jesup@cbmvax.UUCP (Randell Jesup) (08/02/88)
In article <1988Jul28.171444.7068@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <4324@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: >>>> ... The techniques are >>>> not that well known (especially in the micro community, which is notorious >>>> for its ignorance of the state of the art -- the people at Commodore most >>>> likely had no idea it was even possible to do a fast software BitBlt). >> Excuse me, I have a paper in front of me published in EUUG last fall >>by a researcher at AT&T on optimized software BitBlits in C on an 68020. >I don't consider a 68020 a micro, myself. The 68020 machine I am writing >this on is much bigger and faster than the minicomputer I was using a >month ago. When I say "micro", I mean something that competes with PCs. True, but 68020 code is largely applicable to the 68000, plus the paper included examples in pure C and in 68000. And 68020's are about to become considered micros. (though you can build mini's out of them: the difference between mini and micro is becoming a matter more of price tag and I/O speed than processor type - and even those are fuzzy.) We do compete with PC's (actually, the A500 is positioned even lower than most PC-clones, about $600). However, we try very hard to be much closer to "state of the art" than other micros (multitasking, semaphores, blitters, etc, etc). And our OS comes with the machine, it doesn't cost you an extra $795 or whatever. ;-) >As for insulting the wrong people, if you're reading papers like that, I >would consider you a (laudable) rare exception to the (deplorable) rule. The only reason it ticked me off was the direct reference to Commodore (for whom I work on Amiga system software). >> Fast software implementations of BitBlit tend to be 2 sources, 1 >>destination blits (therefor 16 operations). If you expand them to 3 sources, >>1 destination (256 operations), they tend to take up a LOT of code/ROM >>space, or they slow down a lot, or both... > >This is where dynamic compilation, the original subject of this conversation >(remember back then? :-)) wins big: you don't need to duplicate all that >code N times, just know how to generate it. Well, you need to revise your algorithms. The paper I cited, for example, uses effectively "canned" code pre-compiled by the compiler. Quite amusing code to read, it does truely evil things with the macroprocessor, but is more or less portable (and almost entirely in C, though the fastest version has one or two #asm's in it). It means you have to REALLY do code generation on the fly, if you need 256 operations. Not impossible, but much more of a pain to write/ support. Having a compiler do most of your work for you makes things much easier. :-) -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
jesup@cbmvax.UUCP (Randell Jesup) (08/02/88)
In article <1988Jul28.173620.7325@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <4929@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >>... To go faster, they would have >>needed a faster 68000. Which in 1985 would have probably added $200 to >>the cost of the machine (after going through two levels of markups). > >Which, I would guess, is the same order of magnitude as the cost added >by the custom chips, after the same markups. Actually no, since the custom chips do everything else as well (video, sprites, ram refresh, 4-channel audio, disk io, serial, parallel.) The blitter is a small part of one chip, using the address generation/DMA control of another (which does these for all the custom chip DMA channels). Back in '84-'85 14+Mhz 68000's weren't very cheap. (and we still would have needed the custom chips for everything else). -- Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup
wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) (08/03/88)
In article <1189@ficc.UUCP>, peter@ficc.UUCP (Peter da Silva) writes: > > Here's a way to decide: you write the fastest LIFE program you can > on the 68030, using the "dumb" algorithm (calculate all cells every cycle). > Allow for clock rate differences. > > The Blitter can be used to pump LIFE along at a bit over 20 generations > per second. The 68000 can't hope to catch up. How's the 68030? Okay, so how does blitter help if the cell array is the frame buffer. I mean it seems kind of a waste to work in an array and then to move it out to the frame buffer. Wayne Knapp
henry@utzoo.uucp (Henry Spencer) (08/04/88)
In article <6490@megaron.arizona.edu> mike@arizona.edu (Mike Coffin) writes: >4. Code is simple enough for a human programmer to figure out easily >given "deep" knowledge about what the program is doing, in which case >a compiler has big troubles figuring out almost anything... Well, remember, I'm not talking about proof of correctness; all I want is absence of runtime errors. Even so, I concede that this sort of thing is likely to happen now and then. Having to turn the checking off once in a while (and, preferably, having to justify this in writing) still strikes me as better than not having it at all. -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
scott@applix.UUCP (Scott Evernden) (08/04/88)
In article <3084@tekig5.PEN.TEK.COM> wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes: >In article <1189@ficc.UUCP>, peter@ficc.UUCP (Peter da Silva) writes: >> >> The Blitter can be used to pump LIFE along at a bit over 20 generations >> per second. The 68000 can't hope to catch up. How's the 68030? > >Okay, so how does blitter help if the cell array is the frame buffer. The blitter can be used as a sort of "parallel processor". You allocate 5 bitmaps, and then blit to and fro simulating a parallel adder. Here's the generation section of am Amiga program I wrote in early '86 which did about 5 gens/sec for a 320 x 200 grid. I needed 39 2-input blits per generation. Wiser folks than I have since improved this to about 20/sec by going straight to the blitter and using its 3 input capability, requiring 9 blits. (See 'Amazing Computing' vol.2 #12 thru vol.3 #2.). Inspired by Mark D. Niemiec's "Life Algorithms" in Byte, Jan '79... ---------------------------------------------- /* Blitter op-codes */ #define MOV 0xC0 #define COM 0x50 #define XOR 0x60 #define BIC 0x20 #define BIS 0xE0 #define AND 0x80 /* Blitter use */ #define SBLIT(s,l,t,d,c) BltBitMap(s,l, t, d,Left,Top,Wid,Hyt,c,0xFF,0) #define BBLIT(s,d,c) BltBitMap(s,Left,Top,d,Left,Top,Wid,Hyt,c,0xFF,0) generation() /* * The blitter "program" to produce the next generation. * I think it might be possible to work directly with the blitter * to utilize its 3-input capability, but this is for some other day... */ { /* add up the neighbor's above each cell */ SBLIT(bm[SCREEN], Left-CellX, Top-CellY, bm[0], MOV); SBLIT(bm[SCREEN], Left, Top-CellY, bm[1], MOV); SBLIT(bm[SCREEN], Left+CellX, Top-CellY, bm[2], MOV); BBLIT(bm[1], bm[0], XOR); BBLIT(bm[0], bm[1], BIC); BBLIT(bm[2], bm[0], XOR); BBLIT(bm[0], bm[2], BIC); BBLIT(bm[2], bm[1], BIS); /* add up the neighbor's below each cell */ SBLIT(bm[SCREEN], Left-CellX, Top+CellY, bm[2], MOV); SBLIT(bm[SCREEN], Left, Top+CellY, bm[3], MOV); SBLIT(bm[SCREEN], Left+CellX, Top+CellY, bm[4], MOV); BBLIT(bm[3], bm[2], XOR); BBLIT(bm[2], bm[3], BIC); BBLIT(bm[4], bm[2], XOR); BBLIT(bm[2], bm[4], BIC); BBLIT(bm[4], bm[3], BIS); /* now add the 'above' and 'below' counts */ BBLIT(bm[2], bm[0], XOR); BBLIT(bm[0], bm[2], BIC); BBLIT(bm[2], bm[1], XOR); BBLIT(bm[1], bm[2], BIC); BBLIT(bm[3], bm[1], XOR); BBLIT(bm[1], bm[3], BIC); BBLIT(bm[3], bm[2], BIS); /* left and right */ SBLIT(bm[SCREEN], Left-CellX, Top, bm[3], MOV); SBLIT(bm[SCREEN], Left+CellX, Top, bm[4], MOV); BBLIT(bm[4], bm[3], XOR); BBLIT(bm[3], bm[4], BIC); /* grand total neighbors */ BBLIT(bm[3], bm[0], XOR); BBLIT(bm[0], bm[3], BIC); BBLIT(bm[3], bm[1], XOR); BBLIT(bm[1], bm[3], BIC); BBLIT(bm[4], bm[1], XOR); BBLIT(bm[1], bm[4], BIC); BBLIT(bm[3], bm[2], BIS); BBLIT(bm[4], bm[2], BIS); /* add 'self' in and put on screen */ SBLIT(bm[SCREEN], Left, Top, bm[0], BIS); BBLIT(bm[2], bm[0], BIC); BBLIT(bm[1], bm[0], AND); SBLIT(bm[0], Left, Top, bm[SCREEN], MOV); } -scott
peter@ficc.UUCP (Peter da Silva) (08/04/88)
In article <3084@tekig5.PEN.TEK.COM>, wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes: > In article <1189@ficc.UUCP>, peter@ficc.UUCP (Peter da Silva) writes: > > The Blitter can be used to pump LIFE along at a bit over 20 generations > > per second. The 68000 can't hope to catch up. How's the 68030? > Okay, so how does blitter help if the cell array is the frame buffer. I > mean it seems kind of a waste to work in an array and then to move it > out to the frame buffer. It took me a while to parse that sentence, because of course the cell array is the frame buffer. Then I realised that you were missing one important point... The Blitter is doing the work. I don't have the article here, but the Blitter can actually do all the work of creating the next generation. I think it takes 11 blits. -- Peter da Silva, Ferranti International Controls Corporation, sugar!ficc!peter. "You made a TIME MACHINE out of a VOLKSWAGEN BEETLE?" "Well, I couldn't afford another deLorean." "But how do you ever get it up to 88 miles per hour????"
elg@killer.DALLAS.TX.US (Eric Green) (08/07/88)
In message <3084@tekig5.PEN.TEK.COM>, wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) says: >In article <1189@ficc.UUCP>, peter@ficc.UUCP (Peter da Silva) writes: >> >> The Blitter can be used to pump LIFE along at a bit over 20 generations >> per second. The 68000 can't hope to catch up. How's the 68030? > >Okay, so how does blitter help if the cell array is the frame buffer. I >mean it seems kind of a waste to work in an array and then to move it >out to the frame buffer. The Amiga blitter is in "chip" memory (the memory also accessible by the video chip and other custom DMA devices, e.g. the sound DMA channel which runs around 32khz or so). I assume that this is what you are referring to as a "frame buffer". In the microcomputer world, generally the video memory is in the same addressing space as the rest of CPU memory -- it's just on an isolated bus, where video refresh cannot interfere with CPU operations (at least in the case of IBM PC type thingys, and, partially, the Amiga). In any event, blitter memory and "video" memory are the same, so there's no need to do any moving. I assume that this is the case with most viable blitter systems. In the microcomputer world, "frame buffers" are generally associated with "frame grabbers", i.e., video frame digitizers. Perhaps this is a case where workstation/Sun terminology slams into conflicting microcomputer terminology? -- Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Snail Mail P.O. Box 92191 Lafayette, LA 70509 MISFORTUNE, n. The kind of fortune that never misses.
atbowler@watmath.waterloo.edu (Alan T. Bowler [SDG]) (08/15/88)
In article <61781@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: > >What if they look at the microcode, or directly at the hardware (if the >underlying machine isn't implemented using microcode)? In order for the >environment to be truly airtight, the microcode (or hardware) has to be >airtight. > >It may be that producing a sufficiently airtight compiler that implements the >object-oriented model by producing the "right" code, including checks, is >either impossible or more costly than producing a microcode or hardware >implementation atop a compiler that doesn't have to be so airtight because the >microcode or hardware is. If people's experience points in this direction, >fine. > Perhaps this is the wrong question. There is also the danger that the existence of the manditory checks will increase program complexity as the programmer finds that he has to program arround the environment to accomplish the task. A classic example are those Pascal programs that allocate 1 large static linear array and then run their own memory allocator withing it. Instead of machine pointers they pass arround indexes into the array. This has defeated the purpose of the Pascal tight checking, reintroduced all the old danger of running off the end of your allocated memory (into a chunk logically attached to another pseudo pointer), and increased the notational complexity of the program (and probability of errors). If you examine these programs and what they accomplish, you will usually conclude 1) This was the best choice that could be made given the constraints of the language 2) The programmer must have been persistent to get the thing working at all.
lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (08/22/88)
In article <20349@watmath.waterloo.edu> atbowler@watmath.waterloo.edu (Alan T. Bowler [SDG]) writes: >..... A classic example are those Pascal programs >that allocate 1 large static linear array and then run their own >memory allocator withing it. Instead of machine pointers they >pass arround indexes into the array. This has defeated the purpose >of the Pascal tight checking, reintroduced all the old danger of >running off the end of your allocated memory (into a chunk logically >attached to another pseudo pointer), and increased the notational complexity >of the program (and probability of errors). If you examine these programs >and what they accomplish, you will usually conclude > 1) This was the best choice that could be made given the constraints > of the language > 2) The programmer must have been persistent to get the thing working > at all. Just a comment on this....this is the typical technique to implement pointers and dynamic memory handling in Fortran, that to the contrary of Pascal doesn't have pointer types and dynamic memory allocation as language primitives. Are "those Pascal programs" maybe translated Fortran programs? Bjorn Lisper
Robert_P_Rubendunst@cup.portal.com (08/30/88)
Every operating system that loads code into RAM and then executes it is that nasty old self modifying code. Every cache on every computer should be flushable by CPU instruction.
jpoon@mipos2.intel.com (Jack Poon~) (10/11/89)
Greetings, Could any experts out there educate me WHY and HOW does self-modifying code use? What the advantage of using self-modifying code that non-self-modifying code cannot achieve? Is there any compiler which will generate code that self-modified? A small and useful example of self-modifying will be very helpful. Thanks, Jack Poon
slackey@bbn.com (Stan Lackey) (10/11/89)
In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes: >Greetings, > >Could any experts out there educate me WHY and HOW does self-modifying code use? >What the advantage of using self-modifying code that non-self-modifying code >cannot achieve? I'm not an expert, but I figured I'd answer anyway. There are many examples of self-modifying code. Perhaps the most common is for code to change the value of an immediate operand. This would be an advantage, if memory space were very tight. However, it is discouraged mainly because modern processors often have deep prefetching of instructions, if not instruction caches, and it can be difficult to detect if a normal write could write data that has already been prefetched. So, newer architectures have specified unpredictable behavior when this is done. Yet there are uses for self-modifying code. Some spreadsheet programs compile on-the-fly to improve execution speed (the difference between compiled and interpreted code), and some bitblt's as well. Debuggers often write jumps or traps into the instruction stream, to make single-stepping and breakpoints work. All architectures do have some kind of facility to make instruction- space writes work; there's a sequence that clears instruction prefetchers and instruction caches. Often a supplier will furnish a system service that a user program can call. -Stan
firth@sei.cmu.edu (Robert Firth) (10/11/89)
In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes: >Could any experts out there educate me WHY and HOW does self-modifying code use? >What the advantage of using self-modifying code that non-self-modifying code >cannot achieve? In the spirit of this question, I won't discuss why the current consensus is that self-modifying code is bad, but rather will give examples. Directly self-modifying code is pretty rare nowadays, but used to be an essential tool. Consider how you access an array element, A[i], on a modern machine: Load index-register, i Load accumulator, address-of-A[0](index-register) Some older machines didn't have index registers, so your compiler generated this: Load accumulator, i Add accumulator, address-of-A[0] Store accumulator, right-half-of-instruction-L L: Load accumulator, contents-of-address-0 So, by the time you executed L, the right operand had been modified to read 'contents-of-address-A[i]'. And, basically, you HAD to do it this way. Some debuggers still play this trick. If you set a breakpoint at L, the debugger actually overwrites instruction L with 'trap-to-debugger', of course saving the old instruction for subsequent execution. ---- Rather more common, are systems where code is generated and executed dynamically. Of course, any code overlay system is like that: a data area is allocated, filled with bits read from a file, and then jumped into and obeyed as code. A more sophisticated system allows the dynamic replacement of code bodies by alternative versions; thus, if during debugging you suspect the new version of "munge()" is bad, you replace it with the old version, read in from some file in the program development environment, and carry on running. This doesn't need true self-modifying code; rather, it needs dynamic variability of procedure bodies (eg called via pointers), and the ability to remap pieces of memory from data space into code space. ---- Finally, one very powerful technique is used in some implementations of prototyping languages. Typically, such languages are interpreted, so each statement in 'Object Logo Plus', or whatever, is turned into a data structure that is crawled over by an interpreter. The trick is this: keep track of how often a code fragment (procedure, method, or whatever) is modified, and how often it is executed. When a certain stability threshold is reached, compile the sucker into true machine code, and have all subsequent executions be direct rather than interpreted. Over a prototyping session of an hour or two, the user will observe a speedup of 2x to 20x, as the more stable and heavily used parts of the program get converted gradually into machine code. Hope that helps.
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (10/11/89)
In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes: > >Could any experts out there educate me WHY and HOW does self-modifying code use? >What the advantage of using self-modifying code that non-self-modifying code >cannot achieve? In prehistoric times, self-modifying code was used to make up for machine limitations. This led to snarled messes, and today, the practice mostly survives in three areas: Copy protected PC software: the obscurity treated as virtue. Debuggers: a debugger may write e.g. traps into the code area, and may fill little buffers with code (and then execute the buffer). (Note that typical hardware features, such as trap-after-each- instruction modes, have NOT obsoleted this practice.) Interpreters: some interpreters will actually compile little code sequences, and then run them on the spot. Some APLs did this. (APL compilers aren't possible because you run into expressions whose meaning isn't known until you compute the value of some previous expression. So, the interpreter would just compile up-to-there, run that, and then go back around.) -- Don D.C.Lindsay Carnegie Mellon Computer Science
chase@Ozona.orc.olivetti.com (David Chase) (10/11/89)
jpoon@mipos2.intel.com (Jack Poon~) writes: >Could any experts out there educate me WHY and HOW does self-modifying >code use? This was discussed at length a couple of months ago; I believe Robert Henry made some sort of a summary. >What the advantage of using self-modifying code that non-self-modifying code >cannot achieve? It's possible to generate a "function" on the fly. Koopman and Lee had a fun paper in this summer's SIGPLAN conference (SIGPLAN Notices July 1989) where they used self-modifying code in an interpreter. >Is there any compiler which will generate code that self-modified? I had one once, but scrapped that scheme for portability reasons. I believe that some LISP compilers make use of self-modifying code in various ways. >A small and useful example of self-modifying will be very helpful. The following runs on a Motorola 68020 (with Sun calling conventions): ---------------- typedef int (*PF)(); PF bar; PF partially_apply_f_to_a(f,a) unsigned int f; unsigned int a; { static unsigned int pa_bits[] = { 0x20172f00, 0x203a000e, 0x2f400004, 0x4e714ef9, 0xFFFFFFFF, 0xAAAAAAAA}; unsigned int * code = (unsigned int *) malloc (sizeof pa_bits); unsigned int i; for (i = 0; i < 6; i++) code[i] = pa_bits[i]; code[4] = f; code[5] = a; return (PF) code; } ---------------- If you have a 68020, you can test it with the following program: ---------------- foo(a,b) { return a - b; }; main() { PF bar; printf("Executing text gives 9 - 2 = %d\n", foo(9,2)); bar = partially_apply_f_to_a(foo,9); printf("Executing partial application gives 9 - 2 = %d\n", (*bar)(2)); bar = partially_apply_f_to_a(bar,2); printf("Executing second partial application gives 9 - 2 = %d\n", (*bar)()); } ---------------- Note that this code is less "self-modifying" and more "run-time-generated", though rapid recycling of a block of memory could cause it appear to be "self-modifying". Enjoy, David
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/11/89)
In article <4415@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: >into and obeyed as code. A more sophisticated system allows the >dynamic replacement of code bodies by alternative versions; thus, if >during debugging you suspect the new version of "munge()" is bad, you Anybody else remember the TSS operating system, which let a system programmer test modified system routines on his own task on the fly without affecting other tasks? Of course, VM can do this, too, but try to find a Unix system with a *real* dynamic loader like this :-) Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
usenet@cps3xx.UUCP (Usenet file owner) (10/11/89)
In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes: >Greetings, > >Could any experts out there educate me WHY and HOW does self-modifying code use? >What the advantage of using self-modifying code that non-self-modifying code >cannot achieve? Back when I was doing assembly hacking on the RadioShack coco, I needed a super fast block copy. To save a few cycles per loop, I poked the calculated indecies into the load/store instructions. I ended up with a routine that could copy blocks at a rate of less than 3cycels per byte. This is good considering each instruction takes a minimum of 2 cycles, and it takes two instructions to do a load here; store there. >Is there any compiler which will generate code that self-modified? >A small and useful example of self-modifying will be very helpful. possible, but i doubt it. Joe Porkka porkka@frith.egr.msu.edu
tada@athena.mit.edu (Michael J Zehr) (10/11/89)
I found an example of self-modifying code back in the middle ages (c. 1981) in the operating system on a Z-80 chip. As i recall, the Z-80 had bit set, clear, and test op-codes which needed an assembly-time bit number rather than using a register to determine which bit to set/clear/test. it took less space to determine the right bit and load the opcode (i think a shift would give you the opcode for a bit, or maybe shift and add an offset) into the right place in memory and then execute it. the other option was to have a 8 branch statements. but the way to really save space was to call the routine with a parameter which indicated which operation (set, clear, test) and after computing the opcode to indicate the right bit, also select the opcode to indicate the kind of operation and load that into memory and execute it. *very* kludgy, but it took a surprisingly small amount of space with only a small penalty in speed. -michael j zehr
scott@bbxsda.UUCP (Scott Amspoker) (10/11/89)
In article <48682@ricerca.UUCP> chase@Ozona.UUCP (David Chase) writes: >jpoon@mipos2.intel.com (Jack Poon~) writes: >>Could any experts out there educate me WHY and HOW does self-modifying >>code use? In some cases it was necessary to implement subroutines. I used to work on an old IBM System/3 (now *there's* a clunker). There was no subroutine branch instruction. Anytime you did a normal branch, the address of the next instruction was placed into the "address recall register (ARR)". The first thing the subroutine had to do was store the ARR into the proper location of some branch instruction that would ultimately act as a "return". Obviously, subroutines were not re-entrant. In fact, we had a standard convention of using the expression '*-*' (read as: location counter minus location counter: absolute 0). So, if you were maintaining some assembly code and enountered an instruction such as: bra *-* You knew that the operand would be determined at runtime. Scary stuff, no? -- Scott Amspoker Basis International, Albuquerque, NM (505) 345-5232 unmvax.cs.unm.edu!bbx!bbxsda!scott
johnl@esegue.segue.boston.ma.us (John R. Levine) (10/11/89)
In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes: >Could any experts out there educate me WHY and HOW does self-modifying code >use? What the advantage of using self-modifying code that non-self-modifying >code cannot achieve? Depending on your point of view, either every modern computer uses self-modifying code, or almost nobody uses it any more. One of the great conceptual breakthroughs that made the modern computer possible was to store instructions and data in the same memory, so that instructions could create and modify other instructions. (I believe it was due to Von Neumann, but Babbage may have preceded him.) In ancient days, before about 1954, there were no index registers or even indirect addressing, so the only way you got a computer program to do anything interesting like add up all of the elements in an array was to have the program patch itself. For example, if your array started at location 100, the first instruction in the loop would be something like "ADD 100" which added the contents of location 100. Next you wanted to add location 101, so you'd do that by adding 1 to that ADD instruction, changing it to "ADD 101", and so forth. There were, as you might expect, a whole slew of clever tricks involving diddling instructions. Since the advent of index registers, rather than modifying the add instruction in memory, one uses an index register or indirect address word which modifies the effect of the instruction without having to modify the instruction in memory. The bad side of modifiable code is that a buggy program that is trying to change data will change code instead, which can lead to some truly spectacular wipeouts. (Sometimes they completely clear memory, so there is no trace of the program at all!) There are performance problems associated with self-modifying code as well. Most computers fetch instructions considerably ahead of where they are executing; even the lowly 8086 can fetch up to 6 bytes ahead of where it is executing. This means that if you store into the next instruction, the CPU might or might not already have fetched that instruction, so it might execute the old instruction or the new one, depending on such things as interrupts, DMA, and even register contents. Needless to say, this kind of bug is very hard to find. Experience suggests that in most cases, the possible damage from bugs smashing code outweighs the possible gain from allowing modifications, particularly since indexing and indirect addressing solve the problems most often addressed by code modification. Most computers now make the code read-only, so while a particular program is running it is forcibly prevented from modifying itself. There are a few places where instruction modification is still used. In some cases where programs dynamically link to libraries, the "call" instructions that refer to library routines are modified by the dynamic linker to point to wherever the routine happens to have ended up. High-performance sort programs that run on mainframes invariably write the inner comparison loop for the sort at the time the sort starts, so they don't have to reinterpret the sort specifications over and over. Incremental compilers such as are commonly found in Lisp systems compile a function to code at the time it is first called, install that code into the running process, and sometimes even patch call instructions to point to it (like the dynamic linker.) Some people might claim that compiling a program to object code and then loading and executing it counts as instruction modification. After all, the compiled code wasn't there when the computer started up, somebody must have modified what was there, so we all modify code all the time. I find this definition a little precious, and prefer to consider code to be self-modifying only in the case where the modifier and modifiee are part of the same program. I leave the definition of "same program" up to you. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl Massachusetts has over 100,000 unlicensed drivers. -The Globe
koopman@a.gp.cs.cmu.edu (Philip Koopman) (10/11/89)
In article <1080@mipos3.intel.com>, jpoon@mipos2.intel.com (Jack Poon~) writes: > ... > What the advantage of using self-modifying code that non-self-modifying code > cannot achieve? > ... In combinator graph reduction execution of functional programming languages, self-modifying code gives approximately a factor of two speedup on almost any hardware. This is because graph reduction is by its very nature self-modifying. My paper "A fresh look at combinator graph reduction" in the 1989 SIGPLAN conference on Programming Language Design and Implementation has details. In my case, the self-modification is limited to changing the addresses of subroutine call instructions. If a RISC architecture such as the R2000 had a jump to subroutine that took an address out of the data cache (perhaps by using the suceeding word in memory as an address, referenced through the data cache) this would be good enough for my application. Phil Koopman koopman@greyhound.ece.cmu.edu Arpanet 2525A Wexford Run Rd. Wexford, PA 15090 Senior Scientist at Harris Semiconductor, and researcher at CMU. I don't speak for them, and they don't speak for me.
peter@ficc.uu.net (Peter da Silva) (10/11/89)
One use of self-modifying code that still survives is total balls-to-the-wall block operations, such as BitBlts. There was a discussion of hardware graphics coprocessors where someone (maybe Henry Spencer, as the leading opponent of the beasts) brought this up. I have not seen such code myself, but it involved generating the BitBlt subroutine optimised for the operation desired on the fly and then executing it. -- Peter da Silva, *NIX support guy @ Ferranti International Controls Corporation. Biz: peter@ficc.uu.net, +1 713 274 5180. Fun: peter@sugar.hackercorp.com. `-_-' 'U` Quote: Structured Programming is a discipline -- not a straitjacket.
bjornl@tds.kth.se (Bj|rn Lisper) (10/11/89)
Self-modifying code can be seen as partial evaluation delayed to runtime. Partial evaluation, in its cleanest form, is to use the knowledge of some function arguments to simplify the function. Consider a function f(x,y) in two arguments x,y. If we know the value of x, say that x=c, then we can substitute c for x in the call f(c,y) and obtain a simplified "residual" function fc(y). Usually this is done at compile-time. A prime example is substitution of symbolic constants. Nothing prevents, however, that this is done at runtime (if, for instance, x is not known before then), which then results in self-modifying code. Bjorn Lisper
pl@etana.tut.fi (Lehtinen Pertti) (10/11/89)
From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman): > > In my case, the self-modification is limited to changing the > addresses of subroutine call instructions. If a RISC architecture > I've been lately wondering if there is any architecture with possibility to execute instruction indirectly. Old Intel 8080 made interrupt handling by executing instruction provided by peripheral, but does any current architecture support anything like this. I think this would provide a way to create 'self-modifying' code without crashing into cache problems. I mean something like: exec r0 ; execute instructio in register r0 or exec (r0) ; execute instruction pointed by r0 Does anyone know? -- pl@tut.fi ! All opinions expressed above are Pertti Lehtinen ! purely offending and in subject Tampere University of Technology ! to change without any further Software Systems Laboratory ! notice
wilson@carcoar.Stanford.EDU (Paul Wilson) (10/11/89)
In article <4415@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: >Finally, one very powerful technique is used in some implementations >of prototyping languages. Typically, such languages are interpreted, >so each statement in 'Object Logo Plus', or whatever, is turned into >a data structure that is crawled over by an interpreter. > >The trick is this: keep track of how often a code fragment (procedure, >method, or whatever) is modified, and how often it is executed. When >a certain stability threshold is reached, compile the sucker into true >machine code, and have all subsequent executions be direct rather than >interpreted. Over a prototyping session of an hour or two, the user >will observe a speedup of 2x to 20x, as the more stable and heavily used >parts of the program get converted gradually into machine code. > >Hope that helps. Does anybody actually do this? It seems the obvious thing to do, but I don't know of anybody who does it. All of the dynamic compilation systems I know of compile things the very first time they're executed, and don't recompile unless something gets redefined. It seems to me that in a very high-performance system you want to do this, recompiling with a higher level of optimization when an execution count hits a threshold. This will heuristically improve your response time/ execution time tradeoff in an interactive programming environment. (And also your space/time tradeoffs in the generated code.) I would be *very* interested in anybody's experiences with such a system, especially any statistics on the frequencies of execution of different procedures/methods. (e.g., does a 90/10 rule apply, or an 80/20, or what? What does the curve look like?) I'm interested in preventing unnecessary compilation and optimization of things that aren't executed very frequently. This is not just to avoid useless compiler work, but to keep the code transparently debuggable whenever possible. (Yes, I know you can often debug optimized code by recording the optimizations and backward-mapping them at debug time. But sometimes you can't, and it's always more work for the compiler and debugger writers. I'm looking at two levels of optimized code -- easily source-mappable "transparent" optimizations, and hairy "opaque" optimizations for important well-debugged time-critical code. Any info to evaluate the tradeoffs would be of interest.) -- Paul Paul R. Wilson Software Systems Laboratory lab ph.: (312) 996-9216 U. of Illin. at C. EECS Dept. (M/C 154) wilson@carcoar.stanford.edu Box 4348 Chicago,IL 60680 Paul R. Wilson Software Systems Laboratory lab ph.: (312) 996-9216 U. of Illin. at C. EECS Dept. (M/C 154) wilson@carcoar.stanford.edu Box 4348 Chicago,IL 60680
joel@cfctech.UUCP (Joel Lessenberry) (10/12/89)
I did a lot of work on IBM series 1's with the event driven executive operating system...programmed in Event Driven Language. EDL (sorta a MACRO assembler) had opcode parameter naming ie move #1,1,p2=lable1 this would move to index reg #1 the immediate value 1, with lable1 being the address of the immediate in the opcode. all of the instructions had this 'feature'. Actually, the S/1 was not all that bad. any one else have any experience? joel Joel Lessenberry, Distributed Systems | +1 313 948 3342 joel@cfctech.UUCP | Chrysler Financial Corp. joel%cfctech.uucp@mailgw.cc.umich.edu | MIS, Technical Services {sharkey|mailrus}!cfctech!joel | 2777 Franklin, Sfld, MI
johne@hpvcfs1.HP.COM (John Eaton) (10/12/89)
<<<< < i< What the advantage of using self-modifying code that non-self-modifying code cannot achieve? --------- Job Security... I once traced a problem in some PC software to a bit of SMC that worked on a 8088 but not on a 8086. Turned out that the larger prefetch of an 8086 would grab the old data before the write and use that instead of what the programmer intended. John Eaton !hpvcfs1!johne
hascall@atanasoff.cs.iastate.edu (John Hascall) (10/12/89)
In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes: }From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman): } I've been lately wondering if there is any architecture } with possibility to execute instruction indirectly. } I mean something like: } } exec (r0) ; execute instruction pointed by r0 } Does anyone know? Well, how about the IBM 370 series. (not exactly modern, I know) If I recall correctly: EX mask,location <next instruction> location: <target instruction> Fetches the target instruction at "location" and ORs "mask" with (some of) the target instruction and then executes the resulting instruction. Control resumes at "next instruction". The EX instruction is used all the time as it is the only decent way to do a number of things. John Hascall ISU Comp Center
cooper@hpsrad.enet.dec.com (g.d.cooper in the shadowlands) (10/12/89)
One my major reasons for using self-modifying code was due to the user implementable instructions (UUOs) on the PDP-10. If you were doing funky IO you could define your own system call that executed within the user space. Since IO channel numbers are different for each file you would build the instruction sequences somewhere in memory and then execute them. A couple of months ago I was writing a device driver for RT-11 on the PDP-11 and I had to use self-modifying code to either store or discard data depending upon whether or not there was a free buffer. The '11 lends itself to this sort of manipulation by allowing instruction sequences such as: tstb free ;any free buffers bne 1$ ;yes don't worry be happy mov (pc)+,@#routine ;no free buffers then we'll nop ;have to discard br 2$ 1$: mov (pc)+,@#routine ;move the mov instruction into mov (r0)+,(r1)+ ;the routine 2$: I also remember doing similar things on the PDP-8 and the RDS-500 due to hardware limitations but the less said about that the better. Hacking assembler for a living, shades ============================================================================ | He paid too high a price for living | Geoffrey D. Cooper | | too long with a single dream..... | cooper@hpsrad.enet.dec.com | |-------------------------------------| business (508) 467-3678 | | decwrl!hpsrad.enet.dec.com!cooper | home (617) 925-1099 | ============================================================================ Note: I'm a consultant. My opinions are *MY* opinions.
mrc@Tomobiki-Cho.CAC.Washington.EDU (Mark Crispin) (10/12/89)
In article <1619@atanasoff.cs.iastate.edu> hascall@atanasoff.UUCP (John Hascall) writes: >In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes: >}From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman): >} I've been lately wondering if there is any architecture >} with possibility to execute instruction indirectly. > Well, how about the IBM 370 series. (not exactly modern, I know) On the DEC PDP-10 (CPU for the DECsystem-10 and DECSYSTEM-20), *all* instructions calculated an "effective" address based on an 18-bit value (the right half of the instruction word), index register (the low order 4 bits of the left half iff non-zero), and indirect bit (the next higher order bit after the index register). This was done before any opcode decoding was done (hence, you would get any page faults caused by the effective address decoding before you'd get any illegal instruction faults). Indirection was recursive, so you could do an amazing set of references. So, you could always execute indirect and/or indexed. In fact, it was fairly common to do XCT FCNTAB(FC) where FCNTAB was a table of instructions to execute and FC was the desired index into that table. Mark Crispin / 6158 Lariat Loop NE / Bainbridge Island, WA 98110-2020 mrc@CAC.Washington.EDU / MRC@WSMR-SIMTEL20.Army.Mil / (206) 842-2385 Atheist & Proud / 450cc Rebel pilot -- a step up from 250cc's!!! tabesaserarenakerebanaranakattarashii...kisha no kisha ga kisha de kisha-shita sumomo mo momo, momo mo momo, momo ni mo iroiro aru uraniwa ni wa niwa, niwa ni wa niwa niwatori ga iru
cac@steven.COM (cac) (10/12/89)
In article <9175@etana.tut.fi>, pl@etana.tut.fi (Lehtinen Pertti) writes: + I've been lately wondering if there is any architecture + with possibility to execute instruction indirectly. + I think this would provide a way to create 'self-modifying' + code without crashing into cache problems. + I mean something like: + exec r0 ; execute instructio in register r0 + or + exec (r0) ; execute instruction pointed by r0 + Does anyone know? The PDP-15 did, as I recall, and I think the PDP-10 did as well, altho I never did assembly on the 10. > Pertti Lehtinen ! purely offending and in subject -- Charles A. Clinton Disclaim! Disclaim! Sierra Geophysics, Inc. ...!uw-beaver!sumax!ole!steven!cac 11255 Kirkland Way Voice: (206) 822-5200 Telex: 5106016171 Kirkland, WA 98033 USA FAX: (206) 827-3893
johnl@esegue.segue.boston.ma.us (John R. Levine) (10/12/89)
In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes: >From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman): > I've been lately wondering if there is any architecture > with possibility to execute instruction indirectly. Lots of architectures do. The IBM 370 has an execute instruction, as does the DEC-20, GE 635 or whatever they call it these days (Bull DPS-8, perhaps) and many other more antique or obscure machines. Each has its own quirks. On the 370, EX takes two operands, the address of the instruction to execute and a register. The low byte of the register is logically or-ed with the instruction before it is executed (in the instruction decoder, not in memory.) In memory-to-memory move or compare instructions, the second byte is the length, so this is how you faked variable length string instructions on the 360. The success of this approach can be appreciated by noting that one of the changes in the 370 was to add variable length string compare and move instructions. It is expressly forbidden to execute another execute instruction. On the DEC-20, you can execute anything you want, including another execute instruction. On the '20 the execute instruction again takes two operands, the location of the instruction to be executed and a register. In early implementations of the PDP-6 and PDP-10 the register was ignored, but later they used the register field to mean things like "interpret the source address in user context" to make it easier to fetch and store user data while executing a system call in kernel mode. The -20 also takes interrupts by executing instructions in fixed addresses. You can use a "read (or write) word and increment pointer," thereby implementing a poor man's DMA. I haven't seen execute instructions on any more modern machines; I guess that the same arguments that give us read-only code say that the hackery to implement execute isn't worth it either. i can't see how you could implement it on a pipelined machine without totally draining the pipeline, thus causing a terrible performance hit. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl Massachusetts has over 100,000 unlicensed drivers. -The Globe
bga@odeon.ahse.cdc.com (Bruce Albrecht) (10/12/89)
In article <236@bbxsda.UUCP>, scott@bbxsda.UUCP (Scott Amspoker) writes: > So, if you were maintaining some assembly code and enountered > an instruction such as: > > bra *-* > > You knew that the operand would be determined at runtime. > Scary stuff, no? It's equivalent to the FORTRAN assigned GOTO, or the PDP-11's JSR/JMP reg, or other jumps calculated on the fly, it's just that the address is saved in memory. CDC's peripheral processors (PP's) had 4k bytes. In order to squeeze everything into a PP, they relied heavily on overlays, and putting code in buffer areas to preset constants, etc. The program would execute the code in the buffer, which set any jumps, constants, etc, that needed changing, and then reuse the buffer in the main code area.
art@salt.acc.com (Art Berggreen) (10/12/89)
I seem to recall one of the primary paper tape bootstraps for PDP-11s used a loop which read a byte from the tape into the instruction loop. The punch pattern for the leader left the code unchanged until the beginning of a record at which time the code was modified to jump to a different part of the bootstrap controlled by the byte read off the tape. Details are a bit hazy (twas quite a while ago...).
clj@ksr.com (Chris Jones) (10/12/89)
In article <9175@etana.tut.fi>, pl@etana (Lehtinen Pertti) writes: > > I've been lately wondering if there is any architecture > with possibility to execute instruction indirectly. > > I mean something like: > > exec r0 ; execute instructio in register r0 > > or > > exec (r0) ; execute instruction pointed by r0 We used to do this on Data General (16 bit) Eclipses. Peripherals on the machine were addressed by device codes, and the I/O instructions had the device codes as part of the instruction (rather than being taken from a register). To allow the same code to service more than one device, we'd build an I/O instruction in a register, then execute it out of the register. The GE/Honeywell 635/645/DPS-8 line had both XEC (execute) and XED (execute double) instructions, used to implement one or two instruction subroutines. By the way, the instruction format on the those machines was a 36 bit word, with the 18 bits of address in the upper half of the word. I have been told the reason for this is so that self-modifying code wouldn't destroy opcode bits if address manipulation caused a carry out of the address field.
tim@cayman.amd.com (Tim Olson) (10/12/89)
In article <6481@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes: | In my case, the self-modification is limited to changing the | addresses of subroutine call instructions. If a RISC architecture | such as the R2000 had a jump to subroutine that took an address | out of the data cache (perhaps by using the suceeding word | in memory as an address, referenced through the data cache) | this would be good enough for my application. But it does! The R{2,3}000, SPARC, Am29000, (and 88K, I believe) all have the ability to perform a register-indirect call, where the target of the call is sourced from a general-purpose register. These instructions normally are used to build "big" addresses, but can be used in your application simply by loading the modified address into the register before performing the call. A similar technique for branches is normally used for performing "switch" statements via a jump table stored in data memory. -- Tim Olson Advanced Micro Devices (tim@amd.com)
aaronb@lagrange.gatech.edu (Aaron Birenboim) (10/12/89)
In article <1989Oct11.013553.3893@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes: >... Most computers fetch >instructions considerably ahead of where they are executing; even the lowly >8086 can fetch up to 6 bytes ahead of where it is executing. This means that >if you store into the next instruction, the CPU might or might not already >have fetched that instruction, so it might execute the old instruction or the >new one, depending on such things as interrupts, DMA, and even register >contents. Needless to say, this kind of bug is very hard to find. Is this really true? There isn't enough dependancy checking in the 8086 instruction pipe to detect this type of operation, clear the pipe, and re-fetch the altered instruction, or some such corrective measure. I'm glad I don't try self-modifying code. Aaron Birenboim | aaronb@eedsp.gatech.edu | Why do we have to wear Georgia Tech Box 30735 | (404) 874-1973 | shoes all the time? Atlanta, GA 30332 +-------------------------+ USENET: ...!{allegra,hplabs,ihnp4,ulysses}!gatech!gt-eedsp!aaronb
yodaiken@freal.cs.umass.edu (victor yodaiken) (10/12/89)
Forget where I saw this nice example of self-modifying code. It was billed as one ofthe earliest mutual exclusion programs. L: move opcode["branch to L"] to L critical region ... move opcode[" move opcode["branch to L"]] to L The first programto execute the instruction at L changes the instruction so that all others will just busy loop, and then changes it back when it leaves the critical region Of course it only works for uni-processors. victor
fwb@demon.siemens.com (Frederic W. Brehm) (10/12/89)
Compiler generated self-modifying code: YES. When I was a graduate student (in ancient times: 1974) I had occasion to look at the generated code for both a CDC 6000-series Fortran compiler and an IBM 370-series COBOL compiler. Both compilers used the same technique for inserting argument addresses in subroutines (did you know COBOL has subroutines?!!!). The compilers generated code that copied the argument addresses from the argument list into the instruction address fields. Fred -- Frederic W. Brehm Siemens Corporate Research Princeton, NJ fwb@demon.siemens.com -or- princeton!siemens!demon!fwb
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/12/89)
In article <9175@etana.tut.fi>, pl@etana.tut.fi (Lehtinen Pertti) writes: | I've been lately wondering if there is any architecture | with possibility to execute instruction indirectly. Among others the GE-600 series, currently marketed as the Honeywell DPS. It has an execute and execute double (execute two instructions at location) to use. One use of self-modifying code is to get around bad hardware design, such as making i/o port numbers immediate addresses. If this is the case, as it has been on a number of machines, you either need to have one device driver per port or modify the code. There was some such limitation on the DPS, and the way I got around it, rather than write self modifying code, was to build the instruction I needed on the stack and do a XEC on it. This was useful on the models which support pure code segments. I think the 8080 had i/o port as part of the address of the instruction, and the Z80 added the ability to have the port address in a register. Someone will undoubtedly remember it better than I do. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon
fwb@demon.siemens.com (Frederic W. Brehm) (10/12/89)
In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes: > I've been lately wondering if there is any architecture > with possibility to execute instruction indirectly. >... > I mean something like: > > exec r0 ; execute instructio in register r0 >... > Does anyone know? The Data General Eclipse (16-bit) had an instruction just like this. I don't know if the MV series has one like it. Fred -- Frederic W. Brehm Siemens Corporate Research Princeton, NJ fwb@demon.siemens.com -or- princeton!siemens!demon!fwb
EGNILGES@pucc.Princeton.EDU (Ed Nilges) (10/12/89)
I have followed this discussion with interest. In modern very-high- level interpreted languages like REXX, an interesting variant of the SMC issue emerges. This is the use of the INTERPRET verb: INTERPRET expression evaluates "expression" and executes it as a REXX statement. Some of the same arguments pro and con are used with reference to the use of INTERPRET: pro: 1. It's "efficient" 2. It allows me to execute more complex and sophisticated algorithms in which code is data and data is code, eg., implementation of expert systems con: 1. It's "confusing" 2. It breaks associated technology. Traditional SMC breaks instruction caches and other such contraptions, whereas use of INTERPRET breaks compilers. This last point is not a necessity, since the compiler could generate code for lazy evaluation of the expression, but the only current compiler for REXX simply prohibits INTERPRET. Other examples of this SMC variant include DO in Hypertalk.
henry@utzoo.uucp (Henry Spencer) (10/12/89)
In article <1989Oct12.041940.5651@ginger.acc.com> art@salt.acc.com (Art Berggreen) writes: >I seem to recall one of the primary paper tape bootstraps for PDP-11s >used a loop which read a byte from the tape into the instruction loop. Bootstraps written before the days of decent-sized EPROMs often employ all kinds of dirty tricks to minimize size. The classic was the toggle-in (i.e. manually entered, none of this wimpy ROM business :-)) bootstrap for the pdp8 with RK disks: two instructions. The first was "kick disk controller" (the controller handily came up in a fairly reasonable state after bus reset). The second was "branch to self". These went in at a very specific location in low memory. As the second-stage bootstrap came in off disk, it overwrote low memory, and the early part of it was very carefully crafted to pick up control and wait for the disk controller to finish. -- A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology megabyte of flaming. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
andrew@frip.WV.TEK.COM (Andrew Klossner) (10/13/89)
[] "I had occasion to look at the generated code for both a CDC 6000-series Fortran compiler and an IBM 370-series COBOL compiler. Both compilers used the same technique for inserting argument addresses in subroutines (did you know COBOL has subroutines?!!!)." Did you know that COBOL has an instruction to modify a GOTO instruction? ALTER LOOP-LABEL TO PROCEED TO ALL-DONE. means change the destination of the GOTO at LOOP-LABEL from whatever it is now to the label ALL-DONE. (I've heard that the ANSI COBOL committee is doing away with this, and I've heard further that members of the committee have been sued, but I don't know details.) Of course, you can implement this without actually modifying any code, by arranging that all branches which are targets of ALTER statements fetch their targets from data space. "The R{2,3}000, SPARC, Am29000, (and 88K, I believe) all have the ability to perform a register-indirect call, where the target of the call is sourced from a general-purpose register. These instructions normally are used to build "big" addresses ..." On the 88k, at least, these instructions are more often used to implement indirect procedure calls, for example to translate the C construct return_value = (*procedure_ptr)(arg1, arg2); The direct procedure call on the 88k can address plus-or-minus 128MB of instruction space, which so far has been big enough for my codes. :-) -=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
cassel@sce.carleton.ca (Ron Casselman) (10/13/89)
I seem to remember a piece of self-modifying code that used to run on CP/M. I remember it because I had to disassemble it as an assignment for a course in systems software. The program would copy itself to a different part of memory and change all address references as it did so. It would then load a second program into the original location. (all programs under CP/M started at a fixed location). The second program would then begin to run. Sorry I cannot remember any more details as to the purpose of this self-modifying code.
pasek@ncrcce.StPaul.NCR.COM (Michael A. Pasek) (10/13/89)
In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes: > I've been lately wondering if there is any architecture > with possibility to execute instruction indirectly. >[stuff deleted] > I mean something like: > exec r0 ; execute instructio in register r0 Something like this is certainly possible. The IBM 360/370 architecture has an "Execute" instruction. The instruction allows you to execute a single instruction "indirectly", with bits 8-15 of the executed instruction modified by bits 24-31 of a register specified on the Execute instruction. For example, to do an immediate compare of a byte value: LA R1,BYTEVAL GET THE VALUE WE'RE LOOKING FOR EX R1,COMPARE CHECK TO SEE IF IT'S THERE . . COMPARE CLI TESTVAL,0 COMPARE In the above example, the "virgin" instruction in memory has the following format (assuming that "TESTVAL" is at absolute location 00000123 in memory): 95000123 It is actually executed (assuming BYTEVAL = X'AF') as: 95AF0123 The Execute instruction can be very handy for variable-length block moves, and is certainly much safer than self-modifying code. Of course, you could still call it self-modifying, it's just modified in the CPU rather than in memory :-). M. A. Pasek Switching Software Development NCR Comten, Inc. (612) 638-7668 CNG Development 2700 N. Snelling Ave. pasek@c10sd3.StPaul.NCR.COM Roseville, MN 55113
johnl@esegue.segue.boston.ma.us (John R. Levine) (10/13/89)
In article <527@eedsp.gatech.edu> aaronb@lagrange (Aaron Birenboim) writes: >In article <1989Oct11.013553.3893@esegue.segue.boston.ma.us> I wrote: >>even the lowly 8086 can fetch up to 6 bytes ahead of where it is executing >>[so if you store into the next instruction it will probably execute the >>old version] >Is this really true? There isn't enough dependancy checking in the 8086 >instruction pipe to detect this type of operation, clear the pipe, and >re-fetch the altered instruction, or some such corrective measure. You bet there isn't. Why waste all that logic to handle the .000001% of programs that would want to store into the next instruction? And you thought the 8086 wasn't a RISC. You can patch code reliably so long as there is a branch executed between the patcher and the patchee, since all '86 processors flush the prefetch queue on a branch. The advisability of patching code remains as dubious as ever, particularly on machines in the '86 series in which the funky instruction encoding makes it tricky to identify the correct location to patch. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl Massachusetts has over 100,000 unlicensed drivers. -The Globe
bean@putter.sgi.com (Bean Anderson) (10/13/89)
In article <16557@siemens.siemens.com>, fwb@demon.siemens.com (Frederic W. Brehm) writes: > In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes: > > I've been lately wondering if there is any architecture > > with possibility to execute instruction indirectly. > >... > > I mean something like: > > > > exec r0 ; execute instructio in register r0 > >... > > Does anyone know? > The HP300 (a discontinued machine) had an execute-top-of-stack instruction that got considerable use -- mostly because certain instructions that had immediate operands did not have corresponding instructions that had stack (or register) operands. I think the HP3000 had a similar instruction. Bean
slackey@bbn.com (Stan Lackey) (10/13/89)
In article <1989Oct12.184824.2465@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes: >In article <527@eedsp.gatech.edu> aaronb@lagrange (Aaron Birenboim) writes: >>In article <1989Oct11.013553.3893@esegue.segue.boston.ma.us> I wrote: >>>even the lowly 8086 can fetch up to 6 bytes ahead of where it is executing >>>[so if you store into the next instruction it will probably execute the >>>old version] >>Is this really true? There isn't enough dependancy checking in the 8086 >>instruction pipe to detect this type of operation, clear the pipe, and >>re-fetch the altered instruction, or some such corrective measure. > >You bet there isn't. Why waste all that logic to handle the .000001% of >programs that would want to store into the next instruction? All the PDP/11's that do prefetching or instruction caching DO check for possible writes to the next instruction, and flush/refetch/etc to make it work, simply because it isn't specified NOT to work. Even 11mode on the VAXes do this. Some shortcuts are taken, like just comparing the 8? low bits of a data write to those of the prefetched instructions; performance will be hurt a little, but erring on the conservative side always works. Why "waste" all that logic? Because it's better for you in the long run, if as much as possible of old software works on new hardware. -Stan
colwell@mfci.UUCP (Robert Colwell) (10/13/89)
>You vendors of custom processors has better take note of this statement. It >is really amusing to see the Multiflow and Alliant fellows bragging at how >well they are doing when single chip microprocessors run circles around their >entire multiprocessor complexes. The SUN-4 Sparcstation which did all this I used to design micros, and I used to design workstations, too. What they're capable of now, and where they're going in the performance space, isn't the constant source of surprise you imply. I have yet to brag about anything here. This is comp.arch, not comp.marketing. Marketing and market success are peripheral issues which I have no interest in pursuing in this forum. Or do you think that you have to make a lot of money off something before it can be considered technically important? What interests me is building fast computers. To that end, Multiflow has contributed something original and innovative, and as far as I can tell, multiple-instruction issue is a path that seems to be gaining adherents, especially among the very micros I'm supposed to be terrified of. From a comp.arch point of view I think that's neat. >"damage" is an absolute DOG compared to the most recent Sparc, MIPS, or i860 >chip sets. No doubt a real terror from Motorola is not far away. The lifetime >of custom processors, even processors like the Cray line, are real short. >Imagine the surprise of CRI executives when they realize that they are no >longer losing market share to the Japanese super computer companies, and are >now losing market share to microprocessors. We are watching this happen >internally at LLNL and do not doubt that it is happening elsewhere. >There is no defense against the ATTACK OF THE KILLER MICROS! We've been through this before. When micros want to supplant the high-end supers, they'll have to come up with answers to the age-old (well, ok, maybe 15 years old) problem of how to build a memory that will keep a fast CPU fed. Such a memory costs a Whole Lot Of Money. When you've sunk a lot of money into that memory, you can afford more CPU than you can buy off the shelf. There *is* some question about how fast one can turn out specialized processors vs. how fast the micro guys can crank them out. (It's clear who's winning that race for certain data points such as Cray vs. Moto/Intel, though.) But you do have a point (whether it's the one you intended, I don't know). If LLNL personnel are getting their work done on workstations, maybe the market for true supercomputer performance (for which you need the kind of memory and I/O bandwidth I referred to above) isn't as big as people have always thought it was. Or their workloads have migrated away from the canonical vector codes to codes that no longer show off the big supers to their best advantage? If either of these is true, then you're right, this presages a major shift in what hardware people buy. Bob Colwell ..!uunet!mfci!colwell Multiflow Computer or colwell@multiflow.com 31 Business Park Dr. Branford, CT 06405 203-488-6090
mustard@sdrc.UUCP (Sandy Mustard) (10/13/89)
In article <1619@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes: > In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes: > Well, how about the IBM 370 series. (not exactly modern, I know) > If I recall correctly: > EX mask,location > <next instruction> > location: <target instruction> > Fetches the target instruction at "location" and ORs "mask" with > (some of) the target instruction and then executes the resulting > instruction. Control resumes at "next instruction". > John Hascall > ISU Comp Center To quote the 370/XA Principles of Operations manual, "When the target instruction is a successful branching instruction, the instruction address of the current PSW is replaced by the branch address specified in the target instruction." thus control may resume at the target address of the branch instruction. Sandy Mustard mustard@sdrc.UU.NET
aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (10/13/89)
You might consider instructions of the form "Execute Register" or "Execute Memory", where the primary instruction specified where a (single) second instruction was to be fetched from, a form of self-modifying code. Certainly, these caused the same difficulties for the pipeline. I have seen "Execute Register" used to synthesize dynamic shift instructions, where the shift amount is not known at compile time (dynamic shifts are much less frequent than static shifts in most instruction mixes), and to produce the indirect syscall() system call. "Execute Register" was also used to produce a relatively fast instruction level simulator for a similar, but not identical, instruction set processor. -- Andy "Krazy" Glew, Motorola MCD, aglew@urbana.mcd.mot.com 1101 E. University, Urbana, IL 61801, USA. {uunet!,}uiucuxc!udc!aglew My opinions are my own; I indicate my company only so that the reader may account for any possible bias I may have towards our products.
mcdonald@uxe.cso.uiuc.edu (10/13/89)
>>There is no defense against the ATTACK OF THE KILLER MICROS! >We've been through this before. When micros want to supplant the high-end >supers, they'll have to come up with answers to the age-old (well, ok, >maybe 15 years old) problem of how to build a memory that will keep a fast >CPU fed. Such a memory costs a Whole Lot Of Money. When you've sunk a lot >Bob Colwell ..!uunet!mfci!colwell >Multiflow Computer or colwell@multiflow.com Prognostication: When the need arieses (oh My, I'll leave that typo in. Our MIPS is named Aries) the micro people will find a way to get the memory bandwidth they will need. 4 megabit chips, itsy bitsy low capacitance wires, a few ASICS, maybe even a few fiber optics, they will squeeze it into the motherboard somehow. And it won't cost a Whole Lot of Money, because it will be simple (outside the chips, of course!!) and elegant. Doug MCDonald
aubrey@rpp386.cactus.org (Aubrey McIntosh) (10/13/89)
In article <1989Oct12.162236.24239@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <1989Oct12.041940.5651@ginger.acc.com> art@salt.acc.com (Art Berggreen) writes: >>I seem to recall one of the primary paper tape bootstraps for PDP-11s >>used a loop which read a byte from the tape into the instruction loop. > >Bootstraps written before the days of decent-sized EPROMs often employ >all kinds of dirty tricks to minimize size. The classic was the toggle-in >(i.e. manually entered, none of this wimpy ROM business :-)) bootstrap I'm beginning to believe that every time I have an original idea, there must be a hundred people working on exactly the same thing. In my case, there was a pdp8e connected to a mass spec, and a 16 bit machine in another building that had the loadable programs. When a comm packet of type 'execute there' passed its checksum, the 'go loop' instruction was overwritten with the (successfully) loaded code's start up address. Saved lots of walking across campus to re-toggle after a comm error. -- Aubrey McIntosh Freelance using Modula-2 Real time, embedded, instruments. Austin, TX 78723 Enquiries welcome 1-(512)-452-1540 aubrey%rpp386.Cactus.org@cs.utexas.edu
henry@utzoo.uucp (Henry Spencer) (10/14/89)
In article <46852@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >>>Is this really true? There isn't enough dependancy checking in the 8086 >>>instruction pipe to detect this type of operation, clear the pipe, and >>>re-fetch the altered instruction, or some such corrective measure. >> >>You bet there isn't. Why waste all that logic to handle the .000001% of >>programs that would want to store into the next instruction? > >All the PDP/11's that do prefetching or instruction caching DO check >for possible writes to the next instruction, and flush/refetch/etc to >make it work, simply because it isn't specified NOT to work... I think that can fairly be considered an oversight by the 11's original designers. IBM did not make the same mistake on the 360 series. If you read the 360/370 "Principles of Operation" -- still a model of how to document a processor -- you will find quite a bit of very carefully phrased verbiage describing *exactly* what you can and cannot assume about self-modifying code. Even that is rather more generous than a modern designer would be, though, since the 360 series arose at a time when self-modifying code was still common practice. -- A bit of tolerance is worth a | Henry Spencer at U of Toronto Zoology megabyte of flaming. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/14/89)
Some of this discussion reminds me of discussions about threaded-code systems like Forth. Has anyone ever identified any ISA issues that are specific for code like this? I have always assumed that a fast branching RISC would be ideal, but I don't know of this is a correct assumption. (I am not asking for a repeat of RISC arguments: any instruction, to be considered, has to show at least a 1% speedup on an actual complete program to be considered for special support.) Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
rcd@ico.ISC.COM (Dick Dunn) (10/14/89)
EGNILGES@pucc.Princeton.EDU (Ed Nilges) writes: > I have followed this discussion with interest. In modern very-high- > level interpreted languages like REXX, an interesting variant of the > SMC issue emerges... >... INTERPRET expression >...evaluates "expression" and executes it as a REXX statement... It's not a very new idea, though. It exists in LISP (apply/eval) and SNOBOL (compile()), both of which go way back. It exists in the UNIX shells (eval and ``). The hardware analog, as others have noted, is the EXECUTE instruction on some older machines. It may be useful to make a distinction between self-modifying and "self- generating", for want of a better word. About objections to self-modification: > 2. It breaks associated technology. Traditional SMC breaks > instruction caches and other such contraptions, whereas > use of INTERPRET breaks compilers. This last point is not > a necessity, since the compiler could generate code for > lazy evaluation of the expression, but the only current > compiler for REXX simply prohibits INTERPRET. In hardware, you just need a way to say "clean these things out of your cache". It shouldn't be hard, but it has to be provided. It helps the tradeoff at which generating code is practical if it isn't a privileged instruction. INTERPRET, or eval or compile or whatever, should not break a compiler. There's no excuse for it--as Nilges says, you just have to handle it correctly. -- Dick Dunn rcd@ico.isc.com uucp: {ncar,nbires}!ico!rcd (303)449-2870 ...No DOS. UNIX.
hedley@imagen.UUCP (Hedley Rainnie) (10/14/89)
Mention has been made of the Dec-10 and 20. These machines also had the feature that the registers were part of the address space, a very rare feature indeed these days, this allows one to code a string search algorithm and have the code reside in addresses 0..17 (octal of course), and then to jump into the registers via an XCT instruction, the pleasure of this is that the code will run at register speeds. Nowadays the cache technology is such that this kind of coding is no longer needed but a little nostalgia is a good thing. Hedley {decwrl|sun}!imagen!iit!hedley hedley@imagen.com -- {decwrl!sun}!imagen!hedley hedley@imagen.com
cliffhanger@cup.portal.com (Cliff C Heyer) (10/14/89)
>Could any experts out there educate me WHY and HOW does self-modifying code use? The first program I ever wrote was self-modifying. In 1976 with the 8080 for class assignment I wrote a ping pong program with LEDs. The rest of the class wrote code for each direction, but I wrote one "shell" routine and then inserted instructions into it after calculating the "pong" direction. My program was about 1/5 as long as all the others, which was important when you had only 1 or 2K of memory. (The teacher had never heard of doing such a thing, and I had to convince him I deserved an A+) >What the advantage of using self-modifying code that non-self-modifying code >cannot achieve? On DECsystem-10s and 20s I wrote code to check the serial number so that people who stole the code could not run it (easily) on their processor. But if you put the instructions in the code itself, a smart hacker could remove them from the EXE file or insert a jump statement to skip over them. So I generated the check at run time by moving "numbers" to memory and then executing them. This make cracking the security much more difficult. I moved numbers to registers and then bit shifted them to make the instruction, then moved the PC to the register. (God forbid trying this on a VAX!) >Is there any compiler which will generate code that self-modified? If you mean C or BASIC, I'm sure there are some "hacker" compilers that allow you to play, but in general you have no way of knowing address locations that your C compiler will create, or even what instructions your C compiler will generate, so how can you write C statements that will modify something that does not yet exist? You have to use machine code, since then you know what is where. You can do it with any assembler as far as I know, but you have to fuss with the OS and/or compile switches to actually compile and run the code and suppress error messages. >A small and useful example of self-modifying will be very helpful. Read above.
baum@Apple.COM (Allen J. Baum) (10/14/89)
[] >In article <4534@imagen.UUCP> hedley@imagen.UUCP (Hedley Rainnie) writes: > >Mention has been made of the Dec-10 and 20. These machines also had the >feature that the registers were part of the address space, a very rare >feature indeed these days, this allows one to code a string search algorithm >and have the code reside in addresses 0..17 (octal of course), and then to >jump into the registers via an XCT instruction, the pleasure of this is that >the code will run at register speeds. Except on the newer DEC-20s, it ran slower out of the register file than out of real memory. oops. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
dworkin@Solbourne.COM (Dieter Muller) (10/14/89)
In article <35636@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >[] >>In article <4534@imagen.UUCP> hedley@imagen.UUCP (Hedley Rainnie) writes: >> >>[running out of the registers] > >Except on the newer DEC-20s, it ran slower out of the register file than >out of real memory. oops. Well, it depends on how you define `real memory'. Registers are in `fast memory'. Quoting from the Hardware Reference Manual, Volume I -- Central Processor, page 1-24: Fast memory times are for referencing as a memory location for an operand; a fast memory instruction fetch takes slightly more time than a cache access. And, on the next page, a nice table of access times. The relevant entries are: Memory Read/Write MA20 Core .833/.40 KL10 Fast .067/.067 KL10 Cache .133/.133 Times are in milliseconds for a single word access. So, if you had cache installed, and your code happened to get placed in it (at the whim of the monitor), then yes, code from memory would run faster than code in registers. However, if you were a financially-strapped site that decided not to invest in such new-fangled things like cache, code would be much faster in the registers. Dworkin -- No, I'm this way normally. Why do you ask? boulder!stan!dworkin dworkin%stan@boulder.colorado.edu dworkin@solbourne.com Flamer's Hotline: (303) 678-4624 (1000 - 1800 Mountain Time)
gary@dgcad.SV.DG.COM (Gary Bridgewater) (10/14/89)
In article <16557@siemens.siemens.com> fwb@demon.UUCP (Frederic W. Brehm) writes: >In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes: >> I've been lately wondering if there is any architecture >> with possibility to execute instruction indirectly. >>... >> I mean something like: >> >> exec r0 ; execute instructio in register r0 >>... >> Does anyone know? > >The Data General Eclipse (16-bit) had an instruction just like this. I >don't know if the MV series has one like it. The MV instruction set includes the Eclipse instruction set which includes the Nova instruction set. Assembly routines writtern for the first Nova run on the latest MV. The most useful purpose, on our I/O architecture, for the XCT instruction is to write self-modifying re-entrant I/O handlers: load accumulator 0 with an I/O instruction with a 0 device code OR in the device code of interest XCT accumulator 0 So you have one generic disk handler for each type of disk which can handle any number of controllers. You can also write a fairly tight loop via load accumulator 0 with an XCT 0 XCT 0 which was kind of an interesting thing to watch back when we had lights - the u-code address lights came on fairly solid and the address lights froze. This can be interrupted but the cpu can't be halted since halt is only honored between instructions. -- Gary Bridgewater, Data General Corp., Sunnyvale Ca. gary@sv4.ceo.sv.dg.com or {amdahl,aeras,amdcad,mas1,matra3}!dgcad.SV.DG.COM!gary No good deed goes unpunished.
brooks@vette.llnl.gov (Eugene Brooks) (10/15/89)
>>There is no defense against the ATTACK OF THE KILLER MICROS! >We've been through this before. When micros want to supplant the high-end >supers, they'll have to come up with answers to the age-old (well, ok, >maybe 15 years old) problem of how to build a memory that will keep a fast >CPU fed. Such a memory costs a Whole Lot Of Money. When you've sunk a lot >Bob Colwell ..!uunet!mfci!colwell >Multiflow Computer or colwell@multiflow.com Interleaving directly on a memory chip is absolutely trivial. The memory chip makers will do it as soon as there is a mass-market for it. Of course the supercomputer vendors will be able to make use of these hot memory chips, but their slow CPUs just won't be fast enough to keep up with microprocessors. brooks@maddog.llnl.gov, brooks@maddog.uucp
bart@videovax.tv.tek.com (Bart Massey) (10/15/89)
I can't resist throwing in my favorite example of self-modifying code. A friend and I were studying how the stack looks after a panic on a VAX 785 running 4.3BSD UNIX, and noticed that all the registers are saved by _panic. OK, we said to ourselves, that makes sense. Let's look at the assembly for panic: $ adb /vmunix /dev/kmem _panic+2,5?ia _panic+2: subl2 $8,sp _panic+5: cvtwl $100,-4(fp) _panic+b: tstl _panicstr _panic+11: beql _panic+19 _panic+13: bisl2 $4,-4(fp) _panic+17: Hmm, no register saves there -- ah, I know! The regmask, at _panic, must be set to all ones using loader magic or a sed script or something. _panic?x _panic: _panic: 0 At this point, we began to _panic :-). To make a long story short, sure 'nuf, after only a half-day of thinking about it, we tried _panic/x _panic: _panic: fff and found in sys/vax/locore.s /* trap() and syscall() save r0-r11 in the entry mask (per ../h/reg.h) */ /* panic() is convenient place to save all for debugging */ bisw2 $0x0fff,_trap bisw2 $0x0fff,_syscall bisw2 $0x0fff,_panic Sigh. Of course. Since panic() is written in C, there's no clean way to set its regmask, perhaps not even with an asm(), so... Bart Massey ..tektronix!videovax.tv.tek.com!bart ..tektronix!reed.bitnet!bart
jba@harald.ruc.dk (Jan B. Andersen) (10/16/89)
>In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes: >}From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman): > >} I've been lately wondering if there is any architecture >} with possibility to execute instruction indirectly. > >} I mean something like: >} >} exec (r0) ; execute instruction pointed by r0 > >} Does anyone know? Data General's MV Family has an 'execute accumulator' instruction. Very useful when implementing a debugger.
don@gp.govt.nz (Don Stokes, GPO) (10/16/89)
In article <672@sce.carleton.ca>, cassel@sce.carleton.ca (Ron Casselman) writes: > I seem to remember a piece of self-modifying code that used to run > on CP/M. I remember it because I had to disassemble it as an assignment for > a course in systems software. The program would copy itself to a different > part of memory and change all address references as it did so. It would > then load a second program into the original location. (all programs > under CP/M started at a fixed location). The second program would then > begin to run. Sorry I cannot remember any more details as to the purpose > of this self-modifying code. Sounds a little like DDT (Dynamic Debugging Tool). If you type DDT <program> DDT starts itself up at the bottom of the TPA (at this stage, DDT is just a normal COM file), slides itself into high memory and loads <program> from disk into the bottom of the TPA so that the program doesn't know that DDT is there. Since DDT has to run in many different memory sizes, it would have to relocate itself dynamically (this is based on observation of my trusty Kaypro II, not an actual study of the code). Don Stokes ZL2TNM / / vuwcomp!windy!gpwd!don Systems Programmer /GP/ Government Printing Office PSI%0530147000028::DON __________________/ /__Wellington, New Zealand__________don@gp.govt.nz________ Those who believe their systems are idiot-proof just do not realise how ingenious idiots can be.
elwin@Athena.MIT.EDU (Lee W Campbell) (10/16/89)
>>There is no defense against the ATTACK OF THE KILLER MICROS! > >We've been through this before. When micros want to supplant the high-end >supers, they'll have to come up with answers to the age-old (well, ok, >maybe 15 years old) problem of how to build a memory that will keep a fast >CPU fed. Such a memory costs a Whole Lot Of Money. When you've sunk a lot > ... Reminds me of the old joke about the Cray 1, with about 4 megawords of fast static ram: if you buy the memory, for $1 per word (not an unreasonable price in those days), Cray would throw in the CPU for free! .===. \|/ - This Space - { Max } AKA Lee - Unintentionally - H-headphones /|\ Campbell - Left Blank -
roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (10/17/89)
In article <480@gp.govt.nz> don@gp.govt.nz (Don Stokes, GPO) writes: >In article <672@sce.carleton.ca>, cassel@sce.carleton.ca (Ron Casselman) writes: >> I seem to remember a piece of self-modifying code that used to run >> on CP/M. I remember it because I had to disassemble it as an assignment for > >Sounds a little like DDT (Dynamic Debugging Tool). If you type DDT ><program> DDT starts itself up at the bottom of the TPA (at this stage, >DDT is just a normal COM file), slides itself into high memory and loads Didn't DDT use relative addressing? Something vaguely tells me that DDT could (block) copy itself to another address. If it used relative addressing then I don't think it needed to self-modify. Who would have thought that SMC would've been a me-too thread :-)? Anyways heres my me too contribution: Many moons ago I had a cute little Z80 micro called a Jupiter Ace and had those elegant little rubber keypads which were specially designed for the one finger-push hard typists. For those of you who have never heard of this magnificent machine: it was designed by the same guys who did the ZX-81. They left Sinclair's company, founded their own startup company and neither they nor the Jupiter Ace were ever heard of again. But I'm digressing... The Jupiter Ace had a Forth intepreter in rom (8k - of which 2k was a character set generator - oh for the good old days...). What I wanted to do was use the interpreter to trace through itself at the Z80 instruction level. Now the Z80 didn't have any support for trace instructions on the one hand and I didn't want to write code to interpret the 600 odd cases that the Z80 had on the other. I came up with the following little routine: save real registers/condition codes load registers/condition codes from emulated registers nop nop nop nop save registers and condition codes to emulated registers load real register/condition code values By loading the instruction I wanted to trace in the area with nop's and calling the routine I could let the Z80 "calculate" the condition codes and the changed registers for me. The next step was to calculate the new pc and load the following instruction to be traced. Of course, this was pretty primitive. It wouldn't work for branches (but I could determine if a branch would take place and set the emulated pc accordingly and continue from there). Nor would it work for io instructions :-) but they didn't occur often anyway. Primitive or no it helped me trace through the interpreter. Wonder if the Ace is in the attic somewhere...:-) -- Artificial Intelligence: When computers start selling stocks because its Friday the 13th.... Roelof Vuurboom SSP/V3 Philips TDS Apeldoorn, The Netherlands +31 55 432226 domain: roelof@idca.tds.philips.nl uucp: ...!mcvax!philapd!roelof
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/17/89)
You *may* be thinking of the CP/M system generation. Since the o/s was written in absolute assembler it was shipped to run in a small (maybe 16k) system. When you created a larger version there was the original o/s and a list of locations which needed to be changed. The builder took a copy of the o/s and added a constant to each of the locations in the list. Then again you may be thinking of something I've happily forgen... -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon
baum@Apple.COM (Allen J. Baum) (10/17/89)
Reply-To: baum@apple.UUCP (Allen Baum) Distribution: net -------- [] >In article <274@ssp1.idca.tds.philips.nl> roelof@idca.tds.PHILIPS.nl (R. Vuurboom) writes: > >I came up with the following little routine: > > save real registers/condition codes > load registers/condition codes from emulated registers > > nop > nop > nop > nop > > save registers and condition codes to emulated registers > load real register/condition code values > > >By loading the instruction I wanted to trace in the area with nop's >and calling the routine I could let the Z80 "calculate" the condition codes >and the changed registers for me. The next step was to calculate the >new pc and load the following instruction to be traced. > >Of course, this was pretty primitive. It wouldn't work for branches >(but I could determine if a branch would take place and set the >emulated pc accordingly and continue from there). Nor would it >work for io instructions :-) but they didn't occur often anyway. The original Apple II monitor ROM had a single step/trace routine that did exactly that. I think for branches I just replaced the offset, so it always branched to a fixed place, and updated the emulated PC with the offset if it did. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
mjones@stdc01.UUCP (Michael Jones) (10/17/89)
In article <527@eedsp.gatech.edu> aaronb@lagrange.UUCP (Aaron Birenboim) writes: >johnl@esegue.segue.boston.ma.us (John R. Levine) writes: >>... Most computers fetch >>instructions considerably ahead of where they are executing; even the lowly >>8086 can fetch up to 6 bytes ahead of where it is executing. This means that >>if you store into the next instruction, the CPU might or might not already >>have fetched that instruction, so it might execute the old instruction or ... >Is this really true? There isn't enough dependancy checking in the >8086 instruction pipe to detect this type of operation, clear >the pipe, and re-fetch the altered instruction, or some such >corrective measure. I'm glad I don't try self-modifying code. It is most absolutely true. Some of the early copy protect schemes (I think it may have been Lotus 1.0) used this sort of thing to disarm trace or single-step breaking of their code. If you ran normally, the CPU did not see the write into "here + 1", but if you were single stepping, or had an ICE, or whatever, then it did. This happened several times in the code, with each false branch sending you to a pile of "mystery code". -- -- Michael T. Jones Email: ...!mcnc!rti!stdc01!mjones -- -- The wise man will pursue Paper: 3101-H Aileen Drive, Raleigh NC 27606 -- -- excellence in all things Voice: W:(919)361-3800 and H:(919)851-7979 --
rec@sisyphuscpd4 (Robert Cousins) (10/18/89)
In article <480@gp.govt.nz>, don@gp.govt.nz (Don Stokes, GPO) writes: > In article <672@sce.carleton.ca>, cassel@sce.carleton.ca (Ron Casselman) writes: > > I seem to remember a piece of self-modifying code that used to run > > on CP/M. I remember it because I had to disassemble it as an assignment for > > a course in systems software. The program would copy itself to a different > > part of memory and change all address references as it did so. It would > > then load a second program into the original location. (all programs > > under CP/M started at a fixed location). The second program would then > > begin to run. Sorry I cannot remember any more details as to the purpose > > of this self-modifying code. > > Sounds a little like DDT (Dynamic Debugging Tool). If you type DDT > <program> DDT starts itself up at the bottom of the TPA (at this stage, > DDT is just a normal COM file), slides itself into high memory and loads > <program> from disk into the bottom of the TPA so that the program > doesn't know that DDT is there. Since DDT has to run in many different > memory sizes, it would have to relocate itself dynamically (this is based > on observation of my trusty Kaypro II, not an actual study of the code). > > Don Stokes ZL2TNM / / vuwcomp!windy!gpwd!don > Systems Programmer /GP/ Government Printing Office PSI%0530147000028::DON > __________________/ /__Wellington, New Zealand__________don@gp.govt.nz________ > Those who believe their systems are idiot-proof just do not realise how > ingenious idiots can be. Actually, there were a whole family of relocating programs under the CP/M family of operating systems. A common trick was to use PRL (Page ReLocatable) images. This was created by assembling (!) the programs at two different origins (typically 0 and 100h) and taking the difference between the two images. This was used to create a bitmap which could be used by the loading program to assign the final location to the changed bytes. Code of this form could be executed at any page (page = 256 bytes) boundry. MP/M codified this for running two programs in one 48k bank. One partition began at 0000h (and programs loaded at 100h) while the other began at a higher location. COM and PRL files could run at 100h, but PRL files only could be run at the higher addresses. The process of configuring the kernel involved relocating it by this same technique. Some companies used it for working with their BIOS also. The technique was never highly popular, however, since people rarely needed to change the base address of their code but were unable to relink it. I personally preferred the TurboDOS operating system. It was compatible with CP/M but approximately 6 times faster on file operations, multiuser and multiprocessor. I still have a TurboDOS machine at home. Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
bls@cs.purdue.EDU (Brian L. Stuart) (10/18/89)
In article <33557@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >Some of this discussion reminds me of discussions about threaded-code systems >like Forth. Has anyone ever identified any ISA issues that are specific for >code like this? I have always assumed that a fast branching RISC would be >ideal, but I don't know of this is a correct assumption. > > Hugh LaMaster, m/s 233-9, UUCP ames!lamaster This would in fact be quite useful in some Forth implementations, but those implementations really do not utilize self-modifying (or compile-on-the-fly) code. The Forth compiler I wrote (which did compile-on-the-fly and some really strange stack manipulations) would have benefited most from a VERY fast and simple subroutine linkage. (What I had wasn't really bad at all; it was a PDP-11.) The reason for this is that the executable code I generated consisted completely of JSR's, MOV #xxx,(R5)+'s, and RTS's. (The R5 being because that was my data stack pointer.) The primitives were coded in PDP-11 assembler and would have been well served by any standard RISC, though some of the 11's addressing modes were handy. Brian L. Stuart Department of Computer Science Purdue University
newbery@rata.vuw.ac.nz (Michael Newbery) (10/18/89)
In article <1989Oct12.162236.24239@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >Bootstraps written before the days of decent-sized EPROMs often employ >all kinds of dirty tricks to minimize size. describes PDP8 bootstrap as... >.. two instructions. The first was "kick disk >controller" (the controller handily came up in a fairly reasonable state >after bus reset). The second was "branch to self".[...] Eons ago we had an IBM 1130 to boot which, you inserted a special binary punched card in the reader and pressed "Program Load". The 1130 was a 16bit machine with an instruction format like (I think, it's been a while) opcode Long Indirect Index_Reg rest [4 bits] [1] [1] [2] [8] But, an 80col card has only 12 rows, so the load read 12 bits into each 16bit word as [opcode] [4 zeros] [rest]. All very well, except that the XIO (eXecute Input Output) instruction had to be Long, so the bootstrap program spent its first few instructions franticly modifying itself to create the reuquired instructions. The IBM supplied card used most of the 80 cols and a friend of mine decided he could do better. The result was a program that modified itself to create an XIO to read in sector 0 of the disk, which it then executed... at about which point the program ended. Sector 0 was read in, containing, among other things, the interrupt handling routines for the interrupt that resulted when the I/O completed, and some code that was in the right place by the time the bootstrap program reached it... The 1130 was not noted for its speediness. The new bootstrap card only needed about 1/3 of a card. -- Michael Newbery<newbery@rata.vuw.ac.nz> (...!uunet!vuwcomp!newbery if you must) Abstain from wine, women and song. Mostly song.
jthomas@nmsu.edu (543) (10/19/89)
In article <1148@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
bill> You *may* be thinking of the CP/M system generation. Since the o/s was
bill> written in absolute assembler it was shipped to run in a small (maybe
bill> 16k) system. When you created a larger version there was the original
bill> o/s and a list of locations which needed to be changed. The builder
bill> took a copy of the o/s and added a constant to each of the locations in
bill> the list.
And CP/M got that (too) from RT-11 which did it at initialization and got
it from .... ::-{})
Jim Thomas
toddpw@tybalt.caltech.edu (Todd P. Whitesel) (10/19/89)
This isn't a me too, but it is an early and (was) a widespread piece of code. The original Apple // DOS 3.3 (and its predecessors) came on a "DOS System Master" diskette with a bunch of nifty utilities and demos. When you booted a "Master" diskette (you could make one with a utility on the System Master disk) it would load in from $1D00-3FFF (i.e. 16K) and figure out how much memory there was. Then it would copy itself up to the top of RAM, via a relocator that took about 1/2K on the disk (I don't know exactly how long it was). If you used the INIT command to format a disk, it would be a "Slave" diskette because the bootable DOS image was already relocated (INIT saved a copy of the DOS in to disk when it was finished). one pirating technique was to remove the top row of DRAMs and boot a master, forcing it to load in 32K. then you INITed a disk with that DOS. after replacing the DRAMs you booted your favorite 48K copy-protected game (which almost always used a heavily-modified copy of DOS) and somehow got it to reboot the 32K slave disk. the slave thinks you only have 32K so it doesn't overwrite the copy-protected dos above it in memory, and you could examine the patched DOS at your leisure or even save it's routines to disk. (we had programs which used the copy-protected image in 48K and internal "standard routines" to copy a protected disk's files onto a normal disk, this was called "DeMuffin" which was a pun on the 13-to-16-sector converted on the system master called Muffin.) ain't nostalgia great? Todd Whitesel toddpw @ tybalt.caltech.edu
davecb@yunexus.UUCP (David Collier-Brown) (10/19/89)
The ICL System 10 had an interesting boot (which used no self-modifying code whatsoever (:-)). Zero, as an instruction, was interpreted as opcode operand modifier ====== ======= ======== disk_load address_0,sector_0 and was generated when one cleared the screen (!) and pressed the execute- immediate button. Poof! The bootstrap appears. This was less desirable when someone tried to execute data, of course, but did show a fine bit of subtlety in assigning opcodes. --dave -- David Collier-Brown, | davecb@yunexus, ...!yunexus!davecb or 72 Abitibi Ave., | {toronto area...}lethe!dave Willowdale, Ontario, | Joyce C-B: CANADA. 416-223-8968 | He's so smart he's dumb.
herndon@umn-cs.CS.UMN.EDU (Robert Herndon) (10/20/89)
Another very common use for self-modifying codes that I have not seen mentioned at all is bootstraps. These are typically a block of code copied into low/high memory at start-up time. On most machines, excepting the many workstations and micros that have the bootstrap in ROM, the bootstrap is usually loaded where the operating system is eventually supposed to go. The bootstrap, then, is usually forced to copy itself elsewhere before reading in the operating system. Since it typically doesn't know how big memory is, it often opts to use heuristics to find out, then copies itself to high/low memory, relocating itself in the process. [No pun intended.] If the machine supports position-independent-code, this is an ideal place to use it. If it doesn't, one has to diddle instructions. Either way, the bootstrap has to know enough about itself to move itself somewhere and resume execution there (like some of the 'core-wars' creatures described in Scientific American). The PDP-11 block 0 bootstrap worked this way with position independent code. Since the PDP-11 supported position independent code, it didn't have to diddle instructions. It was easy, however, to forget to strip the header (8 words) off. The original UNIX "magic number" is a tribute to this fact -- a 407 instruction [the original magic number] would jump over the header, and the bootstrap was off and running in spite of your omission. Another option, seen on some machines with instruction caches, is to read the whole operating system into memory at some location not overlapping the bootstrap. At this point, all that has to be done is to move the OS to the right location in memory and jump to its start location. The code to do this is typically only a few instructions, and can be fit into a/the instruction cache. Robert Herndon herndon@umn-cs.cs.umn.edu -- Robert Herndon Dept. of Computer Science, ...!ihnp4!umn-cs!herndon Univ. of Minnesota, herndon@umn-cs.ARPA 136 Lind Hall, 207 Church St. SE herndon.umn-cs@csnet-relay.ARPA Minneapolis, MN 55455
eob@cbnewsk.ATT.COM (eamonn.j.o'brien) (10/22/89)
In article <23062@cup.portal.com>, cliffhanger@cup.portal.com (Cliff C Heyer) writes: > in general you > have no way of knowing [program] address locations that your C > compiler will create ... except for pointers to functions. The following is my attempt at self modifying C code: -------------------------------------------------- foo() { printf( "foo()\n" ); } endfoo(){} bar() { printf( "bar()\n" ); } endbar(){} main() { char* foo_ptr = foo; char* endfoo_ptr = endfoo; char* bar_ptr = bar; char* endbar_ptr = endbar; printf("before:\n"); foo(); bar(); printf("modifying:\n"); for( ; bar_ptr < endbar_ptr; ++foo_ptr,++bar_ptr ) *foo_ptr = *bar_ptr; printf("after:\n"); foo(); bar(); } ---------------------------------------------------- It calls foo() and bar(), overwrites foo() with bar(), and calls foo() and bar() again. However on my system (AT&T Unix PC -- 68000) it coredumps after printing "modifying:". Perhaps the program memory is read-only. -- Eamonn O'Brien ohm!eob eob@ohm.att.com ~~~~~~~~~~~~~~ I'm not speaking for the company, I'm just losing my mind.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/23/89)
I was feeling smug about not having written any self modifying code in years when I remembered that I HAD written a program which modifies the o/s in memory. It seems that people were posting patches for every version of DOS known to allow ^W and ^U to work as they do in UNIX. I wrote a small program to search the core image, find the location for the patch and the patch area, and apply the fix. This avoids having a hacked version of the the o/s on a disk, which always worries people in these virus ridden days. It was called CTLENABL, and works on DOS 2.0-4.0. DOS users run it in their autoexec, usually. Now if DOS only included the source... -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon
honig@ics.uci.edu (David A. Honig) (10/24/89)
I've programmed a high-performance graphics system (an IMI 455) with techniques that exploit stored-program advantages. To program the thing, you write C-programs on a Unix host which write programs into "display list memory". These aren't compilers per se, they compute (eg, from physical models) what kind of display commands to give to the display generator, itself a microcoded cpu. One way of using this thing for animation is to have the C-program change the display-generator's program as it is being read and interpreted by the display-generator. Why use this technique? Its actually simplifying, given the existing complex architecture needed to obtain the high performance. And it avoids some of the problems that self-referencing, self-modifying code has, since the C-program doesn't modify itself. -- David A Honig
meissner@dg-rtp.dg.com (Michael Meissner) (10/24/89)
In article <173@harald.UUCP> jba@harald.ruc.dk (Jan B. Andersen) writes: > Data General's MV Family has an 'execute accumulator' instruction. Very > useful when implementing a debugger. The execute accumulator instruction is not as useful for implementing a debugger as you might think. First of all, it burns a register, and since the machine only has four integer registers, this can be deadly, particularly, since some of the character move instructions use all four integer accumulators. Second of all, if the instruction takes more than one 16-bit word, the remaining words of the instruction are the 16-bit words following the XCT instruction, which isn't terribly useful. About the only time I have ever seen the XCT used is for device drivers to build I/O instructions, with the appropriate device code inserted into the instruction. The language runtimes, when they wanted to build instructions on the fly, would build them on the stack (if running under AOS, AOS/VS or RDOS) or in low memory (if running under native UNIX), insert a long jump back to the next location, and jump to the location where the instructions were built. One place where self-mofifying code was used, was in the general purpose SYS function under RDOS, AOS, and AOS/VS, which typically took four arguments, an integer giving the system call number, and pointers to three integer sized items that were copied to/from three accumulators. Except for native UNIX, system calls consisted of a call instruction, followed by the system call number, error branch location, and normal branch location. The general SYS function would have to build such a call on the fly. Another place was generalized functions that executed an user specified instruction. In terms of debugger support, there is a BKPT instruction which the debugger replaces the first 16-bits of the instruction. When a BKPT instruction is encountered, the registers are pushed, and the machine jumps to the user breakpoint handler (most traps on the MV do not go directly into the kernel, but jump to user trap handlers). The handler can then do whatever it wants, and it uses either the WPOPB instruction to return if the BKPT instruction has been removed, or the PBX instruction if the breakpoint is still active (Ac0 contains the first 16-bits of the instruction to execute). -- Michael Meissner, Data General. If compiles where much Uucp: ...!mcnc!rti!xyzzy!meissner faster, when would we Internet: meissner@dg-rtp.DG.COM have time for netnews?
yodaiken@freal.cs.umass.edu (victor yodaiken) (11/06/89)
In article <2630@gandalf.UUCP> carr@gandalf.UUCP (Dave Carr) writes: >In article <1080@mipos3.intel.com>, jpoon@mipos2.intel.com (Jack Poon~) writes: ... >> What the advantage of using self-modifying code that non-self-modifying code >> cannot achieve? > [stuff about 80186 i/o deleted] >Another thing that pops to mind, is when your software may contain lots of >options, and spends most of its time testing option flags and skipping >around unused options. Why not have the software create the software >without the options in! You would need the software in a run-time locatable >format, but this is certainly feasible. > The Synthesis operating system, by Henry Massilin and Carlton Pu of Columbia University is an amazing example of this method. System calls and system operations which will be computed repeatedly are compiled on the fly. For example, when a process "opens" a file, the system generates code for reading and writing that particular file, eleminating several layers of switches. Context switch code is also compiled for each process. Basically code generation is used to do constant propagation. Interesting to see how this work might do on a machine with a deep pipline. Probably ok, since compliations saves 100's of instruction executions. If there is interest, I'll try to dig up the reference.
aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (11/09/89)
>Basically, it's a performance hack which can make a system much more >useable on interactive performance. There are systems sensitive to an >extra IF statement in their inner-loops, particularly when the >condition check is expensive (a few instructions can be expensive in >this context) and the inner loop can be executed millions of times per >request. There's always a way to replace it, but it may not result in >acceptable performance. On a system I worked on adding a conditional branch to the critical path for syscall added 5-10% (correct me if I'm wrong, Wayne) for minimal syscall timings. Maybe not much overall, but annoying. (This was a "semi-static" test, which we got around by making up a separate head and changing the syscall vector). We have several times talking about self-modifying code on the functional level. Perhaps it is just a question of notation - perhaps we just need a "statement data type - or a question of optimization - perhaps a good compiler should generate self modifying code in some of the hairy situations we have talked about in this thread, given high-level code that is not self modifying. Conside Barry's example: [B] ...predecessor code... L: {IF request_is_flagged THEN do_flag_action} ...successor code... Note that I have put curly braces {} around the statement to indicate the range to which the label L applies. In a high level notation the self-modification that Barry described, removing the IF at the code that changes the assertion, may be described as: [S]: ...Elsewhere, conditions have changed ...so now the test does not need to be performed, ... ie. !request_is_flagged: L := skip /* Dijkstra's high level nop */ ... ...Elsewhere, conditions have changed ... ie. request_is_flagged: L := do_flag_action; ... This involves assignment to a variable of type "statement". In a functional notation you might do [F]: L: pointer to function L_skip(){} L_do_flag_action() { ... } ...predecessor code: (*L)(); ...successor notation where the code that changes the request_is_flagged status changes the function pointer: L := &L_skip, L := & L_do_flag_action. In a decent language you could declare the function pointer and the values it assumes at the point at which it is used. (Yeah, great liberties with notation...) Obviously [S] is equivalent to [F], although more/less convenient. To begin with, [F] can be implemented in the obvious way, through indirect function call/return. [C] L_skip: return L_do_flag_action: ... return ...predecessor code: call indirect through L ...successor notation The first level of optimization is to notice that L_skip and L_do_flag_action always return to the same place, reducing function calling overhead to indirect branch. [G] L_skip: goto L' L_do_flag_action: ... goto L' ...predecessor code: goto indirect L L': /* return place */ ...successor notation (By the way, this is an optimization that would be nice in any system that makes calls through arrays of pointers to functions: like UNIX syscalls (sysent[]), file system calls (NFS, the System V file system switch), device driver calls ([bc]devsw[]). 99% of these functions are called in one and only one place.) The next level of optimization is to inline the code with an IF - getting us back to the explicitly hand coded version: [IF] ...predecessor code: IF L == L_skip THEN skip ELSE do_flag_action FI ...successor notation (Of course, a good programmer might have done this manually: instead of indicating request_was_flagged in a complicated data structure, she might have encoded it in a single variable, which she effectively did with the function pointer. So, the complexity of the test doesn't concern us - just the overhead of the IF or call) The next step is to actually do self-modifying code on the "micro" scale: [SM] ...predecessor code L_skip: L_do_flag_action: L1: ___ nop do_1 L2: ___ nop do_2 L3: ___ nop do_3 ...successor code ...Elsewhere, conditions have changed ...so now the test does not need to be performed, ... ie. !request_is_flagged: L1 := nop L2 := nop L3 := nop ...flush cache or whatever else is necessary ... ...Elsewhere, conditions have changed ... ie. request_is_flagged: L1 := do_1 L2 := do_2 L3 := do_3 ...flush cache or whatever else is necessary ... Optimizations based on the code body might be desirable: do_action_1: do_action_2: step1 step1 step2.1 step2.2 step3 step3 step4.1 step4.2 step5 step5 Instead of 5 instructions changing, only 2 need be changed. Alternatively, rather than micro self-modifying code, a compiler might take [IF] and attempt to broaden the domain of the code affected. For example, if [IF] were called within the body of another indirectly called function that assumed values A and B, two copies of the bodies of each of A and B might be made... Okay, enough time wasted. I think I have several points winding through all this: (1) Self-modifying code at the "function" level is easily expressible (somebody was trying to make a standard library out of this (I was supposed to help, sorry)) (2) "Micro" self-modifying code is easily expressible (a) with an augmented language (b) with function pointers. (3) It would not be unreasonable for an optimizing compiler to make the sort of optimizations described above automatically, acheiving the performance advantages of "micro" self modifying code from a high level representation. But, is it worthwhile spending the effort on a compiler that can do this sort of thing? Probably not... the situations where "micro" self modifying code are really necessary are few and far between. Force the implementors to do it by hand... Not commercially important. Might be fun for some academic research... -- Andy "Krazy" Glew, Motorola MCD, aglew@urbana.mcd.mot.com 1101 E. University, Urbana, IL 61801, USA. {uunet!,}uiucuxc!udc!aglew My opinions are my own; I indicate my company only so that the reader may account for any possible bias I may have towards our products.
bjornl@tds.kth.se (Bj|rn Lisper) (11/13/89)
In article <1989Nov6.172043.10373@world.std.com> bzs@world.std.com (Barry Shein) writes: %A common, general usage of self-modifying code is seen in interpreter %(not necessarily, but often, programming languages) loops. Suppose you %have the ability to trace or otherwise flag certain events. This is %typically done by a check at the top of the loop: % IF request_is_flagged THEN do_flag_action %The check for "request_is_flagged" can be quite expensive (say, %hashing into a symbol table to see if a bit is set) and interpreters %might have to go through this loop for every minor operation (eg. add %two numbers.) If there are no items flagged or only certain subsets %need to be checked one way to speed up this loop is to simply modify %the top-level code so the check isn't done or loops back to a %different check. ... Some time ago I made a posting about the connection between partial evaluation and self-modifying code. (Partial evaluation is simplifying functions, or code, when the input is partially known. Self-modifying code can be seen as partial evaluation at runtime.) The above was an instance I though of when writing that posting. The rule IF true THEN S => S is used to simplify conditionals where the condition is known. This can be done either at compile-time, if the condition is known then, or at run-time, when the condition becomes known (and we know it will not change value). Actually, I think partial evaluation would provide a quite clean theory for an interesting class of self-modifying programs. Bjorn Lisper
raulmill@usc.edu (Raul Deluth Rockwell) (11/15/89)
The original message of this thread seems to have expired, so I may be way off base, but here is what appears to be a fragment of the question: In article <1080@mipos3.intel.com>, jpoon@mipos2.intel.com (Jack Poon~) writes: > Could any experts out there educate me WHY and HOW does self-modifying code use? > ... > What the advantage of using self-modifying code that non-self-modifying code > cannot achieve? In short, incremental compilers, linking loaders, debuggers and similar system tools use self-modifying code. It is quite possible to have a situation where rapid allocation and deallocation of memory will run into cache corruption. Of course with classic environments (unix & c for example), this isn't much of a problem, or it can be hacked around by flushing the entire cache. But to exclude fast/tight loader design because somebody decided that that is a 'poor programming practice' seems rather silly. --