preston@ariel.rice.edu (Preston Briggs) (02/16/91)
>pmontgom@euphemia.math.ucla.edu (Peter Montgomery): >> The requested operation returns q and/or r, where >> >> a*b + c = q*n + r and 0 <= r < n jlg@lanl.gov (Jim Giles) writes: >You will, unfortunately, recieve little sympathy for this kind of >request. The UNIX community, in particular, will accuse you of >requesting a 'swiss army knife' compiler. He has my sympathy; it's free. However, I won't spend much time inventing special syntax or optimizations for specific problems. There are however, many people working on many ways of expressing and implementing extensible languages. "They" don't belong to a special club; everyone is free to join. It's very simple and fairly popular: while (dissatisfied) { Design a language (perhaps an extension of another language) Implement your language (perhaps hacking an existing compiler or building an iterpreter). Write many programs in your language to gain experience with its limitations. Try and get others to use it. } Some people repeat the process many times (e.g., Wirth). Other people have problems they want solved and don't wish to get pulled into an infinite loop. They can keep complaining; but I don't expect to see any results. Language designers aren't satisfied with existing languages (or they'd be out of the loop); they're busy - designing, implementing, testing. They've got their own agenda. You might be able to sneak your pet idea on someone's agenda, but it's probably expensive. If you want it done right, ...
hrubin@pop.stat.purdue.edu (Herman Rubin) (02/16/91)
In article <1991Feb15.192653.9846@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes: > >pmontgom@euphemia.math.ucla.edu (Peter Montgomery): > >> The requested operation returns q and/or r, where > >> > >> a*b + c = q*n + r and 0 <= r < n > > jlg@lanl.gov (Jim Giles) writes: > >You will, unfortunately, recieve little sympathy for this kind of > >request. The UNIX community, in particular, will accuse you of > >requesting a 'swiss army knife' compiler. > > He has my sympathy; it's free. > However, I won't spend much time inventing special syntax > or optimizations for specific problems. There are however, > many people working on many ways of expressing and implementing > extensible languages. "They" don't belong to a special club; > everyone is free to join. It's very simple and fairly popular: > > while (dissatisfied) { > Design a language (perhaps an extension of another language) > > Implement your language (perhaps hacking an existing compiler > or building an iterpreter). > > Write many programs in your language to gain experience > with its limitations. Try and get others to use it. > } This assumes that the one who needs the operations, or the improved performance, has the resources and time to do this. Montgomery and Silverman are number theorists, I am a professor of Statistics and Mathematics. We are expected to do other things. Mathematicians are used to the idea of an extensible language for communicating to reasonably intelligent beings. A computer is not an electronic "brain"; it is an extremely fast sub- imbecile. Someone remarked elsewhere on the complexity of the code in Fortran to handle x**y; there is necessarily separate code for each combination of types of x and y, and even code taking into account particular values of x and y. Producing translators of this magnitude is a major project, and there are not adequate macro translators even available. I believe that such a macro translator could be produced, and with that it would be easy to use machine language, rather than the overly restrictive assembler languages designed essentially so as to be easy for the machine to read. What is needed in addition is a language which allows the introduction of operator syntax according to the user's desires. The above production > >> a*b + c = q*n + r and 0 <= r < n is understood by essentially any mathematician, and it should not be necessary to do more than provide a template to convert this into code. Of course, that does nothing to get the operation into hardware. The RS/6000 has a*b+c in floating point, using double double arithmetic internally but not reusably. One of the reasons behind putting this into hardware is that it is relatively easy, and it would be even easier for integer arithmetic. From an architecture standpoint, to a mathematician there is not that much difference between integer and floating point, floating point being the more complicated. > Some people repeat the process many times (e.g., Wirth). > Other people have problems they want solved and don't > wish to get pulled into an infinite loop. They can keep > complaining; but I don't expect to see any results. > Language designers aren't satisfied with existing languages > (or they'd be out of the loop); they're busy - designing, > implementing, testing. They've got their own agenda. You > might be able to sneak your pet idea on someone's agenda, > but it's probably expensive. > > If you want it done right, ... Wirth is being employed to design and develop languages and related things. In addition, he has available for this development somewhere between a squad and a platoon of assistants. This is extermely important; it is difficult to catch one's own minor errrors, even in something as flexible as English, designed to be read by an intelligent being. Communicating to a machine is much harder. There is also the need for people to do the routine work. For Montgomery or Silverman or me to produce the language would be more than asking an individual automotive designer to do everything from idea to prototype without any assistance whatever. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
chip@tct.uucp (Chip Salzenberg) (02/17/91)
[ Followups to comp.misc ] According to hrubin@pop.stat.purdue.edu (Herman Rubin): >Montgomery and Silverman are number theorists, I am a professor >of Statistics and Mathematics. We are expected to do other >things. I'm a programmer, but that doesn't mean I've got all the time I want to do projects for J. Random Person. I'm expected to do other things, too. >I believe that such a macro translator could be produced ... A spec, Herman. Surely you can squeeze a few hours our of your busy schedule to produce a spec. Or is your desired language a mystery even to you? -- Chip Salzenberg at Teltronics/TCT <chip@tct.uucp>, <uunet!pdn!tct!chip> "I want to mention that my opinions whether real or not are MY opinions." -- the inevitable William "Billy" Steinmetz
dmocsny@minerva.che.uc.edu (Daniel Mocsny) (02/18/91)
In article <6049@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >This assumes that the one who needs the operations, or the >improved performance, has the resources and time to do this. >Montgomery and Silverman are number theorists, I am a professor >of Statistics and Mathematics. We are expected to do other >things. This is a problem I have grappled with at virtually every stage of my own work: I see all these interesting and compelling tools the computer people create for their own work, or their own idea of a market. Needless to say, if one has anything resembling a specialized need, one finds oneself shunted off the mainstream into a sort of computer orphanage. What are scientists and mathematicians "expected to do"? Historically, that answer has been: "Whatever it takes". Scientific researchers have a venerable tradition of inventing all manner of experimental and computational devices to support their work. In the "good old days", when a scientist had a challenging problem to solve, often the response was to design and build machines to assist. The decline of this tradition speaks to the success of general-purpose computers. Today computers are so cheap and effective that scientists find they usually can't afford to try to build anything better suited to their problem. (This is almost always true for hardware, and it is often true for software.) What's more, general-purpose computers are improving so rapidly that even machines designed specifically to solve specialized problems are in great danger of rapid obsolescence. If a research team decides to design a specialized computational device, they had better get cracking, because the general-purpose performance they are trying to beat is a rapidly moving target. A factor-of-ten improvement buys you about 3--5 years. If you sink 1--4 years into building the specialized box, you may have wasted your time. If I may muse a bit---I see much hope in the recent trend toward object-oriented languages. These are inherently extensible, making them nice substrates for specialized development. Each of the scientific disciplines may well evolve its own standard set of common objects, methods, operators, etc., much as they have always evolved their own nomenclature, curricula, and so on. Assuming, of course, that busy specialists will find time to detour into all the intellectual overhead involved in mastering a second discipline (which is what learning a language well enough to extend it represents). Unfortunately, piling all this makeup on top of uncooperative hardware may give us a performance hit. However, at some point the computer industry will find ways to let users cooperate in the design of their machines. Optimizing compilers are, in fact, a first step in this direction. The optimizing compiler permits the computer to adapt itself to the user's "paradigm". I.e., the user thinks in a language containing many constructs and instructions that do not directly tell the computer how best to solve the problem at hand. The correct response is not to tell the user to think like the computer, but to teach the computer to think like the user. Eventually we must extend the notion of optimization to the hardware as well as the software. Computers need some sort of hardware flexibility permitting them to shift resources in response to user demands. I.e., instead of collecting average statistics on instruction- set usage, and then imposing a one-size-fits-all cast-in-stone "solution" on everyone, why can't the computer have flexibility in deciding how it should execute instructions? Buying a computer should be like hiring an employee: the longer that employee stays on your payroll, the more productive (s)he becomes. Computers, however, turn in pretty much the same benchmarks the first time you run your problem as they will after the 1,000,000th run. By the time the computer has served you for several years, it has had ample time to compile detailed statistics on the instruction mix that *you* want. -- Dan Mocsny Internet: dmocsny@minerva.che.uc.edu
preston@ariel.rice.edu (Preston Briggs) (02/18/91)
hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >>This assumes that the one who needs the operations, or the >>improved performance, has the resources and time to do this. >>Montgomery and Silverman are number theorists, I am a professor >>of Statistics and Mathematics. We are expected to do other >>things. dmocsny@minerva.che.uc.edu (Daniel Mocsny) writes: >This is a problem I have grappled with at virtually every stage >of my own work: I see all these interesting and compelling tools >the computer people create for their own work, or their own idea >of a market. Needless to say, if one has anything resembling a >specialized need, one finds oneself shunted off the mainstream >into a sort of computer orphanage. I was pessimistic in an earlier posting; here's an optimistic view. One possibility is try to lure computer scientists into your area with promises of interesting new problems to work on. That's worked well on some prominent people I've met. One example is Eugene Myers, who's done significant work in algorithms and optimization. Recently though, he's been working on gene sequencing, etc., supporting the Human Genome project. (I should point out that "recently" is several years. These are long term commitments.) A 2nd example is Guy Steele. He has said that he envisions himself as a guy who builds tools to support AI research. --- I hope I haven't misrepresented these people. I respect them both for vision, energy, and accomplishments. Preston Briggs
herrickd@iccgcc.decnet.ab.com (daniel lance herrick) (02/23/91)
In article <6049@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > In article <1991Feb15.192653.9846@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes: [i may have edited away all of preston's contributions to this line] >> >> jlg@lanl.gov (Jim Giles) writes: >> many people working on many ways of expressing and implementing >> extensible languages. "They" don't belong to a special club; >> everyone is free to join. It's very simple and fairly popular: >> >> while (dissatisfied) { >> Design a language (perhaps an extension of another language) >> >> Implement your language (perhaps hacking an existing compiler >> or building an iterpreter). >> >> Write many programs in your language to gain experience >> with its limitations. Try and get others to use it. >> } > > This assumes that the one who needs the operations, or the > improved performance, has the resources and time to do this. > Montgomery and Silverman are number theorists, I am a professor > of Statistics and Mathematics. We are expected to do other > things. [much more edited away, this is only to remind of what went before] You are faculty at one of the two schools with the oldest departments of Computer Science in the country. Go to your colleagues in CS and tell them you have a design problem that seems to be about the right size for a PhD. If you can get them to cobble it together on gcc (or g++) you can then profit from any improvements made later in the gnu code generator. (And also portability to future computers.) What has frosted me about the languages is that the arithmetic operators are all single valued. I usually want the quotient and remainder from a division. I have to tell the compiler to give me the quotient and then tell it to give me the remainder. It might be smart enough to notice that it gets both of them in one machine operation, but if it is, it is only undoing bad language design. dan herrick herrickd@iccgcc.decnet.ab.com
ph@ama-1.ama.caltech.edu (Paul Hardy) (02/23/91)
In article <3381.27c548c3@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes: > What has frosted me about the languages is that the arithmetic > operators are all single valued. I usually want the quotient > and remainder from a division. I have to tell the compiler to > give me the quotient and then tell it to give me the remainder. This is annoying, but it's one thing an optimizer should look for since it happens a lot in multi-precision code so that it won't cost much extra time. Alternatively there's Forth, a stack-based language, which will leave a quotient and remainder on the stack after a divide. And there's always assembly language.... --Paul -- This is my address: ph@ama.caltech.edu This is UUCP: ...!{decwrl,uunet}! This is my address on UUCP: ...!{decwrl,uunet}!caltech.edu!ama!ph Any questions? "Does Emacs have the Buddha nature?" --Paul Hardy "Yow!" --Zippy
hrubin@pop.stat.purdue.edu (Herman Rubin) (02/24/91)
In article <3381.27c548c3@iccgcc.decnet.ab.com>, herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes: > In article <6049@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > > In article <1991Feb15.192653.9846@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes: > [i may have edited away all of preston's contributions to this line] > >> jlg@lanl.gov (Jim Giles) writes: > >> many people working on many ways of expressing and implementing > >> extensible languages. "They" don't belong to a special club; > >> everyone is free to join. It's very simple and fairly popular: ....................... > > This assumes that the one who needs the operations, or the > > improved performance, has the resources and time to do this. > > Montgomery and Silverman are number theorists, I am a professor > > of Statistics and Mathematics. We are expected to do other > > things. > [much more edited away, this is only to remind of what went before] > You are faculty at one of the two schools with the oldest departments > of Computer Science in the country. Go to your colleagues in CS and > tell them you have a design problem that seems to be about the right > size for a PhD. You are almost right about its size; something of this complexity needs at least two people working on it to avoid first class goofs. But it is not appropriate for a PhD thesis. It is a contribution, but not particularly research, which is the criterion for a thesis and for tenure. The current pressure for students in academia is to avoid what is not either required or directly needed for their thesis research and overconcentrate. The project can be done by a faculty member whose academic reputation is essentially assured with the help of a graduate student or two. Someone like Wirth can get the funding and the students. Someone like me has a small chance to get the funding, and real difficulty in getting the students. > If you can get them to cobble it together on gcc (or g++) you can > then profit from any improvements made later in the gnu code generator. > (And also portability to future computers.) > > What has frosted me about the languages is that the arithmetic > operators are all single valued. I usually want the quotient > and remainder from a division. I have to tell the compiler to > give me the quotient and then tell it to give me the remainder. > It might be smart enough to notice that it gets both of them in > one machine operation, but if it is, it is only undoing bad language > design. This WAS obvious when languages were first started. Fortran had a limited purpose, but not Algol. It is syntactically more difficult, but a purpose of a language is to make it easier for the user to use the hardware, and the hardware could do this on the great bulk of machines then. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
bs@faron.mitre.org (Robert D. Silverman) (02/25/91)
In article <PH.91Feb22180807@ama-1.ama.caltech.edu> ph@ama-1.ama.caltech.edu (Paul Hardy) writes: :In article <3381.27c548c3@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes: : :> What has frosted me about the languages is that the arithmetic :> operators are all single valued. I usually want the quotient :> and remainder from a division. I have to tell the compiler to :> give me the quotient and then tell it to give me the remainder. : :This is annoying, but it's one thing an optimizer should look for since it :happens a lot in multi-precision code so that it won't cost much extra time. :Alternatively there's Forth, a stack-based language, which will leave a :quotient and remainder on the stack after a divide. And there's always :assembly language.... Yes. There always is assembler. However, the cost of calling an assembler routine to do a division returning both quotient and remainder is very expensive. In-lining mechanisms for assembler code are woefully inadequate in current languages. One can, of course, look at the intermediate assembler code generated by the compiler of a HLL and then adjust your assembler code so that register usage is correct, but this is very clumsy and must be redone everytime you make a change to the HLL code. -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (02/26/91)
In article <1991Feb25.135057.23667@linus.mitre.org> bs@faron.mitre.org (Robert D. Silverman) writes: >Yes. There always is assembler. However, the cost of calling an assembler >routine to do a division returning both quotient and remainder is very >expensive. In-lining mechanisms for assembler code are woefully inadequate >in current languages. One can, of course, look at the intermediate assembler >code generated by the compiler of a HLL and then adjust your assembler code >so that register usage is correct, but this is very clumsy and must be redone >everytime you make a change to the HLL code. I find it hard to criticize the facilities for inlining assembler code (not that I would ever do it myself -- well, hardly ever) provided by modern compilers, e.g. gcc as outlined in earlier posts to this group and elsewhere. At least some modern compilers would seem to provide the facilities for including such code in the (sometimes extensive) dataflow analysis they already perform for the HLL code -- even to the extent of allocating available registers for your use in the inline assembler code. It would therefore seem low-level tweaking as described above is at least on the way out, if not already `passed over'. But on the other hand -- why put assembler into a program at all? Although I understand there _are_ a (very few) application areas where even a 10% reduction in running time represents big bucks, I find it hard to accept that the ability to return the quotient and remainder, e.g., at the same time will make even _that_ much difference to code that already would almost certainly include more time-intensive instructions in the mix. For example, suppose we compare two tight loops, one with separate divide & modulus/remainder (whichever you need); the other with some combined code. Suppose there are n other instructions in each loop. Let's discard pipe flushing considerations and further assume all instructions take a single cycle. We have, in loop I say, n+3 instructions (let's even allow another move instruction to get at either the quotient or remainder as they may have ended up in inconvenient places) and in loop II n+1 instructions. Iterate both loops N times. Loop I takes N(n+3) cycles: loop II takes N(n+1) cycles -- the ratio is obviously (n+3)/(n+1). This will exceed a 10% difference if n < 19. So, provided there are no more than about 20 instructions in the loop in this obviously _idealized_ circumstance, the body of the loop will run 10% or more faster given the fancy combined quot/rem facility. Considering our loop is only _part_ of a larger program the impact on the total running time of the complete code will be even less marked given the facility. But perhaps typical loops, esp those containing divide and remainder instructions, are small? Well, after looking through my source code I don't happen to find _any_ loops with both divide & remainder in them -- I guess I tend to avoid it (blush). However, below is an _example_ that I will not say is _typical_, but is at least indicative of the kind of code I'm looking for (it's an inner loop from an FFT routine). With -O a Sun3/60 cc produces the code following. Strangely, there are about 20 instructions in the loop. ------ for(xp=x,yp=y,zp=z; yp<z; ++xp,++yp) { if(*xp==M-1) *zp++ = M-*yp; else if(*yp==M-1) *zp++ = M-*xp; else *zp++ = *xp**yp % M; } L77003: cmpl #1000,a0@ jne L77005 movl #1001,d0 subl a1@,d0 movl d0,a5@ jra LY00000 L77005: cmpl #1000,a1@ jne L77007 movl #1001,d0 subl a0@,d0 movl d0,a5@ jra LY00000 L77007: movl a0@,d0 mulsl a1@,d0 divsll #1001,d1:d0 movl d1,a5@ LY00000: addqw #4,a5 addqw #4,a0 addqw #4,a1 cmpl a6@(-12),a1 jcs L77003 --- To summarize: I don't think provision of a combined divide/remainder (or divide/modulus for that matter) instruction will necessarily speed up the total running time of any real programs appreciably (i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate some circumstance which contradicts this? -kym === No sig is a good sig (this isn't one).
mash@mips.COM (John Mashey) (02/26/91)
In article <3381.27c548c3@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes: .... >What has frosted me about the languages is that the arithmetic >operators are all single valued. I usually want the quotient >and remainder from a division. I have to tell the compiler to >give me the quotient and then tell it to give me the remainder. >It might be smart enough to notice that it gets both of them in >one machine operation, but if it is, it is only undoing bad language >design. I have no idea if it does it in all cases that would make sense. However, at least the current MIPS compilers and assembler, if faced with: i = j / k; l = j % k; conspire to execute a single division instruction. (To see this, you have to look not at the generated .s file, which has two divs, but at the dis-assembly of the .o or a.out, which shows only 1...) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
rex@cs.su.oz (Rex Di Bona) (02/26/91)
In article <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: [ Inlining HLL calls to assembly routines ] > For example, suppose we compare two tight loops, one with separate divide & > modulus/remainder (whichever you need); the other with some combined code. > Suppose there are n other instructions in each loop. Let's discard pipe > flushing considerations and further assume all instructions take a single > cycle. Single Cycle? For a divide instruction? On what machine? If we take a real example here instead (say the 68020, or the R3000 (as I have the info just behind me here, ugh, there.... lets see, the 68020... , divide, here we are: Divide unsigned ranges from 76 cycle (best case, register to register) to a massive 78+24 = 102 cycles for the worst case, a divide with the source program counter memory indirect pre-indexed (PCMIP). For signed divisions we add 12 cycles so the best case is 88 to a worst case of 114 cycles. Now our move takes (best case) 0 cycles (it is totally overlapped with other instructions), to a worst case of 3+49 = 51 (the 49 is the PCMIP, to PCMIP addressing mode). > We have, in loop I say, n+3 instructions (let's even allow another move > instruction to get at either the quotient or remainder as they may have ended > up in inconvenient places) and in loop II n+1 instructions. > > Iterate both loops N times. Loop I takes N(n+3) cycles: loop II takes N(n+1) > cycles -- the ratio is obviously (n+3)/(n+1). This will exceed a 10% > difference if n < 19. back in 68020 land... For a move of a register to a register we can assume that it is overlapped as we don't need the result in the next instruction, for a cost of 0 cycles. (Add more if we move the intermediate to memory...) For unsigned division... For the best case... (n*(average cycles per instruction) + 76)/(n*Ave + (76 * 2)) Assumption: average cycles = 1, n < 685 average cycles = 2, n < 343 average cycles = 3, n < 229 average cycles = 4, n < 172 average cycles = 10, n < 69 average cycles = 20, n < 35 For the worst case... (n*(average cycles per instruction) + 102)/(n*Ave + (102 * 2)) Assumption: average cycles = 1, n < 919 average cycles = 2, n < 460 average cycles = 3, n < 307 average cycles = 4, n < 230 average cycles = 10, n < 92 average cycles = 20, n < 46 If we look at the example program given in the article, we have SIGNED division, so... For the best case... (n*(average cycles per instruction) + 88)/(n*Ave + (88 * 2)) Assumption: average cycles = 1, n < 793 average cycles = 2, n < 397 average cycles = 3, n < 265 average cycles = 4, n < 199 average cycles = 10, n < 80 average cycles = 20, n < 40 For the worst case... (n*(average cycles per instruction) + 114)/(n*Ave + (114 * 2)) Assumption: average cycles = 1, n < 1027 average cycles = 2, n < 514 average cycles = 3, n < 343 average cycles = 4, n < 257 average cycles = 10, n < 103 average cycles = 20, n < 52 > > So, provided there are no more than about 20 instructions in the loop in this > obviously _idealized_ circumstance, the body of the loop will run 10% or > more faster given the fancy combined quot/rem facility. Considering our loop > is only _part_ of a larger program the impact on the total running time of > the complete code will be even less marked given the facility. Yes, but it is usually the case that cycle times for other, more normally executed instructions are _far_ less than those for divide instructions. In the above examples we find that we strill require a large number of other instructions before we arrive at the point where the overhead is lost in the rest of the loop. I would assume (rough guess here) that 10 cycles per 68020 instruction is about right, taking into account cache misses and the such, perhaps less, But the less the average cycle count for those other instructions the better it is to use the one instruction to do both the div and remainder... (considering the R3000, it becomes easier (slightly), we have the following.. divide takes, hmm... let me see, oops, the Kane book doesn't seem to mention the time taken for a divide, ok then, lets assume it is 19 cycles (the time for a Fp divide), and also try it at 12 cycles (lower bound, the time for a multiply) we get (assuming 1.2 cycles per instruction) 19 cycles for a divide n < 143 12 cycles for a divide n < 91 The 1.2 figure comes from a hazy recollection of a talk by John Mashey, where he was going on about why it wasn't really 1.00 cycles per instruction, and how they (MIPS) wanted the cycles per instruction to go below 1.00 :-) > > To summarize: I don't think provision of a combined divide/remainder > (or divide/modulus for that matter) instruction will necessarily > speed up the total running time of any real programs appreciably > (i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate > some circumstance which contradicts this? > > -kym > === > No sig is a good sig (this isn't one). I hope that the above number might convince you that having the one instruction generate both is better in those pesky real world situations, using real world machines. I doubt (oops, rash prediction here) that a division will become as cheap as a move instruction, unless something changed (radically) the speed of the processing unit without changing the speed of memory (in which case we'll be back at super CISC anyway). -------- Rex di Bona (rex@cs.su.oz.au) Penguin Lust is NOT immoral
herrickd@iccgcc.decnet.ab.com (daniel lance herrick) (02/27/91)
In article <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: [long and eloquent demonstration that one less divide instruction in the loop will have no significant effect on computation time] > To summarize: I don't think provision of a combined divide/remainder > (or divide/modulus for that matter) instruction will necessarily > speed up the total running time of any real programs appreciably > (i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate > some circumstance which contradicts this? > But I still have to obfuscate the code with q = dividend / divisor r = dividend % divisor /* even the operator is bad */ when what I want is (q, r) = dividend divided by divisor Division is defined as producing a unique quotient and remainder in the first abstract algebra course. Every hardware divide instruction I have ever seen produces both quotient and remainder in one operation. (nothing else makes any sense). Even the 709 prototype for FORTRAN provided both results. My main objection is to the obfuscation of the expression of an algorithm in the programming language. This is a place where the computer architects have done the right thing and the software architects have all ignored this right thing and done a wrong thing. Of course, we all know that the world is floating point, the perpetrators of FORTRAN told us that that is what is REAL. dan herrick eschew obfuscation herrickd@iccgcc.decnet.ab.com
bs@linus.mitre.org (Robert D. Silverman) (02/27/91)
In article <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
:To summarize: I don't think provision of a combined divide/remainder
:(or divide/modulus for that matter) instruction will necessarily
:speed up the total running time of any real programs appreciably
:(i.e. more than, say, 10%). Perhaps Bob Silverman could illustrate
:some circumstance which contradicts this?
You miss the point. There ARE many processors that provide double
length multiply and divide with remainder instructions. Most current
HLL's have no way of generating code that will access these instructions.
Furthermore, there ARE codes which can show much more than a 10% gain
by having such instructions. Most are number-theoretic or cryptographic
or group-theoretic in nature.
--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"
ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/27/91)
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) wrote:
>But on the other hand -- why put assembler into a program at all?
The people who really love gcc's assembler inlining features are
OS writers. I suspect that inline assembler spl?() does wonderful
things to some Unix kernel code, as does access to test-and-set
or other atomic instructions in other kernels.
But it's also useful if I just happen to acquire some desperate
need to use the VAX's edit instruction.
--
-Colin
henry@zoo.toronto.edu (Henry Spencer) (02/27/91)
In article <1991Feb26.171338.8362@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: >You miss the point. There ARE many processors that provide double >length multiply and divide with remainder instructions. Most current >HLL's have no way of generating code that will access these instructions. Nonsense. You mean "most current compilers do not recognize situations where such instructions could be generated". In most any HLL you can write something like x = a / b y = a % b which amply suffices as a way of requesting use of such instructions. Blame the compiler, not the language, if the generated code doesn't use the hardware properly. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
bs@faron.mitre.org (Robert D. Silverman) (02/27/91)
In article <1991Feb26.190242.18983@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <1991Feb26.171338.8362@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: >>You miss the point. There ARE many processors that provide double >>length multiply and divide with remainder instructions. Most current >>HLL's have no way of generating code that will access these instructions. > >Nonsense. You mean "most current compilers do not recognize situations >where such instructions could be generated". In most any HLL you can >write something like > > x = a / b > y = a % b > >which amply suffices as a way of requesting use of such instructions. >Blame the compiler, not the language, if the generated code doesn't use >the hardware properly. Bull. Both the compiler and the language are to blame. One should have some way of specifying, for example, that c = a*b is to be a 32x32 multiply (and generating the appropriate instruction or calling a routine to do it). One might also want c = a*b to return only 32 bits since the latter operation is a lot faster. What is needed is a facility in the LANGUAGE ITSELF for specifying which type of multiply to perform. short, int, and long are not enough. One also needs a 'long long' for intermediate results. However, If I write the following code: long a,b,c,d; a = (b*c + 1)/d The COMPILER should realize that since the product of two longs can overflow, then it needs to generate code that will perform a 32x32 multiply followed by a 64 / 32 divide. Even on machines that have these instructions in hardware no compilers currently do this. Current compilers default to returning ONLY the low 32 bits and this is clearly wrong from a mathematical standpoint. Now if it should turn out that the true value of a is bigger than 32 bits, then I have made an error in my code. Furthermore, there should be some way of specifying explicitly that I ONLY want the low order bits of a 64 bit product. Currently, all compilers assume this by default. One doesn't even have a way of getting the high 32 bits if wanted. e.g. a = low32(b*c + 1)/d or a = high32(b*c + 1)/d It is still my contention that current languages are woefully inadequate for implementing integer arithmetic in an EXACT mathematical fashion. The operations proposed here are NOT highly specialized and abstract. They are basic arithmetic. -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
torek@elf.ee.lbl.gov (Chris Torek) (02/27/91)
In article <1991Feb26.172239.10089@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: >The people who really love gcc's assembler inlining features are OS writers. Funny you should mention that.... /* load byte from alternate address space */ static __inline int lduba(void *loc, int asi) { int result; #ifdef PARANOIA if ((unsigned)asi > 15) panic("lduba asi"); #endif __asm __volatile("lduba [%1]%2,%0" : "=r" (result) : "r" (loc), "n" (asi)); return (result); } [sparc] or: #define _spl(s) \ ({ \ register int _spl_r; \ \ asm __volatile ("clrl %0; movew sr,%0; movew %1,sr" : \ "&=d" (_spl_r) : "di" (s)); \ _spl_r; \ }) #define spl1() _spl(PSL_S|PSL_IPL1) [680x0] -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
henry@zoo.toronto.edu (Henry Spencer) (02/27/91)
In article <3439.27ca4e40@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes: >... Every hardware divide instruction >I have ever seen produces both quotient and remainder in one operation. Try reading the hardware manuals for the VAX or the NS32k series. Your experience seems to be limited. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
gah@hood.hood.caltech.edu (Glen Herrmannsfeldt) (02/27/91)
In the common hardware implementation of the divide instruction the remainder is automatically generated. It is left in the register after the rest is subtracted off. Most high level languages throw it away. Most machines generate it anyway, because it is already there.
sef@kithrup.COM (Sean Eric Fagan) (02/27/91)
In article <3439.27ca4e40@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes: >But I still have to obfuscate the code with > q = dividend / divisor > r = dividend % divisor /* even the operator is bad */ >when what I want is > (q, r) = dividend divided by divisor ANSI C has a function div(int, int) which returns a vector. In the form of a *struct*, mind you, but gcc has, I believe, some extensions which allow doing such things easier. -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
rex@cs.su.oz (Rex Di Bona) (02/27/91)
In article <1991Feb27.034955.28860@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > Since I am not entirely clear as to what the article boils down to, apart > from ``kym, you are _WAY_ wrong'' Sorry, but; yes. The time for a 'divide' in what I would consider a general purpose machine would out weigh (by a large factor) the time taken for more normal instructions... > I ran a program on a Sun3/60 created from the loop given last time with and > without adding an extra signed divide (that I made certain was not removed by > the optimizer) and found there was a `slow down' of about 30% over 100 reps > (arithmetic mean). So, for each iteration it was 30% for that extra divide, ie, a divide was taking 42% of the time for the loop??? > Considering how far a Sun was from the model I indicated in my previous post, > I'm a bit amazed it didn't diverge more (from Rex Di Bona's post and other > private email I had been expecting maybe orders of magnitude). > > -kym Oh, I don't know, having a divide be so long is about what I figured... Your loops are not as long as you might think, as they have those if statements. Here are the instruction counts... For 68K For R3000 the initial if comparison, 6 6 the *zp++ = M-*yp 5 13 the else if compare. 6 6 the *zp++ = M-*xp 6 11 the fluff for the final else 5 18 Stalls - 10 (? I'm not sure) Anyway, looking at how these branches are used, we find (on a sample run of 100 program executions) the two (non division) branches are executed a total of 0 (yes, zero) times, and the division is executed each time. (Note 1, below) So we have a loop that is 6+6+5 = 17 (and the division) instructions long.... if adding a division increases the time by 30% we can conclude that a division is (roughly) equal to 13 (If 17 = 57.% then 42.% is 13) instructions???? (if we assume the divide takes 78 cycles, this works out at an average of 6 cycles for a 68K instruction, not too unreasonable?) On the R3000 we have (6+6+18)*1.2+10 = 46, for a divide of about 19.3 cycles.. again, not too unreasonable? Note 1: If you look at your program the conditions to take the top two decision traces depend _totally_ upon random numbers being equal... You do _not_, I repeat _not_ change the seed to this random number generator between runs, so it _always_ produces the same sequence. This is a bug(feature??) of rand()? So, you would always be executing _exactly_ the same code, the differences you measured were in fact the instabilities in the kernel timing of your program :-( > > P.S. my `back of the envelope' calcs using the above result indicates a real > 68k signed divide in a similar loop as before will contribute 10% of the > total at 68 instructions ^^^^^^^^^^^^ - Instructions are a bad measure for a CISC, as the time each takes varies so much. Um, lets see... (assumption ... an instruction is 6 cycles) we get that we need 703 additional instructions (apart from the divide) to make adding an extra divide only 10% more overhead... 68 instructions (68 * 6 = 408 cycles) means an extra 16% overhead. I ran your program on a MIPS RC3230 and got very similar numbers, which also supports my premis. The additional divide was taking 20 cycles, which _completely_ overwhelms the rest of the loop, so adding in that extra divide is taking a _big_ performance hit! What the important thing is is the _relative_ speed of divide to more normal (move, simple alu) instructions. When the divide instruction is much greater than the cost of a 'move' instruction you have to have _a lot_ of move instructions to compenstate for that extra divide. Unless you have these you lose big. In your case, you have a relatively small number of 'other' instructions (17 in the 68020, and 30 in the R3000) which ment that the divide instruction was almost equal in time to all these instructions! (This is why it is easier when multiplying by a constant to do the shift and adds instead of the multiply instruction (SPARC is except from this example :-) -------- Rex di Bona (rex@cs.su.oz.au) Penguin Lust is NOT immoral
peter@ficc.ferranti.com (Peter da Silva) (02/27/91)
Just a random observation: In article <1991Feb26.171338.8362@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: > Furthermore, there ARE codes which can show much more than a 10% gain > by having [div/rem] instructions. Most are number-theoretic or cryptographic > or group-theoretic in nature. I would think the most common one would be formatting integers into ASCII strings. -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
hrubin@pop.stat.purdue.edu (Herman Rubin) (02/27/91)
In article <1991Feb27.021435.11296@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > In article <2124@cluster.cs.su.oz.au> rex@cluster.cs.su.oz (Rex Di Bona) writes: > >Single Cycle? For a divide instruction? On what machine? If we take a real > >example here instead (say the 68020, or the R3000 (as I have the info just > >behind me here, ugh, there.... lets see, the 68020... , divide, here we are: > > Oh, I don't know. I kinda had the impression that there were some > high performance divide pipelines that did give a result every cycle. > Milage may vary however. (Perhaps giving example 68000 code was a bit > misleading). Of the machines I know, this is not the case. The CYBER 205, which pipelines just about everything, does not pipeline divide, square root (the same unit), or convert between decimal and binary, which I believe also uses that unit. The CRAY does not provide a division at all, presumably because it could not be pipelined. It does have a pipelined approximation to reciprocal, and a pipelined correction factor. This is very definitely an architecture problem, and not a language problem. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
weaver@jetsun.weitek.COM (Mike Weaver) (02/28/91)
In article <1991Feb27.021435.11296@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >In article <2124@cluster.cs.su.oz.au> rex@cluster.cs.su.oz (Rex Di Bona) writes: >Oh, I don't know. I kinda had the impression that there were some >high performance divide pipelines that did give a result every cycle. >Milage may vary however. (Perhaps giving example 68000 code was a bit >misleading). > >-kym Does anyone know of a pipelined divider that gives a result every cycle? As far as I know, they exist only on paper, and for good reason: every algorithm I have heard of for an 'array' divider starts with an iterative algorithm, and simply repeats the hardware for each iteration. This means that the amount hardware is increased by the same factor as the increase in throughput, and the latency remains the same, which is not very attractive. Also, the actual number of transistors is horrendous -- it would take perhaps ten times as many transistors as an array multiplier, which is a large thing.
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (02/28/91)
In article <2134@cluster.cs.su.oz.au> rex@cluster.cs.su.oz (Rex Di Bona) writes: [I say I'm not way off even considering we're moving away from my original assumptions] > Sorry, but; yes. The time for a 'divide' in what I would consider a general > purpose machine would out weigh (by a large factor) the time taken for more > normal instructions... From my original post I kinda eliminated CISC workstations & such from consideration, but let's run with Rex's ball anyhow. [I measured that an extra divide in a loop caused a slow down of 30%] > So, for each iteration it was 30% for that extra divide, ie, a divide was > taking 42% of the time for the loop??? No -- 30%. 1 x divide + other stuff = 1.0 2 x divide + offer stuff = 1.3 --- therefore 1 x divide = 0.3 of loop > I ran your program on a MIPS RC3230 and got very similar numbers, which > also supports my premis. The additional divide was taking 20 cycles, which > _completely_ overwhelms the rest of the loop, so adding in that extra divide > is taking a _big_ performance hit! Using pixie on a Magnum I find the following with my little loop with one divide and the whole thing repeated 100 times: Number of cycles: 8.2e7 Number of instructions: 3.8e7 (i.e. 2.1 c/i av) Number of divides executed: 2.6% I also find a divide takes about 4 add times on average. So we want to know how big the loop is such that an extra divide accounts for 10% -- Td ---------- = 0.1 2 Td + n Ta ==> n = 32 Where Td = 4 Ta (say). I still think I'm in the ballpark wrt my previous posts even though we have diverged a little wrt the architectural assumptions. As Rex rightly points out the number of instructions _executed_ in my little loop is actually fairly small (which only improves my claim that loops of around 20 instructions would suffice to swap a `divide effect'). There are also two other tricks -- the number of divides may be much smaller than the number of loop iterations (depending on the size of M), and the `divide' effect will be masked by an equal number of multiplies (fairly typical of the codes originally in question I think). Which all goes to show the importance of measurement. As to what the _typical_ size of loops that contain, e.g. a divide & remainder, ARE -- I have no real idea at this time. The loop I have posted is (actually the smallest) from various code that I think may be typical of its type. If there is any interest, maybe I can do some more pixifying of other programs & come up with some better indication. To summarize: we have an indication that EVEN WHERE DIVIDES ARE EXPENSIVE (e.g. on the 68k's as Rex intimates) the cost of an extra divide in (what I guess I'm claiming to be) typical loops is not very high wrt the total time of the loop (i.e. 30% in my simple test case of the last post) and rapidly less when considering the total running time of the program as a whole. My original point was that talking about the cost of an extra instruction or two IN ISOLATION is (both common and) misleading. The total running cost of a program is fairly insensitive to the cost of an individual instruction in a loop. I indicated at the time that my argument was overly simplistic, but I'm surprised at how little it seems to `diverge' (although you _could_ say 30% is 3 times 10% -- I prefer 30% is only 20% away :-) when altering some of the underlying assumptions. -kym
wca@oakhill.sps.mot.com (William Anderson) (02/28/91)
weaver@jetsun.weitek.COM (Mike Weaver) writes: >Does anyone know of a pipelined divider that gives a result every cycle? I was under the impression that HP had designed and implemented a multi- chip FPU with a (partially?) pipelined divider. It used a model division (similar to SRT) on every row. See the proceedings of the 1986 ISSCC (p. 34) for a description (and die photo) of this divider as well as a clever tree multiplier. Judging by this article, HP could cycle the divider only half as fast (420 nS) as the fully pipelined multiplier (210 nS). My, how the clockrates have changed in 5 years.... > ... Also, the actual number of transistors is horrendous -- it would >take perhaps ten times as many transistors as an array multiplier, which >is a large thing. The numbers listed for the (NMOS) parts mentioned above were ~153K transistors for the multiplier and ~160K for the divider - not an order of magnitude by any meams. William Anderson Motorola 88K Design Group Motorola MMTG Austin, TX
tk@wheat-chex.ai.mit.edu (Tom Knight) (02/28/91)
>In article <3381.27c548c3@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com (daniel lance herrick) writes: >What has frosted me about the languages is that the arithmetic >operators are all single valued. I usually want the quotient >and remainder from a division. I have to tell the compiler to >give me the quotient and then tell it to give me the remainder. >It might be smart enough to notice that it gets both of them in >one machine operation, but if it is, it is only undoing bad language >design. Of course Common Lisp has had just such a function since at least the early 80's. In fact, you get your choice of how to treat the dividend... floor, ceiling, or truncate. But we all know that Lisp isn't a real language 8-]. <obligatory garbage to make certain inferior operating system news programs happy> <obligatory garbage to make certain inferior operating system news programs happy> <obligatory garbage to make certain inferior operating system news programs happy> <obligatory garbage to make certain inferior operating system news programs happy> <obligatory garbage to make certain inferior operating system news programs happy> <obligatory garbage to make certain inferior operating system news programs happy> <obligatory garbage to make certain inferior operating system news programs happy> <obligatory garbage to make certain inferior operating system news programs happy> <obligatory garbage to make certain inferior operating system news programs happy>
ali@f.gp.cs.cmu.edu (Ali-Reza Adl-Tabatabai) (03/02/91)
In article <1991Feb27.220856.4067@oakhill.sps.mot.com> wca@oakhill.sps.mot.com (William Anderson) writes: >weaver@jetsun.weitek.COM (Mike Weaver) writes: > >>Does anyone know of a pipelined divider that gives a result every cycle? > >I was under the impression that HP had designed and implemented a multi- >chip FPU with a (partially?) pipelined divider. It used a model division >(similar to SRT) on every row. See the proceedings of the 1986 ISSCC (p. 34) >for a description (and die photo) of this divider as well as a clever tree >multiplier. Judging by this article, HP could cycle the divider only >half as fast (420 nS) as the fully pipelined multiplier (210 nS). My, >how the clockrates have changed in 5 years.... > >> ... Also, the actual number of transistors is horrendous -- it would >>take perhaps ten times as many transistors as an array multiplier, which >>is a large thing. > >The numbers listed for the (NMOS) parts mentioned above were ~153K >transistors for the multiplier and ~160K for the divider - not an order >of magnitude by any meams. Let us compare a radix-4 division algorithm with residuals (partial remainders) in carry save form with a radix-4 multiplication algorithm that keeps it's partial products in CS form. A division step consists of (1) quotient digit selection (2) divisor multiple selection (2) carry save addition to produce next partial remainder A multiplication step consists of (1) multiplication of recoded multiplier digit with multiplicand (2) carry save addition to produce next partial product The only difference in logic between the two is the quotient digit selection, which can be done based on an estimate of the partial remainder. This can be accomplished using a limited precision adder (5-6 bits, independant of divider's full precision) and a PLA. Therefore if we unwind the steps to produce an array implementation, there should hardly be a 10 times difference in the number of transistors. As the precision becomes larger, the difference diminishes since the size of the quotient digit selection logic will remain the same. The speed, however, is a different story, since in the muliplier the recoding and multiplicand selection can be done in parallel followed by a propogation through the adder array. In the divider, each step depends on the previous partial remainder, so the path length is longer. Therefore, in a pipelined implementation, for a given pipeline step time, you may do much more of the multiplication than of the division. Hence you may require more latches for the divider. Note that I did not take into account the last carry-assimilation step for the muliplier and divider, and I did not discuss recoding of the redundant quotient digits into conventional form. After the last multiplication step, the most significant half of the product is in carry-save form and needs to be passed through a carry-assimilate adder. The last partial remainder of the divider will also be in carry-save form. This introduces an error in the last bit of the quotient, so that it must also be assimilated for precise rounding. The conversion of the redundant quotient to conventional form may be done on-the-fly as the digits are produced. See 'On-the-fly Conversion of redundant into conventional representations', IEEE Trans. on comp. July 1987 by Ercegovac and Lang. > >William Anderson >Motorola 88K Design Group >Motorola MMTG >Austin, TX Ali-Reza Adl-Tabatabai (ali@cs.cmu.edu)