[comp.sys.ibm.pc] Ami Bios and the V20

kristyn@aludra.usc.edu (KRISTYN GREENWOOD) (10/30/89)

    Stupid Question time: has anyone gotten a V20 to work with the
                          '88 AMI bios on a vanilla XT?

    Somebody gave me a bios that works with the V20, but it doesnt
    seem to want to recognize the mono card(Hercules graphic card
    -the real thing). System boots fine, just no picture. 

    Any shred of wisdom on the subject would be greatly appreciated.

-g.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
                                      |
Disclaimer - Dont blame them, I just  |  I hope they dont blow us up
             rent this space.         |  as they try to figure out how
                                      |  to blow us up.
Glenn Schmall - !uunet!ucscb!astroid  |                    -Geordi TNG
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

vicc@unix.cie.rpi.edu (VICC Project (Rose)) (11/10/89)

In article <6129@merlin.usc.edu> kristyn@aludra.usc.edu (Kris' better half) writes:
>
>    Stupid Question time: has anyone gotten a V20 to work with the
>                          '88 AMI bios on a vanilla XT?
>
>    Somebody gave me a bios that works with the V20, but it doesnt
>    seem to want to recognize the mono card(Hercules graphic card
>    -the real thing). System boots fine, just no picture. 

Hmm, I've got a Zenith Z161 (portable - luggable) I just popped my V20
in and zingo everything goes. My Norton SI went from 1.0 to 1.8!

Now for my question:

where can I find software that uses the V20 (especially assemblers,
disassemblers and debuggers. I've verified that much of the additional
instructions are in fact the same as the 80186 so using .186 for
Tirbo assembler works fine, but I would like to be able to use the
bit field instructions without defining macros.

Btw: for those who don't know, you can call NEC and have them send a 
users manual with full instruction set description (Free!, and they
were fast - 2 days! - says something about the US-SNAIL these days)

--
Frank Filz
Center For Integrated Electronics
Rensselaer Polytechnic Institute
vicc@unix.cie.rpi.edu

rob@prism.TMC.COM (11/11/89)

 > Hmm, I've got a Zenith Z161 (portable - luggable) I just popped my V20
 > in and zingo everything goes. My Norton SI went from 1.0 to 1.8!

  It's been said before, but it's worth repeating - The 1.8 number from
SI is unrealistic (in general, any number from SI comparing different
CPUs is unrealistic). Norton's SI tests speed by looping around an IMUL
and IDIV instruction. The V20 executes these disproportionately quickly
compared to the 8086/88. Since IMUL and IDIV instructions are very rare in
'real world' code, SI's figure isn't meaningful. The same problem comes 
when running SI on 286/386/etc... machines.

   In general, a V20 should give you about a 5 - 10% speedup, possibly
as high as 20 - 30% when running floating point code, which tends to be 
heavy in integer multiplies and divides unless you have an 8087.

brianr@phred.UUCP (Brian Reese) (11/17/89)

In article <206900136@prism> rob@prism.TMC.COM writes:
>
>  It's been said before, but it's worth repeating - The 1.8 number from
>SI is unrealistic (in general, any number from SI comparing different
>
>   In general, a V20 should give you about a 5 - 10% speedup, possibly
                      ^^^^^^
>as high as 20 - 30% when running floating point code, which tends to be 

Have you ever actually _tried_ it?  I replaced the CPU in my XT with a V20
and realized very noticable increase in speed, running a variety of apps.
If the increase was only 5 - 10%, I really doubt that I would notice it.
I'd say, on the average, I got a 40 - 50% boost.  (Just for GP, it went
from 1.0 to 1.8, just like the original poster.)

I do agree with you that the SI is rather unrealistic, for the reasons you
cited.  I'm just offering my hands-ons, personal expirience.

Anyone else out there with V20's?

Brian

-- 
Brian Reese                           uw-beaver!pilchuck!seahcx!phred!brianr
Physio Control Corp., Redmond, Wa.                         brianr@phred.UUCP
"Sticks and stones may break my bones, but whips and chains excite me!"
* Do not write on this line.  This line has been left blank intentionally. *

silver@eniac.seas.upenn.edu (Andy Silverman) (11/17/89)

While I find 40-50% speedup kind of an unrealistic statistic, I'd say 30% is
reasonable in specific applications. Take for example, that FRACTINT program
that does all those neat fractals using integer math.  On the old 8088, the
integer math ops were SLLOOWW, but with a V20 in my system there was a very
noticeable speed increase.

+-----------------------+-----------------------------------------+
| Andy Silverman        | Internet:   silver@eniac.seas.upenn.edu |
| "All stressed out and | Compu$erve: 72261,531                   |
|  nobody to choke."    |                                         |         
+-----------------------+-----------------------------------------+

vicc@unix.cie.rpi.edu (VICC Project (Rose)) (11/17/89)

In article <2851@phred.UUCP> brianr@phred.UUCP (Brian Reese) writes:
>In article <206900136@prism> rob@prism.TMC.COM writes:
>>
>>  It's been said before, but it's worth repeating - The 1.8 number from
>>SI is unrealistic (in general, any number from SI comparing different
>>
>>   In general, a V20 should give you about a 5 - 10% speedup, possibly
>                      ^^^^^^
>>as high as 20 - 30% when running floating point code, which tends to be 
>
>Have you ever actually _tried_ it?  I replaced the CPU in my XT with a V20
>and realized very noticable increase in speed, running a variety of apps.
>If the increase was only 5 - 10%, I really doubt that I would notice it.
>I'd say, on the average, I got a 40 - 50% boost.  (Just for GP, it went
>from 1.0 to 1.8, just like the original poster.)

Actually, from examining NECs timing, it would seem that an increase
of 20-30% could be expected. Multiplies and divides run 4x faster.
All Effective Addresses take 2 clock cycles, instead of 6+ (the V20
has hardware address calculation as opposed to microcode) The V20 also
does multiple shifts at 1 cycle per bit instead of 4 because of
hardware aid to the microcode (ie - not a barrel shifter which would
do all shifts in 1 cycle. A number of instructions are also 1 or 2
cycles quicker. Of course these numbers are affected by the
instruction cache (which I think might be better on the V20 also)
REP instructions are also speeded up, prefix interrupts are handled
correctly (up to 3 prefixes are 'remembered' as opposed to only 1 on
the 8086 - but the V20 adds a REPC [carry] or REPNC so you could have
4 - but - most people dont use the LOCK prefix) 

As I said before the V20 has all 80186 instructions. In addition the
V20 has REPC, REPNC, bit field instructions, BCD arithmetic string
functions, and 8080 emulation (either mode shift or SW interrupt)

Since someone asked for the number for NEC:

   1-800-632-3531 (or 3532 in California)

Ask for the V20 User's Manual

One complaint I have about the manual: the registers are renamed, and
the instructions are renamed (probably a copyright problem)

One question I have: there is also a 2nd Co-Processor Escape op-code,
does anyone know what this does? (could it support a 387 or something
weird - neat like that? (I doubt it but one could hope))

One note I picked up from a friend - if you send your system in for
repairs, make sure that they pull the V20 if they replace your system
board, my friend lost his V20 because of that (so now he has a
handfull to take care of all future problems - also purchased when it
looked like the supply would dry up real fast)

A note about the speed up - a register to memory operation is
typically 15 cycles or so on the V20 and 13+EA on the 8088, which
translates to a 20% speedup at worst. This is what I base my 20-30%
speedup on, most instructions are not MUL or DIV, but many are MOV
reg,mem or OP reg,mem.

--
Frank Filz
Center For Integrated Electronics
Rensselaer Polytechnic Institute
vicc@unix.cie.rpi.edu

rob@prism.TMC.COM (11/17/89)

>>  It's been said before, but it's worth repeating - The 1.8 number from
>>SI is unrealistic (in general, any number from SI comparing different
>>
>>   In general, a V20 should give you about a 5 - 10% speedup, possibly
                      ^^^^^^
>>as high as 20 - 30% when running floating point code, which tends to be 

>Have you ever actually _tried_ it?  I replaced the CPU in my XT with a V20
>and realized very noticable increase in speed, running a variety of apps.
>If the increase was only 5 - 10%, I really doubt that I would notice it.
>I'd say, on the average, I got a 40 - 50% boost.  (Just for GP, it went
>from 1.0 to 1.8, just like the original poster.)

   Actually, I did try it a few years ago. I should have been more
specific about what I meant by 'in general'. Running non-floating
point code (program compiles, spreadsheets, and databases), the speedup
was from 5 to 10%. I wouldn't have noticed it if I hadn't been timing
it. As mentioned, floating point code, which makes heavy use of the
integer multiplies and divides at which the V20 excels, shows a greater
increase (in my experience, around 25%).

   It's sort of surprising how little difference a V20 makes. As another
note mentioned, it also claims to drastically speed up (by a factor of 3 
to 6) effective address calculation, which, unlike integer multiplies and 
divides, is a real factor in most code. Yet a test program I wrote that 
simply looped around a bunch of statements like

		   MOV   AX, [BX+SI+2]

showed a speedup of only about 20%, as I recall. Looping overhead was
clearly a consideration (the V20 doesn't claim to speed up loops
significantly), but I still expected a larger gain.

   Still, whether it's worthwhile to you depends on what you're running,
and what you consider a significant speedup. You could probably also realize 
a larger gain if you optimized code for the V20. Given how inexpensive it
is, getting a V20 or V30 is probably worth it. The point remains, though: 
someone expecting the 80% speedup that SI promises will be disappointed
(i.e. my complaint is with SI, not the V20).

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/18/89)

In article <2851@phred.UUCP> brianr@phred.UUCP (Brian Reese) writes:

| I do agree with you that the SI is rather unrealistic, for the reasons you
| cited.  I'm just offering my hands-ons, personal expirience.
| 
| Anyone else out there with V20's?

  I found that the one program which justified buying a V20 did run much
faster. That's good, because I don't notice anything else running
better. I did measure a few programs, but the change was 10-15% faster,
not enough to really notice.

  Then I got a 386... that I notice.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

Ralf.Brown@B.GP.CS.CMU.EDU (11/18/89)

In article <206900137@prism>, rob@prism.TMC.COM wrote:
 >   It's sort of surprising how little difference a V20 makes. As another
 >note mentioned, it also claims to drastically speed up (by a factor of 3 
 >to 6) effective address calculation, which, unlike integer multiplies and 
 >divides, is a real factor in most code. Yet a test program I wrote that 
 >simply looped around a bunch of statements like
 >
 >                   MOV   AX, [BX+SI+2]
 >
 >showed a speedup of only about 20%, as I recall. Looping overhead was
 >clearly a consideration (the V20 doesn't claim to speed up loops
 >significantly), but I still expected a larger gain.

A major problem is that the 8088 and V20 are bus-bound.  Any instruction
that executes in less than four clock cycles per byte will drain the
four-byte instruction prefetch queue.  Once the prefetch queue is empty,
instructions run only as fast as they can be fetched from memory (at one
byte every four clock cycles).	Since every branch empties the prefetch
queue (and the instructions at the destination may not let it refill),
the prefetch queue spends a significant percentage of the time empty.

For example, the sequence
	SHL  AX,1
	SHL  AX,1
	SHL  AX,1
	SHL  AX,1
takes eight clocks according to the official Intel instruction timings.
Unfortunately, each of these instructions is two bytes long, so it takes
eight clocks to fetch each instruction.  Thus, the best case is when the
instruction queue is full at the start of this sequence:
	SHL  AX,1    two clocks, PQ now has two bytes and is fetching a third
	SHL  AX,1    two clocks, PQ now empty, third byte arrives at end
	SHL  AX,1    only one byte, so start fetching next
		     four clocks later, we can start, so total is six clocks
	SHL  AX,1    wait two clocks for first byte, four for second,
		     then two clocks to execute = eight clocks
Total: 18 clocks

Worst case is when the prefetch queue is empty, with the next byte two
clocks away.  Then the first three instructions each take eight clocks
to execute, and the last takes ten clocks, for a total of 34 clocks.

You should see a greater improvement when replacing an 8086 with a V30,
since they can fetch two bytes every four clocks and have a six-byte
prefetch queue, greatly reducing the bus-boundedness of the processor
(the above instruction sequence runs in eight to 16 clocks, depending on
how full the prefetch queue is at the beginning)


--
UUCP: {ucbvax,harvard}!cs.cmu.edu!ralf -=-=-=-=- Voice: (412) 268-3053 (school)
ARPA: ralf@cs.cmu.edu  BIT: ralf%cs.cmu.edu@CMUCCVMA  FIDO: Ralf Brown 1:129/46
FAX: available on request                      Disclaimer? I claimed something?
"How to Prove It" by Dana Angluin
  8.  proof by wishful citation:
      The author cites the negation, converse, or generalization of a theorem
      from the literature to support his claims.

rob@prism.TMC.COM (11/21/89)

 > A major problem is that the 8088 and V20 are bus-bound.  Any instruction
 > that executes in less than four clock cycles per byte will drain the
 > four-byte instruction prefetch queue.  

  This is true for many code sequences, though not all. The problem is that, 
on the 8088 at least, many instructions take longer than 4 clocks/byte. In 
general, register intensive code (like the SHL AX,1 in your example) is bus 
bound, while memory intensive code, and some register arithmetic, is compute 
bound. 'Typical' code falls somewhere in between, though the bus is still a 
bottleneck. The V20, with its faster effective address calculation, is more 
likely to run up against the bus limit when accessing memory, so your point 
about bus bandwidth being a limiting factor is valid.

   One of SI's problems is that it's entirely compute bound on an 8086/88. 
It shows no difference between an 8088 and an 8086 running at the same clock 
speed. This is because SI spends about 2/3 of its time on IMUL or IDIV 
instructions, which are uninfluenced by bus bandwidth. Thus the way to speed 
up a CPU's SI rating is to speed up its multiplies and divides. Of course, 
since most code doesn't contain many IMULs or IDIVs, a CPU that speeds those 
instructions up more than it speeds up the more common ones will do better 
at SI than it will in 'real life'. That's why SI overstates the performance 
of many CPUs so drastically.