[comp.arch] *big iron*

cliffhanger@cup.portal.com (Cliff C Heyer) (09/25/89)

>In <22308@cup.portal.com> Cliff C Heyer wrote:
>> the *big iron* guys use MIPS as a trojan horse to hide the *real*
>> performance issue - "real world" I/O bandwidth. Lets blow their
>> cover!)
Eric S. Raymond responded...
>Huh? The `big iron' crowd is happy to talk I/O bandwidth, it's about the
>only place general-purpose mainframes got room to shine any more. It's
>the *PC/workstation* crowd that's prone to yell about MIPS and whisper
>about I/O performance.

Guess you got my point...but still I can express it better:
To clarify, *big iron* guys emphasize I/O on
their *mainframes*, but not on their *PCs*. Instead, they emphasize
MIPS on their PCs. Of course many would say that PC users are not
"interested" in I/O BW, and I agree with this. BUT there is a convenient
dual purpose here for the *big Iron* guys.  The following is my
theory:

Let me play *joe Mr. Marketing VP* who will get a BIG promotion if
profits are maximized. Joe wants two things: (1) saturate the market
with *slow* hardware so users will have to *buy more* in the future,
and (2) don't tell people how slow their PC hardware is compared to
*big iron* so as not to call attention to the relatively slow speed. 
Why have people complaining and demanding better performance if
you can avoid it? 

This is the only answer I can come up with to explain why IBM 
consistently puts out PCs that are substantially below average in "real"
disk I/O speed: 200KB/sec. Just look at Byte benchmarks. Plus I'm
getting LOTS of mail now further confirming this. Now I'm talking
about hardware bought & configured by IBM. I know you can
plug in aftermarket disks w/custom drivers that may blow away
the original equipment.

Companies are trying to promote the image that they are selling "state
if the art" hardware, but if you look at "mainframe" specs you see
how far from "state of the art" you really are. Lets just be honest here,
I say why let a company tell you that your buying the "best" when the
truth is you are buying mid 70's level performance. I know I know,
it doesn't cost $5,000,000. That is an accomplishment! My beef is how 
we are told all the time that we are buying the "best" when in fact,
we are often buying below the current average - as with IBM.

I've had lots of input from usenet about SRAM prices, etc., and I can
appreciate that there is no way to build a 33MHz 80386 PC with no wait
states 100% of the time without the price going to $50,000. 'Nuf said.

BUT my point is that the alleged "leaders" in the industry are not even
keeping up with what the small companies are doing. For example, the
Amiga doing 700-900KB/s "real" disk I/O. I've had this confirmed now
by several people, some who I know personally. And new PS/2s come out
with SCSI doing 200KB/s, at THREE TIMES the price of the Amiga. THEN
we have the $30,000 MIPS M120/5 only doing 600KB/sec, the AViiON
doing 300KB/sec, the Sparcstation doing 200KB/s, etc. etc.

So my belief is that some companies are trying to save I/O BW for their
*big iron* by purposefully handicapping the speed of their PCs. They 
accomplish two things: First, saving BW for the *big iron* by making
big databases run too slow, and second encourage hardware upgrades 
sooner by limiting your I/O so you'll need more hardware sooner.
And even though workstation vendors don' have *big iron*, they DO
have a $200,000 "high end" they NEED to sell which won't sell if their
entry level models do 1MB/sec disk I/O.

What I would like to see happen is for trade papers to place much more
emphasis on disk I/O so that the *big iron* PC makers will no longer be 
able to play this game with the consumer. Infoworld, PC Week, are you
listening? (Or are you going along with this because of all the advertising
dollars you are being paid?) At lease PCs are cheap enough that you can
buy & test them without signing a non-disclosure agreement!

Barry Traylor writes...
In article <7981@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:>
>>	Well, I just tried it on my machine (old, slower disk controller,
>>medium fast SCSI disk (Quantum)).  Read 3Meg file into memory: 609K/s.
>>Copy 3 meg file (on slightly fragged partition) to another file on the
>>same disk partition: ~550K/s.  On a newer controller with a fast SCSI disk
>>(170Meg CDC): ~900K/s and ~800K/s.
>
>Ok, ok, so we've now seen two pretty impressive transfer rates for MICROs.
>I would even go so far as to say that the rates reported beat by a little
>the PDP11/70 I used 10 years ago.  I hope, however, that you don't think
>this comes even close to what is attainable SUSTAINED on a mainframe.  

Hmmm.  I'm talking about sustained rates *per job*, not for the *system*. I know
overall throughput is in excess is 100MB/sec. But who makes a disk drive that
does 100MB/sec transfers? The best now is 3-4MB/sec. So when we get right
down to it, a COBOL program reading a file can expect less than 3-4MB/sec on
a mainframe. (The same reasoning explains how a 100 MIPS 4 processor mainframe
can only support 25 MIPS *per job*)

>much of the CPU was chewed up while these transfers were underway? 

If you are running single tasking OS (NOT UNIX), who cares? You have to
wait until the transfer is done anyway, so it might as well be as fast as 
possible. Hopefully SCSI does DMA while the CPU is busy elsewhere(comments
please..)

>I have seen mainframes do 50 times that rate (on a 1 processor system) and only
>utilize 10% of a CPU.  

Yup. That is because of intelligent channel processors that do DMA to multi-ported
memory. The same thing SCSI can do. Except with one user, we only need one channel
(or one for each file). But with UNIX we could use a few more.

>I have seen I/O rates at 4000-5000 i/os per second
>where the CPU is less than 75% utilized.  How many SCSI channels do these
>micros support? 

One I think. Comments others please!!!!!

>On a strict connectivity basis, the mainframe I am associated with can support 
>over 100!

But also they can support 1000's of users at once. Not needed on a PC.

>So go ahead and feed steroids to SCSI.  It will help mainframes as much as
>everyone else.  We would love to sell our mainframe customers hundreds of
>the things to squeeze into their pack farm acherage.

 I guess you aren't in marketing! MF vendors don't want to *encourage* users to
 port 500MB databases to micros, because the profit margin is so low on PCs.
 I believe they enact plans that discourage users from leaving the *big iron*, which
 would include limiting PC I/O to 300KB/sec. Look at IBM's OfficeVision - it's
 mainframe-centered.
 
I'm hoping some engineers might speak up who have actually designed PC disk I/O 
subsystems and could tell us why they didn't try for 900KB/sec like on the
Amiga.


Cliffhanger@cup.portal.com

-----

ccplumb@rose.waterloo.edu (Colin Plumb) (09/25/89)

In article <22488@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
> I'm hoping some engineers might speak up who have actually designed PC disk
> I/O subsystems and could tell us why they didn't try for 900KB/sec like on
> the Amiga.

In their defense, they're handicapped by the MS-DOS file system, which is
pretty piss-poor.  Randell's figures are using the rewritten file system;
replacing MS-DOS's is trickier.  A 2090A SCSI controller with a CDC Wren III
can do 1.2MB/sec through the device driver and the 2091 is probably faster,
so it's possible I will be able to get 1MB/sec I/O out of my 7.14MHz 68000
one day.
-- 
	-Colin

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/26/89)

In article <22488@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:

>To clarify, *big iron* guys emphasize I/O on
>their *mainframes*, but not on their *PCs*. Instead, they emphasize

>This is the only answer I can come up with to explain why IBM 
>consistently puts out PCs that are substantially below average in "real"
>disk I/O speed: 200KB/sec. Just look at Byte benchmarks. Plus I'm

>So my belief is that some companies are trying to save I/O BW for their
>*big iron* by purposefully handicapping the speed of their PCs. They 

Many of your points are well taken.  In fact, many big companies don't make it
a secret that they limit their user's options to force certain migration paths.
The industry trade rags are full of speculation about such things, and 
sometimes even print a lot of criticism of the big boys for introducing
new, high performance products too quickly - it is hard on the used equip. mkt.

However, I think you are painting with too broad a brush to include Sun, MIPSCo,
etc. in your list.  Remember that the controllers you have been using for
your comparisons to get ~1 MB/sec. through a filesystem are relatively new.
Most of these controllers have been thoroughly *debugged* and in volume
production (two prerequisites for full service companies to buy) for 6 mos.
to one year.  Sun now sells faster controllers that will do almost 1 MB/sec.
on SMD disks through a Unix filesystem.  I haven't had a chance to measure
any IPI or synchronous SCSI disks.  But it is unfair to use today's controllers
to criticize systems shipped 1-2 years ago.

The other thing that would probably help would be if more people said to
salesrep from company X:  "I am buying the system from company Y.  Even
though the CPU is only 10 MIPS instead of 20, it can stream data from 4
controllers simultaneously at 2.5MB/sec. each, with negligible CPU overhead."

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

jesup@cbmvax.UUCP (Randell Jesup) (09/26/89)

In article <22488@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) wri
tes:
>>much of the CPU was chewed up while these transfers were underway? 
>
>If you are running single tasking OS (NOT UNIX), who cares? You have to
>wait until the transfer is done anyway, so it might as well be as fast as 
>possible. Hopefully SCSI does DMA while the CPU is busy elsewhere(comments
>please..)

	Well, I don't want to sound commercial here, but the Amiga (referenced
by the above quote) is multitasking.  I don't have any cpu benchmarks run 
during intense disk I/O handy, but I'll post some when I get time to dig them
out.  BTW, most Unix machines are handicapped by the "standard" unix fs/disk
cache.  This cache requires them to do single-block reads, while under AmigaDos
the filesystem can ask for large blocks and have it transfered by DMA directly
from disk to where the application's read goes to.  This works quite well 
with SCSI.

	On the same hardware, the Amiga Unix (Amix) gets signifigantly lower
I/O throughput because of this, and the extra transfer via CPU to the
application's buffer.

>>I have seen I/O rates at 4000-5000 i/os per second
>>where the CPU is less than 75% utilized.  How many SCSI channels do these
>>micros support? 
>
>One I think. Comments others please!!!!!

	You can add up to 5 SCSI controllers to an amiga (limited by the 5
slots).  The other limit is the bus bandwith of the current Amiga, at about
3.5 Mb/s.  Of course, each scsi controller can talk to at least 7 drives, if
you don't use multi-lun drives.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

elgie@canisius.UUCP (Bill Elgie) (09/26/89)

In article <22488@cup.portal.com>, cliffhanger@cup.portal.com (Cliff C Heyer) writes:
> 
> BUT my point is that the alleged "leaders" in the industry are not even
> keeping up with what the small companies are doing. For example, the
> Amiga doing 700-900KB/s "real" disk I/O.......   THEN
> we have the $30,000 MIPS M120/5 only doing 600KB/sec, .......
> 
  Actually, I believe that Amiga is somewhat bigger than MIPS (tho the latter
  should catch up...).

  The "$30,000 MIPS M/120" is 1) more than a year old and slated for an up-
  grade, and 2) includes quite a bit of memory, a disk, ethernet, serial ports,
  etc, as well as a very well-done UNIX and its associated software, with an
  unlimited user license.  It runs considerably faster than anything I have 
  seen from Amiga.

  We support a good-sized database application on one of these systems, in
  spite of the limited "600KB/sec" transfer rate: that measure is not very mean-
  ingful and inaccurate in any case.  

  greg pavlov (under borrowed account), fstrf, amherst, ny

henry@utzoo.uucp (Henry Spencer) (09/26/89)

In article <7997@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>... most Unix machines are handicapped by the "standard" unix fs/disk
>cache.  This cache requires them to do single-block reads, while under AmigaDos
>the filesystem can ask for large blocks and have it transfered by DMA directly
>from disk to where the application's read goes to...

It's quite possible to do this under Unix as well, of course, if you've got
kernel people who seriously care about I/O performance.
-- 
"Where is D.D. Harriman now,   |     Henry Spencer at U of Toronto Zoology
when we really *need* him?"    | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

sritacco@hpdml93.HP.COM (Steve Ritacco) (09/27/89)

This seems like an appropriate time to mention the channel controller
in the NeXT machine.  There is an example of a micro with attention
paid to I/O bandwidth.  I don't know what transfer rates it can
sustain, but maybe someone on the net can tell us.

markb@denali.sgi.com (Mark Bradley) (09/27/89)

In article <32512@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
> 
> However, I think you are painting with too broad a brush to include Sun, MIPSCo,
> etc. in your list.  Remember that the controllers you have been using for
> your comparisons to get ~1 MB/sec. through a filesystem are relatively new.
> Most of these controllers have been thoroughly *debugged* and in volume
> production (two prerequisites for full service companies to buy) for 6 mos.
> to one year.  Sun now sells faster controllers that will do almost 1 MB/sec.
> on SMD disks through a Unix filesystem.  I haven't had a chance to measure
> any IPI or synchronous SCSI disks.  But it is unfair to use today's controllers
> to criticize systems shipped 1-2 years ago.

I can't yet publish our IPI numbers due to signed non-disclosure, but suffice
it to say that it would not make sense to go to a completely different controller
and drive technology for anything less than VERY LARGE performance wins or
phenomenal cost savings....

On the other hand, using a controller from the same company, SGI gets over
2 MB/sec. on SMD *through* the filesystem.  See comp.sys.sgi for discussions
on the Extent File System designed by our own Kipp Hickman and Donovan Fong
(who is no longer with SGI).  However it is also true that a decent job on
drivers, caching, selection of the right technology, both in terms of con-
trollers and disk drives, and actually marrying these all together will yield
a more coherent disk subsystem that is capable of providing nearly theoretical
maximum throughput.  This is something many companies seem to miss the boat
in doing.  Clearly I am somewhat biased, but the numbers don't lie (see below
for our mid-range ESDI numbers, which are handy in my current directory).
> 
> The other thing that would probably help would be if more people said to
> salesrep from company X:  "I am buying the system from company Y.  Even
> though the CPU is only 10 MIPS instead of 20, it can stream data from 4
> controllers simultaneously at 2.5MB/sec. each, with negligible CPU overhead."

This is reasonable.  Even more so if there is negligible CPU overhead.  2.5
on 4 controllers seems low and/or expensive, however.
> 
>   Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
>   NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
>   Moffett Field, CA 94035     
>   Phone:  (415)694-6117       

IP9,nfs,bbs 
Sequential write test
total 33554432   time 19.910   ms/IO 9   Kb/S 1685
Sequential read test
total 33554432   time 22.420   ms/IO 10   Kb/S 1496
Random read test
total 8388608   time 14.990   ms/IO 29   Kb/S 559
Multiple processes writing separate files simultaneously
total 268435456   time 187.660   ms/IO 11   Kb/S 1430
Multiple processes reading the intermixed files
total 268435456   time 216.950   ms/IO 13   Kb/S 1237
Multiple processes reading randomly from the intermixed files
total 67108864   time 128.580   ms/IO 31   Kb/S 521
Write 8 files, one at a time
total 33554432   time 20.140   ms/IO 9   Kb/S 1666
total 33554432   time 20.110   ms/IO 9   Kb/S 1668
total 33554432   time 20.570   ms/IO 10   Kb/S 1631
total 33554432   time 20.230   ms/IO 9   Kb/S 1658
total 33554432   time 20.870   ms/IO 10   Kb/S 1607
total 33554432   time 19.930   ms/IO 9   Kb/S 1683
total 33554432   time 20.290   ms/IO 9   Kb/S 1653
total 33554432   time 20.510   ms/IO 10   Kb/S 1636
Multiple processes reading the sequentially laid-out files
total 268435456   time 202.900   ms/IO 12   Kb/S 1322
Multiple processes reading randomly from the sequentially laid-out files
total 67108864   time 123.270   ms/IO 30   Kb/S 544

Disclaimer:  This is my opinion.  But in this case, it might just be that of
             my employer as well.

						markb

--
Mark Bradley				"Faster, faster, until the thrill of
IO Subsystems				 speed overcomes the fear of death."
Silicon Graphics Computer Systems
Mountain View, CA			     ---Hunter S. Thompson

nelson@udel.EDU (09/27/89)

In article <22488@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
>
>I'm talking about sustained rates *per job*, not for the *system*. I know
>overall throughput is in excess is 100MB/sec. But who makes a disk drive that
>does 100MB/sec transfers? The best now is 3-4MB/sec. So when we get right
>down to it, a COBOL program reading a file can expect less than 3-4MB/sec on
>a mainframe. (The same reasoning explains how a 100 MIPS 4 processor mainframe
>can only support 25 MIPS *per job*)
>
Since we are talking about "*big iron*", let's talk about real big iron.
Cray DD-40 disk drives can support >10MB/sec through the operating
system (at least COS; I assume the case is also true for UNICOS).
And COS also supports disk striping at the user level, so for
sequential reads of a file striped across an entire DS-40 disk
subsystem (20+ GB, 4 drives) a process can achieve sustained rates
of 40MB/sec.  Of course, this is for relatively large (~ 0.5MB)
reads, but these aren't uncommon for the sort of processing Crays
do.

Disk I/O is one of Cray's big selling points vs. the Japanese
super-computer manufacturers--their machines generally have
mainframe (read 4MB/sec) style disk channels.

Mark Nelson                 ...!rutgers!udel!nelson or nelson@udel.edu
This function is occasionally useful as an argument to other functions
that require functions as arguments. -- Guy Steele

pl@etana.tut.fi (Lehtinen Pertti) (09/27/89)

From article <1989Sep26.163307.17238@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> In article <7997@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>>... most Unix machines are handicapped by the "standard" unix fs/disk
>>cache.  This cache requires them to do single-block reads, while under AmigaDos
>>the filesystem can ask for large blocks and have it transfered by DMA directly
>>from disk to where the application's read goes to...
> 
> It's quite possible to do this under Unix as well, of course, if you've got
> kernel people who seriously care about I/O performance.

	Yes, and if your DMA-controller can manage user buffers spreading
	across several pages all over your memory.

--
pl@tut.fi				! All opinions expressed above are
Pertti Lehtinen				! purely offending and in subject
Tampere University of Technology	! to change without any further
Software Systems Laboratory		! notice

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/28/89)

In article <24950@louie.udel.EDU> nelson@udel.EDU writes:
>Since we are talking about "*big iron*", let's talk about real big iron.
>Cray DD-40 disk drives can support >10MB/sec through the operating
>system (at least COS; I assume the case is also true for UNICOS).

I just tested this on our Cray running Unicos.  The speed was almost
exactly 10MB/sec.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/28/89)

In article <42229@sgi.sgi.com> markb@denali.sgi.com (Mark Bradley) writes:

>I can't yet publish our IPI numbers due to signed non-disclosure, but suffice
>it to say that it would not make sense to go to a completely different controller
>and drive technology for anything less than VERY LARGE performance wins or
>phenomenal cost savings....

You might, however, be able to say what architectural features of your system
and the controller contributed.  For example, is there anything about cache,
memory, etc. that helps a lot.  What controller features are needed?  Which
ones are bad? 

>maximum throughput.  This is something many companies seem to miss the boat
>in doing.  Clearly I am somewhat biased, but the numbers don't lie (see below

I agree.  *Big Iron* machines have been able to provide sustained sequential
reads at 70% of theoretical channel/disk speed on multiple channels,
while providing 70% of CPU time in user CPU state to other CPU bound jobs, 
for at least the past 10 years.  Many of today's workstations have as fast 
CPUs as those machines did then, but, needless to say, the I/O hasn't been 
there.  I am glad to see that this is getting a lot more attention in industry
now.   

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

rod@venera.isi.edu (Rodney Doyle Van Meter III) (09/28/89)

Someone mentioned the Crays as examples of hot I/O boxes.

A friend of mine who knows these things much better than me called the
Cray peripherals "incestuous". Sure they work, but they apparently
rely on a nearly intimate knowledge of the timing quirks of the
processors and other peripherals. Makes them expensive, and makes it
hard to use off-the-shelf parts.

What about Thinking Machines' Data Vault? I've been given to
understand it's actually better than the machines themselves in some
respects.

		--Rod

brooks@vette.llnl.gov (Eugene Brooks) (09/28/89)

In article <9911@venera.isi.edu> rod@venera.isi.edu.UUCP (Rodney Doyle Van Meter III) writes:
>
>What about Thinking Machines' Data Vault? I've been given to
>understand it's actually better than the machines themselves in some
>respects.
Thinking Machines' Data Vault is a fine example of the right way to
build an IO system these days.  Instead of using limited production
high performance drives, you build a highly parallel system using
the same mass production drives you can buy for workstations and throw
in a SECDED controller while you are at it.  The system has 72 drives
implementing a 64 bit wide data path with one bit per drive.  Using current
1.2 Gbyte drives each having a bandwidth of more than a megabyte per second
you could build a selfhealing disk system of more than 64 gigabytes and having
more than 64 megabytes a second throughput.  For one of the future
supercomputers built of 1000 microprocessors each having 8 to 32 mbytes of
memory you would need more than one of these disk systems to keep the thing fed.


brooks@maddog.llnl.gov, brooks@maddog.uucp

johng@cavs.syd.dwt.oz (John Gardner) (09/28/89)

>
Organization: CSIRO Division of Wool Technology, Ryde, Sydney, Australia.
Lines: 15

In article <22488@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
>Yup. That is because of intelligent channel processors that do DMA to multi-ported
>memory. The same thing SCSI can do. Except with one user, we only need one channel
>(or one for each file). But with UNIX we could use a few more.
>
	One small point to add, the amiga does DMA to dual port ram.  All DMA
goes through a bank of ram called chip ram ( because the graphics coprocessors
also use this area) while everything else is run is fast ram.  This is a big 
help as the amiga does have a multitasking operating system ( usually single
user though.)
-- 
/*****************************************************************************/
PHONE          : (02) 436 3438
ACSnet         : johng@cavs.dwt.oz
#include <sys/disclaimer.h>

gsh7w@astsun3.acc.Virginia.EDU (Greg Scott Hennessy) (09/28/89)

In article <34298@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene
Brooks) writes: 
#Thinking Machines' Data Vault is a fine example of the right way to
#build an IO system these days. 
#The system has 72 drives
#implementing a 64 bit wide data path with one bit per drive. 

What are the extra 8 drives used for? Parity?

-Greg Hennessy, University of Virginia
 USPS Mail:     Astronomy Department, Charlottesville, VA 22903-2475 USA
 Internet:      gsh7w@virginia.edu  
 UUCP:		...!uunet!virginia!gsh7w

vaughan@mcc.com (Paul Vaughan) (09/28/89)

Ok, so we've straightened out the definitions of mini, micro,
mainframe, and super computer for this month.  Now we have to define
*big iron*?   Give me a break!  (Or was choosing a nice nebulous term
that everybody could interpret their own way the whole idea?  :-)

 Paul Vaughan, MCC CAD Program | ARPA: vaughan@mcc.com | Phone: [512] 338-3639
 Box 200195, Austin, TX 78720  | UUCP: ...!cs.utexas.edu!milano!cadillac!vaughan

rlk@think.com (Robert Krawitz) (09/29/89)

In article <34298@lll-winken.LLNL.GOV>, brooks@vette (Eugene Brooks) writes:
]Thinking Machines' Data Vault
					      The system has 72 drives

84, actually.  Specifically, 64 data + 14 ECC, and 6 hot spares (they
can be brought on-line immediately).
-- 
ames >>>>>>>>>  |	Robert Krawitz <rlk@think.com>	245 First St.
bloom-beacon >  |think!rlk				Cambridge, MA  02142
harvard >>>>>>  .	Thinking Machines Corp.		(617)876-1111

bruce@sauron.think.com (Bruce Walker) (09/29/89)

In article <34298@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>Thinking Machines' Data Vault is a fine example of the right way to
>build an IO system these days.  Instead of using limited production
>high performance drives, you build a highly parallel system using
>the same mass production drives you can buy for workstations and throw
>in a SECDED controller while you are at it.  The system has 72 drives
>implementing a 64 bit wide data path with one bit per drive.

Actually, the current DataVaults have 42 drives.  Though the bus to
the DV is 64 bits wide, it is broken down into a 32-bit data path
inside the DV.  There are 32 data drives, 7 ECC drives, and 3 hot
spares, each of which can be switched into any of the other 39
channels.

We also offer double-capacity DVs with 84 drives; no more bandwidth,
just a 2nd tier of drives off of each channel.


--Bruce Walker (Nemnich), Thinking Machines Corporation, Cambridge, MA
  bruce@think.com, think!bruce, bjn@mitvma.bitnet; +1 617 876 1111

brooks@vette.llnl.gov (Eugene Brooks) (09/29/89)

In article <2045@hudson.acc.virginia.edu> gsh7w@astsun3 (Greg Scott Hennessy) writes:
>In article <34298@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene
>Brooks) writes: 
>What are the extra 8 drives used for? Parity?
Actually, the data in my article was just a wild guess.  There
is nothing like incorrect data on the USENET to get the boys
at TM to speak up and reveal the facts that don't appear in their
publicly available literature.


brooks@maddog.llnl.gov, brooks@maddog.uucp

mcdonald@aries.uiuc.edu (Doug McDonald) (09/29/89)

>Thinking Machines' Data Vault is a fine example of the right way to
>build an IO system these days.  Instead of using limited production
>high performance drives, you build a highly parallel system using
>the same mass production drives you can buy for workstations and throw
>in a SECDED controller while you are at it.  The system has 72 drives
>implementing a 64 bit wide data path with one bit per drive.  Using current

I remember with great fondness a similar setup on the Illiac IV. It was
so unreliable when that machine first got (sort-of) running my program,
which didn't use it for hours, got to run while others were waiting 
for the farm to be fixed.

SECDED 	sounds OK for reading - but what about writing? Don't they need
to have an extra disk to take the data that should go to a sick disk
being replaced?

Doug McDonald

pa1159@sdcc13.ucsd.EDU (pa1159) (09/29/89)

In article <24950@louie.udel.EDU> nelson@udel.EDU () writes:
>In article <22488@cup.portal.com> cliffhanger@cup.portal.com (Cliff C Heyer) writes:
>>
>Since we are talking about "*big iron*", let's talk about real big iron.
>Cray DD-40 disk drives can support >10MB/sec through the operating
>system (at least COS; I assume the case is also true for UNICOS).
>And COS also supports disk striping at the user level, so for
>sequential reads of a file striped across an entire DS-40 disk
>subsystem (20+ GB, 4 drives) a process can achieve sustained rates
>of 40MB/sec.  Of course, this is for relatively large (~ 0.5MB)
>reads, but these aren't uncommon for the sort of processing Crays
>do.
>

This brings up a point:  in what processing regimes does total
sustained disk tranfer rate be the performance-limiting factor?

For a mini/single-user workstation configuration I'd think that the
average access time rather than sustained throughput would be most
important as most I/O transfers would be relatively small.

So, given equal access times, how much of a difference in
interactive workloads does a jump from say 500 KB/s (low end micro
disks) to 3-4 MB/s make in performance?

Of course, for things like massive image processing applications 
sustained throughput is a Good Thing, but for the Rest Of Us, how
much does it really matter?

Matt Kennel
pa1159@sdcc13.ucsd.edu

PS:  The Connection Machine parallel disk subsystem is pretty nifty.
40 simultaneous bitstreams, which when error-corrected &c make a
32-bit word per tick.  You can trash one drive and then reconstruct
its contents from the 39 others.

I don't know the numbers, but I suspect that it's very fast.


l>Disk I/O is one of Cray's big selling points vs. the Japanese
>super-computer manufacturers--their machines generally have
>mainframe (read 4MB/sec) style disk channels.
>
>Mark Nelson                 ...!rutgers!udel!nelson or nelson@udel.edu
>This function is occasionally useful as an argument to other functions
>that require functions as arguments. -- Guy Steele

Don_A_Corbitt@cup.portal.com (09/29/89)

Warning - posting from newcomer - Disk IO data enclosed

System: 
Northgate 386 16MHz
4MB 32 bit memory on motherboard (paged - 0WS in page, else 1WS)
RLL hard disk - 7.5MBit/sec transfer rate
No RAM or disk cache

Test 1 - How does transfer buffer size affect throughput under
MS-DOS?

Buffer		RLL KB/s	RAMDrive KB/s
512		156		446
1024		192		714
2048		284		1027
4096		352		1316
8192		409		1511
16384		445		1633
32768		471		1700

Test 2 - Using low-level calls, how does throughput differ?
These are still the MS-DOS calls, but using read/write sector,
not read/write file, commands.

Buffer		RLL KB/s	RAMDrive KB/s
512		196		1245
1024		336		2203
2048		381		3206
4096		387		5266
8192		489		6526
16384		567		7367
32767		611		7856

Conclusion - it appears that MSDOS does a MOVSB to copy data from an
internal buffer to the user area.  I did the timing, and that almost
exactly matched the speedup we see going from file IO to raw IO on the
RAM disk.  Note that this disk drive has a maximum burst transfer rate
of 937KB/s, and a maximum sustained rate of around 800KB/s (assuming 0ms 
seek, etc).  So we are able to get >1/2 max performance using the filesystem,
and 3/4 of max using the low-level calls.  

Also, it appears that the memory-memory bandwidth is sufficient for anything
that can get into a 32-bit 16 MHz slot.  Of course, generic peripherals
are looking at a 8MHz 16 bit slot with slow DMA.

Don_A_Corbitt@cup.portal.com - this non-lurking could ruin my reputation

PS - in 1984 I wrote the firmware for a 3.5" floppy drive with performance
in mind - 1:1 interleave, track skewing, etc for the portable Tandy Model 100.
It ran faster than any desktop PC I could find to benchmark it against.  And I 
haven't noticed anyone making the effort to do the same since.  So nobody cares
about Disk IO? 

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/29/89)

In article <1186@sdcc13.ucsd.EDU>, pa1159@sdcc13.ucsd.EDU (pa1159) writes:

|  This brings up a point:  in what processing regimes does total
|  sustained disk tranfer rate be the performance-limiting factor?

  On a Cray2, swapping! You can have programs using 2GB (yeah, that's
GB) of *real* memory, and when you swap those suckers out... disk
throughput is very important as program size gets larger.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

chen@pooh.cs.unc.edu (Dave) (09/30/89)

In article <2045@hudson.acc.virginia.edu> gsh7w@astsun3 (Greg Scott Hennessy) writes:
>In article <34298@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene
>Brooks) writes: 
>#Thinking Machines' Data Vault is a fine example of the right way to
>#build an IO system these days. 
>#The system has 72 drives
>#implementing a 64 bit wide data path with one bit per drive. 
>
>What are the extra 8 drives used for? Parity?
>


They are there for SEC-DED, i.e. single error correction, double error
detection.  If one of the 64 drives goes bad, the data is can be completely
recovered simply by accessing every word in the vault.  When doing a
read the extra 8 bits allow you to tell which bit is wrong.  If two bits are
wrong it can be detected, but not corrected.  The method is described in
many computer architecture books, I think, and is used in most mainframe
memory systems.

Dave

_________________________David_T._Chen_(chen@cs.unc.edu)_______________________
It's funny, I hate the itching, but I don't mind the swelling.
		-- David Letterman

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (09/30/89)

  Thanks for posting that data. It looks like a lot of other stuff
previously posted as the DOS call level, so I have more faith at the
BIOS level. Did you try a CORE test? It shows about 600KB for a cached
RLL controller.

  If a disk is rotating at 3600 rpm, and there are 26 sectors of 512
bytes on each track, the burst rate is:
	26*512*3600/60/1024 = 780
Where:
	26	sectors
	512	bytes/sector
	3600	rpm
	60	sec/min
	1024	1K
	780	kilobytes/sec

  I think that shows you are getting close to "all there is."
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

toivo@uniwa.uwa.oz (Toivo Pedaste) (10/02/89)

>Actually, the current DataVaults have 42 drives.  Though the bus to
>the DV is 64 bits wide, it is broken down into a 32-bit data path
>inside the DV.  There are 32 data drives, 7 ECC drives, and 3 hot
>spares, each of which can be switched into any of the other 39
>channels.

What I've wondered about such a configuration is how you bring a disk
back on line after it has failed. Do you rebuild the information on
it by reading the other drives and using the ECC? If so how long does
it take and what effect does it have on the performance of the system?

Just curious.
-- 
	Toivo Pedaste				ACSNET: toivo@uniwa.uwa.oz

daveb@rtech.rtech.com (Dave Brower) (10/03/89)

some people wrote:
>>Cray DD-40 disk drives can support >10MB/sec through the operating
>>system (at least COS; I assume the case is also true for UNICOS).
>
>This brings up a point:  in what processing regimes does total
>sustained disk tranfer rate be the performance-limiting factor?
>

In many tp/database/business applications, CPU is fast enough that disk
bandwidth will soon be the limiting factor for many applications.  Some
airline reservation systems are said to have huge farms of disk where
only one or two tracks are used on the whole pack to avoid seeks, for
instance.  A 1000 tp/s database benchmark might easily require 10MB/sec
i/o throughput.  

Maybe Cray should change markets...

-dB
-- 
"Did you know that 'gullible' isn't in the dictionary?"
{amdahl, cbosgd, mtxinu, ptsfa, sun}!rtech!daveb daveb@rtech.uucp

philf@xymox.metaphor.com (Phil Fernandez) (10/06/89)

In article <3752@rtech.rtech.com> daveb@rtech.UUCP (Dave Brower) writes:
> ... Some
>airline reservation systems are said to have huge farms of disk where
>only one or two tracks are used on the whole pack to avoid seeks, for
>instance.

No, I don't think so.

I did a consulting job for United Airlines' Apollo system a couple of
years ago, looking for architectures to break the 1000t/s limit.  We
looked at distributing transactions to many processors and disks,
etc., etc., but nothing quite so profligate at using only a couple of
tracks (or cyls) on a 1GB disk pack in order to minimize seeks.

On the *big iron* that UAL and other reservations systems use, the
operating systems (TPFII and MVS/ESA) implement very sophisticated
disk management algorithms, and in particular, implement elevator
seeking. 

With elevator seeking, disk I/O's in the queue are ordered in such a
way to minimize seek latency between I/O operations.  In an I/O-
intensive tp application with I/O's spread across multiple disk packs,
a good elevator scheduling scheme is all that's needed to get the
appropriate disk I/O bandwidth.

Makes for a good story, tho!

phil





+-----------------------------+----------------------------------------------+
| Phil Fernandez              |             philf@metaphor.com               |
|                             |     ...!{apple|decwrl}!metaphor!philf        |
| Metaphor Computer Systems   |"Does the body rule the mind, or does the mind|
| Mountain View, CA           | rule the body?  I dunno..." - Morrissey      |
+-----------------------------+----------------------------------------------+

jesup@cbmvax.UUCP (Randell Jesup) (10/07/89)

>>Yup. That is because of intelligent channel processors that do DMA to multi-
>>ported
>>memory. The same thing SCSI can do. Except with one user, we only need one
>>channel
>>(or one for each file). But with UNIX we could use a few more.
>>
>	One small point to add, the amiga does DMA to dual port ram.  All DMA
>goes through a bank of ram called chip ram ( because the graphics coprocessors
>also use this area) while everything else is run is fast ram.  This is a big 
>help as the amiga does have a multitasking operating system ( usually single
>user though.)

	A correction: some amiga disk controllers DMA to dual-ported memory.
Some DMA directly to system memory.  A few don't use DMA at all (but are
slightly cheaper).

	DMA to DP memory has some advantages, but speed isn't usually one of
them on the Amiga.  This is because the data has to cross the bus at least
one extra time, and of course the processor load increases.

	DMA straight to system memory (essentially any memory in the system)
is faster if done right.  FIFO's are important here to avoid DMA overruns.
This, combined with the Amiga FastFileSystem allows data to often be DMA'd
directly into the application's destination for the read (or from it for
a write).  This improves performance even more.

	"chip ram" is ram that the graphics/audio/floppy/etc coprocessors
can access directly.  This is currently 1Meg (used to be 512K).  Expansion
devices (like HD controllers) can DMA to any memory they want to (including
"chip ram").

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

cliffhanger@cup.portal.com (Cliff C Heyer) (10/08/89)

Cliff wrote...
>>	Yes, they (IBM and DEC - and all the rest, including DG) will
>>save the bandwidth and fast i/o for the *big iron* machines AND the
>>high-end "workstations".  This is just what the main problem is!  The
Robert Cousins responded....
>If you decide that $8000 is high end, then you are right, but frankly,
>I think your facts need to double checked.  There are currently only 4 
>DG 88k workstations which all have approximately the same I/O bandwidth.
>While I will admit that it is quite fast, the I/O performance of the 
>low end is almost the same as the high end.  The major difference is in
>CPU speed.

The "high end" is the SERVER with the VME bus. Note that the "low cost"
VME bus TO be introduced for the workstations is alleged not to do block I/O,
thus limiting it's throughput.

>DMA is a must for performance at any level whenever you have
>more than one task running at a time.

Yup, which is why I've been looking for a 386 board maker that has put
an ESDI or SCSI controller "on board" and bypassed the AT-bus with
direct DMA channel(s). (Like the Amiga.) So far, looks like the Mylex
MX386 is the only one (?)  Hoping if I subscribe to Computer Architecture 
News I might learn more!

news@rtech.rtech.com (USENET News System) (10/10/89)

In article <829@metaphor.Metaphor.COM> philf@xymox.metaphor.com (Phil Fernandez) writes:
>In article <3752@rtech.rtech.com> daveb@rtech.UUCP (Dave Brower) writes:
>> ... Some
>>airline reservation systems are said to have huge farms of disk where
>>only one or two tracks are used on the whole pack to avoid seeks, for
>>instance.
>With elevator seeking, disk I/O's in the queue are ordered in such a
>way to minimize seek latency between I/O operations.  

A number of techniques which we used on a VAX-based TP exec called the
Transaction Management eXecutive-32 (TMX-32) were:

	- per disk seek ordering - as stated above

	- which disk seek ordering - with mirrored disks, choose the
	disk with the heads closest the part of the disk you're
	gonna read. (sometimes just flip-flopping between the two is
	enough.)

	- coalesced transfers - for instance, if you need to read
	track N, N+3 and N+7 its sometimes faster to read tracks N to
	N+7 and sort out the transfers in memeory.

	- single-read-per-spindle-per-transaction - split up heavily
	accessed files over N spindles, mapping logical record M to
	disk (M mod N), physical record (N/M), such that on the
	average only one disk seek needs to be made per transaction
	(in parallel, of course). This is worthwhile when the
	transactions are well defined.

This task became considerably difficult when DEC introduced the HSC-50
super-smart, caching disk controller for the VAXcluster and the
RA-style disks:

	1) it was impossible to know the PHISICAL location of a disk
	block, due to dynamic, transparent bad-block revectoring and 
	lack of on-line information about the disk geometry.  We
	placed the files carfully on the disk so that they started on
	a cylinder boundary, adjacent to other files, and assumed
	what they were "one dimensional."

	2) Some of the optimizations were done in the HSC itself so
	we didnt do them on HSC disks. (seek ordering and command
	ordering)

	3) HSC volume shadowing made the optimizations to our
	home-grown shadowing obsolete.  We kept our shadowing to use
	in non-HSC enviroments, like uVAXes and locally connected
	disks, and because it was per-file based, not per volume.

Using these techniques, I ran the million-customer TP benchmark @76
TPS on a vax 8600 (~4-mips).  I dont remember the $/TPS (of course),
but it might have been pretty high because there were a LOT of disk
drives. We might have eeked out a few more TPS if we had physical
control over the placement of the disk blocks, but probably not more
than a few.  I also felt that I never knew what the disk was 'really
doing' because so much was hidden in the HSC; being the computer
programmer that I am, I wanted to know where each head was at each
milli-second:->.

(The 76TPS bottleneck was the mirrored journal disk, which, although
it was written sequentially, it was still nescessary to write to it
for the close of each transaction.  The next step would have been to
allow multiple journal files, but since the runner-up was about 30TPS,
we never got around to it :->.)

As an aside, for you HSC fans building this kind of stuff, it is
possible that large write I/Os to an HSC-served disk will be broken up
into multiple physical I/O operations to the disk.  This means that if
you are just checking headers and trailers for transaction checkpoint
consistency, you may have bogus stuff in the middle with perfectly
valid header and trailer information if the HSC crashed during the
I/O.
- bob
+-------------------------+------------------------------+--------------------+
! Bob Pasker		  ! Relational Technology	 !          	      !
! pasker@rtech.com        ! 1080 Marina Villiage Parkway !    INGRES/Net      !
! <use this address> 	  ! Alameda, California 94501	 !		      !
! <replies will fail>	  ! (415) 748-2434               !                    !
+-------------------------+------------------------------+--------------------+