[comp.databases] *big iron*

daveb@rtech.rtech.com (Dave Brower) (10/03/89)

some people wrote:
>>Cray DD-40 disk drives can support >10MB/sec through the operating
>>system (at least COS; I assume the case is also true for UNICOS).
>
>This brings up a point:  in what processing regimes does total
>sustained disk tranfer rate be the performance-limiting factor?
>

In many tp/database/business applications, CPU is fast enough that disk
bandwidth will soon be the limiting factor for many applications.  Some
airline reservation systems are said to have huge farms of disk where
only one or two tracks are used on the whole pack to avoid seeks, for
instance.  A 1000 tp/s database benchmark might easily require 10MB/sec
i/o throughput.  

Maybe Cray should change markets...

-dB
-- 
"Did you know that 'gullible' isn't in the dictionary?"
{amdahl, cbosgd, mtxinu, ptsfa, sun}!rtech!daveb daveb@rtech.uucp

philf@xymox.metaphor.com (Phil Fernandez) (10/06/89)

In article <3752@rtech.rtech.com> daveb@rtech.UUCP (Dave Brower) writes:
> ... Some
>airline reservation systems are said to have huge farms of disk where
>only one or two tracks are used on the whole pack to avoid seeks, for
>instance.

No, I don't think so.

I did a consulting job for United Airlines' Apollo system a couple of
years ago, looking for architectures to break the 1000t/s limit.  We
looked at distributing transactions to many processors and disks,
etc., etc., but nothing quite so profligate at using only a couple of
tracks (or cyls) on a 1GB disk pack in order to minimize seeks.

On the *big iron* that UAL and other reservations systems use, the
operating systems (TPFII and MVS/ESA) implement very sophisticated
disk management algorithms, and in particular, implement elevator
seeking. 

With elevator seeking, disk I/O's in the queue are ordered in such a
way to minimize seek latency between I/O operations.  In an I/O-
intensive tp application with I/O's spread across multiple disk packs,
a good elevator scheduling scheme is all that's needed to get the
appropriate disk I/O bandwidth.

Makes for a good story, tho!

phil





+-----------------------------+----------------------------------------------+
| Phil Fernandez              |             philf@metaphor.com               |
|                             |     ...!{apple|decwrl}!metaphor!philf        |
| Metaphor Computer Systems   |"Does the body rule the mind, or does the mind|
| Mountain View, CA           | rule the body?  I dunno..." - Morrissey      |
+-----------------------------+----------------------------------------------+

news@rtech.rtech.com (USENET News System) (10/10/89)

In article <829@metaphor.Metaphor.COM> philf@xymox.metaphor.com (Phil Fernandez) writes:
>In article <3752@rtech.rtech.com> daveb@rtech.UUCP (Dave Brower) writes:
>> ... Some
>>airline reservation systems are said to have huge farms of disk where
>>only one or two tracks are used on the whole pack to avoid seeks, for
>>instance.
>With elevator seeking, disk I/O's in the queue are ordered in such a
>way to minimize seek latency between I/O operations.  

A number of techniques which we used on a VAX-based TP exec called the
Transaction Management eXecutive-32 (TMX-32) were:

	- per disk seek ordering - as stated above

	- which disk seek ordering - with mirrored disks, choose the
	disk with the heads closest the part of the disk you're
	gonna read. (sometimes just flip-flopping between the two is
	enough.)

	- coalesced transfers - for instance, if you need to read
	track N, N+3 and N+7 its sometimes faster to read tracks N to
	N+7 and sort out the transfers in memeory.

	- single-read-per-spindle-per-transaction - split up heavily
	accessed files over N spindles, mapping logical record M to
	disk (M mod N), physical record (N/M), such that on the
	average only one disk seek needs to be made per transaction
	(in parallel, of course). This is worthwhile when the
	transactions are well defined.

This task became considerably difficult when DEC introduced the HSC-50
super-smart, caching disk controller for the VAXcluster and the
RA-style disks:

	1) it was impossible to know the PHISICAL location of a disk
	block, due to dynamic, transparent bad-block revectoring and 
	lack of on-line information about the disk geometry.  We
	placed the files carfully on the disk so that they started on
	a cylinder boundary, adjacent to other files, and assumed
	what they were "one dimensional."

	2) Some of the optimizations were done in the HSC itself so
	we didnt do them on HSC disks. (seek ordering and command
	ordering)

	3) HSC volume shadowing made the optimizations to our
	home-grown shadowing obsolete.  We kept our shadowing to use
	in non-HSC enviroments, like uVAXes and locally connected
	disks, and because it was per-file based, not per volume.

Using these techniques, I ran the million-customer TP benchmark @76
TPS on a vax 8600 (~4-mips).  I dont remember the $/TPS (of course),
but it might have been pretty high because there were a LOT of disk
drives. We might have eeked out a few more TPS if we had physical
control over the placement of the disk blocks, but probably not more
than a few.  I also felt that I never knew what the disk was 'really
doing' because so much was hidden in the HSC; being the computer
programmer that I am, I wanted to know where each head was at each
milli-second:->.

(The 76TPS bottleneck was the mirrored journal disk, which, although
it was written sequentially, it was still nescessary to write to it
for the close of each transaction.  The next step would have been to
allow multiple journal files, but since the runner-up was about 30TPS,
we never got around to it :->.)

As an aside, for you HSC fans building this kind of stuff, it is
possible that large write I/Os to an HSC-served disk will be broken up
into multiple physical I/O operations to the disk.  This means that if
you are just checking headers and trailers for transaction checkpoint
consistency, you may have bogus stuff in the middle with perfectly
valid header and trailer information if the HSC crashed during the
I/O.
- bob
+-------------------------+------------------------------+--------------------+
! Bob Pasker		  ! Relational Technology	 !          	      !
! pasker@rtech.com        ! 1080 Marina Villiage Parkway !    INGRES/Net      !
! <use this address> 	  ! Alameda, California 94501	 !		      !
! <replies will fail>	  ! (415) 748-2434               !                    !
+-------------------------+------------------------------+--------------------+