[comp.lang.misc] Sorting

dmr@alice.att.com (Dennis Ritchie) (11/17/90)

This group has turned into a sewer, as bad as talk.origins
(even the people I might politically agree with are acting dumb).

But I do have a serious question.  I've got a boxcar with 10,000 8-mm tapes,
each containing 10,000,000 100-byte binary records
without any substructure.  I need to get these records sorted.
Can someone quote me a cost and expected delivery time?

This is just the start of a burgeoning business; next week
I expect a full trainload of 100 such boxcars.
What will this cost and how long will it take your company
to handle it?

Since I understand that this will run into big bucks, I trust
your company can give a brief account of its technical approach
before I write the purchase orders.

	Dennis Ritchie

schow@bcarh185.bnr.ca (Stanley T.H. Chow) (11/20/90)

In article <11628@alice.att.com> dmr@alice.att.com (Dennis Ritchie) writes:
>This group has turned into a sewer, as bad as talk.origins
>(even the people I might politically agree with are acting dumb).
>
>But I do have a serious question.  I've got a boxcar with 10,000 8-mm tapes,
>each containing 10,000,000 100-byte binary records
>without any substructure.  I need to get these records sorted.
>Can someone quote me a cost and expected delivery time?

Ordinarily, I avoid topics like this. Seeing that it is Dennis Ritchie, just
my be I can get rich on this after all, so here is my proposal. :-)

Assumptions: Input - you hand me a boxcar of tapes, (figuratively speaking)
	     Output - I throw you 10,000 tapes back; in sequence, one tape
		      at a time (hope you have a good catching glove).
		      (Each output tape has 10,000,000 records exactly).
             Sort Key - is the entire 100-byte record, duplicates are
			allowed.

Observations:
	     1) Key space is 2 ** 800 (roughly 6 * 10 **240)
	     2) total "file" size = 10,000 * 10,000,000 * 100 bytes
				  = 10 ** 13 bytes.
             3) Companies like SyncSort Inc. (and others) make their money
		soving problems like this. 
                             
I will give you two quotes:

  1) Minimum turn-around time.

     I will need a bunch of tape drives along with a bunch of disk drives
     connected by a network. Stuf an input tape into each tape drive, read
     the records and throw each record into a disk. Unload the input tapes,
     load blank output tapes, and copy 10,000,000 records onto each tapes.

     Elapse time: approximately two times R/W time of a single tape.

     Cost: option a: you provide the hardware
		     I charge 1% of the hardware cost for processing.

           option b: I provide the hardware
		     Sorry, I want to get paid in my life time, so I can't
		     wait for the delivery of all that hardware.


  2) Minimum hardware cost.

     I will buy a really cheap CPU (Commodore 64 comes to mind), hook up a
     single tape drive.

     [I will skip the gory details of how. If you really want to know how,
     look up any sort program on the mainframes - SyncSort is a good example.
     They will tell you how to do it will three drives, cutting down to a 
     single drive is simple, but tedious]

     Elapse time: approximately the expected life time of the universe.

     Cost: option a: Total payment up front - $10K (negotiable)

	   option b: pay as you go: nothing down,
                                    $10 a month for the duration. (negotiatable).


>
>This is just the start of a burgeoning business; next week
>I expect a full trainload of 100 such boxcars.
>What will this cost and how long will it take your company
>to handle it?

By a remarkable coincidence (and extreme lazyness), my quotes for this problem
are exactly the same as the previous smaller problem. This conclusively
demonstrates the advantage of dealing with a service-oriented compnay like
us - "The same price for all your problems". 

Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!uunet!bnrgate!bcarh185!schow
(613) 763-2831               ..!psuvax1!BNR.CA.bitnet!schow
Me? Represent other people? Don't make them laugh so hard.

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (11/20/90)

In article <11628@alice.att.com> dmr@alice.att.com (Dennis Ritchie) writes:
  [ real words from a Real Programmer ]
> But I do have a serious question.  I've got a boxcar with 10,000 8-mm tapes,
> each containing 10,000,000 100-byte binary records
> without any substructure.  I need to get these records sorted.
> Can someone quote me a cost and expected delivery time?

I'm afraid I'm not in a position to contract out NYU's computer time to
AT&T, but I'll try to answer your question.

Let me assume a mainframe with 100M memory usable for data, internal
speed about ten times a Sun 4, and a 10-terabyte 5000-disk farm.

To sort as much data as will fit into memory will take 20 seconds with
my current software, for a total CPU time of 23 days. The final merge
shuttles a total of 170 terabytes to and from disk, using approximately
5 days of CPU time and relatively little I/O time.

If we can read (by some miracle) 100K a second on a hundred separate
tape drives, reading all your data takes 11 days. Add another 11 days
for writing back to tape (though we suggest a better storage method).

Total: Only as much time as Noah spent in his ark. We'll give you a
guaranteed time of twelve weeks, expected nine. The disk farm is the
biggest cost: including controllers and an IBM to handle it all, the
disks could cost up to $10M initially. Assuming we do this all the time
and already have disks, you just pay something more than maintenance
costs plus floor space plus personnel plus CPU time for two months.

So (without any experience with how much a company might actually charge
for a project of such a size) I'd estimate nine weeks and $1,000,000.

> This is just the start of a burgeoning business; next week
> I expect a full trainload of 100 such boxcars.
> What will this cost and how long will it take your company
> to handle it?

Approximately 100 times the cost and approximately 100 times the time.
However, a 500000-disk farm is much more than 100 times less realistic
than a 5000-disk farm; and I don't think anyone's going to embark on a
project taking a couple of decades to complete.

---Dan

rh@smds.UUCP (Richard Harter) (11/20/90)

In article <11628@alice.att.com>, dmr@alice.att.com (Dennis Ritchie) writes:
> This group has turned into a sewer, as bad as talk.origins
> (even the people I might politically agree with are acting dumb).

	.... Snicker ....

> But I do have a serious question.  I've got a boxcar with 10,000 8-mm tapes,
> each containing 10,000,000 100-byte binary records
> without any substructure.  I need to get these records sorted.
> Can someone quote me a cost and expected delivery time?

> This is just the start of a burgeoning business; next week
> I expect a full trainload of 100 such boxcars.
> What will this cost and how long will it take your company
> to handle it?

> Since I understand that this will run into big bucks, I trust
> your company can give a brief account of its technical approach
> before I write the purchase orders.

Are you ever in luck!  Trust me, I've got the answer to your problem.
My cousin Sid, who is a part time sales executive for the Digicomputronics
division of Nocturnal Aviation and Screen Door Corporation will be in
touch with you very shortly.

As you may know Digicomputronics holds the patent on the inverted-Q-list
linear boxcar sort algorithm.  Sid will be with you shortly.  All you
will need to do is to sign a non-disclosure agreement and give Sid a 
certified non-refundable check for $10,000 and you will be in business.

[If you don't have the ten G's handy, get him drunk (easy to do) and
offer him a bottle of Chiva Regal.]

Oh, by the way.  You do have a computer with 2*100 bytes of fast mnemory
and ten thousand tape drives, don't you?


-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

tneff@bfmny0.BFM.COM (Tom Neff) (11/21/90)

In article <11628@alice.att.com> dmr@alice.att.com (Dennis Ritchie) writes:
>But I do have a serious question.  I've got a boxcar with 10,000 8-mm tapes,
>each containing 10,000,000 100-byte binary records
>without any substructure.  I need to get these records sorted.
>Can someone quote me a cost and expected delivery time?

Dennis, it'll take some work to gauge the problem, then we'll be on
track.  We're not one of those outfits that tries to railroad you into
one solution, only to switch you later on.  Our engineers will go to
work decoupling the problem, as we train them to do, and I'm not just
blowing smoke when I tell you that the solution we tender will really
make the grade.

By the way, have you got a dozen or so empty rolling stock we can use
for tempfiles??  :-)

-- 
When considering "victim's rights," remember  || vv  Tom Neff
that an innocent defendant is also a victim.  ^^ ||  tneff@bfmny0.BFM.COM

gsh7w@astsun.astro.Virginia.EDU (Greg Hennessy) (11/24/90)

In article <16051@bfmny0.BFM.COM> tneff@bfmny0.BFM.COM (Tom Neff) writes:
#Dennis, it'll take some work to gauge the problem, then we'll be on
#track.  

Are you trying to guage the reaction that you get?


--
-Greg Hennessy, University of Virginia
 USPS Mail:     Astronomy Department, Charlottesville, VA 22903-2475 USA
 Internet:      gsh7w@virginia.edu  
 UUCP:		...!uunet!virginia!gsh7w

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (11/26/90)

In article <11628@alice.att.com>, dmr@alice.att.com (Dennis Ritchie) writes:
> But I do have a serious question.  I've got a boxcar with 10,000 8-mm tapes,
> each containing 10,000,000 100-byte binary records
> without any substructure.  I need to get these records sorted.
> Can someone quote me a cost and expected delivery time?

This is a very interesting question, because the answer isn't entirely
a computer science one.  This is clearly outside the scope of internal
sorting.  I'd go straight to Knuth V3, and of course to the specs of the
8-mm tape drives.  However, the absolute *minimum* is obviously that each
tape must be located, put in a drive, removed from the drive, and restored
to its place.  Say 5--10 minutes of handling time.  We're talking about an
absolute minimum of somewhere around 25 man-weeks of "operator" time just
to read each record once.  The cost is thus going to depend on the cost of
labour.  For the 100-boxcar version, it might even be worth building
special tape-handling machinery.

> Since I understand that this will run into big bucks, I trust
> your company can give a brief account of its technical approach
> before I write the purchase orders.

How will "trade secret" do (:-)?  Non-"technical" factors such as
"we're going to ship it to the Far East" might easily dominate
technical ones for a problem this size.  I trust that any company
approached for real with this problem would start by saying "why
do you need to have these things sorted; can we help you change your
set-up so they're generated in the right order; ...".   The fastest
way to do anything is after all to ensure that it doesn't _need_ doing.
-- 
I am not now and never have been a member of Mensa.		-- Ariadne.

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (11/27/90)

In article <4373@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
> However, the absolute *minimum* is obviously that each
> tape must be located, put in a drive, removed from the drive, and restored
> to its place.  Say 5--10 minutes of handling time.  We're talking about an
> absolute minimum of somewhere around 25 man-weeks of "operator" time just
> to read each record once.

Yes. In my analysis I assumed a reasonable number of operators to load
the tape farm at the right times. This was a significant fraction of
the $1000000 cost---after all, the real cost for one person is about
$100K a year.

---Dan