Robert.Biddle@comp.vuw.ac.nz (Robert Biddle) (01/28/91)
Is External Sorting done much anymore? External Sorting used to imagine that you had some file (usually thought of as being on several tapes, say) that was too big to sort in primary memory - so you used whatever memory you had in conjunction with whatever tape drives you had. (See Knuth III:5.4 for details.) Is it used anymore at all, either using traditional tape methods, or in some new form? [I wasn't sure where to ask this question, so if anyone has any better suggestions please mail me.] -- -------- Robert Biddle, Computer Science, Victoria University, Wellington Internet: Robert.Biddle@Comp.VUW.Ac.NZ NEW ZEALAND Telecom: Voice +64 4 721-000 ext. 8546; Facsimile +64 4 712-070
drake@drake.almaden.ibm.com (01/28/91)
In article <1991Jan28.031017.19886@comp.vuw.ac.nz> robert@comp.vuw.ac.nz (Robert Biddle) writes: >Is External Sorting done much anymore? Oh, heavens, YES! Electronic Data Processing ... COBOL, accountants, that kind of stuff ... is essentially all about inputting data to SORT, and the data files used are absolutely, positively too big to fit in main memory. It is very rare these days that tapes are used as work files for sort (disk sort-work files are typical), but both tapes and disks are commonly used as input and output files for sort, with disk sort-work files being essential for the processing to operate. No matter how fast, a main-memory sort routine is essentially unusable for commercial data processing applications. Sam Drake / IBM Almaden Research Center Internet: drake@ibm.com BITNET: DRAKE at ALMADEN Usenet: ...!uunet!ibmarc!drake Phone: (408) 927-1861
johnl@iecc.cambridge.ma.us (John R. Levine) (01/29/91)
In article <1991Jan28.031017.19886@comp.vuw.ac.nz> robert@comp.vuw.ac.nz (Robert Biddle) writes: >Is External Sorting done much anymore? It sure is. The traditional rule of thumb is that 1/3 of all commercial computing is sorting, and I see no reason to think that is any less true now than it used to be. The Unix "sort" command which is respectable but not stellar in its performance gets plenty of use on most systems. External sorts seem all to be variations on the same basic plan, in which you put sorted chunks of the input into external storage, then merge those chunks together. These days, the external storage is invariably disks rather than tapes, but that doesn't to affect the algorithms much, since reading a disk sequentially is much faster than reading it randomly. IBM mainframes and their operating systems have lots of features that are specifically for speeding up sorts. For example, under MVS you can start read operations on several files, then process the records in whatever order the physical reads happen to finish. This is useful for merging files together. Also, recent 370 CPUs have microcoded instructions that accelerate the inner comparison loops. Mainframe tape drives have always had the ability to read tapes backwards, since an oscillating sort that merges alternate passes forward and backwards avoids time-consuming rewinds between passes. It's also worth noting that since sorting is a lengthy and repetitive task, there are all sorts of performance hacks commonly done in sort programs that are rare elsewhere. In all serious sort programs, the sort specification (i.e. what kind of keys, and where in the record) is compiled into a machine code comparison routine, one of the earliest and still most important uses of code modified at runtime. In many cases, the sort program is actually a driver with a library of routines tuned for different file sizes and media, and it links the comparison routine with library routines and user provided "exit routines" to produce a custom sort program for each sort job. These days I'm sure there are all sorts of hacks to improve paging performance, since sorting and merging have predictable access patterns that are often not LRU. Sorting is a very competitive business (the main competitors I'm aware of are Syncsort and IBM) and the details of their performance-enhancing tricks are jealously guarded trade secrets. -- John R. Levine, IECC, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@iecc.cambridge.ma.us, {ima|spdcc|world}!iecc!johnl " #(ps,#(rs))' " - L. P. Deutsch and C. N. Mooers
roy@phri.nyu.edu (Roy Smith) (01/29/91)
johnl@iecc.cambridge.ma.us (John R. Levine) writes: > under MVS you can start read operations on several files, then process > the records in whatever order the physical reads happen to finish. How is this any different from what you can do in BSD systems by issuing several non-blocking reads and then select(2)ing whichever one finishes first? I think the real question is not so much whether the software supports the operation but whether the I/O system is macho enough to keep up with 8 (or more) 6250 drives, all running at 100+ ips. Keep in mind that what passes for a smart device controller on most Unix boxes today (say, a Ciprico Rimfire disk controller for a Sun) pales in comparison to the sorts of things you find on IBM big iron. > Mainframe tape drives have always had the ability to read tapes > backwards, since an oscillating sort that merges alternate passes forward > and backwards avoids time-consuming rewinds between passes. My trusty 1975 pdp-11 peripherals handbook indicates that the TU-16 (TJU-16?) could do reverse reads. Was there ever a version of Unix which supported that? I could see doing it either by adding a special ioctl, or by adopting the convention that a seek backwards of a single block on a tape device would be cached, and if immediately followed by a read of a single block, the two would be folded into a single read-reverse operation. -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy "Arcane? Did you say arcane? It wouldn't be Unix if it wasn't arcane!"
yossie@fnal.fnal.gov (Yossie Silverman) (01/30/91)
In article <467@rufus.UUCP> drake@drake.almaden.ibm.com writes: > It is very rare these days that tapes are used as work files for sort > (disk > sort-work files are typical), but both tapes and disks are commonly used > as > input and output files for sort, with disk sort-work files being > essential > for the processing to operate. > > No matter how fast, a main-memory sort routine is essentially unusable > for commercial data processing applications. > > > Sam Drake / IBM Almaden Research Center > Internet: drake@ibm.com BITNET: DRAKE at ALMADEN > Usenet: ...!uunet!ibmarc!drake Phone: (408) 927-1861 Does IBM still include a "READ BACKWARDS" instruction in their tape drives? This instruction, I am told (and it seems logical) is primarily for sort routines, where reading backwards on the tape could actually save some time. Of course the bytes come in reversed, but that just needs some extra code to deal with. - Yossie --- yossie@fnal.fnal.gov; yossie@fnccf.bitnet What did the Caspian Sea? - Saki
johnl@iecc.cambridge.ma.us (John R. Levine) (01/30/91)
In article <DK7|1@linac.fnal.gov> yossie@fnal.fnal.gov (Yossie Silverman) writes: >Does IBM still include a "READ BACKWARDS" instruction in their tape >drives? This instruction, I am told (and it seems logical) is primarily >for sort routines, where reading backwards on the tape could actually save >some time. Of course the bytes come in reversed, but that just needs some >extra code to deal with. I expect that start-stop drives still support read backwards. It's pretty hard to imagine that you can read streamers backwards, since they are used almost entirely for backup and logging anyway. By the way, the bytes from a read backwards do not come in reversed. You pass the channel the address of the end rather than the beginning of the buffer, and it knows to decrement rather than increment the buffer pointer during the progress of a read backwards. Gross, but quite effective. -- John R. Levine, IECC, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@iecc.cambridge.ma.us, {ima|spdcc|world}!iecc!johnl " #(ps,#(rs))' " - L. P. Deutsch and C. N. Mooers
drake@drake.almaden.ibm.com (01/30/91)
In article <DK7|1@linac.fnal.gov> yossie@fnal.fnal.gov (Yossie Silverman) writes: >Does IBM still include a "READ BACKWARDS" instruction in their tape >drives? This instruction, I am told (and it seems logical) is primarily >for sort routines, where reading backwards on the tape could actually save >some time. Of course the bytes come in reversed, but that just needs some >extra code to deal with. Yes, current drives still have a "Read Backward" command. As previously mentioned, this does NOT result in the bytes being placed in memory backwards; the buffer appears "normal" after the I/O operation has completed. The one case where "read backwards" doesn't work is if the new "IDRC" feature is being used. This feature lets the tape drive perform compression in hardware on data as it is written on the tape (transparently to applications). If a particular tape is written using IDRC, then "read backwards" actually has to jockey the tape around and read forwards, not backwards. Read Backwards is also apparently used in commercial transaction processing systems such as IMS where a tape drive is constantly keeping a log of transactions as they are executed. Read Backwards is used in recovery to examine the log without rewinding the tape...potentially a Big Win. Sam Drake / IBM Almaden Research Center Internet: drake@ibm.com BITNET: DRAKE at ALMADEN Usenet: ...!uunet!ibmarc!drake Phone: (408) 927-1861
hoffman@speedy.cs.pitt.edu (Bob Hoffman) (02/01/91)
In article <1991Jan30.015657.12840@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes: >I expect that start-stop drives still support read backwards. It's pretty >hard to imagine that you can read streamers backwards, since they are used >almost entirely for backup and logging anyway. I just installed a Cipher M990 1/2" streamer and it will read and write both directions in start-stop mode or streaming. ---Bob. -- Bob Hoffman, N3CVL {allegra, bellcore, idis, psuvax1}!pitt!hoffman Pitt Computer Science hoffman@cs.pitt.edu FAX: +1 412 624 8854