tim@hoptoad.uucp (Tim Maroney) (04/16/89)
In article <4015@ece-csc.UUCP> jnh@ece-csc.UUCP (Joseph Nathan Hall) writes: >Your comments about printing (deleted from above) are well taken. But >so far as reading in "newline" mode goes: > > 1) Virtually all programming environments on all major operating > systems support this (C, Pascal, FORTRAN, etc., on UNIX, VMS, > MS-DOS, etc.) at a reasonably low level in a reasonably > versatile manner. Actually, C only provides this at a high level. UNIX only provides it with low-level I/O to certain devices such as terminals. (There may be an fcntl I don't know about to do this on any file, but it's not commonly used if so.) The C high-level I/O routines are available in all Mac C implementations with which I'm familiar. The same can be said for the Pascal readline and related routines. So C and Pascal do have this capability on the Mac; I don't see why the OS should be expected to provide it as well. > 2) Why would it have to be slow? The most common mid- to high-level > UNIX I/O is streamed and it's not a problem. I just want to > read mid-sized text files (xx-xxxK); I don't want to sector- > copy volumes... Context switching is a big consideration; efficiency demands that you call the OS as infrequently as possible. There's also the issue of fetching entire disk blocks at once, though this is less relevant on a caching OS like UNIX or the Mac. Expect context switching into the kernel to become even more expensive on the Mac as the OS becomes more sophisticated. I'd also point you to Earle's empirical measurements confirming this effect, and I also noticed that increasing buffer size in the TOPS Terminal text editor's open operation greatly improved speed (even when the buffers were already bigger than disk blocks). Another message from John Gilmore has informed me that one-disk-block at a time fetches can actually be more efficient if the OS implements a block prefetch capability. That way, you are processing block n at the same time the OS is fetching block n+1. However, the Mac doesn't do this, and because disk I/O on the Mac is so processor intensive, it may never do so. > 3) Most programming tasks handled by the Toolbox and other system > software aren't beyond the skills of professional programmers. > That's not the point. Fast character I/O (buffered by the > system or language run-time routines) and line-at-a-time I/O are > SIMPLE and USEFUL tools. During development, who CARES if I/O > performance is -25% of optimum? I do, for one. A lack of efficiency in a program under development slows down the edit-compile-test cycle significantly. I spend enough of my life waiting for compilers, I don't want the testing to be slow as well. But in any case, the capability you want is available in both C and Pascal using high-level I/O routines that handle their own buffering, so I don't see what you're complaining about. > 4) Anyway, I disagree with the simple assertion that the "key to speed" > is reading as much as possible at a time from disk. This is just > not true. You can't hope for much improvement in I/O performance > once your buffer size exceeds the controller's buffer size. > You may even suffer a speed *penalty* if you do random I/O on a > fragmented file with a buffer size > sector size. Furthermore, > in situations where it is necessary to scan the file being read > character-by-character anyway (when reading a text file, or > parsing a source that must be kept on disk, or whatever) the > overhead of a system-based character I/O routine can be > negligible. In systems that provide fast character I/O, it can > be SLOWER to do the "raw" block reads yourself if you have to > write the supporting code in a HLL. This is speculation. The experience of those of us who've actually written text file readers on the Mac is quite different. In general, the larger the buffer, the less time spent in Read, and the difference is large enough that the user will notice for any large file. Aside from the trap dispatch (context switch) overhead, there's a per-Read overhead which apparently comes from consultation of the disk directories to find the proper physical blocks, as well as the mechanics of request queueing and so forth. >The lack of explicit Toolbox support for non-block-oriented file I/O is, at >best, a weird omission. Sure, it's there, but >hiss< >boo< it's buried in >the low-level routines and it's not well documented. Are we just not >SUPPOSED to read text files on the Mac the same way we read them on any >machine? Sheesh. I have to say, if you're reading a line at a time on any machine, it's likely you're taking a performance hit. And writing a loop to turn blocks into lines on your own is so easy that a first-semester programmer could do it. I don't see why it should have been included in the Toolbox at all -- it isn't in UNIX, which you cite as a favorable example -- but since it is, your complaints make even less sense. If you want to write your own block-structured high-level FSRead-style call, it's trivial to do so, and you can easily contain the supposed complexity to this single short routine. (But why are people phobic about parameter blocks?) OSErr FSReadLine(refNum, count, buffPtr) short refNum; long *count; char *buffPtr; { IOParam io; io.ioRefNum = refNum; io.ioBuffer = buffPtr; io.ioReqCount = *count; io.ioPosMode = fsFromMark | 0x0d80; io.ioPosOffset = 0; PBRead(&io, false); *count = io.ioActCount; return io.ioResult; } I haven't tested this, so I can't guarantee it works, but it's certainly close to what you want. (You may have to explicitly move the mark forward when you use fsFromMark instead of fsAtMark; I don't know, and the documentation seems ambiguous on this point.) Is the lack of a trivial routine like this in Inside Macintosh really a problem? -- Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim These are not my opinions, those of my ex-employers, my old schools, my relatives, my friends, or really any rational person whatsoever.
rang@cpsin3.cps.msu.edu (Anton Rang) (04/16/89)
In article <7015@hoptoad.uucp> tim@hoptoad.uucp (Tim Maroney) wrote lots of stuff in reply to article <4015@ece-csc.UUCP> by jnh@ece-csc.UUCP (Joseph Nathan Hall). I've deleted the articles to save space.... 1. Why should an OS provide newline support when high-level languages also provide it? To make life easier for the developer of a HLL. Also, suppose that a program uses both C and Pascal, using both fgets() and readln(). If the OS provides the newline support then you don't have (much) duplication of code in the support libraries. 2. Using individual read calls is slow; why use them? Well, they're probably always slower than doing stuff at a very low level--I can write my own disk I/O routines and read stuff faster by totally bypassing the file manager. Just as one answer, maybe there's a reason I don't want to allocate a big fixed-size buffer for reading this file--after all, the smallest size which would make sense for a buffer is a disk block. Maybe I'm trying to conserve memory in an INIT; maybe I need to read the file without worrying about running out of memory in the process. 3. Why do stuff inefficiently during development which we'd make more efficient for a production program anyway? Perhaps I'm porting a program from another operating system. Maybe the newline character is different (gasp!)--I might not want to worry about fixing this up yet. As Tim pointed out, there isn't really anything to complain about here if you're using C or Pascal anyway. 4. A bit more complex. Joseph Hall claims that reading as much as possible on each read call isn't necessarily the key to speed. Tim says it's speculation. One point here--if allocating a 32K buffer to read a text file quickly means swapping out 32K of code from somewhere, this might be true. A procedure which counts the number of lines in a text file may well find that using a huge buffer is overkill. 5. A final note (of my own). Tim says that "if you're reading a line at a time on any machine, it's likely you're taking a performance hit." Just to make things a little more complicated, I'd just like to say that there are systems which do NOT require any specific character to mark the end of a line--if you say writeln() it writes out your data, whether it contains ^M or ^J or whatever. On these systems, reading data block-by-block and trying to figure out the end of a line is either near-impossible or just plain slow. [Quibble, quibble.] 6. Tim says "And writing a loop to turn blocks into lines on your ownn is so easy that a first-semester programmer could do it." Probably true. But writing an *efficient* loop probably means using assembly language, at least until some decent optimizing compilers are widely available on the Mac. I apologize (a little) for using net bandwidth on this. It probably doesn't really belong in this group.... +---------------------------+------------------------+----------------------+ | Anton Rang (grad student) | "VMS Forever!" | "Do worry...be SAD!" | | Michigan State University | rang@cpswh.cps.msu.edu | | +---------------------------+------------------------+----------------------+
tim@hoptoad.uucp (Tim Maroney) (04/17/89)
In article <2551@cps3xx.UUCP> rang@cpswh.cps.msu.edu (Anton Rang) writes: >1. Why should an OS provide newline support when high-level languages > also provide it? To make life easier for the developer of a HLL. > Also, suppose that a program uses both C and Pascal, using both > fgets() and readln(). If the OS provides the newline support then > you don't have (much) duplication of code in the support libraries. Could be true of Pascal, but not of C. C's "stdio" buffered i/o library does a lot more than just read lines. Most C compilers use code licensed from AT&T Bell Labs for at least some part of stdio, and this assumes an underlying OS file system is being used for block-structured reads. It would actually be considerably harder (and less efficient) to use the OS to do line-oriented reads. So, the OS might make it easier for a Pascal implementer to write readln, but it wouldn't help a C implementer, nor would it reduce functional overlap in library code between a program incorporating both C and Pascal. >2. Using individual read calls is slow; why use them? Well, they're > probably always slower than doing stuff at a very low level--I can > write my own disk I/O routines and read stuff faster by totally > bypassing the file manager. And break over LANs, other external file systems, new system releases, etc. > Just as one answer, maybe there's a > reason I don't want to allocate a big fixed-size buffer for > reading this file--after all, the smallest size which would make > sense for a buffer is a disk block. Maybe I'm trying to conserve > memory in an INIT; maybe I need to read the file without worrying > about running out of memory in the process. First, you allocate the buffer before you do any reading at all, so there's no chance you can run out in the middle of the operation. Second, you just get the biggest buffer you can given the current memory space limitations. If there's enough for the whole file, go for it; if there's only 512 bytes in the largest buffer you can allocate, use that instead. (Though if you're that low on storage, you probably won't be able to read in the file anyway....) >3. Why do stuff inefficiently during development which we'd make more > efficient for a production program anyway? Perhaps I'm porting a > program from another operating system. To the Mac? Maybe as an MPW Tool, but everyone who's tried to do this kind of porting on a real application has wound up with awfully ugly results. There's a real philosophical difference between prompt driven software (the computer telling the user what to do) and event driven software (the user telling the computer what to do). I can see porting specific libraries without user interfaces to the Mac, e.g., a B-tree database package for developers, but forget about porting ordinary programs. >4. A bit more complex. Joseph Hall claims that reading as much as > possible on each read call isn't necessarily the key to speed. > Tim says it's speculation. One point here--if allocating a 32K > buffer to read a text file quickly means swapping out 32K of code > from somewhere, this might be true. A procedure which counts the > number of lines in a text file may well find that using a huge > buffer is overkill. I have to admit -- I never swap out code. I use too many function pointers and segment unloading seems like an anachronism from the 128K Mac days. Now everybody gets a chance to take shots at me for not using this great feature of the Mac. One more point -- 32K is hardly a huge buffer on a megabyte machine. >5. A final note (of my own). Tim says that "if you're reading a line > at a time on any machine, it's likely you're taking a performance > hit." Just to make things a little more complicated, I'd just > like to say that there are systems which do NOT require any > specific character to mark the end of a line--if you say writeln() > it writes out your data, whether it contains ^M or ^J or whatever. > On these systems, reading data block-by-block and trying to figure > out the end of a line is either near-impossible or just plain slow. Er, good point. You're right. It's been so long since I've done any VMS programming that I forgot about line-structured files. Of course, the VMS people at DEC finally got around to implementing byte-stream files a few years ago, and everyone treated this as a great step forward.... >6. Tim says "And writing a loop to turn blocks into lines on your own > is so easy that a first-semester programmer could do it." > Probably true. But writing an *efficient* loop probably means > using assembly language, at least until some decent optimizing > compilers are widely available on the Mac. First, MPW C 3.0 is supposedly a pretty smart optimizer. Second, I don't agree. Any good compiler can create reasonably good code for a simple loop of this kind. With an old C compiler, you may have to use register declarations, but there's no reason a compiler can't produce code as good as assembler for a "for" loop. (I refuse to use register declarations in 1989; the techniques of register optimization have been well understood for more than a dozen years now, and a compiler that doesn't use them is brain damaged. I'm only using LSC now because my client preferred it.) -- Tim Maroney, Consultant, Eclectic Software, sun!hoptoad!tim "Next prefers its X and T capitalized. We'd prefer our name in lights in Vegas." -- Louis Trager, San Francisco Examiner