johnl@ima.UUCP (08/27/83)
#R:csd1:-10800:ima:30100002:000:2822 ima!johnl Aug 26 11:38:00 1983 The matter at hand is why file access shouldn't look just like variable access, with some reference to Pascal. Pascal's I/O is certainly a pain, largely because it was designed with card readers in mind. But to answer the question, there are languages which do make file access and variable access look the same. Snobol comes to mind, in which every time you fetch the variable INPUT (or SYSPIT in the old days) it gets a record from the input, when you assign something to OUTPUT it goes to the output, and there are functions to associate variables with files. Grody old RSTS Basic-Plus has a virtual array feature that lets you treat a disk file just like an array. Various forms of APL have a "shared variable" feature that associates variables with all sorts of outside things including files. Unfortunately, all of these schemes stink. The problem is that file I/O is more complicated than reading and writing variables. The data in files is often much more complex than anything you'd keep in a single structure in memory. Try writing a variant record structure that describes a Unix archive file as seen by the linker, keeping in mind that some archive members are in a.out format and some aren't. You could probably do it, but the structure would be so grotesque as to be useless. Once you're done that, fit files with multiple key structures into your framework. Then throw in line printers, bit map terminals, and HDLC X.25 ports. Remember that you have to deal with time-outs, end of file conditions, and I/O errors. One might say "then don't use such complicated file structures," but the job of any program that does I/O is to deal with some a priori structure. (If the program only writes data that only it can read, it's not likely to be very useful, is it?) Snobol and APL programs cannot generally read files not made of plain old lines of text (or in the APL case, of funny APL objects.) As far as portabilty goes, experience shows that programs written in rotten old Cobol* port amazingly well (a lot better than Fortran or Pascal) because the I/O model, although complex, is well specified and complete. It deals quite adequately with variable length records of different sizes, indexed files, and all that stuff. C language I/O also ports reasonably well, although to be fair it usually is moving along with a Unix environment, and few C programs do really complicated I/O like some Cobol programs do. John Levine, decvax!yale-co!ima!johnl, ucbvax!cbosgd!ima!johnl, {allegra|floyd|amd70}!ima!johnl, Levine@YALE.ARPA * - I'm not claiming that Cobol is the best thing around for expressing algorithms or whipping up little programs, just that the I/O model is much better thought out than in many more "modern" languages which follow the Algol tradition and punt. Save your flames.
condict@csd1.UUCP (Michael Condict) (08/28/83)
I wish I could say that I am genuinely surprised at the negative responses to my proposal for unifying files with other variables, but that would be disingenuous. The course of evolution of programming languages and storage technologies has made it almost inevitable that our thinking about the simplest ways to describe the manipulation of data be distorted. In any case, it is clear that I have to some extent been misunderstood. I'll try to individually address most of the objections that have been raised so far, but first, just to be pedantic, let us prove the following THEOREM 1. If the contents of (or operations on) a file or device are not adequately modelled by the variable declarations and operations of C (or Pascal) then they are not adequately modelled by viewing the file or device as an array of bytes, as is required in the Unix I/O model. PROOF. Both C and Pascal have the array of bytes (chars) in their repertoire of data structures. If you want to get picky and insist that a file cursor position be remembered, the array can be paired with an integer in a record. COROLLARY 2. It is at least as easy to program any operation on any file or device in$!language where files are unified with variables as it is in a language where they are required to conform to the Unix I/O model. Armed with these results, I am ready to tackle the objections: Objection 1: Files and devices are much too complicated, and their desired time behaviour too critical and too dependent on geometry of storage devices for them to be dealt with as programming variables. Response: For the first part, refer to Corollary 2. For the second, I claim that questions of track and cylinder locations and rotational speeds are diminishing in importance and certainly are not factors that are considered "more often than not" in the typical application program, even data-base applications -- rather such concerns are more often entrusted to the operating system. In future technologies, e.g. bubbly memory, optical disks and the disappearance of sequential magnetic tapes, such concerns will fade away. Even if they don't, there are people working hard right now on the problem of expressing such time-critical behaviour in the emerging family of real-time programming languages. Objection 2: I don't want the C compiler gunked up with knowledge of all sorts of little-used devices and data structures for talking to them. This would ruin portability of the compiler because it would have to be changed on each machine. Response: I'm not proposing that the C compiler or any other compiler know how to translate the use of variant records into commands for a McDoogle's french-fries computer or whatever. The idea would be for the compiler to know how to translate variable declarations and operations into a fixed I/O model, either the one that is presented to it by its local operating system, or an abstract one that is more general, if it is to be a portable compiler. On a Unix system this would amount to it knowing how to compile such operations into the reading and writing of chunks of storage that are accessed by an integer address (the file cursor). It already knows how to do this; only the names of the instructions that it generates would have to be changed. And this connection to the I/O model of the operating system must be made in any case; the only question is the most elegant way to do it. As far as the programming of exotic devices is concerned, that would be a matter for the programmer and the operating system to negotiate through the restrictions of the I/O model, as it is now. The role of the compiler would just be to give the programmer a high-level, elegant interface to this model. Objection 3: SNOBOL and APL implement your proposal and the results are not satisfactory. Response: SNOBOL, at least, is farther from my proposal than is Pascal. In SNOBOL, input must be viewed as a sequence of lines, whereas in Pascal it is a sequence of arbitrary data. I am unfamiliar with APL shared variables, but my limited knowledge of them is that they are much like what I am proposing. Unfortunately, as you point out, the data structures of APL are a bit restricted, even strange, compared to languages like Pascal and C. This is an argument for generalizing the data structures of APL, not an argument against my proposal (and, again, several people have proposed extending APL to allow more general data structures than the n-dimensional array, e.g. Goeff Lowney here at N.Y.U.). Alongside this objection the author mentioned that he approved of the way I/O is done in Cobol, because it is complete and precisely defined. I mostly agree, but see no reason that this completeness and precision would be lost if I/O were to be merged with variables in Cobol (in fact it partly is, to the extent that the same sort of nested record declarations can be made to describe files or program variables). Michael Condict ...!csd1!condict Courant Inst., NYU 251 Mercer St. New York, NY 10012
mark@cbosgd.UUCP (08/30/83)
Micheal Condict: Do you propose adding funny character arrays to the C language and calling them "files"? If so, are you willing to assume that all machines that run C will have virtual memory? If so, you can have a nice pmapped implementation of files, given proper operating system support. 4.2BSD does not have such support, although it is on the list of things for 4.1d, which will presumably wind up in 4.3. However, I think this assumption kills portability. If you don't assume virtual memory, then the compiler will have to make a distinction between character arrays that are files and character arrays that are not files, generating different code for each. (Certainly just dumping things into/out of memory at a given byte offset won't work if the file won't fit in memory; yet for regular character string operations you want tight efficient code.) Now, suppose you want to use a function you have written that does character string manipulation. (Say, sprintf, or strcmp, or something you write yourself.) The parameter to this function ought to be allowed to be either a real character array or a file. (After all, isn't the whole point to be able to use string operations on files? And all our string operations are functions in the C library.) Now, what will the generated code for this function look like? How will it tell whether it's a file or a regular old pointer? The absolute best you can do is to include a descriptor, either in the array (where is the beginning?) or in the pointer. Now, all the code in the function has to do a runtime check on the descriptor for every string operation it does. There goes your efficiency of C code. Or perhaps you have some other implementation in mind? Let's see now, you could keep a window into the file in core, and set up segmentation registers to trap to a user routine when you run off the end or beginning of the window. There goes 8K of PDP-11 addressing space. And what happens if you do a very random access and happen to point into a valid spot in your data area?