[net.lang] I/O operations in programming langua

johnl@ima.UUCP (08/27/83)

#R:csd1:-10800:ima:30100002:000:2822
ima!johnl    Aug 26 11:38:00 1983

The matter at hand is why file access shouldn't look just like variable
access, with some reference to Pascal.

Pascal's I/O is certainly a pain, largely because it was designed with card
readers in mind.  But to answer the question, there are languages which do
make file access and variable access look the same.  Snobol comes to mind,
in which every time you fetch the variable INPUT (or SYSPIT in the old days)
it gets a record from the input, when you assign something to OUTPUT it goes
to the output, and there are functions to associate variables with files.
Grody old RSTS Basic-Plus has a virtual array feature that lets you treat a
disk file just like an array.  Various forms of APL have a "shared variable"
feature that associates variables with all sorts of outside things including
files.  Unfortunately, all of these schemes stink.

The problem is that file I/O is more complicated than reading and writing
variables.  The data in files is often much more complex than anything you'd
keep in a single structure in memory.  Try writing a variant record
structure that describes a Unix archive file as seen by the linker, keeping
in mind that some archive members are in a.out format and some aren't.  You
could probably do it, but the structure would be so grotesque as to be
useless.  Once you're done that, fit files with multiple key structures into
your framework.  Then throw in line printers, bit map terminals, and HDLC
X.25 ports.  Remember that you have to deal with time-outs, end of file
conditions, and I/O errors.

One might say "then don't use such complicated file structures," but the job
of any program that does I/O is to deal with some a priori structure.  (If
the program only writes data that only it can read, it's not likely to be
very useful, is it?) Snobol and APL programs cannot generally read files
not made of plain old lines of text (or in the APL case, of funny APL
objects.)

As far as portabilty goes, experience shows that programs written in rotten
old Cobol* port amazingly well (a lot better than Fortran or Pascal) because
the I/O model, although complex, is well specified and complete.  It deals
quite adequately with variable length records of different sizes, indexed
files, and all that stuff.  C language I/O also ports reasonably well,
although to be fair it usually is moving along with a Unix environment, and
few C programs do really complicated I/O like some Cobol programs do.

John Levine, decvax!yale-co!ima!johnl, ucbvax!cbosgd!ima!johnl,
{allegra|floyd|amd70}!ima!johnl, Levine@YALE.ARPA

* - I'm not claiming that Cobol is the best thing around for expressing
algorithms or whipping up little programs, just that the I/O model is much
better thought out than in many more "modern" languages which follow the
Algol tradition and punt.  Save your flames.

condict@csd1.UUCP (Michael Condict) (08/28/83)

I wish I could say that I am genuinely surprised at the negative responses
to my proposal for unifying files with other variables, but that would
be disingenuous.  The course of evolution of programming languages and
storage technologies has made it almost inevitable that our thinking about
the simplest ways to describe the manipulation of data be distorted.  In any
case, it is clear that I have to some extent been misunderstood.

I'll try to individually address most of the objections that have been raised
so far, but first, just to be pedantic, let us prove the following

THEOREM 1.  If the contents of (or operations on) a file or device are not
adequately modelled by the variable declarations and operations of C (or
Pascal) then they are not adequately modelled by viewing the file or device
as an array of bytes, as is required in the Unix I/O model.

PROOF.  Both C and Pascal have the array of bytes (chars) in their repertoire
of data structures.  If you want to get picky and insist that a file cursor
position be remembered, the array can be paired with an integer in a record.

COROLLARY 2.  It is at least as easy to program any operation on any file or
device in$!language where files are unified with variables as it is in a
language where they are required to conform to the Unix I/O model.

Armed with these results, I am ready to tackle the objections:

Objection 1: Files and devices are much too complicated, and their desired
time behaviour too critical and too dependent on geometry of storage devices
for them to be dealt with as programming variables.

Response: For the first part, refer to Corollary 2.  For the second, I
claim that questions of track and cylinder locations and rotational speeds
are diminishing in importance and certainly are not factors that are considered
"more often than not" in the typical application program, even data-base
applications -- rather such concerns are more often entrusted to the operating
system.  In future technologies, e.g. bubbly memory, optical disks and the
disappearance of sequential magnetic tapes, such concerns will fade away.
Even if they don't, there are people working hard right now on the problem of
expressing such time-critical behaviour in the emerging family of real-time
programming languages.

Objection 2: I don't want the C compiler gunked up with knowledge of all sorts
of little-used devices and data structures for talking to them.  This would
ruin portability of the compiler because it would have to be changed on each
machine.

Response: I'm not proposing that the C compiler or any other compiler know how
to translate the use of variant records into commands for a McDoogle's
french-fries computer or whatever.  The idea would be for the compiler to know
how to translate variable declarations and operations into a fixed I/O model,
either the one that is presented to it by its local operating system, or an
abstract one that is more general, if it is to be a portable compiler.  On
a Unix system this would amount to it knowing how to compile such operations
into the reading and writing of chunks of storage that are accessed by an
integer address (the file cursor).  It already knows how to do this; only the
names of the instructions that it generates would have to be changed.  And
this connection to the I/O model of the operating system must be made in any
case; the only question is the most elegant way to do it.  As far as the
programming of exotic devices is concerned, that would be a matter for the
programmer and the operating system to negotiate through the restrictions of
the I/O model, as it is now.  The role of the compiler would just be to give
the programmer a high-level, elegant interface to this model.

Objection 3: SNOBOL and APL implement your proposal and the results are not
satisfactory.

Response: SNOBOL, at least, is farther from my proposal than is Pascal.  In
SNOBOL, input must be viewed as a sequence of lines, whereas in Pascal it is
a sequence of arbitrary data.  I am unfamiliar with APL shared variables, but
my limited knowledge of them is that they are much like what I am proposing.
Unfortunately, as you point out, the data structures of APL are a bit
restricted, even strange, compared to languages like Pascal and C.  This is
an argument for generalizing the data structures of APL, not an argument against
my proposal (and, again, several people have proposed extending APL to allow
more general data structures than the n-dimensional array, e.g. Goeff Lowney
here at N.Y.U.).  Alongside this objection the author mentioned that he
approved of the way I/O is done in Cobol, because it is complete and precisely
defined.  I mostly agree, but see no reason that this completeness and precision
would be lost if I/O were to be merged with variables in Cobol (in fact it
partly is, to the extent that the same sort of nested record declarations
can be made to describe files or program variables).


Michael Condict		...!csd1!condict
Courant Inst., NYU
251 Mercer St.
New York, NY   10012

mark@cbosgd.UUCP (08/30/83)

Micheal Condict:

Do you propose adding funny character arrays to the C language and
calling them "files"?  If so, are you willing to assume that all
machines that run C will have virtual memory?  If so, you can have
a nice pmapped implementation of files, given proper operating system
support.  4.2BSD does not have such support, although it is on the list
of things for 4.1d, which will presumably wind up in 4.3.  However, I
think this assumption kills portability.

If you don't assume virtual memory, then the compiler will have to
make a distinction between character arrays that are files and character
arrays that are not files, generating different code for each.  (Certainly
just dumping things into/out of memory at a given byte offset won't work
if the file won't fit in memory; yet for regular character string operations
you want tight efficient code.)  Now, suppose you want to use a function
you have written that does character string manipulation.  (Say, sprintf,
or strcmp, or something you write yourself.)  The parameter to this function
ought to be allowed to be either a real character array or a file.
(After all, isn't the whole point to be able to use string operations on files?
And all our string operations are functions in the C library.)  Now, what
will the generated code for this function look like?  How will it tell whether
it's a file or a regular old pointer?  The absolute best you can do is to
include a descriptor, either in the array (where is the beginning?) or in
the pointer.  Now, all the code in the function has to do a runtime check
on the descriptor for every string operation it does.  There goes your
efficiency of C code.

Or perhaps you have some other implementation in mind?  Let's see now, you
could keep a window into the file in core, and set up segmentation
registers to trap to a user routine when you run off the end or beginning
of the window.  There goes 8K of PDP-11 addressing space.  And what
happens if you do a very random access and happen to point into a valid
spot in your data area?