jcampbell@mrfort.DEC (Jon Campbell) (07/25/85)
From: Jon Campbell Digital Equipment Corp. Marlboro, MA 617-467-6876 DECnode:MRFORT::JCAMPBELL To: UNIX developers and users Subject: problems with the UNIX file system Some of us at Digital think we have found a basic problem with the UNIX file system for FORTRAN. The problem is that there is no place to put various kinds of information about the contents of the file. More specifically: 1. The FORTRAN language requires that one be able to have "random access" files, with a fixed "recordsize". The obvious UNIX implementation is one which uses a fixed number of bytes (perhaps even with a <newline> at the end) for each "record". However, there is no way on UNIX that one can open such a file and find out the size of each record. Thus it is impossible to write a utility to look at, modify, or extract data from such a file without the user having previous knowledge about the file. 2. As you probably know, most FORTRAN output data files reserve the 1st character position of each output line for a "FORTRAN carriage control character". When the file is printed (or, in some circumstances, typed) these control characters are supposed to be translated into corresponding vertical motion characters (such as one or more line-feeds, a form-feed, a vertical tab, etc.) and the <newline> character at the end of the "record" is removed. So FORTRAN output files are "different" than other files, even though you cannot tell that by looking at them - they just have "funny numbers" in the 1st character position of each line. UNIX provides a utility for piping the FORTRAN output through a translator module, so that the vertical motion characters appear directly in the output file. But often that is not what is desirable. Often one wants to leave the file in its original ("FORTRAN data file") state, modify it many weeks later, and then print it. Again, as in the case above, the user must know that the file was produced by a FORTRAN program and pipe it to a filter program on the way out to the printer or terminal. 3. The ANSI Magnetic Tape Label Standard defines a set of file attributes in the file labels which must be filled in when the tape is written. Among them are record size and carriage control (referred to in the Standard as "Form Control"). I would like to propose that UNIX users and developers begin thinking about which "file attributes" (knowledge about the file that would be useful to know for generalized programs which cannot have previous knowledge about each file) would be useful to attach to UNIX files. Keep in mind that these "attributes" would NOT in any way detract from the simplicity of UNIX - one would not have to use them; they would be Page 2 there only for those users who wish to carry information about the files along with the files. Nor would files with attribute information be looked at by UNIX in any way than they are looked at now - they just have some more information about them that can be discovered when they are opened. No "file management layer" is implied for UNIX by the creation of these "attributes". We would not even have to make an "incompatible change" for the printing of files with the "FORTRAN data file" attribute: a new command could be introduced to take the place of LPR for those users who wish the utility to find out whether the attribute is set and print the file accordingly; many people would probably continue to use LPR. Below is a list of those "attributes" which I have found useful in my work in implementing the FORTRAN runtime library for TOPS-10 and TOPS-20. Many of them have been included in the ANSI Magnetic Tape Label Standard: Carriage control FORTRAN - funny numbers in char position 1, translated on printing LIST - take just the contents of the "record", add a <newline>. This is for files which have no <newline> characters in them NONE - print the file as it appears (the default) Character set (for those folks who want to have both EBCDIC and ASCII files) Record format - (refer to the Tape Label Standard) Delimited - each record has a 4-character byte count in front of it Fixed - all records have the same length, with no terminators Undefined - the default - no implied record format Record size (For "fixed" record format, the size of all records; for variable-length records, this is usually interpreted as the maximum record length - zero means "unknown" maximum record length) File type (for "data management" programs...) Sequential (the default) Others (user-definable, for various flavors of other types of access, such as [ugh] indexed sequential, database, etc.) Bytesize (for typesetting applications which use 16- or 32-bit character sets) I'm sure you'll all think of others that would be useful. Since I have not looked at the UNIX internal file system much, I do not know how difficult it would be to find a place to attach this large (and, potentially, expanding) set of attributes, or what the FOPEN (or other) interface would look like to set/get the attribute values. Thanks for your time, Jon Campbell --------
faustus@ucbcad.UUCP (Wayne A. Christopher) (07/26/85)
> Some of us at Digital think we have found a basic problem with the UNIX > file system for FORTRAN. The problem is that there is no place to put > various kinds of information about the contents of the file. More > specifically: < lots of stuff> > I'm sure you'll all think of others that would be useful. Since I have > not looked at the UNIX internal file system much, I do not know how > difficult it would be to find a place to attach this large (and, > potentially, expanding) set of attributes, or what the FOPEN (or other) > interface would look like to set/get the attribute values. I don't see what is wrong with letting fortran and its utility programs do all of this themselves. The problem is that UNIX is not a "fortran" operating system, and unlike systems like VMS, it doesn't have a lot of stuff for the benefit of fortran programmers. There is really no reasonable way to put this into the filesysem itself without a lot of re-writing, and I doubt many people think it is worth the trouble. The fact is that fortran is a dying language, and it would be silly to make unix more friendly to fortran at the expense of more trouble for people who use modern languages. Wayne
alb@alice.UUCP (Adam L. Buchsbaum) (07/26/85)
One's spine shivers at the thought of putting utility/program/etc support into the kernel itself.
ark@alice.UUCP (Andrew Koenig) (07/26/85)
> Some of us at Digital think we have found a basic problem with the UNIX > file system for FORTRAN. The problem is that there is no place to put > various kinds of information about the contents of the file. The place to put information about the contents of the file is in the file itself. If you are unmoved by that philosophical argument, consider this: If you expand Unix files to include additional information that is not really part of the file, will that information be copied automatically if you use "cp" to copy the file? Any answer causes problems. If the answer is "no," the information isn't really useful. If it is "yes," then you must rewrite "cp." You must also rewrite "cat," because I can copy a file by saying cat a >b . You will find that you must also rewrite dozens of other commands, as well as writing many new ones.
rcj@burl.UUCP (Curtis Jackson) (07/26/85)
In article <3287@decwrl.UUCP> jcampbell@mrfort.DEC (Jon Campbell) writes: > 1. The FORTRAN language requires that one be able to have "random > access" files, with a fixed "recordsize". The obvious UNIX > implementation is one which uses a fixed number of bytes (perhaps > even with a <newline> at the end) for each "record". However, there > is no way on UNIX that one can open such a file and find out the > size of each record. Thus it is impossible to write a utility to > look at, modify, or extract data from such a file without the user > having previous knowledge about the file. > So write a teeny-weeny little function that calls fopen first and then looks at the first word (16 or 32 bits, I don't care) that will contain the record size because the writing program put it there. > 2. As you probably know, most FORTRAN output data files reserve the > 1st character position of each output line for a "FORTRAN carriage > control character". When the file is printed (or, in some > circumstances, typed) these control characters are supposed to be > translated into corresponding vertical motion characters (such as > one or more line-feeds, a form-feed, a vertical tab, etc.) and the > <newline> character at the end of the "record" is removed. > > So FORTRAN output files are "different" than other files, even > though you cannot tell that by looking at them - they just have > "funny numbers" in the 1st character position of each line. UNIX > provides a utility for piping the FORTRAN output through a > translator module, so that the vertical motion characters appear > directly in the output file. But often that is not what is > desirable. Often one wants to leave the file in its original > ("FORTRAN data file") state, modify it many weeks later, and then > print it. Again, as in the case above, the user must know that the > file was produced by a FORTRAN program and pipe it to a filter > program on the way out to the printer or terminal. > I don't just go around randomly printing files without knowing what they are, do you? I often use the 'file' command to see what type of file I am dealing with -- nroff input, ascii data, C source, etc. I can see a addition to the 'file' command to recognize the types of FORTRAN output files you are talking about, but nothing more. "...the user must know that the file was produced by a FORTRAN program..." -- the FORTRAN user has to know that a file is random access before trying to open it as such; what is the big deal? > 3. The ANSI Magnetic Tape Label Standard defines a set of file > attributes in the file labels which must be filled in when the tape > is written. Among them are record size and carriage control > (referred to in the Standard as "Form Control"). > And if I get a tape in 'tar' format I don't run cpio on it to extract it; so if I get a tape that I know has byte-for-byte data on it I don't try to read and process record size or carriage control fields. >File type (for "data management" programs...) > Sequential (the default) > Others (user-definable, for various flavors of other types > of access, such as [ugh] indexed sequential, database, etc.) > I've pulled the above example out of the original posting as a good example of what Jon wants to do. If the 'others' are user-definable, then why not let the users define them? Why should they be part of the Unix filesystem? Don't mess with my Unix to support archaic languages/formats!! -- The MAD Programmer -- 919-228-3313 (Cornet 291) alias: Curtis Jackson ...![ ihnp4 ulysses cbosgd mgnetp ]!burl!rcj ...![ ihnp4 cbosgd akgua masscomp ]!clyde!rcj
john@genrad.UUCP (John P. Nelson) (07/26/85)
>> Some of us at Digital think we have found a basic problem with the UNIX >> file system for FORTRAN. The problem is that there is no place to put >> various kinds of information about the contents of the file. More >> specifically: > There is really no reasonable >way to put this into the filesysem itself without a lot of re-writing, >and I doubt many people think it is worth the trouble. The fact is that >fortran is a dying language, and it would be silly to make unix more >friendly to fortran at the expense of more trouble for people who use >modern languages. > > Wayne Well, this attitude is a bit extreme, but I really don't see why any of this is necessary. Why not have the fortran format file have a header describing the data contained within, and have the header started by a four byte magic number. Magic numbers are used now to indicate that a file is a binary executable, why not have a new magic number that describes the file as a fortran file? The argument that most (non-fortran) programs do not need the proposed extra filesystem information applies to information stored in a header as well. This would put the extra burden of responsibility on the fortran library, which would have to recognize ordinary files, and parse them differently than from "funny" files. This same extra step would have to take place anyway, except that the information would come from the filesystem, instead of from the file header. What advantage is there to having this information be "out-of-band" (i.e. not part of the file itself)? John P. Nelson (decvax!genrad!john)
gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (07/27/85)
One does not fix Fortran- or DEC- created problems with files by trying to force UNIX to adopt the same mistakes. If you have to maintain "attribute" information in a file, how about storing it as the first yay many bytes of the file contents, to avoid breaking commands like "cp" which are done right on UNIX and wrong on DEC systems. You're going to need special-purpose utilities to decode this "attribute" information anyhow, so please limit the damage to just those utilities. Suggested reading: The Bell System Technical Journal, Vol. 57, No. 6, Part 2 (July/August 1978), pp. 1947-1969, "UNIX Time- Sharing System: A Retrospective" by D. M. Ritchie.
avolio@decuac.UUCP (Frederick M. Avolio) (07/27/85)
In article <3287@decwrl.UUCP>, jcampbell@mrfort.DEC (Jon Campbell) writes: > > Some of us at Digital think we have found a basic problem with the UNIX > file system for FORTRAN. The problem is that there is no place to put > various kinds of information about the contents of the file. More > specifically: > Lots of us at Digital think the UNIX file system is just fine the way it is... (FORTRAN??) --- Fred @ DEC -- ULTRIX Applications Center
phil@amdcad.UUCP (Phil Ngai) (07/27/85)
In article <578@decuac.UUCP> avolio@decuac.UUCP (Frederick M. Avolio) writes: >In article <3287@decwrl.UUCP>, jcampbell@mrfort.DEC (Jon Campbell) writes: >> >> Some of us at Digital think we have found a basic problem with the UNIX >> file system for FORTRAN. > >Lots of us at Digital think the UNIX file system is just fine the way it >is... (FORTRAN??) I'm glad to hear that some at Digital are against this horrible idea to "adalize" the Unix filesystem but I would also say I know a number of sites willing to pay good money for a good FORTRAN for Unix. So, Jon, keep working on it, just don't try to impose VMS ideas on Unix. Try to do things within the Unix philosophy. If you are not sure what that is, there seem to be many people, within DEC, even, like Fred, who will probably help you. -- There are two kinds of people, those who lump people in groups and those who don't. Phil Ngai (408) 749-5720 UUCP: {ucbvax,decwrl,ihnp4,allegra}!amdcad!phil ARPA: amdcad!phil@decwrl.ARPA
hedrick@topaz.ARPA (Chuck Hedrick) (07/28/85)
Jon: I am very glad to see that DEC is interested in Fortran on Unix. You would make many people very happy if you bring to Unix a Fortran compiler of the quality of the DEC VMS (or TOPS-20) compiler. However... I think it is a bad idea to add attributes to the Unix file system. You indicate that it would not cause any incompatibility. There is a sense in which this is true. But you would have to change all the utility programs that copy files, to copy the attributes. You would have to change the formats of backup tapes and tapes such as tar, to include the attributes. To the extent that the attributes are used, you would have to modify language runtime systems and utilities to take attributes into account when reading files that have them. One of my staff members has just written a network spooler for VMS. It is amazing how complex it is to read VMS files in their full generality, at least from Modula 2. (Perhaps this is a defect in the runtime system.) This complexity has nothing to do with whether there is an extra layer of RMS between you and the file system. Indeed that layer may make things more liveable. It has to do simply with the complexity of the file system. I am recommending that our Computer Science Dept use Unix, partly because I want an O.S. that is simple. I would like our students to be able to do some system programming. I would not like to face them with the complexities of an RMS file. If you add attributes to your Unix, I would regretfully have to rule it out as a candidate for our department. However the problem that you pose still remains. I think you want to distinguish between 2 kinds of files: those that are intended to be human-readable, and binary files. I believe you should do whatever violence is necessary to keep human-readable files in a single, simple format. This is the clear difference between Unix/Tenex on the one side and IBM/VMS on the other. I believe Unix people have chosen which side of the fence they want to be on, and you should respect that decision. Fortunately, I believe you do not have to do much violence to Fortran to make this work. The only structure you really have to worry about in human-readable files is carriage control. I suggest that the runtime system should turn the carriage control into carriage return, line feed, form feed, etc. At first glance, this appears to be a problem. After all, you say, Fortran programs might write a file using carriage control, and expect that when the file is read back in, the carriage control is still there. However as I understand it, Fortran 77 has deemphasized carriage control. I believe it is now used only in "print" files. It seems reasonable to believe that a print file is not normally going to be read back in as data to another Fortran program. Thus I believe you should do the following: - by default, map carriage control into CR, LF, etc. when output is to a "print" file. I suggest a convention that by default units 0 (stderr) and 6 (stdout) are print files. - supply an option to OPEN to override this. - for programs that do not use these mechanisms properly (e.g. old Fortran 66 programs), the only damage is that the ANSI carriage control characters will show up in column 1. There can still be a filter to handle this explicitly for those exceptions. I do not like the TOPS-20 idea of defaulting depending upon the actual output device (/dev/tty and /dev/lpt being print, disk files nonprint). The program will not then know in advance whether the file is a print file. That makes it unnecessarily hard to code. For binary files, I like the idea of a "magic number" that specifies "This is a structured binary file". In case you are not familiar with the concept of magic number, all relocatable and executable binaries have a certain number in their first 32 bits. There is no danger of confusing these files with text files, since the magic numbers are small integers. Thus the first 2 or 3 bytes are always 0, which is unlikely in a text file. You then need a way to specify the attributes. Experience with network protocols and other things suggests a text format for this. If you use bits, you will always run out of bits. There are several reasonable formats. My favorite (you are going to laugh, I'm sure) is Lisp format: a parenthesized list with attribute-value pairs, e.g. ((RECORD-SIZE 200) (FORMAT VBA)) This is simple to parse using a higher-level language. Xerox used it for specifying file attributes in PUP FTP, and it is easier to handle than the alternatives I have seen elsewhere. A more "binary" format might be pairs of null-terminated strings, ending with an extra null. But I think the Lisp format is better. You would probably want a convention that the actual data begins on the next 32-bit boundary after the end of the attributes, since that might simplify processing for certain situations. (For paged files, such as B-trees, you would probably want to skip to the next page boundary, but that would be an action implied by certain attributes.) PS: in future messages, could you give a UUCP route? I don't have a routing to mrfort.DEC offhand. Charles Hedrick Rutgers University uucp: ...{harvard, seismo, ut-sally, sri-iu, ihnp4!packard}!topaz!hedrick arpa: HEDRICK@RUTGERS
ignatz@aicchi.UUCP (Ihnat) (07/28/85)
Jon Campbell of Digital Equipment Corp. recently posted a problem statement/proposal concerning the Unix filesystem, particularly addressing the problems encountered by such utilities as FORTRAN and ANSI tape label requirements. His conclusion was that there needs to be an extension to the Unix filesystem scheme, allowing such information to be optionally available to needful users. The problems he quotes are, indeed, real; as real as the need for good database support in Unix. The issue I wish everyone to consider is that *any* specialized support of this type must never go in the kernel! I well remember struggling through the incredible source listings of the Honeywell Level 6 Gcos operating system. They, too, started out to support what appeared to be a reasonable subset of accepted typed files--ISAM, KIDA, etc. In the end, the operating system grew to the point that a listing set stood 3 feet high; much of that, support for various file types. Worse, even if an installation didn't ever intend to use these capabilities, machine resources were dedicated in the 'kernel' to allow the filesystem to recognize and process them, regardless. One of the big plusses of Unix was moving items that weren't required out of the kernel. If it isn't involved with managing shared and/or critical resources, it doesn't belong in the kernel. Database file management should be the realm of a separate, although standard, package--and recognition and support of specialized file formats such as those expected by FORTRAN or programs that must read/write ANSI tapes. Consider also that, whether optionally used or not, such extensions must be validated by 'fsck' and its ilk, thus complicating system maintenance and improving the liklihood of filesystem corruption. I most emphatically agree that some standardized means of providing such information would be desirable; possibly a set of library routines and maintenance programs. But don't clutter the kernel with this; we made significant strides in isolating functionality and improving modularity when the Unix 'toolchest' approach gained favor; let's PLEASE not backslide! -- Dave Ihnat Analysts International Corporation (312) 882-4673 ihnp4!aicchi!ignatz
tim@cithep.UucP (Tim Smith ) (07/29/85)
This is not really relevent, but I have sometimes thought that instead of offsets in a file starting at zero, they should start at some negative number, possibly specified in the inode. When you open the file you start at zero. The only way to get the data before zero would be an explicit seek. This "negative region" could be used for things like the a.out header for executable files, the #!/bin/sh for shell scripts ( note that there is no need for the prog to recognize # as a comment character ), or information on record sizes for files that were brought from another system or produced by a record oriented language ( although it would still be up to user mode code to actually interpret this; let's leave the kernel out of this. ). -- Tim Smith ihnp4!{wlbr!callan,cithep}!tim
tim@cithep.UucP (Tim Smith ) (07/29/85)
Re: the posting I just posted on this topic. Of course, there are some problems with this also.... -- Tim Smith ihnp4!{wlbr!callan,cithep}!tim
lee@eel.UUCP (07/30/85)
Some folks have mentioned using the first words of the file to contain a magic number and the record size for FORTRAN record-oriented files. Before anyone goes off and reproduces the old ar and sccs mistakes, the first few bytes should be used for an ASCII string identifying the file as FORTRAN, giving the record size in an ASCII digit string, and ending in a newline. This makes straight printing, copying, and other processing of the file much easier than if it has binary data in it. E.g. <fortran!> 1722 ` in analogy with the portable ar(1) format. The digit string could be made fixed width in order to make it easy for FORTRAN to know how far into the Unix file to lseek to find the first byte of the FORTRAN file.
broman@noscvax.UUCP (Vincent P. Broman) (07/30/85)
jcampbell wrote: > From: Jon Campbell > Digital Equipment Corp. > Marlboro, MA > 617-467-6876 > DECnode:MRFORT::JCAMPBELL > decwrl!dec-rhea!dec-mrfort!jcampbell > > Some of us at Digital think we have found a basic problem with the UNIX > file system for FORTRAN. The problem is that there is no place to put > various kinds of information about the contents of the file. More > specifically: ... [recordsizes, formcontrol, formats, char set, etc] The obvious solution is the one used by the loader -- put that description in a header in the file. It might be advisable to make the header printable (even legible). Programs in the fortran system need just open the file and read the first n bytes to get the full scoop on how the file is organized. The new "LPR" command needed by fortran users is merely a one-line shell script piping your carriage control filter to lpr. etc, etc. Let's keep all that mumbo-jumbo out of the operating system! UUCP: ucbvax!sdcsvax!noscvax!broman Vincent Broman ARPA: broman@nosc Naval Ocean Systems Center, code 632 Phone: (619) 225-2365 San Diego, CA 92152
friesen@psivax.UUCP (Stanley Friesen) (07/30/85)
In article <4053@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: >> Some of us at Digital think we have found a basic problem with the UNIX >> file system for FORTRAN. The problem is that there is no place to put >> various kinds of information about the contents of the file. > >The place to put information about the contents of the file >is in the file itself. > Absolutely! In fact the obvious solution is to place a header at the beginning of the file containing such information as record length, and write the Fortran I/O library so that it understands this header. Something similar has *already* been incorporated into UNIX, namely the executable(a.out) file format, which contains a header specifying type(the "magic number") and the size of each portion of the executable image. All that need be done is devise a variant of this system for fixed-length record files! -- Sarima (Stanley Friesen) {trwrb|allegra|cbosgd|hplabs|ihnp4|aero!uscvax!akgua}!sdcrdcf!psivax!friesen or {ttdica|quad1|bellcore|scgvaxd}!psivax!friesen
jack@boring.UUCP (08/05/85)
In article <95@cithep.UucP> tim@cithep.UucP (Tim Smith ) writes: >This is not really relevent, but I have sometimes thought that instead of >offsets in a file starting at zero, they should start at some negative >number, possibly specified in the inode. When you open the file you start >at zero. The only way to get the data before zero would be an explicit seek. > >This "negative region" could be used for things like the a.out header for >executable files, the #!/bin/sh for shell scripts ( note that there is no >need for the prog to recognize # as a comment character ), or information >on record sizes for files that were brought from another system or produced >by a record oriented language ( although it would still be up to user mode >code to actually interpret this; let's leave the kernel out of this. ). >-- > Tim Smith > ihnp4!{wlbr!callan,cithep}!tim This looks better than a special 'file attribute', since you don't need funny system calls, etc, but it still has the problem that you have to re-write almost any unix-utility in existence. If I do 'cp a.out foobar', I would prefer the header to be copied too........ -- Jack Jansen, jack@mcvax.UUCP The shell is my oyster.
lasse@daab.UUCP (Lars Hammarstrand) (08/06/85)
In article <3287@decwrl.UUCP> jcampbell@mrfort.DEC (Jon Campbell) writes: > > > > > From: Jon Campbell > Digital Equipment Corp. > Marlboro, MA > 617-467-6876 > DECnode:MRFORT::JCAMPBELL > >To: UNIX developers and users > >Subject: problems with the UNIX file system > >Some of us at Digital think we have found a basic problem with the UNIX >file system for FORTRAN. The problem is that there is no place to put >various kinds of information about the contents of the file. More >specifically: .................... ------------------------------------------------------------------------- Do you realy believe all that rubbish yourself? I mean, do you real think that you can get a better world beacuse you are restricted to file heads, fixed record sizes, lead-in chars, etc, etc ... that other people thinks is best for you?? (It smells IBM) What exactly are you looking for??? NO..... LIFE --> UNIX -------> F R E E D O M PS.. I don't say your thing is bad, but it can be better! My name: Lasse Hammarstrand. My company: Datorisering AB, SWEDEN. UUCP: {seismo,decvax,philabs}!mcvax,ukc,unido!enea!daab!lasse ARPA: decvax!mcvax!enea!daab!lasse@berkley.arpa
cudcv@daisy.warwick.UUCP (Rob McMahon) (08/07/85)
>> basic problem with the UNIX file system for FORTRAN. > >You must also rewrite "cat," because I can copy a file by saying cat a >b . It's worse than that - the shell creates `b', and it's got no idea what sort of file to create, maybe the shell should be changed to include a syntax like `>[ASCII,RECORDSIZE=80,ANSICC]b' !
hoey@nrl-aic.ARPA (Dan Hoey) (08/16/85)
>From: cudcv@daisy.warwick.UUCP (Rob McMahon) >Date: 7 Aug 85 10:25:12 GMT >>> basic problem with the UNIX file system for FORTRAN. >> >>You must also rewrite "cat," because I can copy a file by saying cat a >b . > >It's worse than that - the shell creates `b', and it's got no idea what >sort of file to create, It's even worse than that. ``cat'' can take more than one argument, hence its name. What do *you* want to happen when you concatenate files with different record sizes, character sets, etc. >maybe the shell should be changed to include a syntax >like `>[ASCII,RECORDSIZE=80,ANSICC]b' ! Ulch! Maybe we should begin each command with "//" and use space for a comment character. Dan