emv@msen.com (Ed Vielmetti) (06/26/91)
Let me explain and justify this format for recording archive information, in the hope that it'll get wide use. I'll be converting the "MSEN Archive Service file verification" section of comp.archives postings to this format so I want to get it right. I believe I've sketched out something which would let you verify every file that's available for anonymous FTP or semi-public FTP that's out there; Here's a sample example. 19910520220500 f export.lcs.mit.edu /contrib ups-2.31.tar.Z 963435 ftp,30 19910520223700 f export.lcs.mit.edu /contrib ups-2.31.README 9758 ftp,30 19910528125400 f ftp.uu.net /tmp ups-song.ms 1237 ftp,21 19910529094400 f ftp.uu.net /tmp ups-song.au 1456032 ftp,21 19910530012300 d msen.com /debug/ups emv,case 19910624112800 w prep.ai.mit.edu /pub/gnu gawk* bug-gnu-utils 19910411185329 f nis.nsf.net,anonymous,guest . $read.me 1024 postmaster $date f $site $dir $file $size $owner $date w $site $dir $wildcard $owner $date d $site $dir $owner here's what the fields are. 19910520220500 $date: Date in ISO 3307 format (YYYYMMDDhhmmss[.xxxxxx]) Sorts easily, easy to parse, microsecond resolution, will be used in NNTP2. (d,f,w) $type: directory, file, or wildcard specification. Directories have ($site, $dir, @owner). Wildcards have ($site, $dir, $wildcard, @owner). Files have ($site, $dir, $file, $size, @owner). Any other type can be added as long as suitable definitions for the following fields can be provided. export.lcs.mit.edu $site: System name. The default assumption is login with user "anonymous", any password acceptable; the entry for nis.nsf.net shows a situation where there's something different from the default that's needed; in that case it's interpreted as ($site, $user, $pass) or even ($site, $user, $pass, $acct) XXX should ascii,binary be in here somewhere too? XXX should some notion of explicit file types be here? There are three sorts of things to describe, files, directories, and wildcard specifications. Files are operated on with the DIR, GET, and MGET commands. Directories are operated on with the CD and DIR commands. Wildcards are operated on with the DIR and MGET commands. gawk* $wildcard: wildcard file specification. Specified according to local host conventions. The $date variable for this should be set to the latest change date for the files that match the wildcard file specification. /contrib $dir: Directory. cd to this directory. In the special case where no cd command needs to be (or should not be) issued, treat "." as a no op. ups-song.au $file: File name. 1456032 $size: File size, in bytes, when properly transferred to a different machine. Systems which report sizes in blocks need to translate. emv,case $owner: comma-separated list of owners. First one should also be a mail address (e.g. emv@archive.msen.com). The rest can be group names or any other key words, tags, or identifiers which you might choose to use, or names of mail addresses. XXX this is pretty wide open.... -- Edward Vielmetti, MSEN Inc. moderator, comp.archives emv@msen.com
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/26/91)
In article <EMV.91Jun25175321@bronte.aa.ox.com> emv@msen.com (Ed Vielmetti) writes: > 19910520220500 f export.lcs.mit.edu /contrib ups-2.31.tar.Z 963435 ftp,30 Methinks 19910520220500 f export.lcs.mit.edu:contrib/ups-2.31.tar.Z 963435 ftp,30 is both more readable and more accurate. Your separation between ``directory'' and ``file'' is a mistake, because some operating systems can express neither the null directory nor more than one directory in a single command. It's much more logical to have a composite filename, where each component before a slash means ``change directory to this.'' foo/bar/blah means ``cd foo, then cd bar, then get blah.'' Of course, you can pile all the cd's into one if you're talking to a UNIX server, but automated programs shouldn't depend on knowing the remote system type. It's probably necessary to add quoting ("/", perhaps), for cases like the Kerberos distribution where you simply have to combine two cd's into one or you hit a wall. You had better take into account that a lot of people don't have DNS. Also, some systems don't have meaningful dates or owners but still support ftp. > 19910530012300 d msen.com /debug/ups emv,case 19910530012300 d msen.com:debug/ups/ emv,case The blank filename means that you're referring to the directory. > 19910624112800 w prep.ai.mit.edu /pub/gnu gawk* bug-gnu-utils 19910624112800 w prep.ai.mit.edu:pub/gnu/gawk* bug-gnu-utils > XXX should ascii,binary be in here somewhere too? Probably. Maybe introduce an ``E'' type, just like f but for EBCDIC, meaning that you shouldn't use binary. (I'm half serious.) ---Dan
worley@compass.com (Dale Worley) (06/26/91)
In article <17493.Jun2607.22.3191@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
Methinks
19910520220500 f export.lcs.mit.edu:contrib/ups-2.31.tar.Z 963435 ftp,30
is both more readable and more accurate. Your separation between
``directory'' and ``file'' is a mistake, because some operating systems
can express neither the null directory nor more than one directory in a
single command. It's much more logical to have a composite filename,
However, FTP is oriented to "directory and file name" access -- you CD
to a directory, then you GET a file. Thus, it's best to have two
fields -- the first one is "the argument to give to CD" and the second
is "the argument to give to GET".
Logically, it's still losing, but it corresponds more closely with the
model FTP presents.
Dale Worley Compass, Inc. worley@compass.com
--
The best way to demand something unreasonable is to call it a "freedom".
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/27/91)
In article <WORLEY.91Jun26104902@sn1987a.compass.com> worley@compass.com (Dale Worley) writes: > However, FTP is oriented to "directory and file name" access -- you CD > to a directory, then you GET a file. Thus, it's best to have two > fields -- the first one is "the argument to give to CD" and the second > is "the argument to give to GET". Sorry I didn't make myself clear. The real ftp model is that you do zero or more cd's, then a get. That's what ftp supports, after all... Why doesn't the single-directory, single-filename model work? Because there are some operating systems where you *have* to do cd foo, then cd bar, then get blah. You *cannot* do a combined cd and then get blah, as you can under UNIX. Furthermore, there are many operating systems which have no way to state ``current directory''. You *cannot* do a ``null cd'' and then get blah, as you can under UNIX. So it's best to have any number of fields, each meaning ``give this field to cd'', and then a final field meaning ``get this.'' A name like pub/foo/bar reflects this perfectly: cd pub, cd foo, get bar. Similarly, a simple name like readme.txt means don't do any cd's, just get bar. ---Dan
emv@msen.com (Ed Vielmetti) (06/27/91)
In article <20665.Jun2617.26.1691@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
Sorry I didn't make myself clear. The real ftp model is that you do zero
or more cd's, then a get. That's what ftp supports, after all...
To give an idea of what the pathlogical cases are, here are some spots
which I can't currently verify now (with my automated verification
tools) because they require site-specific knowlege which my current
scheme is missing.
unt-library-list vaxb.acs.unt.edu:[.library]
can't "dir [.library]";
must "cd [.library] ; dir"
netlib research.att.com:/netlib/
different login and password to get to netlib
gorebill nis.nsf.net:nsfnet:gorebill.1991-txt
anonymous password must be guest;
can't "get nsfnet:gorebill.1991-txt",
must "cd nsfnet; get gorebill.1991-txt"
oaklisp f.gp.cs.cmu.edu:/usr/bap/oak/ftpable/
can't "cd /usr; cd /bap";
must "cd /usr/bap/oak/ftpable".
I agree that I'd like one token to specify the whole file reference if
that's possible, if the syntax is rich enough to handle these cases
but still human readable. I don't want to go to the ISO style
tag=value business if it's avoidable.
--Ed
rodney@dali.ipl.rpi.edu (Rodney Peck II) (06/27/91)
In article <20665.Jun2617.26.1691@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: [...] >So it's best to have any number of fields, each meaning ``give this >field to cd'', and then a final field meaning ``get this.'' A name like >pub/foo/bar reflects this perfectly: cd pub, cd foo, get bar. Similarly, >a simple name like readme.txt means don't do any cd's, just get bar. > >---Dan Except there are some sites which won't let you cd to the dirs inbetween. You have to give the full dir path at once. How does the format handle that? You could try to cd to the first one, and then try to cd to the whole thing (or more likely, the other way around). -- Rodney
cmf851@anu.oz.au (Albert Langer) (06/27/91)
In article <EMV.91Jun25175321@bronte.aa.ox.com> emv@msen.com (Ed Vielmetti) writes: >19910411185329 f nis.nsf.net,anonymous,guest . $read.me 1024 postmaster > >$date f $site $dir $file $size $owner >$date w $site $dir $wildcard $owner >$date d $site $dir $owner A couple of suggestions: 1. The rcp format host:directory/file is well established and marginally shorter. It extends to user@host:directory/file and you could add a convention like user,password@host:directory/file (as well as allowing wildcards in file or ending at / for a directory, which makes the "f,w,d" type field redundant though it might as well stay). 2. I don't know if "owner" is really much use. If it is, then "group" would probably be equally relevant. 3. (More important) Using space as the delimiter between fields does not allow for file and directory names that include spaces (common with Mac software). In modifying the Mark Moraes filters that you sent me I got around this by using TAB to separate the site:directory/file field from size and date fields. For comp.archives you would need something like "" around any field that does contain a space (usually omitted since the large majority don't contain a space, but at least allow for it). This may seem unimportant from current experience with anonymous ftp archives, but tools developed will also be applied to archives that are available from non-anonymous ftp and the way things are going all files on any kind of machine are going to be mounted on the net somewhere or other. Also some efforts at classifying files in ftp archives may rely on linking them to long descriptive directory names (with blanks). e.g. The one line descriptions available from dls could simply be linked as alternative names in another directory. -- Opinions disclaimed (Authoritative answer from opinion server) Header reply address wrong. Use cmf851@csc2.anu.edu.au
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/27/91)
In article <=qflq6j@rpi.edu> rodney@dali.ipl.rpi.edu (Rodney Peck II) writes: > In article <20665.Jun2617.26.1691@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > >So it's best to have any number of fields, each meaning ``give this > >field to cd'', and then a final field meaning ``get this.'' A name like > >pub/foo/bar reflects this perfectly: cd pub, cd foo, get bar. Similarly, > >a simple name like readme.txt means don't do any cd's, just get bar. > Except there are some sites which won't let you cd to the dirs inbetween. > You have to give the full dir path at once. How does the format handle > that? As I suggested in the first posting, slashes could be quoted. ---Dan
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/27/91)
In article <EMV.91Jun26151023@bronte.aa.ox.com> emv@msen.com (Ed Vielmetti) writes: > unt-library-list vaxb.acs.unt.edu:[.library] > can't "dir [.library]"; > must "cd [.library] ; dir" vaxb.acs.unt.edu:[.library]/ Meaning: cd [.library], then dir. > gorebill nis.nsf.net:nsfnet:gorebill.1991-txt > anonymous password must be guest; > can't "get nsfnet:gorebill.1991-txt", > must "cd nsfnet; get gorebill.1991-txt" Similarly: nis.nsf.net:nsfnet/gorebill.1991-txt. *Yes*, that's a slash. The slash is just a separator in the format meaning ``cd to this before considering the rest of the name.'' > oaklisp f.gp.cs.cmu.edu:/usr/bap/oak/ftpable/ > can't "cd /usr; cd /bap"; > must "cd /usr/bap/oak/ftpable". f.gp.cs.cmu.edu:"usr/bap/oak/ftpable"/ The "" turn off any special meaning of the slashes, so this parses as cd usr/bap/oak/ftpable, then dir. Here, let me settle the quoting issues: "" and '' are both quoting characters, and both completely equivalent. Absolutely no interpretation goes on after a " or ' except to find the terminating " or '. That covers all cases: you quote " as '"', and ' as "'". And, for a final proof that this works: athena-dist.mit.edu:pub/kerberos5/"dist/xxxxxx"/krb5.src.tar.Z (Sorry, folks, xxxxxx is a United States national secret. Not allowed to say it in public.) Do you see any problems, Ed? ---Dan
chk@alias.com (C. Harald Koch) (06/28/91)
In <17493.Jun2607.22.3191@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > 19910520220500 f export.lcs.mit.edu:contrib/ups-2.31.tar.Z 963435 ftp,30 >is both more readable and more accurate. Your separation between >``directory'' and ``file'' is a mistake, because some operating systems >can express neither the null directory nor more than one directory in a >single command. It's much more logical to have a composite filename, >where each component before a slash means ``change directory to this.'' Except you have the problem that many OSes use a different separator character than /, and allow slashes in file names (most commonly used for putting dates in filenames...). Then there are OSes that don't have 'directories', they have 'disk packs', and then there are the hybrids (e.g. VMS). Every designer seems to want to create yet another incompatible syntax for filesystems... The clearest way to handle all the different variations out there is to keep directory and file information separate. Remember, all the world's not UNIX. -- C. Harald Koch VE3TLA Alias Research, Inc., Toronto ON Canada Internet: chk@alias.com chk@gpu.utcs.toronto.edu chk@chk.mef.org "I think you curdled my Pepsi!"-Gerry Smit, in response to sickening cuteness
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/29/91)
In article <1991Jun27.180424.3522@alias.com> chk@alias.com (C. Harald Koch) writes: > Except you have the problem that many OSes use a different separator > character than /, That doesn't affect the format. A / in the format doesn't mean you send / to the remote ftp server; it means you cd to the thing before the /. See my response to Ed's examples of what the directory-file format can't handle. > and allow slashes in file names (most commonly used for > putting dates in filenames...). That does affect the format, but the quoting rules I defined handle it. ftp.foo.com:"this is a filename with /s and 's"' and "s'/bar. > Then there are OSes that don't have 'directories', they have 'disk packs', > and then there are the hybrids (e.g. VMS). Every designer seems to want to > create yet another incompatible syntax for filesystems... If you can't do it with cd this, cd that, cd the other thing, and then get the file, then you simply cannot get the file via ftp. The format as I've defined it lets you cd this, cd that, cd the other thing, and then get the file, and you can put arbitrary characters anywhere. It works for VMS. It works for TOPS-20. It works for MS-DOS. It even works for VM/CMS, not that I've seen any IBM anonymous ftp sites. > The clearest way to handle all the different variations out there is to keep > directory and file information separate. Uh, no. That *cannot* handle certain files which rcp format (with the multiple cd semantics) can, and I fail to see why you consider it ``cleaner'' than what's obviously a more general and more accurate representation of reality. > Remember, all the world's not UNIX. That's what I've been trying to say. Trying to force everything into a one-cd, one-get model is a UNIX-centric view that we should not impose on a general archive format. Ed, what do you think? ---Dan