[comp.archives.admin] archive normal form

emv@msen.com (Ed Vielmetti) (06/26/91)

Let me explain and justify this format for recording archive
information, in the hope that it'll get wide use.  I'll be converting
the "MSEN Archive Service file verification" section of comp.archives
postings to this format so I want to get it right.

I believe I've sketched out something which would let you verify every
file that's available for anonymous FTP or semi-public FTP that's out
there; 

Here's a sample example.

19910520220500 f export.lcs.mit.edu /contrib ups-2.31.tar.Z 963435 ftp,30
19910520223700 f export.lcs.mit.edu /contrib ups-2.31.README 9758 ftp,30
19910528125400 f ftp.uu.net /tmp ups-song.ms 1237 ftp,21
19910529094400 f ftp.uu.net /tmp ups-song.au 1456032 ftp,21
19910530012300 d msen.com /debug/ups emv,case
19910624112800 w prep.ai.mit.edu /pub/gnu gawk* bug-gnu-utils
19910411185329 f nis.nsf.net,anonymous,guest . $read.me 1024 postmaster

$date f $site $dir $file $size $owner
$date w $site $dir $wildcard $owner
$date d $site $dir $owner

here's what the fields are.

19910520220500	$date: Date in ISO 3307 format (YYYYMMDDhhmmss[.xxxxxx]) 
		Sorts easily, easy to parse, microsecond resolution,
		will be used in NNTP2.
(d,f,w)		$type: directory, file, or wildcard specification.
		Directories have ($site, $dir, @owner).
		Wildcards   have ($site, $dir, $wildcard, @owner).
		Files       have ($site, $dir, $file, $size, @owner).
		Any other type can be added as long as suitable
		definitions for the following fields can be provided.
export.lcs.mit.edu	$site: System name.
		The default assumption is login with user "anonymous", any
		password acceptable; the entry for nis.nsf.net shows
		a situation where there's something different from the default
		that's needed; in that case it's interpreted as
		  ($site, $user, $pass)
	 	or even
		  ($site, $user, $pass, $acct)
XXX should ascii,binary be in here somewhere too?
XXX should some notion of explicit file types be here?

There are three sorts of things to describe, files, directories, and
wildcard specifications.  Files are operated on with the DIR, GET, and
MGET commands.  Directories are operated on with the CD and DIR
commands.  Wildcards are operated on with the DIR and MGET commands.

gawk*		$wildcard: wildcard file specification.  Specified according
		to local host conventions.  The $date variable for this
		should be set to the latest change date for the files
		that match the wildcard file specification.
/contrib	$dir: Directory.  cd to this directory.  In the special
		case where no cd command needs to be (or should not
		be) issued, treat "." as a no op.
ups-song.au	$file: File name.  
1456032		$size: File size, in bytes, when properly transferred to
		a different machine.  Systems which report sizes in blocks
		need to translate.
emv,case	$owner: comma-separated list of owners.  First one should
		also be a mail address (e.g. emv@archive.msen.com).  The
		rest can be group names or any other key words, tags,
		or identifiers which you might choose to use, or names
		of mail addresses.
XXX this is pretty wide open....

-- 
Edward Vielmetti, MSEN Inc. 	moderator, comp.archives 	emv@msen.com

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/26/91)

In article <EMV.91Jun25175321@bronte.aa.ox.com> emv@msen.com (Ed Vielmetti) writes:
> 19910520220500 f export.lcs.mit.edu /contrib ups-2.31.tar.Z 963435 ftp,30

Methinks

  19910520220500 f export.lcs.mit.edu:contrib/ups-2.31.tar.Z 963435 ftp,30

is both more readable and more accurate. Your separation between
``directory'' and ``file'' is a mistake, because some operating systems
can express neither the null directory nor more than one directory in a
single command. It's much more logical to have a composite filename,
where each component before a slash means ``change directory to this.''
foo/bar/blah means ``cd foo, then cd bar, then get blah.'' Of course,
you can pile all the cd's into one if you're talking to a UNIX server,
but automated programs shouldn't depend on knowing the remote system
type. It's probably necessary to add quoting ("/", perhaps), for cases
like the Kerberos distribution where you simply have to combine two cd's
into one or you hit a wall.

You had better take into account that a lot of people don't have DNS.
Also, some systems don't have meaningful dates or owners but still
support ftp.

> 19910530012300 d msen.com /debug/ups emv,case
  19910530012300 d msen.com:debug/ups/ emv,case

The blank filename means that you're referring to the directory.

> 19910624112800 w prep.ai.mit.edu /pub/gnu gawk* bug-gnu-utils
  19910624112800 w prep.ai.mit.edu:pub/gnu/gawk* bug-gnu-utils

> XXX should ascii,binary be in here somewhere too?

Probably. Maybe introduce an ``E'' type, just like f but for EBCDIC,
meaning that you shouldn't use binary. (I'm half serious.)

---Dan

worley@compass.com (Dale Worley) (06/26/91)

In article <17493.Jun2607.22.3191@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
   Methinks

     19910520220500 f export.lcs.mit.edu:contrib/ups-2.31.tar.Z 963435 ftp,30

   is both more readable and more accurate. Your separation between
   ``directory'' and ``file'' is a mistake, because some operating systems
   can express neither the null directory nor more than one directory in a
   single command. It's much more logical to have a composite filename,

However, FTP is oriented to "directory and file name" access -- you CD
to a directory, then you GET a file.  Thus, it's best to have two
fields -- the first one is "the argument to give to CD" and the second
is "the argument to give to GET".

Logically, it's still losing, but it corresponds more closely with the
model FTP presents.

Dale Worley		Compass, Inc.			worley@compass.com
--
The best way to demand something unreasonable is to call it a "freedom".

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/27/91)

In article <WORLEY.91Jun26104902@sn1987a.compass.com> worley@compass.com (Dale Worley) writes:
> However, FTP is oriented to "directory and file name" access -- you CD
> to a directory, then you GET a file.  Thus, it's best to have two
> fields -- the first one is "the argument to give to CD" and the second
> is "the argument to give to GET".

Sorry I didn't make myself clear. The real ftp model is that you do zero
or more cd's, then a get. That's what ftp supports, after all...

Why doesn't the single-directory, single-filename model work? Because
there are some operating systems where you *have* to do cd foo, then cd
bar, then get blah. You *cannot* do a combined cd and then get blah, as
you can under UNIX. Furthermore, there are many operating systems which
have no way to state ``current directory''. You *cannot* do a ``null
cd'' and then get blah, as you can under UNIX.

So it's best to have any number of fields, each meaning ``give this
field to cd'', and then a final field meaning ``get this.'' A name like
pub/foo/bar reflects this perfectly: cd pub, cd foo, get bar. Similarly,
a simple name like readme.txt means don't do any cd's, just get bar.

---Dan

emv@msen.com (Ed Vielmetti) (06/27/91)

In article <20665.Jun2617.26.1691@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

   Sorry I didn't make myself clear. The real ftp model is that you do zero
   or more cd's, then a get. That's what ftp supports, after all...

To give an idea of what the pathlogical cases are, here are some spots
which I can't currently verify now (with my automated verification
tools) because they require site-specific knowlege which my current
scheme is missing.

unt-library-list        vaxb.acs.unt.edu:[.library]
	can't "dir [.library]";
	must "cd [.library] ; dir"
netlib		research.att.com:/netlib/
	different login and password to get to netlib
gorebill        nis.nsf.net:nsfnet:gorebill.1991-txt
	anonymous password must be guest;
	can't "get nsfnet:gorebill.1991-txt",
	must "cd nsfnet; get gorebill.1991-txt"
oaklisp 	f.gp.cs.cmu.edu:/usr/bap/oak/ftpable/
	can't "cd /usr; cd /bap";
	must "cd /usr/bap/oak/ftpable".

I agree that I'd like one token to specify the whole file reference if
that's possible, if the syntax is rich enough to handle these cases
but still human readable.  I don't want to go to the ISO style
tag=value business if it's avoidable.

--Ed

rodney@dali.ipl.rpi.edu (Rodney Peck II) (06/27/91)

In article <20665.Jun2617.26.1691@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
[...]
>So it's best to have any number of fields, each meaning ``give this
>field to cd'', and then a final field meaning ``get this.'' A name like
>pub/foo/bar reflects this perfectly: cd pub, cd foo, get bar. Similarly,
>a simple name like readme.txt means don't do any cd's, just get bar.
>
>---Dan

Except there are some sites which won't let you cd to the dirs inbetween.
You have to give the full dir path at once.  How does the format handle
that?  

You could try to cd to the first one, and then try to cd to the whole thing
(or more likely, the other way around).

-- 
Rodney

cmf851@anu.oz.au (Albert Langer) (06/27/91)

In article <EMV.91Jun25175321@bronte.aa.ox.com> emv@msen.com 
(Ed Vielmetti) writes:

>19910411185329 f nis.nsf.net,anonymous,guest . $read.me 1024 postmaster
>
>$date f $site $dir $file $size $owner
>$date w $site $dir $wildcard $owner
>$date d $site $dir $owner

A couple of suggestions:

1. The rcp format host:directory/file is well established and marginally
shorter. It extends to user@host:directory/file and you could add
a convention like user,password@host:directory/file (as well as allowing
wildcards in file or ending at / for a directory, which makes the
"f,w,d" type field redundant though it might as well stay).

2. I don't know if "owner" is really much use. If it is, then "group"
would probably be equally relevant.

3. (More important) Using space as the delimiter between fields
does not allow for file and directory names that include spaces
(common with Mac software). In modifying the Mark Moraes filters
that you sent me I got around this by using TAB to separate the
site:directory/file field from size and date fields. For comp.archives
you would need something like "" around any field that does contain
a space (usually omitted since the large majority don't contain
a space, but at least allow for it).

This may seem unimportant from current experience with anonymous
ftp archives, but tools developed will also be applied to archives
that are available from non-anonymous ftp and the way things are
going all files on any kind of machine are going to be mounted on
the net somewhere or other.

Also some efforts at classifying files in ftp archives may rely
on linking them to long descriptive directory names (with blanks).
e.g. The one line descriptions available from dls could simply
be linked as alternative names in another directory.

--
Opinions disclaimed (Authoritative answer from opinion server)
Header reply address wrong. Use cmf851@csc2.anu.edu.au

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/27/91)

In article <=qflq6j@rpi.edu> rodney@dali.ipl.rpi.edu (Rodney Peck II) writes:
> In article <20665.Jun2617.26.1691@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> >So it's best to have any number of fields, each meaning ``give this
> >field to cd'', and then a final field meaning ``get this.'' A name like
> >pub/foo/bar reflects this perfectly: cd pub, cd foo, get bar. Similarly,
> >a simple name like readme.txt means don't do any cd's, just get bar.
> Except there are some sites which won't let you cd to the dirs inbetween.
> You have to give the full dir path at once.  How does the format handle
> that?  

As I suggested in the first posting, slashes could be quoted.

---Dan

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/27/91)

In article <EMV.91Jun26151023@bronte.aa.ox.com> emv@msen.com (Ed Vielmetti) writes:
> unt-library-list        vaxb.acs.unt.edu:[.library]
> 	can't "dir [.library]";
> 	must "cd [.library] ; dir"

vaxb.acs.unt.edu:[.library]/

Meaning: cd [.library], then dir.

> gorebill        nis.nsf.net:nsfnet:gorebill.1991-txt
> 	anonymous password must be guest;
> 	can't "get nsfnet:gorebill.1991-txt",
> 	must "cd nsfnet; get gorebill.1991-txt"

Similarly: nis.nsf.net:nsfnet/gorebill.1991-txt. *Yes*, that's a slash.
The slash is just a separator in the format meaning ``cd to this before
considering the rest of the name.''

> oaklisp 	f.gp.cs.cmu.edu:/usr/bap/oak/ftpable/
> 	can't "cd /usr; cd /bap";
> 	must "cd /usr/bap/oak/ftpable".

f.gp.cs.cmu.edu:"usr/bap/oak/ftpable"/

The "" turn off any special meaning of the slashes, so this parses as
cd usr/bap/oak/ftpable, then dir.

Here, let me settle the quoting issues: "" and '' are both quoting
characters, and both completely equivalent. Absolutely no interpretation
goes on after a " or ' except to find the terminating " or '. That
covers all cases: you quote " as '"', and ' as "'".

And, for a final proof that this works:

  athena-dist.mit.edu:pub/kerberos5/"dist/xxxxxx"/krb5.src.tar.Z

(Sorry, folks, xxxxxx is a United States national secret. Not allowed to
say it in public.)

Do you see any problems, Ed?

---Dan

chk@alias.com (C. Harald Koch) (06/28/91)

In <17493.Jun2607.22.3191@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>  19910520220500 f export.lcs.mit.edu:contrib/ups-2.31.tar.Z 963435 ftp,30

>is both more readable and more accurate. Your separation between
>``directory'' and ``file'' is a mistake, because some operating systems
>can express neither the null directory nor more than one directory in a
>single command. It's much more logical to have a composite filename,
>where each component before a slash means ``change directory to this.''


Except you have the problem that many OSes use a different separator
character than /, and allow slashes in file names (most commonly used for
putting dates in filenames...).

Then there are OSes that don't have 'directories', they have 'disk packs',
and then there are the hybrids (e.g. VMS). Every designer seems to want to
create yet another incompatible syntax for filesystems...

The clearest way to handle all the different variations out there is to keep
directory and file information separate.

Remember, all the world's not UNIX.

--
C. Harald Koch  VE3TLA                Alias Research, Inc., Toronto ON Canada
Internet:    chk@alias.com      chk@gpu.utcs.toronto.edu      chk@chk.mef.org
"I think you curdled my Pepsi!"-Gerry Smit, in response to sickening cuteness

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/29/91)

In article <1991Jun27.180424.3522@alias.com> chk@alias.com (C. Harald Koch) writes:
> Except you have the problem that many OSes use a different separator
> character than /,

That doesn't affect the format. A / in the format doesn't mean you send
/ to the remote ftp server; it means you cd to the thing before the /.
See my response to Ed's examples of what the directory-file format can't
handle.

> and allow slashes in file names (most commonly used for
> putting dates in filenames...).

That does affect the format, but the quoting rules I defined handle it.
ftp.foo.com:"this is a filename with /s and 's"' and "s'/bar.

> Then there are OSes that don't have 'directories', they have 'disk packs',
> and then there are the hybrids (e.g. VMS). Every designer seems to want to
> create yet another incompatible syntax for filesystems...

If you can't do it with cd this, cd that, cd the other thing, and then
get the file, then you simply cannot get the file via ftp. The format as
I've defined it lets you cd this, cd that, cd the other thing, and then
get the file, and you can put arbitrary characters anywhere. It works
for VMS. It works for TOPS-20. It works for MS-DOS. It even works for
VM/CMS, not that I've seen any IBM anonymous ftp sites.

> The clearest way to handle all the different variations out there is to keep
> directory and file information separate.

Uh, no. That *cannot* handle certain files which rcp format (with the
multiple cd semantics) can, and I fail to see why you consider it
``cleaner'' than what's obviously a more general and more accurate
representation of reality.

> Remember, all the world's not UNIX.

That's what I've been trying to say. Trying to force everything into
a one-cd, one-get model is a UNIX-centric view that we should not impose
on a general archive format.

Ed, what do you think?

---Dan