[comp.archives.admin] New version of my ftpd is now available

cmf851@anu.oz.au (Albert Langer) (06/16/91)

(Relevance to comp.unix.sysv386 and comp.os.coherent explained at end.
Follow ups only to comp.archives.admin please.)

The CHANGE file in ftpd.5.60.tar.Z mentions that with the new version of
ftpd for BSD systems, ftp can no longer directly run ls.

I have been working on some software (originally by Tony Moraes,
supplied by Ed Vielmetti) which relies on using ls -alR to obtain a
recursive directory listing. (I believe it could be used to greatly reduce
internet traffic, especially at peak times and on international links.
Cheap unix 386 boxes or very cheap Coherent unix lookalikes could 
provide a "transparent" cache/relay service.) Should I assume 
that ftp use of ls -alR will soon be broken on most BSD systems?

I have the impression that Ed's comp.archives listings may also rely
on use of ls for the verification, and perhaps also archie. Also use of 
Tim Cook's (timcc@admin.viccol.edu.anu) "dls" package with ftp to
provide one line descriptions of files instead of an ordinary ls listing
relies on ftpd calling a local bin/ls within the chroot (though I think
that is unchanged).

To avoid possible disruption of important services I hope those concerned
will checkout whether any changes will be needed.

While I am at it, here's an update on the software I have been
working on, in case anybody wants to do something with it, as I won't
be able to do much more. Sysv386 and Coherent users may especially
be interested in the possibilities for conveniently overcoming the
problem of long filenames when transferring files from BSD systems,
and for providing adequate ftpmail gateways to UUCP sites that are
not on the internet.

(Thanks for sending Tony's stuff Ed, sorry about the delay
getting back to you on it.) 

The unreleased collection of shell scripts from Tony Moraes included:

1. ftpfilter. This takes the output of an ftp (or other) ls -AlR command
and converts it into a format like that used by "find" so that grepping
the output can extract a full path as well as the required file names.
(File size and a numeric date are also included on each line - with
conversion of the different formats ls uses for older and recent dates.)

2. ftpgrep. This greps through compressed output files from ftpfilter
stored in a file named after each host, and results in an output where
each full path is prefixed by a hostname and colon in the style of rcp.

3. grabfiles. This uses an rcp/ftpgrep style input to actually get the
files with ftp.

The main changes I have made are:

1. Horrendous butchery of the code, for which nobody else is responsible.

2. Ability to handle paths with multiple leading, trailing and embedded 
spaces (mainly to get at collections of Mac software mounted for NFS access).

3. Porting to work on Sys V 3.2 (ISC 386ix 2.0.2) as well as on SunOs 4.1
(and to work for ftp in either direction between the two).

4. Item 3 required an optional conversion of BSD directory and filenames that
are longer than 14 characters into Sys V equivalents to avoid collisions
between truncated names. Rather than use the "shrink" package that produces
unintelligible names I decided to preserve the full original names but
converted into filepaths with each segment less than 14 characters. This
wastes some inodes and near empty blocks but the files can always be linked to
another name later anyway. Also host names were converted to a sub-directory
for each domain.

5. Items 3 & 4 were so "hairy" that I had to build in comprehensive
testing which compares the files originally requested with those actually
obtained and keeps detailed logs and mails up to 4 lists to the user:
	a) Files ok and of length specified.
	b) Files obtained for which no length was specified.
	c) Files obtained but length different from request.
	d) Files requested but not obtained.
The conversion "works" pretty reliably now but I find the mail very
reassuring when I just run a long request in background. If the ftp
connection gets lost in the middle of a file I will be told (by the
wrong file size) and there is no need to manually review any transcripts
to see if there were ftp failures. This greatly simplifies life. It
could easily be extended to reprocess list d) when a request has been
diverted to a cache or shadow/mirror site and try again elsewhere
(and also to automatically recover from dropped ftp sessions etc).

6. A script was added to process the output from archie prog commands
into the same rcp/ftpgrep style format, using the cliplines.c
included in the log_archie package recently distributed in alt.sources.
This can easily be developed into a similar facility for comp.archives
messages.

7. A single directory tree of domains and hosts (with each site's own directory
trees underneath) was established for use by all components so that once a 
file has been obtained it's local availability can easily be
confirmed and future requests could be directed there. (A simple cron job
to find and delete files that have not been accessed for a certain period
makes this an effective cache).

8. Where more access to a site is available with a specific username
and password than "anonymous", this is recorded in the database tree
and automatically used. The same mechanism can be used for other
special processing (e.g. variant ls commands).

Anyway, it seems to be doing more or less what I want it to in heavy
duty file transfers between two machines. It is horribly slow (in
offline processing), undocumented and clumsily written, but I believe 
a more competent programmer could easily develop it into a "production" 
package that could be released and would also:

a) Batch ftp requests to handle them during offpeak hours while still
providing immediate and reliable feedback about availability.

b) Automatically divert requests that have previously been filled to
fill them from a cache on the same machine.

c) Extend the above to easy maintenance of shadow or mirror archives.

d) Automatically divert requests to other local cache or shadow/mirror
sites and follow up with further requests if unsuccessful.

e) Do all this on a cheap sysv386 system or very cheap Coherent system
(as well as on any other Sys V or BSD unix). By sticking to simple
shell scripts and doing filename conversion the hard way I think I
have made sure a port to Coherent would not be too difficult. A
Coherent box could provide cache services to UUCP users, even if it
had to obtain its files by UUCP from a cooperating BSD system that
had ftp access to the internet  (but only maintained a small cache
of its own with a short timeout and was not interested in providing
modem access to others). With a port of TCP/IP etc to Coherent, the
Coherent box could do the whole job (with filename conversions so
that the 14 character converted names appeared as full BSD names
from outside). Likewise it should be feasible to port to MSDOS
with some further filename manipulations. This could be handy for
ftp requests to BSD systems from MSDOS users.

Unnecessary internet traffic could be greatly reduced by simply providing
a (fairly trivial) utility to process comp.archives messages into the
required form but with the automatic diversions to nearby cache or
shadow/mirror sites.

Users would only notice the convenience without noticing the diversion
(whereas asking users to "check local cache and mirror archives first", 
as is done in Australia, seems somewhat optimistic.

Peak traffic could be reduced further with an option to delay the
request until offpeak times. While that WOULD be noticed by the user,
it could be made much more acceptable by immediate mailed confirmation
of the request and subsequent mail notification of success.

Gateways like ftpmail and netfetch could easily enforce both diversion
to cache and mirror sites and offpeak use for mail requests (while
again providing immediate feedback).

Further substantial reductions in traffic could be achieved by providing
larger capacity cache and mirror sites and locating them more at national and
regional gateways that have expensive links to the rest of the internet.

Ultimately this should be designed into a new internet protocol for a
cached and delayed ftp service that uses network store and forward
resources for file transfers with substantial storage times as well as for
millisecond delay packet switching.

In the meantime an application layer kludge seems well worthwhile.
Disk space is now only USD $2 per MB and the Coherent operating system 
is only USD $100. It seems absurd to pays tens of thousands of dollars per
month for higher speed international links insead of providing adequate
caching.

There was a good deal of dicsussion in news.admin recently about
problems caused to mail relays by the BITFTP gateway and problems to
UUCP sites that are not part of the internet caused by the closing
down of that gateway. Installation of cheap caches 
should substantially relieve the problems of mail relays.
If necessary it should not be too difficult to develop an accounting
systems to pay for the disk space (and modem traffic) by charges
to the users.

Well, if anybody wants to take this up I'll be happy to pass on
code that does what I need and could easily be developed into a
"production" system with the extra features mentioned above. I'm
just too rusty at awk and shell programming to finish the job in a reasonable
amount of time and I have to get on with other priorities, but I'm sure 
anybody reasonably competent would have little trouble producing a worthwhile
releaseable package quite quickly.

(And whoever takes it on can also do the worrying about how to get
recursive directory listings if ftpd is changing the access to ls :-)

--
Opinions disclaimed (Authoritative answer from opinion server)
Header reply address wrong. Use cmf851@csc2.anu.edu.au

emv@msen.com (Ed Vielmetti) (06/17/91)

In article <1991Jun16.092611.15695@newshost.anu.edu.au> cmf851@anu.oz.au (Albert Langer) writes:

   I have been working on some software (originally by Tony Moraes,
   supplied by Ed Vielmetti) which relies on using ls -alR to obtain a

That would be Mark Moraes, please make sure you give him proper credit!

Other than that small detail, your changes sound like they're working
in the right direction.  Just to be sure that there's not a zillion
versions of Mark's so far otherwise unreleased code floating around,
it might be best not to post these scripts all that widely to the net.

   Further substantial reductions in traffic could be achieved by
   providing larger capacity cache and mirror sites and locating them
   more at national and regional gateways that have expensive links to
   the rest of the internet.

Well, that's a nice thought, but I don't believe that you're going to
get any substantial amount of funding for that purpose any time soon,
at least not from public sources.  They seem to be more interested in
subsidizing bandwidth, not building applications which would add value
to the network and help people use it.  On this side of the Pacific
we're starting to hear the T3! T3! T3! chants, as if simply
transporting more bits around would make the network better.  Compare
the cost of T3 lines with the meager resources being thrown at (or not
thrown at) archivist work and other projects designed to add
organization to the net, and it's rather discouraging.

-- 
Edward Vielmetti, moderator, comp.archives, emv@msen.com

"(6) The Plan shall identify how agencies and departments can
collaborate to ... expand efforts to improve, document, and evaluate
unclassified public-domain software developed by federally-funded
researchers and other software, including federally-funded educational
and training software; "
			"High-Performance Computing Act of 1991, S. 272"

cmf851@anu.oz.au (Albert Langer) (06/18/91)

In article <EMV.91Jun17014339@bronte.aa.ox.com> emv@msen.com 
(Ed Vielmetti) writes:

>That would be Mark Moraes, please make sure you give him proper credit!

Whoops! Yes, the files say Mark, and your message accompanying them said
Mark, I guess there is no special reason I should continue calling him
Tony :-) 

I will also take your earlier advice to get in touch with him.

>   Further substantial reductions in traffic could be achieved by
>   providing larger capacity cache and mirror sites and locating them
>   more at national and regional gateways that have expensive links to
>   the rest of the internet.
>
>Well, that's a nice thought, but I don't believe that you're going to
>get any substantial amount of funding for that purpose any time soon,
>at least not from public sources.  They seem to be more interested in
>subsidizing bandwidth, not building applications which would add value
>to the network and help people use it.  On this side of the Pacific
>we're starting to hear the T3! T3! T3! chants, as if simply
>transporting more bits around would make the network better.  Compare
>the cost of T3 lines with the meager resources being thrown at (or not
>thrown at) archivist work and other projects designed to add
>organization to the net, and it's rather discouraging.

I got a note from an AARNET coordinator saying they were planning some
high capacity cache sites (and might be interested in the software),
so the situation isn't all that bleak. (Though reducing traffic is
closer to increasing bandwidth than it is to adding value etc.)

Still, I agree with your general theme. The benefits from funds
diverted to archiving would greatly exceed those from further 
increases in bandwidth etc. I suspect it is partly a problem of
the bandwidth returns being easier to quantify or just that installing
bandwidth is more straight forward (and does not involve so much in
the way of policy complications about whose needs get priority etc).

Another aspect is the usual problem of funding "public goods", with
the not unusual result of "private affluence and public squalor".

The recent news on WAIS and PROSPERO looks very promising as regards
solutions to most of the technical barriers to accessing network
resources. But no amount of automatic searching can fully substitute
for proper cataloging. We still need simple things like keeping
track of version numbers and patches etc to save people a lot of time.

--
Opinions disclaimed (Authoritative answer from opinion server)
Header reply address wrong. Use cmf851@csc2.anu.edu.au

moraes@cs.toronto.edu (Mark Moraes) (06/18/91)

To stave off the mail messages:  The original version of the scripts
Albert refers to can be found on ftp.cs.toronto.edu as
pub/ftptools.shar.Z.  There's also a modified ftp client there -- it
creates the intervening directories (like mkdir -p) locally if
necessary.  (I used them for a long time with a vanilla ftp client,
but got tired of the nuisance) If you make fixes or improvements to
them, I'd appreciate a copy.

>The CHANGE file in ftpd.5.60.tar.Z mentions that with the new version of
>ftpd for BSD systems, ftp can no longer directly run ls.

All anon. ftp sites providing an regularly updated ls-lR.Z in ~ftp
would be a better solution anyway -- some sites may prefer it to
having other people grind their disks with a periodic "ls -lR".  (If a
hundred people started running archie servers, I wouldn't like to
think of the effect on archive site disks)

And then there's the problems when your "ls -lR" scans someone's /mnt
or /afs directory with many other systems mounted there.

Miscellaneous points from the last time we had this discussion on some
archive related mailing list (archie-people?):

- If the format isn't relatively standard, it isn't too useful.  In
that sense, "ls -lR" output is nice and standard.

- If the location of the file isn't standard, it's annoying.  Some
sites put it in /, some in /pub, some in /dist, etc.
	if ftptest -f ftp.wherever:pub/ls-lR.Z

- Even "ls -lR" output is somewhat redundant - the owner, even the
modes of the file aren't very interesting (unless the file is
unreadable.)  And SysV ls has to be given -g or it sticks the group
info in there.

- It's best to run the ls -lR as the ftp user, or nobody, so the files
that are listed are ones that the user can list automatically.

- the date in "ls" output is somewhat annoying from an automatic
archive tracking package, since it goes backward every six months as
the hh:mm go to zero.

- Columns are a real pain for an automatic listing script -- I have
encountered an ftp site that had a columnated ls -R output; useful for
humans, not quite so useful for scripts.

One possibility is for (Unix) sites to run
	ls -lR | ftpfilter -l | compress > FILES.Z

The format that my "ftpfilter -l" produces from ls -lR is:

name bytes YYYYMMDDhhmm

Directory names have a trailing /.  bytes is sort of meaningless for a
directory.  And please don't read me back my news.software.b postings
about all-numeric dates :-)

eg.

README 1191 199012110000
ailist/ 3072 199001190000
bin/ 512 198811200000
ca-domain/ 3584 199104180410
comp.archives/ 1024 199106032020
ailist/V. 30267 199103240154
ailist/V7.10.Z 9242 198806020000
ailist/V7.11.Z 12851 198806020000
ailist/V7.12.Z 8566 198806020000

This form is grep'able, which is a big win over ls-lR, and it
compresses well -- 70% is typical, which compensates for the expansion
caused by pathnames.  For example the ls-lR.Z of ftp.cs.toronto.edu's
files is 104K.  The compressed ftpfilter output is 106K.  uunet's
ls-lR.Z is 376K, the compressed ftpfilter output is 356K.  (even better
compression can be achieved by prefix compression before LZW, but it
isn't worth the cost of the extra non-standard program).

	Mark (don't call me Tony!) Moraes

sysnet@cc.newcastle.edu.au (06/18/91)

In article <1991Jun16.092611.15695@newshost.anu.edu.au>, cmf851@anu.oz.au (Albert Langer) writes:
> The CHANGE file in ftpd.5.60.tar.Z mentions that with the new version of
> ftpd for BSD systems, ftp can no longer directly run ls.
> 
> I have been working on some software (originally by Tony Moraes,
> supplied by Ed Vielmetti) which relies on using ls -alR to obtain a
> recursive directory listing. (I believe it could be used to greatly reduce
> internet traffic, especially at peak times and on international links.
> Cheap unix 386 boxes or very cheap Coherent unix lookalikes could 
> provide a "transparent" cache/relay service.) Should I assume 
> that ftp use of ls -alR will soon be broken on most BSD systems?

I think assuming ls -alR will work on everything is fraught with danger.

A couple of months ago, I spent some time trying to set up a mirror of a 
number of files in different anonymous ftp areas.  After looking at a couple 
of existing things, I found none to be satisfactory, so began looking at how 
to do it myself.

One of the things I tried was ls -alR to find out what was there.  It turned 
out that only a little over half of the sites I wanted would return something 
useful.  And when you think about it, anything that does not use ls will 
generally not understand ls -alR.  This now obviously includes a lot of unix 
machines, but also includes all the non-unix machines.  It is not a universal 
solution which makes is not terribly useful.

A slightly relevent digression in case anyone is interested:
The files I wanted to mirror were just selected files or subdirectories from a
number of machines.  I did not want to mirror entire archives as most of the 
existing things did.  It would be nice if anyone thinking about 
mirroring/caching software could take this into account.
-- 
David Morrison, Manager, Networks and Comms, Uni of Newcastle, Australia
sysnet@cc.newcastle.edu.au or (VAX PSI) psi%0505249626002::sysnet
Phone: +61 49 215397	Fax: +61 49 216910

nelson@sun.soe.clarkson.edu (Russ Nelson) (06/18/91)

In article <91Jun17.215218edt.1568@smoke.cs.toronto.edu> moraes@cs.toronto.edu (Mark Moraes) writes:

   - Columns are a real pain for an automatic listing script -- I have
   encountered an ftp site that had a columnated ls -R output; useful for
   humans, not quite so useful for scripts.

That would be Phil Karn's KA9Q.  It also confuses Andy Norman's ange-ftp
package for GNU Emacs.

   One possibility is for (Unix) sites to run
   	ls -lR | ftpfilter -l | compress > FILES.Z

That would be an excellent idea.  Maybe someone should write an RFC on
archive-site management.  That way, a site could say "we're RFCXXXX
compliant", and automated tools could deal with it accordingly.

--
--russ <nelson@clutx.clarkson.edu> I'm proud to be a humble Quaker.
I am leaving the employ of Clarkson as of June 30.  Hopefully this email
address will remain.  If it doesn't, use nelson@gnu.ai.mit.edu.