cmf851@anu.oz.au (Albert Langer) (06/16/91)
(Relevance to comp.unix.sysv386 and comp.os.coherent explained at end. Follow ups only to comp.archives.admin please.) The CHANGE file in ftpd.5.60.tar.Z mentions that with the new version of ftpd for BSD systems, ftp can no longer directly run ls. I have been working on some software (originally by Tony Moraes, supplied by Ed Vielmetti) which relies on using ls -alR to obtain a recursive directory listing. (I believe it could be used to greatly reduce internet traffic, especially at peak times and on international links. Cheap unix 386 boxes or very cheap Coherent unix lookalikes could provide a "transparent" cache/relay service.) Should I assume that ftp use of ls -alR will soon be broken on most BSD systems? I have the impression that Ed's comp.archives listings may also rely on use of ls for the verification, and perhaps also archie. Also use of Tim Cook's (timcc@admin.viccol.edu.anu) "dls" package with ftp to provide one line descriptions of files instead of an ordinary ls listing relies on ftpd calling a local bin/ls within the chroot (though I think that is unchanged). To avoid possible disruption of important services I hope those concerned will checkout whether any changes will be needed. While I am at it, here's an update on the software I have been working on, in case anybody wants to do something with it, as I won't be able to do much more. Sysv386 and Coherent users may especially be interested in the possibilities for conveniently overcoming the problem of long filenames when transferring files from BSD systems, and for providing adequate ftpmail gateways to UUCP sites that are not on the internet. (Thanks for sending Tony's stuff Ed, sorry about the delay getting back to you on it.) The unreleased collection of shell scripts from Tony Moraes included: 1. ftpfilter. This takes the output of an ftp (or other) ls -AlR command and converts it into a format like that used by "find" so that grepping the output can extract a full path as well as the required file names. (File size and a numeric date are also included on each line - with conversion of the different formats ls uses for older and recent dates.) 2. ftpgrep. This greps through compressed output files from ftpfilter stored in a file named after each host, and results in an output where each full path is prefixed by a hostname and colon in the style of rcp. 3. grabfiles. This uses an rcp/ftpgrep style input to actually get the files with ftp. The main changes I have made are: 1. Horrendous butchery of the code, for which nobody else is responsible. 2. Ability to handle paths with multiple leading, trailing and embedded spaces (mainly to get at collections of Mac software mounted for NFS access). 3. Porting to work on Sys V 3.2 (ISC 386ix 2.0.2) as well as on SunOs 4.1 (and to work for ftp in either direction between the two). 4. Item 3 required an optional conversion of BSD directory and filenames that are longer than 14 characters into Sys V equivalents to avoid collisions between truncated names. Rather than use the "shrink" package that produces unintelligible names I decided to preserve the full original names but converted into filepaths with each segment less than 14 characters. This wastes some inodes and near empty blocks but the files can always be linked to another name later anyway. Also host names were converted to a sub-directory for each domain. 5. Items 3 & 4 were so "hairy" that I had to build in comprehensive testing which compares the files originally requested with those actually obtained and keeps detailed logs and mails up to 4 lists to the user: a) Files ok and of length specified. b) Files obtained for which no length was specified. c) Files obtained but length different from request. d) Files requested but not obtained. The conversion "works" pretty reliably now but I find the mail very reassuring when I just run a long request in background. If the ftp connection gets lost in the middle of a file I will be told (by the wrong file size) and there is no need to manually review any transcripts to see if there were ftp failures. This greatly simplifies life. It could easily be extended to reprocess list d) when a request has been diverted to a cache or shadow/mirror site and try again elsewhere (and also to automatically recover from dropped ftp sessions etc). 6. A script was added to process the output from archie prog commands into the same rcp/ftpgrep style format, using the cliplines.c included in the log_archie package recently distributed in alt.sources. This can easily be developed into a similar facility for comp.archives messages. 7. A single directory tree of domains and hosts (with each site's own directory trees underneath) was established for use by all components so that once a file has been obtained it's local availability can easily be confirmed and future requests could be directed there. (A simple cron job to find and delete files that have not been accessed for a certain period makes this an effective cache). 8. Where more access to a site is available with a specific username and password than "anonymous", this is recorded in the database tree and automatically used. The same mechanism can be used for other special processing (e.g. variant ls commands). Anyway, it seems to be doing more or less what I want it to in heavy duty file transfers between two machines. It is horribly slow (in offline processing), undocumented and clumsily written, but I believe a more competent programmer could easily develop it into a "production" package that could be released and would also: a) Batch ftp requests to handle them during offpeak hours while still providing immediate and reliable feedback about availability. b) Automatically divert requests that have previously been filled to fill them from a cache on the same machine. c) Extend the above to easy maintenance of shadow or mirror archives. d) Automatically divert requests to other local cache or shadow/mirror sites and follow up with further requests if unsuccessful. e) Do all this on a cheap sysv386 system or very cheap Coherent system (as well as on any other Sys V or BSD unix). By sticking to simple shell scripts and doing filename conversion the hard way I think I have made sure a port to Coherent would not be too difficult. A Coherent box could provide cache services to UUCP users, even if it had to obtain its files by UUCP from a cooperating BSD system that had ftp access to the internet (but only maintained a small cache of its own with a short timeout and was not interested in providing modem access to others). With a port of TCP/IP etc to Coherent, the Coherent box could do the whole job (with filename conversions so that the 14 character converted names appeared as full BSD names from outside). Likewise it should be feasible to port to MSDOS with some further filename manipulations. This could be handy for ftp requests to BSD systems from MSDOS users. Unnecessary internet traffic could be greatly reduced by simply providing a (fairly trivial) utility to process comp.archives messages into the required form but with the automatic diversions to nearby cache or shadow/mirror sites. Users would only notice the convenience without noticing the diversion (whereas asking users to "check local cache and mirror archives first", as is done in Australia, seems somewhat optimistic. Peak traffic could be reduced further with an option to delay the request until offpeak times. While that WOULD be noticed by the user, it could be made much more acceptable by immediate mailed confirmation of the request and subsequent mail notification of success. Gateways like ftpmail and netfetch could easily enforce both diversion to cache and mirror sites and offpeak use for mail requests (while again providing immediate feedback). Further substantial reductions in traffic could be achieved by providing larger capacity cache and mirror sites and locating them more at national and regional gateways that have expensive links to the rest of the internet. Ultimately this should be designed into a new internet protocol for a cached and delayed ftp service that uses network store and forward resources for file transfers with substantial storage times as well as for millisecond delay packet switching. In the meantime an application layer kludge seems well worthwhile. Disk space is now only USD $2 per MB and the Coherent operating system is only USD $100. It seems absurd to pays tens of thousands of dollars per month for higher speed international links insead of providing adequate caching. There was a good deal of dicsussion in news.admin recently about problems caused to mail relays by the BITFTP gateway and problems to UUCP sites that are not part of the internet caused by the closing down of that gateway. Installation of cheap caches should substantially relieve the problems of mail relays. If necessary it should not be too difficult to develop an accounting systems to pay for the disk space (and modem traffic) by charges to the users. Well, if anybody wants to take this up I'll be happy to pass on code that does what I need and could easily be developed into a "production" system with the extra features mentioned above. I'm just too rusty at awk and shell programming to finish the job in a reasonable amount of time and I have to get on with other priorities, but I'm sure anybody reasonably competent would have little trouble producing a worthwhile releaseable package quite quickly. (And whoever takes it on can also do the worrying about how to get recursive directory listings if ftpd is changing the access to ls :-) -- Opinions disclaimed (Authoritative answer from opinion server) Header reply address wrong. Use cmf851@csc2.anu.edu.au
emv@msen.com (Ed Vielmetti) (06/17/91)
In article <1991Jun16.092611.15695@newshost.anu.edu.au> cmf851@anu.oz.au (Albert Langer) writes:
I have been working on some software (originally by Tony Moraes,
supplied by Ed Vielmetti) which relies on using ls -alR to obtain a
That would be Mark Moraes, please make sure you give him proper credit!
Other than that small detail, your changes sound like they're working
in the right direction. Just to be sure that there's not a zillion
versions of Mark's so far otherwise unreleased code floating around,
it might be best not to post these scripts all that widely to the net.
Further substantial reductions in traffic could be achieved by
providing larger capacity cache and mirror sites and locating them
more at national and regional gateways that have expensive links to
the rest of the internet.
Well, that's a nice thought, but I don't believe that you're going to
get any substantial amount of funding for that purpose any time soon,
at least not from public sources. They seem to be more interested in
subsidizing bandwidth, not building applications which would add value
to the network and help people use it. On this side of the Pacific
we're starting to hear the T3! T3! T3! chants, as if simply
transporting more bits around would make the network better. Compare
the cost of T3 lines with the meager resources being thrown at (or not
thrown at) archivist work and other projects designed to add
organization to the net, and it's rather discouraging.
--
Edward Vielmetti, moderator, comp.archives, emv@msen.com
"(6) The Plan shall identify how agencies and departments can
collaborate to ... expand efforts to improve, document, and evaluate
unclassified public-domain software developed by federally-funded
researchers and other software, including federally-funded educational
and training software; "
"High-Performance Computing Act of 1991, S. 272"
cmf851@anu.oz.au (Albert Langer) (06/18/91)
In article <EMV.91Jun17014339@bronte.aa.ox.com> emv@msen.com (Ed Vielmetti) writes: >That would be Mark Moraes, please make sure you give him proper credit! Whoops! Yes, the files say Mark, and your message accompanying them said Mark, I guess there is no special reason I should continue calling him Tony :-) I will also take your earlier advice to get in touch with him. > Further substantial reductions in traffic could be achieved by > providing larger capacity cache and mirror sites and locating them > more at national and regional gateways that have expensive links to > the rest of the internet. > >Well, that's a nice thought, but I don't believe that you're going to >get any substantial amount of funding for that purpose any time soon, >at least not from public sources. They seem to be more interested in >subsidizing bandwidth, not building applications which would add value >to the network and help people use it. On this side of the Pacific >we're starting to hear the T3! T3! T3! chants, as if simply >transporting more bits around would make the network better. Compare >the cost of T3 lines with the meager resources being thrown at (or not >thrown at) archivist work and other projects designed to add >organization to the net, and it's rather discouraging. I got a note from an AARNET coordinator saying they were planning some high capacity cache sites (and might be interested in the software), so the situation isn't all that bleak. (Though reducing traffic is closer to increasing bandwidth than it is to adding value etc.) Still, I agree with your general theme. The benefits from funds diverted to archiving would greatly exceed those from further increases in bandwidth etc. I suspect it is partly a problem of the bandwidth returns being easier to quantify or just that installing bandwidth is more straight forward (and does not involve so much in the way of policy complications about whose needs get priority etc). Another aspect is the usual problem of funding "public goods", with the not unusual result of "private affluence and public squalor". The recent news on WAIS and PROSPERO looks very promising as regards solutions to most of the technical barriers to accessing network resources. But no amount of automatic searching can fully substitute for proper cataloging. We still need simple things like keeping track of version numbers and patches etc to save people a lot of time. -- Opinions disclaimed (Authoritative answer from opinion server) Header reply address wrong. Use cmf851@csc2.anu.edu.au
moraes@cs.toronto.edu (Mark Moraes) (06/18/91)
To stave off the mail messages: The original version of the scripts Albert refers to can be found on ftp.cs.toronto.edu as pub/ftptools.shar.Z. There's also a modified ftp client there -- it creates the intervening directories (like mkdir -p) locally if necessary. (I used them for a long time with a vanilla ftp client, but got tired of the nuisance) If you make fixes or improvements to them, I'd appreciate a copy. >The CHANGE file in ftpd.5.60.tar.Z mentions that with the new version of >ftpd for BSD systems, ftp can no longer directly run ls. All anon. ftp sites providing an regularly updated ls-lR.Z in ~ftp would be a better solution anyway -- some sites may prefer it to having other people grind their disks with a periodic "ls -lR". (If a hundred people started running archie servers, I wouldn't like to think of the effect on archive site disks) And then there's the problems when your "ls -lR" scans someone's /mnt or /afs directory with many other systems mounted there. Miscellaneous points from the last time we had this discussion on some archive related mailing list (archie-people?): - If the format isn't relatively standard, it isn't too useful. In that sense, "ls -lR" output is nice and standard. - If the location of the file isn't standard, it's annoying. Some sites put it in /, some in /pub, some in /dist, etc. if ftptest -f ftp.wherever:pub/ls-lR.Z - Even "ls -lR" output is somewhat redundant - the owner, even the modes of the file aren't very interesting (unless the file is unreadable.) And SysV ls has to be given -g or it sticks the group info in there. - It's best to run the ls -lR as the ftp user, or nobody, so the files that are listed are ones that the user can list automatically. - the date in "ls" output is somewhat annoying from an automatic archive tracking package, since it goes backward every six months as the hh:mm go to zero. - Columns are a real pain for an automatic listing script -- I have encountered an ftp site that had a columnated ls -R output; useful for humans, not quite so useful for scripts. One possibility is for (Unix) sites to run ls -lR | ftpfilter -l | compress > FILES.Z The format that my "ftpfilter -l" produces from ls -lR is: name bytes YYYYMMDDhhmm Directory names have a trailing /. bytes is sort of meaningless for a directory. And please don't read me back my news.software.b postings about all-numeric dates :-) eg. README 1191 199012110000 ailist/ 3072 199001190000 bin/ 512 198811200000 ca-domain/ 3584 199104180410 comp.archives/ 1024 199106032020 ailist/V. 30267 199103240154 ailist/V7.10.Z 9242 198806020000 ailist/V7.11.Z 12851 198806020000 ailist/V7.12.Z 8566 198806020000 This form is grep'able, which is a big win over ls-lR, and it compresses well -- 70% is typical, which compensates for the expansion caused by pathnames. For example the ls-lR.Z of ftp.cs.toronto.edu's files is 104K. The compressed ftpfilter output is 106K. uunet's ls-lR.Z is 376K, the compressed ftpfilter output is 356K. (even better compression can be achieved by prefix compression before LZW, but it isn't worth the cost of the extra non-standard program). Mark (don't call me Tony!) Moraes
sysnet@cc.newcastle.edu.au (06/18/91)
In article <1991Jun16.092611.15695@newshost.anu.edu.au>, cmf851@anu.oz.au (Albert Langer) writes: > The CHANGE file in ftpd.5.60.tar.Z mentions that with the new version of > ftpd for BSD systems, ftp can no longer directly run ls. > > I have been working on some software (originally by Tony Moraes, > supplied by Ed Vielmetti) which relies on using ls -alR to obtain a > recursive directory listing. (I believe it could be used to greatly reduce > internet traffic, especially at peak times and on international links. > Cheap unix 386 boxes or very cheap Coherent unix lookalikes could > provide a "transparent" cache/relay service.) Should I assume > that ftp use of ls -alR will soon be broken on most BSD systems? I think assuming ls -alR will work on everything is fraught with danger. A couple of months ago, I spent some time trying to set up a mirror of a number of files in different anonymous ftp areas. After looking at a couple of existing things, I found none to be satisfactory, so began looking at how to do it myself. One of the things I tried was ls -alR to find out what was there. It turned out that only a little over half of the sites I wanted would return something useful. And when you think about it, anything that does not use ls will generally not understand ls -alR. This now obviously includes a lot of unix machines, but also includes all the non-unix machines. It is not a universal solution which makes is not terribly useful. A slightly relevent digression in case anyone is interested: The files I wanted to mirror were just selected files or subdirectories from a number of machines. I did not want to mirror entire archives as most of the existing things did. It would be nice if anyone thinking about mirroring/caching software could take this into account. -- David Morrison, Manager, Networks and Comms, Uni of Newcastle, Australia sysnet@cc.newcastle.edu.au or (VAX PSI) psi%0505249626002::sysnet Phone: +61 49 215397 Fax: +61 49 216910
nelson@sun.soe.clarkson.edu (Russ Nelson) (06/18/91)
In article <91Jun17.215218edt.1568@smoke.cs.toronto.edu> moraes@cs.toronto.edu (Mark Moraes) writes:
- Columns are a real pain for an automatic listing script -- I have
encountered an ftp site that had a columnated ls -R output; useful for
humans, not quite so useful for scripts.
That would be Phil Karn's KA9Q. It also confuses Andy Norman's ange-ftp
package for GNU Emacs.
One possibility is for (Unix) sites to run
ls -lR | ftpfilter -l | compress > FILES.Z
That would be an excellent idea. Maybe someone should write an RFC on
archive-site management. That way, a site could say "we're RFCXXXX
compliant", and automated tools could deal with it accordingly.
--
--russ <nelson@clutx.clarkson.edu> I'm proud to be a humble Quaker.
I am leaving the employ of Clarkson as of June 30. Hopefully this email
address will remain. If it doesn't, use nelson@gnu.ai.mit.edu.