[comp.archives.admin] building an interstate

emv@msen.com (Ed Vielmetti) (06/17/91)

do you ever get the feeling that all this NREN stuff is going to
build us a network that's extraordinarily fast but impossible to use?
mitch kapor, in his "building the open road: policies for the national
public network" [*], compares the existing mess in the current state of
the art at cataloging and describing the services available on the
net today as "like a giant library with no card catalog".

who is going to provide the moral equivalent of the rand-mcnally road
atlas, the texaco road maps, the aaa trip-tiks?  what we have now is
much more like the old Lincoln Highway, with painted markings on trees
and oral tradition that helps you get through the rough spots on the
road.

efforts by existing commercial internet providers have been mediocre
at best.  none appear to be much interested in mapping out the network
beyond the immediate needs of their customers.  if you consider that
one of the roles of a commercial internet provider is to provide
access to software archives, and then you take a look at the state of
the software archives on uunet.uu.net and uu.psi.com, you see enormous
duplication, strange and hard to understand organizations of files, no
aids in finding materials beyond a cryptic "ls-lR" file, and dozens if
not hundreds of files which are stale and out of date compared with
the One True Version maintained by the author of the documents. [&]
Visiting these places is like reading magazines at a dentist's office,
you know that what you're reading was new once a few weeks or months
ago.

efforts by nsf-funded network information centers have been similarly
muddled and half-useful.  if you read the Merit proposal to NSFnet
closely, you saw plans for GRASP (Grand interface to SPIRES) which was
going to be the ideal delivery mechanism for information about the
NSFnet to users of the net.  Promises promises.  What you do have from
nis.nsf.net is a stale collection of out of date maps [%], a bunch of
traffic measurement aggregate numbers [#], and some newsletters[=].  the work
at nnsc.nsf.net isn't all that much better.  part of the problem is
reliance on volunteered information -- the general approach to network
information gathering appears to be not much more than send out a
survey, wait, tabulate the responses.  very little of this work is
what you would call "pro-active", that's why chapter 3 (archives)
lists just 26 of the over 1000 anonymous FTP sites and mail-based
archive servers available on the net. [?] (Think of it as a road atlas
that shows less than 1 road in 40 and you'll get the right idea.)

that's not to say that there aren't skilled people out there, it's
just that they're generally not supplied with resources adequate to
the task they're facing.  you aren't seeing organizations like ANS,
which seems to be flush with cash and hiring skilled people left and
right, hiring anyone with the archivist skills of a (say) Keith
Peterson.  you aren't seeing innovative applications like "archie", a
union list catalog of FTP sites around the globe, funded as part and
parcel of NSF infrastructure; it's being done in Canada, with no
guarantee to continued existence if it starts to swamp their already
soggy USA-Canada slow link or if they need the machine back. [+] you
don't see nic.ddn.mil hosting the arpanet "list of lists" anymore,
they didn't like the contents so it's gone. [@]  the internet library
guides are run as best they can by individuals, and they're in the
form of long ascii lists of instructions on how to connect rather than
an interactive front-end that would make the connections for you --
not that the technology isn't there, just that no one has a mission
and the resources to provide them. [!]

so what do we end up with?  a very fast net (in spots) with a "savage
user interface" [*].  multi-megabit file transfers, you can get
anything you want in seconds, but no way to find it.  regional
networks spending large amounts of federal dollars on bandwidth but
very little on ways to use it effectively.  a vast, largely uncharted
network, with isolated pockets of understanding here and there, and no
one yet who has appeared with any of the proper incentives and
resources to map it out.

-- 
Edward Vielmetti, MSEN Inc. 	moderator, comp.archives 	emv@msen.com

references for further study:

[*] eff.org:/npn/.  discussion in comp.org.eff.talk.
[@] ftp.nisc.sri.com:/netinfo/interest-groups.
    see also dartcms1.dartmouth.edu:siglists
    and vm1.nodak.edu:listarch:new-list.*
    discussion in bit.listserv.new-list.
[!] vaxb.acs.unt.edu:[.library], also
    nic.cerf.net:/cerfnet/cerfnet_info/internet-catalogs*
    discussion in comp.misc and bit.listserv.pacs-l.
[+] see discussion in comp.archives.admin.  archie information can be
    found in quiche.cs.mcgill.ca:/archie/doc/
[%] in nis.nsf.net:maps.  note that several are as old as 1988.
    no readily apparent newsgroup for discussion.
[#] in nis.nsf.net:stats.
    no readily apparent newsgroup for discussion.
[=] in nis.nsf.net:linklttr.  no convenient way to search through them
    short of downloading the whole set.
[&] for instance, see 
    uunet.uu.net:/sitelists/ (empty)
    uunet.uu.net:/citi-macip/ (CITI has withdrawn this code)
    uu.psi.com:/pub/named.ca (out of date named cache file still shows
			      nic.ddn.mil as root nameserver)
    discussion in comp.archives.admin
[?] nnsc.nsf.net:/resource-guide/chapter.3/.
    note that many entries have not been updated since 1989.
    discussion in comp.archives.admin.

srctran@world.std.com (Gregory Aharonian) (06/18/91)

     I agree with Ed completely. For the past six years, I have been
building a database of information on the location of computer software
available in source code form from around the world. Currently I have
information on over 15,000 programs. What I do is very time consuming,
and intellectually demanding in that I have to know a little bit about
everything to help separate the good stuff from the bad.
     To date, I have received ZERO attention and funding from the US
government, even though most of the software I track is government funded.
The Don't Transfer Research Projects Agency epitomizes the incompetence
in the government, spending hundreds of millions of dollars on software
development, and ZERO on any effective transfer. NASA likes to think its
competent with its 1200 program COSMIC collection, even though I have
records on over 4000 programs available at NASA sites. Despite receiving
some attention in the press, and many letters on my part, no government
agencies have shown any interest in doing anything with existing source
code resources (and universities don't do any better).
     The Congressional bills to promote critical technologies, information
highways, and have the CIA involved in technology espionage, are all a 
waste of tax dollars.  There is so much technology already available that
can be transferred with low technology solutions.
     My observation is that there is gross misunderstanding of the economics
of information and information transfer, leading to proposals that, if they
could be evaluated, would have negative cost-benefit.
     Unfortunately, I do not believe (and care anymore) that any solutions
will come out of the government. There has been so little criticism of
government information technology activities inside the DoD, DoE, NASA and
NSF that they would not recognize a good idea if it hit them. The only
way these problems will be solved will be through people willing to
understand the economics of information and software, and offer solutions
through the market.
     (By the way, I forgot to flame my favorite waste project, the Software
Thats Alreadybeen Rejected Somewherelse project, which seeks to improve
software productivity ten fold without spending a cent proving that they
achieved their goals.)
     I'll probably get flamed for this posting (just in case, other words
that come to mind include incompetent, self-serving, tax-dollar waste,
impotent, fraudelent, repetivitie, duplicative (I have seen 200 federally
funded FFT routines), and most other perjoratives).
     All I know is that there are over 15,000 programs available publicly
in source code form in this great computer/software country of ours, and
I'm the only one that knows where.

Gregory Aharonian
Source Translation & Optimization

emv@msen.com (Ed Vielmetti) (06/18/91)

In article <9106171612.AA01441@mazatzal.merit.edu> clw@MERIT.EDU writes:

   The Directory Group at MERIT, Chris Weider and Mark Knopper, are starting
   to address some of these issues.  I do think that Directory Services are
   a good medium term answer, and we're starting to put everything which
   fits the X.500 philosophy into X.500....

All due respects, Chris, but X.500 doesn't address many of these
issues at all, and the ones it does sort of fit into can be more
easily addressed with other tools.  

X.500 Directory services assume a neat, structured, hierarchical name
space and a clear line of authority running from the root all the way
to the leaves.  Indeed, most X.500 services in place on the internet
today that work well enough to be useful run off of centrally
organized, centrally verified, and bureaucractically administered
information -- the campus phone book.  For what this is, it's great --
i'm happy that I can finger user@host.edu at any number of sites and
get something back.  But that is of little relevance to the archives
problem.

X.500 services are hard to run -- the technology is big, bulky,
osified.  So the people who are most interested in running it are the
"computer center" folks.  If you look for the innovative, interesting,
and desirable applications that you'd want to find on the net, you'll
see that many of them are being done out in the field in departmental
computing environments or increasingly in small focused private
commercial or non-commercial efforts.  There's not a terribly good
reason for these two groups to communicate, and so most X.500 projects
have much more structure than substance.

X.500 services are directory oriented.  The data in them is relatively
small, of known value, and highly structured.  Information about
archive sources is just about completely counter to these basic
principles.  The amount of information about any particular service
which you'd like to have on hand can be quite considerable; perhaps at
minimum access instructions, but more likely some text describing the
service, who its intended audience is, sample output, etc.  In
addition it would be valuable to keep information on user reactions to
the system close to the official provided directory notice; these
reviews (a la the michelin guide) are often more valuable than the
official propaganda put out by the designer.  To search this mass of
information, you'll want something much more expressive than the
relatively pitiful X.500 directory access tools -- full text
searching, at the very minimum, with a way to sensibly deal both with
structured data and with more fuzzy matches on "similar" items.

X.500 is a holy grail, there's a lot of money which seems to be being
thrown at it these days in the hope to make it useful.  Good luck, I
wish you well.  But please, don't try to cram all the world's data
into it, because it doesn't all fit.  It's a shame that equivalent
amounts of effort aren't being spent on developing other protocols
more suited to the task. I'm thinking in particular of the Z39.50
implementation in WAIS [*] which holds a lot of potential for
providing a reasonable structure for searching and sifting through
databases which have rich textual information.  Perhaps it's just as
well that federal subsidy hasn't intruded here and clouded people's
judgments on the applicability of a particular technology to a
certain task.

-- 
Edward Vielmetti, MSEN Inc. 	moderator, comp.archives 	emv@msen.com

"often those with the power to appoint will be on one side of a
controversial issue and find it convenient to use their opponent's
momentary stridency as a pretext to squelch them"


[*] think.com:/public/wais/,
    also quake.think.com:/pub/wais/

worley@compass.com (Dale Worley) (06/18/91)

In article <EMV.91Jun18000345@bronte.aa.ox.com> emv@msen.com (Ed Vielmetti) writes:
   X.500 services are directory oriented.  The data in them is relatively
   small, of known value, and highly structured.  Information about
   archive sources is just about completely counter to these basic
   principles.

   X.500 is a holy grail, there's a lot of money which seems to be being
   thrown at it these days in the hope to make it useful.

What can be done to produce good catalogs?  As Ed notes, archive
information is likely to be bulky, chaotic, and of unknown (probably
small) value.  Given how much money is needed to get a directory
system for information without these problems running, it will
probably take much more to get a good system for archive information
working.

Perhaps the analogy to road maps can be a guide -- Roads have been
around for thousands of years, but road maps have only been available
for fifty(?) years.  What happened?  One thing is that it is now
possible to make a map and then sell thousands (hundreds of
thousands?) of copies, thus making each copy reasonably inexpensive.
Until the development of the automobile this was not possible, there
were too few potential users.  (Or even necessary, since a horse cart
is slow enough that stopping to ask directions in each town isn't a
burden.)

One possibility is to make a service that charges you for use.  A good
archive information system should see enough use that each query can
be quite inexpensive.  And the authorization and billing should be
easy enough to automate!

Dale Worley		Compass, Inc.			worley@compass.com
--
Perhaps this excerpt from the pamphlet, "So You've Decided to
Steal Cable" (as featured in a recent episode of _The_Simpsons_)
will help:
	Myth:  Cable piracy is wrong.
	Fact:  Cable companies are big faceless corporations,
		which makes it okay.

eachus@largo.mitre.org (Robert I. Eachus) (06/18/91)

     I was at a Hypertext meeting a year or so ago, and after
listening to all the talks, I commented to a friend: "You know, we had
librarians for thousands of years before the invention of movable type
made them necessary.  In Hypertext, everyone is trying to do it the
other way round."

     What I see on the net makes the Hypertext people sound like
forward thinkers.  The net is even more chaotic than the (often
static) environments that have been used in Hypertext prototypes.  In
the Hypertext arena the problem is that they are developing the tools
without considering how the necessary databases will be created.  On
the net, we have much more data than anyone can comprehend, but no
support for even developing the tools.

     What the world and the net need is a new type of organization
which is a software library.  Given funding, such an institution could
provide disk space (cheap), net access (not so cheap, but arguably
billable to actual users, software developers to provide the necessary
tools (no big deal), and actual software LIBRARIANS to develop a
cataloging system and actually organize all this stuff.  That will be
by far the bigest expense.  There is as yet no Dewey Decimal system
for software, but we desparately need it.

     Incidently, all the fancy software in the world with multiple
keys, multiple views, etc. won't address that need.  What make the
Dewey system (or Library of Congress) useful is that once I have it in
my head, I know where books on say Cryptography are to be found, and I
can find related books that I didn't know about.  A keyword probe will
miss closely related--but different--subjects.
--

					Robert I. Eachus

with STANDARD_DISCLAIMER;
use  STANDARD_DISCLAIMER;
function MESSAGE (TEXT: in CLEVER_IDEAS) return BETTER_IDEAS is...

scs@iti.org (Steve Simmons) (06/19/91)

worley@compass.com (Dale Worley) writes:

>What can be done to produce good catalogs?  As Ed notes, archive
>information is likely to be bulky, chaotic, and of unknown (probably
>small) value.  Given how much money is needed to get a directory
>system for information without these problems running, it will
>probably take much more to get a good system for archive information
>working.

Arguing with an analogy is silly, but I'm gonna do it . . .  :-)

In the middle ages, maps were often critical trade secrets.  A chart
of waters was worth significantly more than its weight in gold, as
it revealed both what places existed and how to get there and back
safely.  The Portugese managed to keep the "safe route" to Japan
secret for an incredibly long time.

Trivially yours,

Steve
-- 
  "If we don't provide support to our users someone is bound to
   confuse us with Microsoft."
	-- Charles "Chip" Yamasaki

ajw@manta.mel.dit.CSIRO.AU (Andrew Waugh) (06/19/91)

In article <EMV.91Jun18000345@bronte.aa.ox.com> emv@msen.com (Ed Vielmetti) writes:
> X.500 Directory services assume a neat, structured, hierarchical name
> space and a clear line of authority running from the root all the way
> to the leaves.

While this is certainly true, it is important to understand why this is
so. X.500 is intended to support a distributed directory service. It
is assumed that there will be thousands, if not millions, of
repositories of data (DSAs). These will co-operate to provide the
illusion of a single large directory.

The problem with this model is how you return a negative answer in a
timely fashion. Say you ask your local DSA for a piece of information.
If the local DSA holds the information you want, it will return it.
But what if it doesn't hold the information? Well, the DSA could
ask another DSA, but what if this second DSA also doesn't hold the
information? How many DSAs do you contact before you return the
answer "No, that piece of information does not exist"? All of them?

X.500 solves this problem by structuring the stored data hierarchically
and using this heirarchy as the basis for distributing the data
amongst DSAs. Using a straightforward navigation algorithm, a query
for information can always progress towards the DSA which should hold
the information. If the information does not exist that DSA can
authoritatively answer "No such information exists." You don't have to
visit all - or even a large proportion - of the DSAs in the world.

It is important to realise that this is a generic problem with highly
distributed databases. The X.500 designers chose to solve it by
structuring the data. This means that X.500 is suitable for storing
data which can be represented hierarchically and is less suitable
for storing data which cannot. Exactly what data will be suitable for
storing in X.500 is currently an open question - there is simply not
sufficient experience.

The proposed archive database which started this thread will have
exactly the same problem. The solution chosen will, if different to
that X.500 uses, will have problems as well. There is no such thing as
a perfect networking solution!

>X.500 services are hard to run -- the technology is big, bulky,
>osified.  So the people who are most interested in running it are the
>"computer center" folks.  If you look for the innovative, interesting,
>and desirable applications that you'd want to find on the net, you'll
>see that many of them are being done out in the field in departmental
>computing environments or increasingly in small focused private
>commercial or non-commercial efforts.  There's not a terribly good
>reason for these two groups to communicate, and so most X.500 projects
>have much more structure than substance.
>
>X.500 services are directory oriented.  The data in them is relatively
>small, of known value, and highly structured.  Information about
>archive sources is just about completely counter to these basic
>principles.  The amount of information about any particular service
>which you'd like to have on hand can be quite considerable; perhaps at
>minimum access instructions, but more likely some text describing the
>service, who its intended audience is, sample output, etc.  In
>addition it would be valuable to keep information on user reactions to
>the system close to the official provided directory notice; these
>reviews (a la the michelin guide) are often more valuable than the
>official propaganda put out by the designer.  To search this mass of
>information, you'll want something much more expressive than the
>relatively pitiful X.500 directory access tools -- full text
>searching, at the very minimum, with a way to sensibly deal both with
>structured data and with more fuzzy matches on "similar" items.
>
>X.500 is a holy grail, there's a lot of money which seems to be being
>thrown at it these days in the hope to make it useful.  Good luck, I
>wish you well.  But please, don't try to cram all the world's data
>into it, because it doesn't all fit.  It's a shame that equivalent
>amounts of effort aren't being spent on developing other protocols
>more suited to the task. I'm thinking in particular of the Z39.50
>implementation in WAIS [*] which holds a lot of potential for
>providing a reasonable structure for searching and sifting through
>databases which have rich textual information.  Perhaps it's just as
>well that federal subsidy hasn't intruded here and clouded people's
>judgments on the applicability of a particular technology to a
>certain task.

As for the rest of the posting, all I can say is that it must be great
to know so much about the costs and benefits of using X.500.
From my perspective, it is obvious that X.500 will not solve all
the world's problems (nothing ever does :-) but it is way too early
to be so dogmatic.  When we have had
	1) The necessary expericence of implementing X.500, running
	X.500 databases and storing different types of data in such
	a database; and
	2) experience in alternative highly distributed databases.
	(X.500 might prove to be extremely poor for storing certain
	types of data - but the alternatives might be even worse.)
then we can be dogmatic.

andrew waugh

rhys@cs.uq.oz.au (Rhys Weatherley) (06/19/91)

In <EACHUS.91Jun18164709@largo.mitre.org> eachus@largo.mitre.org (Robert I. Eachus) writes:

>     Incidently, all the fancy software in the world with multiple
>keys, multiple views, etc. won't address that need.  What make the
>Dewey system (or Library of Congress) useful is that once I have it in
>my head, I know where books on say Cryptography are to be found, and I
>can find related books that I didn't know about.  A keyword probe will
>miss closely related--but different--subjects.

I agree that something like this is needed, but how is it going to be
organised?  There's a big difference between books and computer programs.
If I go into a library and walk up to the shelf marked "Mathematical Logic"
(marked in Dewey Decimal or whatever), then the books I find there will
be about the various aspects of "Mathematical Logic" and just that.
However, if I walk into a computer store and walk up to the shelf marked
"Spreadsheets" I'll also find programs that double up as wordprocessors,
databases, desktop publishers, comms programs, ... in addition to being
a spreadsheet.

So if the "Compy Decimal" system (or whatever) was used, we'd find such
programs under lots of different numbers and sooner or later some librarian
is going to forget to enter a program under all necessary headings, or
a programmer is not going to tell the librarian all the headings and
we are back to square one.  Similarly, using identifiers for programs like
"spreadsheet,database,wordprocessor,unix,xwindows:123.8" aren't going
to be much better, and we'll get back to the keyword search problem
eventually.

Some central control would be needed (as with any library system) and that
would be a good idea (and I agree with this), but with "creeping featurism"
being the favourite passtime of upgrades these days, it's only going to
get worse.  When a book is published, further editions don't stray much
from the original topic - but program users are always screaming for more
features over and above what a program was initially intended for, meaning
extra identifiers for every new version of a program.  Distributed database
technology is not the answer, just the means.  Better information is the
answer.

Maybe it's time we retrained programmers to write programs to perform
a single task, not control the world!  :-)

We'll come up with something eventually, but I don't think it will fit
into the library/archive framework we are used to: there's so much
more information in computing than humans are used to.  It will have
to be something new.  Any ideas?

Cheers,

Rhys.

+=====================+==================================+
||  Rhys Weatherley   |  The University of Queensland,  ||
||  rhys@cs.uq.oz.au  |  Australia.  G'day!!            ||
||       "I'm a FAQ nut - what's your problem?"         ||
+=====================+==================================+

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/19/91)

Instead of complaining about how inappropriate X.500 is for all but the
simplest problems, why don't we identify the problems we're really
trying to solve?

I think that the Internet People Problem---make a *usable* database of
people on the Internet---embodies most, if not all, of the technical and
political difficulties that an archive service has to overcome. You want
to find that vocrpt package? I want to find Don Shearson Jr. You want to
find a SLIP-over-the-phone package? I want to find a SLIP-over-the-phone
expert. You want to know where you got this collection of poems? I want
to know where I got this phone number. You want to see what everyone
thinks of vocrpt now that you've found it? DISCO wants to get references
for Shearson from anyone who's willing to admit he's worked with him.

One advantage of starting from the Internet People Problem is that it
has a lot more prior art than the archive problem, from telephone
directories on up. Once we've solved it we can see how well the same
mechanisms handle data retrieval.

---Dan

eachus@largo.mitre.org (Robert I. Eachus) (06/19/91)

In article <2013@uqcspe.cs.uq.oz.au> rhys@cs.uq.oz.au (Rhys Weatherley) writes:

 > I agree that something like this is needed, but how is it going to be
 > organised?  There's a big difference between books and computer programs...

   We're violently agreeing.  Anyone can do the repository bit, it is
organizing a software collection in a meaningful way that will be the
tough job.  Ed Vielmetti is trying to do one part of the job, but I am
saying that the real need is for the other $150 (or whatever) worth of
work on that Library of Congress card.

 > Maybe it's time we retrained programmers to write programs to perform
 > a single task, not control the world!  :-)

   We used to joke that every program in the MIT AI lab grew until it
could be used to read mail.  Now we know they don't stop there...

 > We'll come up with something eventually, but I don't think it will fit
 > into the library/archive framework we are used to: there's so much
 > more information in computing than humans are used to.  It will have
 > to be something new.  Any ideas?

   Some ideas, but this is in the class of very hard problems.  Even
if you have a database program which is designed only to be a fancy
phone dialer, it may implement an algorithm which is what I am looking
for for my radar application.  Or I may not want the program, but I am
looking for the Minneapolis telephone directory which is provided with
this program, and I'll also need the program so I can use it...

   It seems to me that we will need an indexing scheme that looks
hierarchical to the user, but which is actually implemented with fuzzy
logic.  When I go looking for a database program it would originally
exclude the phone dialer programs, but when I get to database programs
with data on addresses in Minnesota, the example I used above is now
back in.
--

					Robert I. Eachus

with STANDARD_DISCLAIMER;
use  STANDARD_DISCLAIMER;
function MESSAGE (TEXT: in CLEVER_IDEAS) return BETTER_IDEAS is...

lars@spectrum.CMC.COM (Lars Poulsen) (06/20/91)

In article <EMV.91Jun18000345@bronte.aa.ox.com>
   emv@msen.com (Ed Vielmetti) writes:
>   X.500 services are directory oriented.  The data in them is relatively
>   small, of known value, and highly structured.  Information about
>   archive sources is just about completely counter to these basic
>   principles.

In article <WORLEY.91Jun18094957@sn1987a.compass.com>
   WOrley@compass.com (Dale Worley) writes:
>What can be done to produce good catalogs?  As Ed notes, archive
>information is likely to be bulky, chaotic, and of unknown (probably
>small) value.  Given how much money is needed to get a directory
>system for information without these problems running, it will
>probably take much more to get a good system for archive information
>working.

Actually, we know quite well what it takes to raise the signal-to-noise
ratio. Administration and moderation.

One possible option would be for the Internet Society to sponsor an
archive registration facility. Maybe each of the IETF task forces can
identify valuable programs that  need to be archived, with mirrored
servers on each continent, available for NFS mounting as well as
anonymous FTP. It should be worth $50 for each site to have access to
good easily  accessible archives instead of having to keep disk space
for everything in our own space. (I know; not every "hobby site" can
afford $50, but there are many commercial sites, including my own, that
would be happy to help feed such a beaST; I'm sure many academic sites
would be able to help, too).
-- 
/ Lars Poulsen, SMTS Software Engineer
  CMC Rockwell  lars@CMC.COM

worley@compass.com (Dale Worley) (06/20/91)

In article <1991Jun20.070516.683@spectrum.CMC.COM> lars@spectrum.CMC.COM (Lars Poulsen) writes:
   One possible option would be for the Internet Society to sponsor an
   archive registration facility. Maybe each of the IETF task forces can
   identify valuable programs that  need to be archived, with mirrored
   servers on each continent, available for NFS mounting as well as
   anonymous FTP. It should be worth $50 for each site to have access to
   good easily  accessible archives instead of having to keep disk space
   for everything in our own space.

Let me do some calculation.  (Of course, some of these numbers may be
off -- I'd like to see how other people think it can be organized.)

First off, it's going to take at least 6 people to run the
organization.  For the first few years, it will take at least 3
programmers and 3 administrators.  Remember, there are 15,000 (to
quote somebody) programs out there, and each one needs to be
catalogued, at least minimally.  Also, since it is a for-pay service,
somebody has to handle payment and bookkeeping.  That will cost
something like $600,000 per year.

And then there's advertising costs -- and it's going to be hard to
advertise it over the Internet, because the Internet doesn't like
money-grubbing.

And there's the cost of maintaining the system's computer, with its
connection to the Internet.

And there has to be a way to limit access to the archives to those
people who have paid for the service -- otherwise there's no incentive
for people to subscribe.

OK, so maybe the total budget is $700,000 per year.

Now, how many sites can we get to sign up?  If we're extremely lucky,
and spend a lot on advertising, maybe 1000 will sign up the first
year.  That puts subscriptions at $700/year.  If you start with 100
sites, subscriptions have to be $7000/year.

Dale Worley		Compass, Inc.			worley@compass.com
--
I'm a politician -- that means I'm a liar and a cheat.  And when I'm not
kissing babies, I'm stealing their lollypops. -- "The Hunt for Red October"

caa@Unify.Com (Chris A. Anderson) (06/21/91)

In article <2013@uqcspe.cs.uq.oz.au> rhys@cs.uq.oz.au writes:
>In <EACHUS.91Jun18164709@largo.mitre.org> eachus@largo.mitre.org (Robert I. Eachus) writes:
>However, if I walk into a computer store and walk up to the shelf marked
>"Spreadsheets" I'll also find programs that double up as wordprocessors,
>databases, desktop publishers, comms programs, ... in addition to being
>a spreadsheet.

One way around this is to have a "Main Category" and "Sub-Category"
headings for the software.  That way, the primary function of the
software would be listed, and any other features could be placed 
under sub-categories.  And Emacs could still be the kitchen sink. :-)

>So if the "Compy Decimal" system (or whatever) was used, we'd find such
>programs under lots of different numbers and sooner or later some librarian
>is going to forget to enter a program under all necessary headings, or
>a programmer is not going to tell the librarian all the headings and
>we are back to square one.  Similarly, using identifiers for programs like
>"spreadsheet,database,wordprocessor,unix,xwindows:123.8" aren't going
>to be much better, and we'll get back to the keyword search problem
>eventually.

I think that a perfect system is unrealistic.  The idea is to make it
better than it is.  If a program is not entered under all of it's
relevant headings, then so be it.  So long as the main purpose of the
program is found, I'd be lots happier.

>Some central control would be needed (as with any library system) and that
>would be a good idea (and I agree with this), but with "creeping featurism"
>being the favourite passtime of upgrades these days, it's only going to
>get worse.  When a book is published, further editions don't stray much
>from the original topic - but program users are always screaming for more
>features over and above what a program was initially intended for, meaning
>extra identifiers for every new version of a program.  Distributed database
>technology is not the answer, just the means.  Better information is the
>answer.

Why not have the authors of the program provide the categories that a
system or program should be entered under?  They know the software best,
and probably wouldn't forget to enter it under too many headings.  The
problem with having a central librarian concept is that you require those
people to be authorities on a vast amount of information.  Not only what
has gone before, but every new technology that comes out.  That's a 
terrific burden.

>Maybe it's time we retrained programmers to write programs to perform
>a single task, not control the world!  :-)

Reality, my friend, reality!  :-)  And let's train managers and
marketroids to not ask for just "one more thing" while we're at it.

Chris
-- 
+------------------------------------------------------------+
|  Chris Anderson, Unify Corp.                caa@unify.com  |
+------------------------------------------------------------+
|     Do not meddle in the affairs of wizards ... for you    |

peter@ficc.ferranti.com (Peter da Silva) (06/22/91)

In article <2013@uqcspe.cs.uq.oz.au> rhys@cs.uq.oz.au writes:
> If I go into a library and walk up to the shelf marked "Mathematical Logic"
> (marked in Dewey Decimal or whatever), then the books I find there will
> be about the various aspects of "Mathematical Logic" and just that.

Most will. Many will have digressions into other aspects of mathematics,
logic, Zen, etcetera...

> However, if I walk into a computer store and walk up to the shelf marked
> "Spreadsheets" I'll also find programs that double up as wordprocessors,
> databases, desktop publishers, comms programs, ... in addition to being
> a spreadsheet.

Yes, but they are basically spreadsheets. All these "integrated programs"
have a central model that describes their behaviour, and a bunch of
extra tools that are stuck on the side.

They're also a dying fad. The only point to things like Lotus is to make
up for limitations in MS-DOS (single tasking, no IPC, etc). Better operating
environments will replace the swiss-army-knife program.

> Maybe it's time we retrained programmers to write programs to perform
> a single task, not control the world!  :-)

Start by boycotting MS-DOS.
-- 
Peter da Silva; Ferranti International Controls Corporation; +1 713 274 5180;
Sugar Land, TX  77487-5012;         `-_-' "Have you hugged your wolf, today?"

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (06/24/91)

I think the Mathematics Subject Classification model would apply quite
well to archived files (and netnews!). A central authority defines a
three-level hierarchy of codes, each covering some subject area; in the
MSC, for instance, 11 is number theory, 11J is approximation, and 11J70
is continued fraction approximation. Every article published is given
(by the author) a primary five-digit code and any number of secondary
five-digit codes. Mathematical Reviews then lists articles by code.
Anyone who doesn't find his subject listed can use a ``None of the
above but in this section'' classification, then ask the AMS to add that
subject in the next MSC revision.

Of course, the MSC (which is available for anonymous ftp on
e-math.ams.com as mathrev/asciiclass.new) wouldn't apply directly to
software; we'd have to draft a whole new set of categories. But the
model will work.

---Dan

worley@compass.com (Dale Worley) (06/24/91)

There already *is* a computer science classification system (the ACM
Computer Surveys classifications), although it's oriented toward
academic CS research rather than practical software.

Dale Worley		Compass, Inc.			worley@compass.com
--
Vietnam was only a police action so why do we have a War on Drugs?!?
-- P.B. Horton

cmf851@anu.oz.au (Albert Langer) (06/25/91)

In article <11900.Jun2322.59.2491@kramden.acf.nyu.edu> 
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>I think the Mathematics Subject Classification model would apply quite
>well to archived files (and netnews!). 

Sounds like a useful model to start from - especially:

1. Use of more than one level.
2. Codes defined by a central authority.
3. Assignment of primary and any number of secondary codes.

I doubt that there will be much success with self-assignment by
authors of software packages since unlike mathematicians they are
not used to relying on literature searches for prior art anyway.

However there is no way to find out how viable that fourth feature of
the maths system is until we have the codes assigned by a central
authority. If it also turns out to be viable fine, otherwise I propose
the "cooperative cataloging" model used by libraries - i.e. the first
major archive site that stocks the package does the classifying and
others copy - that distributes the work among people who understand
the classification scheme, even though not as widely as by distributing
it to authors as well. (Once it has caught on, and people actually
USE the catalog classifications, one could THEN hope for some
self-cataloging by authors.)

By "major archive site" I really mean "cataloging site" - i.e. one
that is willing to do far more than the typical ftp site in actually 
maintaining organized cataloging information. This need not actually
be a site that has disk space available on the internet, though
considering that disk space is now only $2 per MB I don't see why
not. Another set of possible catalogers are the moderators and indexers of
the *sources* groups. (There was some discussion re a classification
scheme in comp.sources.d recently).

>Of course, the MSC (which is available for anonymous ftp on
>e-math.ams.com as mathrev/asciiclass.new) wouldn't apply directly to
>software; we'd have to draft a whole new set of categories. But the
>model will work.

As well as new categories I think we would have to add quite a lot
of features to the model e.g.

1. Version numbers. For whole and component parts.
2. *sources* message-id/subject headings/archive names
3. file sizes for source and object code software, docs, test and other data,
abstracts (README, HISTORY etc) and various combinations, with "standard" 
filenames.
4. refinement of 3 to include postscript/dvi and "source" forms of 
documentation, compressed and uncompressed versions with various
packaging methods etc.
5. Patches and what they apply to and result in.
6. Languages used (perhaps merely one of many classifications, but
could add file sizes and numbers for each).
7. Pre-requisite software. (Not a classification but a reference to
other cataloged packages with specific version numbers).
8. Pre-requisite hardware.
9. Release status. (alpha, beta, gamma etc)
10. Copyright information. (Whether "freely available" etc)
11. Systems tested on.
12. Systems it is believed to work on.
13. Systems it is believed not to work on.

Only the most important information need be provided initially, but
it should be possible to add other stuff including even review comments
or pointers to discussion in newsgroups. This could be provided for
at the same time as setting up system for cooperative cataloging since
coop cataloging implies being able to take an existing or non-existant
catalog record and add to it and have that then available for others
to use or add to. Adding "review comments" would be particularly useful.

It still strikes me that libraries are the institutions that should be
doing this. One thing though, if they aren't prepared to take it on yet,
perhaps they could make available the software used at no charge? There
are some very powerful systems in use for cooperative cataloging and
MARC records that cover everything from audio tapes to maps are just
as complex as anything we will need for software packages.

How about just submitting a couple of packages as "publisher" to the LC and
ask for the "Cataloging In Publication" data to be returned overnight
as is done for book manuscripts. Should produce some discussion :-).

U.S. copyright law clearly defines computer programs as "literary works"
and I can't see anybody claiming that something like "c news" 
or X windows is "merely ephemeral" so I guess they would HAVE to
catalog it.

The Library of Congress IS on the internet (loc.gov) - but if they
won't accept submissions by email or ftp somebody could just startup
a "publisher" to issue a series of tapes and diskettes for physical
delivery to them with each volume a separate monograph (not part of a 
single serial) containing one software package.

I'm quite serious about this, proper cataloging DOES cost about $200 per item
and it IS THEIR JOB. We should just be helping with specialist advice.

P.S. For anyone wanting to follow up - I just don't have time - a
contact at the LC is:

Sally H. McCallum, Chief
Network Development and MARC Standards Office
Library of Congress
smcc@seq1.loc.gov
(202) 707-6273

--
Opinions disclaimed (Authoritative answer from opinion server)
Header reply address wrong. Use cmf851@csc2.anu.edu.au