[comp.text.sgml] Commercial Archives

emv@msen.com (Ed Vielmetti) (06/24/91)

<excerpt>
   The current comp.archives appears to be driven by "technology
   push": you have the data available, so you're saving it.
   Business doesn't work that way; it works by "pull."  You
   have to find customers who need a specific type of data,
   then you let them pay for the archiving, indexing, and
   knowledgeable data experts.
</excerpt>

There's something that you're missing here, I think.  No doubt there's
some domain-specific knowledge involved in the production of
comp.archives; it's useful to have a feel for which of the 1000+
archive sites in the world have the greatest likelihood of having
current stuff, which authors are most reliable, who is best organized.

But there's more to it than that.  One of the fundamental technologies
involved is taking a piece of text and answering the question "Is this
interesting?", or more likely "Is this likely to be interesting to Ed
Vielmetti, or Chris Torek, or Mark Moraes, or Richard Stallman, or
Mitch Kapor?"  That's not an easy question, but if you can solve it
(for free) for the person involved, then you can instantly market what
you have to everyone else in the world who respects these people's opinions.

<excerpt>
   As an extreme case, you can imagine a host of consultants,
   each with his or her own archive.  Each consultant advertises
   a specialty, collects related data, indexes it according to
   personal needs, seeks out customers, prepares reports, and
   occasionally even publishes a book.
</excerpt>

That's a good model to follow, and I would hope to start following it.
One of the things that's going to be part of the <tm> MSEN Archive
Service </tm> which is not in comp.archives now is a further breakdown
by subject classification; you'll be able to subscribe to
"msen.archives.tex" and get just the latest and greatest on TeX
software announcements and reviews, or "msen.archives.x" to track the
progress of X11 stuff.  You'll particularly want the last one once
X11R5 rolls around.  Each of these collections will have its own
archivist, who is responsible for quality control and additional
research. 

I'm planning to apply the same technology to related fields as well,
subject to the availablility of some copyrighted information (and the
time and investment to pull it off).  For instance, an <tm> MSEN Patent
Watch </tm> subscription would get you news of patent filings,
cross-license agreements, technical information (and raw speculation)
on the viability and challengability of <kw> software patents </kw>,
etc, culled from every available source and tagged (by experts) with
an assessment of quality and value.  I'd bet that this on could even
make a go for itself on paper.

<excerpt>
   Instead of following the consultant model, you seem to be
   following the public library model.  Why?  There's no money
   in it.
<excerpt>

One of the problems with the consultant model is that it doesn't scale
too well; you have to do all of the development yourself, and it's
hard to find like-minded people because you're hoarding all of your
efforts.  By pursuing a strategy that includes some component of
public service / pro bono / for the good of the net, and by
aggressively tracking Internet standards (like the multipart,
multimedia "richmail" spec), it's possible to get a substantial amount
of goodwill, and perhaps enough visibility for people to take you
seriously. 

After all, this sort of thing is very old, it's just a high tech
"clipping service".  It's something that I would do <o>just for
myself</o> except that that hasn't been lucrative enough to buy the
necessary hardware and software I'd need to store all of the
interesting things I find, or to license the necessary rights to the
copyrighted newsfeeds (let alone have anything left over for me) .  It
doesn't matter if there's "no money in it", so long as the venture is
self-supporting and sustainable.

<sig>
Edward Vielmetti, vice president for research, MSEN Inc. emv@msen.com
"MSEN Archive Service" and "MSEN Patent Watch" are trademarks of MSEN, Inc.
<snappy-quote>
On the Net, the Net-way is best.
	It's just that we are trying to figure out what the Net-way is.
						e. miya
</snappy-quote>
</sig>

<comment>
Markup information provided for use by news readers which implement
the experimental "Mechanisms for Specifying and Describing Internet
Message Bodies", available for anonymous ftp from 
	<msen-archive-information>
	<site>thumper.bellcore.com</site>
	<directory>/pub/nsb</directory>
	</msen-archive-information>
This text has been marked up in the hopes that someone will be able to
print it out on paper and make it pretty!  A five dollar reward goes
to the first nice paper copy.  Send submissions to
<snail>
	Edward Vielmetti
	MSEN, Inc.
	317 S. Division, Suite 218
	Ann Arbor, MI 48104-2203
	USA
</snail>
<markup>
<kw> key words </kw>
<o> emphasis </o>
<tm> trademark </tm>
<sig> signature </sig>
<snail> paper mail ("snail mail") address </snail>
<snappy-quote> when in doubt, quote an RFC. </snappy-quote>
<msgid> message id </msgid>
<from> from </from>
<excerpt> 
	<msgid> LAWS.91Jun22223423@sunset.ai.sri.com </msgid>
	<from> laws@ai.sri.com (Kenneth I. Laws) </from>
</excerpt>
</markup>
</comment>

emv@msen.com (Ed Vielmetti) (06/25/91)

<par> 
As far as it is feasible the IETF "richmail" project is being
pushed to use as simple a subset of SGML as possible so that people
can type it in by hand and not have it distract too much from the
actual text. 
</par>
<excerpt>
   in article 1991Jun24.193928.21180@newshost.anu.edu.au 
   cmf851@anu.oz.au (Albert Langer) writes:   

   However if a suitable SGML document type HAS been defined for your
   purposes then you ought to publish it and reference it as a public
   text. Then you can use a MUCH less verbose (but equally readable)
   notation - e.g. omitting or shortening most of the end markers and
   making use of various abbreviations and typist techniques.
</excerpt>
<par>
There's good reasons not to use the SGML minimization rules, not the
least of which is to minimize the amount of work that "dumb" user
agents have to do to strip out the formatting information.  To quote
from the internet draft --
<excerpt>
            NOTE ON THE RELATIONSHIP OF RICHTEXT TO SGML:   Richtext  is
            decidedly  not  SGML,  and  should  not be used to transport
            arbitrary SGML  documents.   Those  who  wish  to  use  SGML
            document  types  as  a mail transport format should define a
            new text-plus subtype,  e.g.  "text-plus/sgml-dtd-whatever".
            Richtext  is  designed  to  be  compatible  with  SGML,  and
            specifically so  that  it  will  be  possible  to  define  a
            richtext  DTD  if  that  is  desired. However, this does not
            imply that arbitrary SGML can be called richtext,  nor  that
            richtext  implementors have any need to understand SGML; the
            description  in  this  memo  is  a  complete  definition  of
            richtext.						
</excerpt>
The approach of avoiding the complicated minimization rules
facilitates treatment of the text by more general systems, such as
Open Text System's PAT, which can be taught to recognize very simple
tagging schemes but which don't have facilities for disambiguating
whether a minimized end-tag matches one or more begin-tags.  I also
hope to have a system built in GNU Emacs, and while the richtext
scheme seems easy enough with it I don't have any intention of hacking
full-blown SGML in emacs.
</par>
<par>
As an extreme example, all of the markup in this document is one tag
per line, which is extremely easy to wipe out with even with grep -v.
</par>
<sig>
Edward Vielmetti, vice president for research, MSEN Inc. emv@msen.com
<snappy-quote>
By the way, Ed, I think you may be the first person in the history of
the world to successfully send a multifont email message to someone who
wasn't using the same software with which the message was composed.
Congratulations!	nsb@thumper.bellcore.com
</snappy-quote>
</sig>