[comp.sys.next] NeXT's Digital Library

levy@phoenix.Princeton.EDU (Silvio Levy) (12/28/88)

In article <19728@ames.arc.nasa.gov> mike@ames.arc.nasa.gov.UUCP
 (Mike Smithwick) writes:

>The Digital Librarian is impressive. We searched for the word "celestial"
>throughout the works of Shakespeare. It found all 3 entries in what appeared
>to be less than 5 seconds.

Huh?  I'm mystified.  What do you mean by ``all 3 entries''?  Using
the UNIX utility `grep' I found 17.  In general I'm very disappointed
with the Digital Library, for many reasons detailed below.  Notice that
while some of the reasons appear to be bugs, hence have a chance to
go away, others seem to constitute ``features'', so probably will
stay.  I start with the ``features''.  I'm thankful to Nick Katz
(nmk@fine.princeton.edu) for pointing out some of the facts below and
motivating me to make a more complete study.

UNDESIRABLE FEATURE #1:
You can only search for words, not for strings or phrases.  
This means if to find out where S. wrote ``To be or not to be'', 
you'd have to wade through thousands of occurrences of ``to'',
``be'', ``or'' or ``not''.  But read on.

UNDESIRABLE FEATURE #2:
Apparently very common words cannot be used as search keys at all--
you get a ``0 found'' response.  This is the case with the four words
mentioned above.  Together with feature #1, this means that the Digital
Librarian simply won't locate S.'s most famous quotation.

UNDESIRABLE FEATURE #3:
The display of occurrences is done in two windows.  The top window, a
smaller one, consists of one line for each file where the word was
found (each file has a scene of a play, or a sonnet, etc.)  The line
contains the file name (e.g. Coriolanus: 1.4) and the beginning of the
file.  The latter is completely useless information, as it usually
consists of stage directions, etc. I would expect here a context line
instead, including the keyword.  To actually see the quotations you
want, you select a line from the top window; the bottom window shows
the corresponding file, centered around the first occurrence of the
word in the file.  The upshot is that to find a particular quotation,
you have to click on every line of the first window to open the
corresponding file, then click on ``Find'' before leaving that line
(just in case the file contains more than one occurrence).  Compare
this with the system used in printed and on-line concordances, where
you're presented with a list of context lines and can scan it visually
for the quotation you're looking for.

UNDESIRABLE FEATURE #4:
The source text has very low-level formatting commands embedded
in it.  (Though I guess I should be thankful it's in ASCII files, not
in binary files in some proprietary format...)  For example, the
beginning of <Shakespeare>/Plays/Hamlet/1.1 is something like this:
...
{\pard\f0\fs28{\fs48 Hamlet\
}\
\
{\b\fs36 1.1}
\
{\i	Enter Barnardo [...]
}{\b \fs24 BARNARDO}	Who's there?\
...
For this text to be used elsewhere than in an ``edit'' file, or even
within an ``edit'' file but in a different format, you have
to strip all this garbage.  The markup should instead be done at a
higher level, so global changes are easy to make.  For example, using
a TeX-like notation (that's what I'm most accustomed with;
but SGML or any other markup language would be do equally well):
\title Hamlet \endtitle
\scene 1.1 \endscene
\dir Enter Barnardo [...] \enddir
\speak BARNARDO \endspeak
  Who's there?

Now for the bugs:

BUG #1:
Not all occurrences of a word are found -- far from it.  And generally
you have no clue of that.  I've already mentioned the ``celestial''
fiasco (3 found in 17).  If you try ``horse'' the situation is
similar: 18 found out of 369.  Actually this is what started this
whole thing:  Nick Katz pointed out that if you search for ``horse''
you get (among others) a line saying ``And our twelve thousand horse''
(Ant. and Cl.: 3.7), but if you search for ``twelve'' you don't get
this same line!  The most annoying thing is that the choice of quotations
presented doesn't seem to be based on any clear criterion: the
14 ``celestial''s that didn't make into the search seemed to have as
much of a right to be there as the three that did!

BUG #2:
Treatment of plurals, etc. is inconsistent.  E.g. searching for ``horse''
and ``horses'' brings up two disjoint sets of occurrences, but one of
the occurrences listed under ``horse'' actually says ``horses''
(1 Henry IV: 2.4).  In general there seems to be any way to search
for words under a prefix (as there seems to be for Webster's, although
that doesn't work all the time either -- but that's the subject of
another message).

Silvio Levy (levy@princeton.edu)

jgreely@diplodocus.cis.ohio-state.edu (J Greely) (12/28/88)

In article <5037@phoenix.Princeton.EDU> levy@Princeton.EDU
 (Silvio Levy) writes:
[in reply to "Mike of the silly return address" stating that he
 found "all 3 entries" of a word in the Librarian]

>Huh?  I'm mystified.  What do you mean by ``all 3 entries''?  Using
>the UNIX utility `grep' I found 17.

Yes, boys and girls, the correct phrase is all *indexed* entries of
a word.  Actually, to be more precise, I should say, "all files for
which a word is indexed", since the indexing is at the file level.
Indexing in general is a very-beta operation, and the current
scheme is listed in the release notes with:

	This set of tools is not supported.  It will change
	between now and the 1.0 release, but it does give a
	flavor of things to come.

Since the indexing library is at the heart of the lookup problems,
simply bear with it until it is replaced by a better scheme.

  Actually, what's there is very nice.  The db library is dbm done
right, and the idea behind pword is excellent (although its current
reliance on modern english is unfortunate; this is one of the major
reasons why the indexing in Shakespeare isn't as good as it could
be).  I have great hopes that db will eventually find its way out
into the world (I'd love to work over everything around here that
relies on dbm, and insert db instead.  This would probably solve
several of our problems with yp).

>You can only search for words, not for strings or phrases.  
>This means if to find out where S. wrote ``To be or not to be'', 
>you'd have to wade through thousands of occurrences of ``to'',
>``be'', ``or'' or ``not''.  But read on.

  This is a combination of things.  Do you really want all
occurrences of "to"?  Quick check shows there to be more than 16000
of them, scattered throughout over 6000 files.  Common noise words
are eliminated from the index as a design decision.  As for the
inability to search for a phrase, this is acknowledged as a
limitation in the release notes.

  Also, the above statement is not quite true.  You can search for
	<word> ["and"|"or"|"and not" <word> ...]
which, if the words you want are indexed, will narrow the search
for you.  My stock example is locating the line "Ready, so please
your grace" in The Merchant of Venice.  Not a very important line,
but it stuck in my memory from when we performed the play.  The
only word that is indexed is "grace", which is occurs 75 times.
The one (reasonable) search that will uniquely locate it is
"merchant and grace" (Merchant of Venice, Act 4, Scene 1, second
line).

>Apparently very common words cannot be used as search keys at all--
>you get a ``0 found'' response.  This is the case with the four words
>mentioned above.  Together with feature #1, this means that the Digital
>Librarian simply won't locate S.'s most famous quotation.

Correct.  At present, that quote (as well as several others I've
tried) cannot be found from within the Library as is.  However, if
you know any of the surrounding context, you're better off.  I
happen to remember that the line comes from Hamlet, and that the
quote continues with "that is the question. Whether 'tis
nobler...".  Searching for "nobler" will return 19 files, while
"hamlet and nobler" will return the correct section (Hamlet, Act 3,
Scene 1).  From there, a Find on "nobler" will put you at the
correct location in the file.

  Mind you, you'll never find "Now is the winter of our discontent
made glorious summer by this son of York", unless you know that
it's the first two lines of Richard III.  Incidentally, Library
reports this line as "...son of York", while Quotations claims that
it's "...sun of York".  Typo, anyone?

>UNDESIRABLE FEATURE #3:
[indexing stores the first line(s) of the file, rather than the
context of the match]

Agreed.  The context would be more useful, but I don't think this
will change.  The index is built at the file level, so all it knows
is that the word is important enough to be indexed for that file.
If it returned context, it would be the context of the first entry,
and not necessarily the one you want.

>UNDESIRABLE FEATURE #4:
[embedded rtf, rather than something brighter]

This looks like a feature, since low-level encoding requires less
intelligence than full TeX-like macros.  Not having any
documentation on the Microsoft RTF format, I can't say whether it
is capable of more sophisticated (read that, "higher level")
formatting.

>Now for the bugs:
>
>BUG #1:
[bug, feature, same difference]

>Not all occurrences of a word are found -- far from it.

I recommend to you the manual page for "pword".  This will help
clarify how the indexing is currently done.  The object is to index
all *significant* words, based on the surrounding context.  A
document with frequent mention of horses is more likely to have
"horse" indexed than one where it's only mentioned once.  Note that
the documentation for pword is slightly out of date, and will
hopefully be correct by 0.9 (for the correct options, use "pword
-:").

One other problem is picking Shakespeare for this discussion.  The
frequency tables used for the indexing appear to be the Modern
English version, rather than one more appropriate for the work.  In
particular, the stop list does not include noise words like thee,
thy, thou, etc., instead indexing them quite heavily ("thou" is
indexed 315 times, for example).

>BUG #2:
>Treatment of plurals, etc. is inconsistent.

Words are "singularized", but no mention is made of the technique
used.  It is quite likely that the method currently used isn't as
bright as one might hope.

Now, to toss in a few of my own (my complete list is a bit too
large to post, so I'll limit myself to a few things you didn't
mention about the Library):

1) A lower-case search string will perform a case-insensitive
   search, while an upper-case character will force an exact match.
   Nice in theory, but it doesn't work.  Searching (in Shakespeare)
   for "Merchant and grace" will return all 75 matches for "grace",
   while "merchant and grace" will return the unique match that I'm
   looking for.

2) There is no way to pull up an arbitrary file into the Library,
   except as the result of a search.  For example, if Act 4, Scene
   1 comes up as the result of a search, I cannot simply proceed to
   the Scene 2 if I wish to continue reading.  I can open an Edit
   window containing it, but I can't pull it into the Library
   unless I can match it with a search.  This is the most serious
   limitation of the program for me, and the one I most want to see
   changed by 1.0.  I want to be able to browse through the files
   contained in the current database, without leaving the Library.

3) The target field is shared by the Search and Find buttons, but
   not by the Open button, which instead pulls up a Browser window.
   Better yet, Search understands multiword targets, while Find
   will attempt to match the literal string.  So, I cannot click
   Search, and then expect Find to locate the search string within
   the selected file, unless the search string was a single word.
   The inconsistent use of the target field is confusing.

4) Printing is useless.  An RTF document printed from the Library
   will have no margins, and will be silently clipped on the way
   to the printer.  If you want to print, you currently have to call
   up Edit on the current file.
-=-
J Greely (jgreely@cis.ohio-state.edu; osu-cis!jgreely)
"Who is it *this* time?"
	"Concert promoters who have gone broke organizing
	 charity benefit concerts.  We call it Aid Aid."

ali@polya.Stanford.EDU (Ali T. Ozer) (12/29/88)

In article <5037@phoenix.Princeton.EDU> Silvio Levy writes:
>UNDESIRABLE FEATURE #4:
>The source text has very low-level formatting commands embedded
>in it.  (Though I guess I should be thankful it's in ASCII files, not
>in binary files in some proprietary format...)  

The formatting used for the Shakespeare files is the Microsoft Rich Text
Format (usually known as RTF). The NextStep Text class understands
RTF; and any program using the Text class should be able to read in and edit
RTF files without a problem. (Currently the Text class cannot write out RTF;
but it will be able to in 0.9.)

You can use Edit, the cut-and-paste editor in the Apps directory, for
reading in RTF files and stripping the RTF info off. If you double-click
on the file name in the Librarian, Edit will be launched and the 
specified file will be read into a new window, formatted correctly and
with the various fonts as indicated by the RTF instructions. If you wish
to strip the RTF commands off, create a new Edit window, then cut the 
desired text from the first window and paste it into the second. Edit
windows by default are mono-font, so the RTF info is automatically stripped
during the paste. 

You can make an Edit window accept RTF by selecting "Make RTF" from the menu.

Ali Ozer, NeXT Developer Support
aozer@NeXT.com

jgreely@dinosaur.cis.ohio-state.edu (J Greely) (12/29/88)

In article <5820@polya.Stanford.EDU> aozer@NeXT.com writes:
[strip RTF from files in the Library by launching Edit, and
 cut-and-pasting into a new edit window]

Well, this is useful, but for those of us who want to strip
RTF from an arbitrary file without the overhead of starting
Edit, the filter /bootdisk/NeXT/System/Searcher/rtf-ascii is
more fun (not perfect, but more fun).
-=-
J Greely (jgreely@cis.ohio-state.edu; osu-cis!jgreely)
"Who is it *this* time?"
	"Concert promoters who have gone broke organizing
	 charity benefit concerts.  We call it Aid Aid."