[comp.sys.next] NeXT Database Prowess

fqoj@vax5.CIT.CORNELL.EDU (08/15/89)

                                                                                
Folks,                                                                          
                                                                                
There is a ton of interest on my campus surrounding the NeXT machine.           
Surprisingly, or not suprisingly, depending on your view-point, much of         
the interest comes from NON-technical fields. I have a question that            
I've seen many people dance around on this net but no one clearly               
address. Many people have seen the "Complete Works of Shakespear" go            
through its paces. The next (no pun) step, obviously, for many, is too          
put the subject matter of their interest "on-line" in a similar fashion.        
I have two specific requests to pass on by way of example: In one case,         
the entire written works of Sigmund Freud have been entered                     
electronically (really! the department even got a grant from NEH to do          
it. They've spent almost $50,000 so far in Kurzweil time!)...they'd             
like to have that database "NeXTized" or whatever the process is                
called. A similar situation is for a unit studying the works of Plato.          
Although the Plato project is not nearly as far along, both projects are        
VERY interested in the technology. So, the querries:                            
                                                                                
What exactly is the process going on "under" the Shakespear icon, it            
can't be just a glorified fgrep. How does the cube, burdened with the           
unix file-system, get such good recall on that large database?                  
                                                                                
Is there a way Cornell could send the disk data to NeXT, or even a third        
party, and have them put the data on an OD with the proper                      
cross-indexing? We'd want to do the front-end ourselves in IB                   
(obviously, where's the fun without that chance?  :-)   )                       
                                                                                
Is there a way to licence the underlying software that drives such              
cross-referenced databases? Is this a NeXT-developed technology or third        
party? Obviously the potential is great for any field to have their "hot        
topics" ready and on-line in such a fashion. Will it be part of a future        
OS release. Maybe something like AppKit only this would be called               
DataBaseKit?                                                                    
                                                                                
Please, we're sincere here and the money is (sort of) there or can be           
found. If anyone has any info please pass it along or, if you could,            
direct me to someone who is in the know. Thanks in advance.                     
                                                                                
Reply to:                                                                       
                                                                                
Roger Jagoda                                                                    
System Analyst                                                                  
Cornell University                                                              
Internet: FQOJ@CORNELLA.CIT.CORNELL.EDU                                         
Bitnet: FQOJ@CORNELLA.BITNET                                                    
Snail Mail: 220 Cornell Comp. Cent. Cornell University, Ithaca, NY 14853        
AT&T: (607) 255-8960                                                            

dz@tangello.ucsb.edu (Daniel James Zerkle) (08/15/89)

In article <19350@vax5.CIT.CORNELL.EDU> fqoj@vax5.cit.cornell.edu () writes:
> ...they'd             
>like to have that database "NeXTized" or whatever the process is                
>called. A similar situation is for a unit studying the works of Plato.          
>What exactly is the process going on "under" the Shakespear icon, it            
>can't be just a glorified fgrep. How does the cube, burdened with the           
>unix file-system, get such good recall on that large database?                  

There is a fairly straightforward implemetation of inverted indices.
That is, keywords are sifted out from the original text, sorted, and
hashed.  When the digital librarian looks for a word, it has three
files (set up previously) that are exceedingly fast to search, due
to the way they are arranged (hashed and sorted).  Once they are found
there, the keywords reference the individual files and locations of
the original text.  And actually, it is possible to turn off the
indexes and use fgrep, which is necessary to search for certain
sophisticated patterns (parts of words) that the indexes can't
handle.  This is similar to the REFER database system already
implemented on any Berkeley (and maybe sys V, yo no se).  It is a
bit more sophisticated, as there are systems for indexing multitudes
of different kinds of files, and more information is available about
the objects searched after a key is found and before it is looked up.

>Is there a way Cornell could send the disk data to NeXT, or even a third        
>party, and have them put the data on an OD with the proper                      
>cross-indexing?

Not necessary.  Just drag the folder (i.e. directory) from the
directory browser to an empty icon well in the Digital Librarian.
You can index the files from a menu selection (I forget which),
but be careful, as DL has a bug that makes it think it has an
indexed directory when it isn't really indexed.

>We'd want to do the front-end ourselves in IB                   
>(obviously, where's the fun without that chance?  :-)   )                       

I am planning on immediately starting a similar project.  Perhaps
we should share our work.  I need to expand on the capabilities
that DL just doesn't provide (diplay troff text properly).

>Is there a way to licence the underlying software that drives such              
>cross-referenced databases? Is this a NeXT-developed technology or third        
>party? Obviously the potential is great for any field to have their "hot        
>topics" ready and on-line in such a fashion. Will it be part of a future        
>OS release. Maybe something like AppKit only this would be called               
>DataBaseKit?                                                                    

You already have the software.  There are a bunch of poorly documented
function calls (well, not all THAT poorly documented) to handle all
the indexing stuff.  It is not objective C, but just the ordinary
stuff.  Search in the digital librarian for "index" and "indexing"
under the release notes and the manual pages, and you'll find all
sorts of stuff.  I recommend you start from a terminal with
"man 1 index", and follow the cross references.

I responded here because I thought some of this stuff is of
general interest, but I would really like to work with you, as
I think we could help each other out a lot.  Please send mail.

| Dan Zerkle home:(805) 968-4683 morning:961-2434 afternoon:687-0110  |
| dz@cornu.ucsb.edu dz%cornu@ucsbuxa.bitnet ...ucbvax!hub!cornu!dz    |
| Snailmail: 6681 Berkshire Terrace #5, Isla Vista, CA  93117         |
| Disclaimer: If it's wrong or stupid, pretend I didn't do it.        |

UH2@PSUVM.BITNET (Lee Sailer) (08/15/89)

One other point about indexed files.  Remeber that the index stuff takes
roughly as much disk space as the original material, so that if you have
50 MB of Freud, then you'll need about 50 MB to strore the indexes, too.

Also, perhaps we should call this software something else.  In some quarters,
"database" refers to the management of more structured material comprising
"entities" which have "attributes" (like Employee == name, age, salary, ssn ).
Since NeXT has (will have) such database capabilities, it is confusing
to call the Webster's and Shakespeare capabilities "database", too.

I suggest we call them "Information Retrieval" utilities.

chari@nueces.UUCP (Christopher M. Whatley) (08/16/89)

In article <89227.110740UH2@PSUVM> UH2@PSUVM.BITNET (Lee Sailer) writes:
>Also, perhaps we should call this software something else.  In some quarters,
>"database" refers to the management of more structured material comprising
>"entities" which have "attributes" (like Employee == name, age, salary, ssn ).
>Since NeXT has (will have) such database capabilities, it is confusing
>to call the Webster's and Shakespeare capabilities "database", too.

What is wrong with "free-form database" and "relational database". That is
what you have with "index" and Sybase SQL.

>I suggest we call them "Information Retrieval" utilities.

Gee, is seems like I just retrieved some information from Fourth Dimension
awhile ago and that I just made a mod.recipes database with "index" a few
days ago. Confusing?!?


-- 
Chris Whatley			chari@nueces.cactus.org
P.O. Box 50254			!nueces!chari@cs.utexas.edu
Austin, TX 78763		chari@walt.cc.utexas.edu
512/499-0475

bajan@opus.cs.mcgill.ca (Alan Emtage) (08/22/89)

Just as a comment: Is it just me, or do more people out there think that
there could have been a more logical way for NeXT to have bundled the
various databases it maintains? Granted, a dictionary is a special kind
of database (the keys are generally obvious), but why have the quotations
in a separate application from the Shakespeare (which is effectively the
complete collection of quotes of The Bard ) ? 

Couldn't we benefit from the techniques used by the Webster application?
I sent a message to NeXT asking for any documentation on the internal
structure of the Webster database and got a (very nice) reply saying in 
effect that this was "private" to Webster. My reaction was one of mild
amusement since, as far as I'm concerned it's naive to think that this
information won't soon be available (if it isn't already).
As a side project, I was thinking of doing something similar to Webster,
but using the KJV of the Bible as the text, as an academic exercise. Thus
you could include some of the more well known works of art with a
religious theme as the ``pictures'' (I like art history).

This isn't a flame (well, not really), just a suggestion that the various
databases could have been put together in a more effective way.


-----------------------------------------------------------------------------
Alan Emtage,                    "It's currently a problem of access to
McGill University,CANADA        gigabits through punybaud." -  Licklider

INTERNET: bajan@cs.mcgill.ca    UUCP: ...mit-eddie!musocs!bajan
	  listmaster@cs.mcgill.ca
BITNET:	  bajan@musocs.BITNET
-----------------------------------------------------------------------------

jpd00964@uxa.cso.uiuc.edu (08/23/89)

/* Written 12:30 pm  Aug 21, 1989 by bajan@opus.cs.mcgill.ca in uxa.cso.uiuc.edu:comp.sys.next */
>As a side project, I was thinking of doing something similar to Webster,
>but using the KJV of the Bible as the text, as an academic exercise. Thus
>you could include some of the more well known works of art with a
>religious theme as the ``pictures'' (I like art history).
/* End of text from uxa.cso.uiuc.edu:comp.sys.next */


As a bit of a bible collector, may I suggest using one of the many besides
KJ?  I don't mean to offend, start a religious war, or a crusade, but the 
mis-translations in KJV seem to make it one that should be avoided.  Two better
versions that I hope you might consider are either NIV or better yet New Oxford.
Both of these used the original Hebrew and retranslated.  The New Oxford even 
comes with the Apocrypha.

Michael Rutman

epsilon@wet.UUCP (Eric P. Scott) (08/23/89)

In article <1445@opus.cs.mcgill.ca> bajan@opus.UUCP (Alan Emtage) writes:
>Couldn't we benefit from the techniques used by the Webster application?
>I sent a message to NeXT asking for any documentation on the internal
>structure of the Webster database and got a (very nice) reply saying in 
>effect that this was "private" to Webster. My reaction was one of mild
>amusement since, as far as I'm concerned it's naive to think that this
>information won't soon be available (if it isn't already).

The "Webster database" is an image of the typesetter tape.  No
one's pulling a fast one here--NeXT's indexing is flexible enough
to work on a variety of file formats in addition to straight ASCII.
NeXT gave you an honest answer; you just asked the wrong question.

>As a side project, I was thinking of doing something similar to Webster,
>but using the KJV of the Bible as the text, as an academic exercise.

When NeXT did their first major demonstrations on our campus,
they had (some version of) the Bible online, fully indexed and
searchable in the Digital Librarian.  No pictures, though.

					-=EPS=-