[net.general] UNIX system to house 140 mbyte unformatted textual dbase?

grc@wjh12.UUCP (crane) (05/28/84)

We have an unformatted textual database currently comprising 140
mbytes of text, which will grow to about 500 mbytes within the
next two years. Inverted indices (50% overhead--on top of
140 mbytes of text) have been developed,
but for some applications (such as fixed phrases or combinations of
common words) it is necessary to perform a linear search 
on the entire corpus.

a) i am interested in benchmarks to see how fast different machines
can perform linear searches. in particular, i would like to know
how fast the command "egrep xxx /usr/dict/words" (where
/usr/dict/words ~= 200K) runs on a GOULD, PYRAMID, ZILOG or different
68K based systems. We have access to a VAX 11/750 and 780, PDP 11/44
and PIXEL 100. Benchmarks from any other systems would be greatly
appreciated. The PIXEL is quite fast in core, but the disks
are ruinously slow: an otherwise idle PIXEL 100 (with 40 mbyte disks)
can only spend 30% of its time on an egrep. the rest of the time
it is evidently twiddling electrons waiting for more disk blocks.
does anybody out there have a Sun with the Fujitsu eagle?
	This dbase has a limited clientele, and the machine would not
need to field more than 4 searches or so at a time, but we could
easily use a more powerful system and would as soon not dedicate a
system to this database.

b) does anyone out there know of any good way to deal with searching
this much data on a UNIX system? experiments in distributed processing
that could provide wide access cheaply? this is a read only dbase, so
we could avoid the UNIX file system and store the data in big blocks
on a raw file system. has anyone got some special hardware hanging off
of a UNIX system to perform this kind of task?

						Gregory Crane
						Harvard University

trt@rti-sel.UUCP (05/30/84)

I ran this on a "Gould Concept 32/8750" (whew) when it was fairly idle.
The filesystem is 1k/8k, but there are only 102 1K incore buffers.

	Script started on Tue May 29 18:34:20 1984
	rti-sel% time wc /usr/dict/words
	   24474   24474  201039 /usr/dict/words
	        1.2 real         0.6 user         0.2 sys  
	rti-sel% time egrep xxx /usr/dict/words
	        1.1 real         0.5 user         0.3 sys  
	rti-sel% ^D
	script done on Tue May 29 18:34:45 1984

I have about 5 megabytes of old news article titles (plus article ID)
so it is handy to have a fast grep.
	Tom Truscott