grc@wjh12.UUCP (crane) (05/28/84)
We have an unformatted textual database currently comprising 140 mbytes of text, which will grow to about 500 mbytes within the next two years. Inverted indices (50% overhead--on top of 140 mbytes of text) have been developed, but for some applications (such as fixed phrases or combinations of common words) it is necessary to perform a linear search on the entire corpus. a) i am interested in benchmarks to see how fast different machines can perform linear searches. in particular, i would like to know how fast the command "egrep xxx /usr/dict/words" (where /usr/dict/words ~= 200K) runs on a GOULD, PYRAMID, ZILOG or different 68K based systems. We have access to a VAX 11/750 and 780, PDP 11/44 and PIXEL 100. Benchmarks from any other systems would be greatly appreciated. The PIXEL is quite fast in core, but the disks are ruinously slow: an otherwise idle PIXEL 100 (with 40 mbyte disks) can only spend 30% of its time on an egrep. the rest of the time it is evidently twiddling electrons waiting for more disk blocks. does anybody out there have a Sun with the Fujitsu eagle? This dbase has a limited clientele, and the machine would not need to field more than 4 searches or so at a time, but we could easily use a more powerful system and would as soon not dedicate a system to this database. b) does anyone out there know of any good way to deal with searching this much data on a UNIX system? experiments in distributed processing that could provide wide access cheaply? this is a read only dbase, so we could avoid the UNIX file system and store the data in big blocks on a raw file system. has anyone got some special hardware hanging off of a UNIX system to perform this kind of task? Gregory Crane Harvard University
trt@rti-sel.UUCP (05/30/84)
I ran this on a "Gould Concept 32/8750" (whew) when it was fairly idle. The filesystem is 1k/8k, but there are only 102 1K incore buffers. Script started on Tue May 29 18:34:20 1984 rti-sel% time wc /usr/dict/words 24474 24474 201039 /usr/dict/words 1.2 real 0.6 user 0.2 sys rti-sel% time egrep xxx /usr/dict/words 1.1 real 0.5 user 0.3 sys rti-sel% ^D script done on Tue May 29 18:34:45 1984 I have about 5 megabytes of old news article titles (plus article ID) so it is handy to have a fast grep. Tom Truscott