tony@disk.uucp (tony) (06/14/91)
For the past weeks I have been creating dictionary lists for a friend's word game (on his BBS). I've been taking data files for various programs and converting them to word lists. Each word must be alone on its own line. OftenI have many duplicate words in every file. I would love to find a program that will quickly delete all the duplicates. Deleting them manually or in blocks takes forever. I had thought about delimiting each word (") and importing the text into a dbase file, but that would be pretty time consuming too. Does anyone know of a program that could do the job I need done? Thanks! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ tony@disk.uucp or uunet!ukma!corpane!disk!Tony.Safina
valley@gsbsun.uchicago.edu (Doug Dougherty) (06/14/91)
tony@disk.uucp (tony) writes: >For the past weeks I have been creating dictionary lists for a friend's word >game (on his BBS). I've been taking data files for various programs and >converting them to word lists. Each word must be alone on its own line. OftenI have many duplicate words in every file. I would love to find a program that >will quickly delete all the duplicates. Deleting them manually or in blocks >takes forever. I had thought about delimiting each word (") and importing >the text into a dbase file, but that would be pretty time consuming too. Does >anyone know of a program that could do the job I need done? >Thanks! There is a program called DO.COM (or .EXE, not sure) that DOes a lot of things to files, including, I think, the eqv of Unix "uniq". You should get it. -- (Another fine mess brought to you by valley@gsbsun.uchicago.edu)
ressler@CS.Cornell.EDU (Gene Ressler) (06/14/91)
>For the past weeks I have been creating dictionary lists for a friend's word >game (on his BBS). I've been taking data files for various programs and >converting them to word lists. Each word must be alone on its own line. OftenI have many duplicate words in every file. I would love to find a program that >will quickly delete all the duplicates. ... This is a canonical example for pipes in several Unix texts. You say sort file | uniq > file_with_no_dups Uniq is a very simple filter that looks at the current line and prints it iff it's not different than the last. Both sort (which is faster/nicer than DOS sort) and uniq are available in the gnuish MSDOS ports of gnu utilities by Thurston Ohl. See wsmr-simtel20.army.mil, pd1:<msdos.gnuish>. Another alternative to uniq if you have awk (also available on simtel) is to use the following awk program: { if ($0 != last) print $0; last = $0 } of course it's almost as easy in any other language. Awk is a wonderful tool for this sort of thing. For instance, the program { for (i = 1; i <= NF; ++i) print $i } will print the words one per line. It's not much harder to strip punctuation, force to lower case, or surround with quotes and comma separate for that matter! Gene
dw3w+@andrew.cmu.edu (Database Work) (06/15/91)
tony@disk.uucp was asking how to remove duplicate words from a word list. In unix, this is trivial. Just do sort < word.list | uniq > new.word.list I know that the sort filter is included with MS-DOS, so all you really need is the uniq program, for which I'm certain source must be available somewhere, if someone hasn't ported it to DOS already. Anyone know of a uniq for DOS? Failing this, see if you can't get access to a unix system. Good luck. Tod McQuillin dw3w+devin@andrew.cmu.edu
smfst2@unix.cis.pitt.edu (Seth M Fuller) (06/19/91)
Mix Software of Texas has a set of Unix-like utilities for MSDOS. I'm pretty sure an implementation of uniq is one of them. The programs are very good and only $19.95. There number is 1-800-333-0330. They even have an implementation of the Bourne shell. Seth M. Fuller
w8sdz@rigel.acs.oakland.edu (Keith Petersen) (06/19/91)
WSMR-SIMTEL20.ARMY.MIL [192.88.110.20] Directory PD1:<MSDOS.TXTUTL> Filename Type Length Date Description ============================================== UNEEK101.ZIP B 14900 910614 Eliminate duplicate records in text files, w/C This program is very fast. It assumes that the input file has been sorted (I use Vernon Buerg's SORTF). The input file is renamed to filename.BAK and the new file has the name of the original. UNEEK is *much* faster than any version of "uniq" that I've tried. Keith -- Keith Petersen Maintainer of the MSDOS, MISC and CP/M archives at SIMTEL20 [192.88.110.20] Internet: w8sdz@WSMR-SIMTEL20.Army.Mil or w8sdz@vela.acs.oakland.edu Uucp: uunet!umich!vela!w8sdz BITNET: w8sdz@OAKLAND