[comp.binaries.ibm.pc.d] Need a program to delete duplicate lines of text

tony@disk.uucp (tony) (06/14/91)

For the past weeks I have been creating dictionary lists for a friend's word
game (on his BBS).   I've been taking data files for various programs and 
converting them to word lists.   Each word must be alone on its own line.  OftenI have many duplicate words in every file.  I would love to find a program that
will quickly delete all the duplicates.  Deleting them manually or in blocks 
takes forever.  I had thought about delimiting each word (") and importing
the text into a dbase file, but that would be pretty time consuming too.  Does
anyone know of a program that could do the job I need done?
Thanks!

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tony@disk.uucp
or
uunet!ukma!corpane!disk!Tony.Safina

valley@gsbsun.uchicago.edu (Doug Dougherty) (06/14/91)

tony@disk.uucp (tony) writes:

>For the past weeks I have been creating dictionary lists for a friend's word
>game (on his BBS).   I've been taking data files for various programs and 
>converting them to word lists.   Each word must be alone on its own line.  OftenI have many duplicate words in every file.  I would love to find a program that
>will quickly delete all the duplicates.  Deleting them manually or in blocks 
>takes forever.  I had thought about delimiting each word (") and importing
>the text into a dbase file, but that would be pretty time consuming too.  Does
>anyone know of a program that could do the job I need done?
>Thanks!

There is a program called DO.COM (or .EXE, not sure) that DOes a lot of
things to files, including, I think, the eqv of Unix "uniq".  You should
get it.
--

	(Another fine mess brought to you by valley@gsbsun.uchicago.edu)

ressler@CS.Cornell.EDU (Gene Ressler) (06/14/91)

>For the past weeks I have been creating dictionary lists for a friend's word
>game (on his BBS).   I've been taking data files for various programs and 
>converting them to word lists.   Each word must be alone on its own line.  OftenI have many duplicate words in every file.  I would love to find a program that
>will quickly delete all the duplicates.
...
This is a canonical example for pipes in several Unix texts.
You say

sort file | uniq > file_with_no_dups

Uniq is a very simple filter that looks at the
current line and prints it iff it's not different than the last.

Both sort (which is faster/nicer than DOS sort) and uniq are 
available in the gnuish MSDOS ports of gnu utilities by Thurston
Ohl.  See wsmr-simtel20.army.mil,  pd1:<msdos.gnuish>.

Another alternative to uniq if you have awk (also available on 
simtel) is to use the following awk program:

{ if ($0 != last) print $0; last = $0 }

of course it's almost as easy in any other language.  Awk is
a wonderful tool for this sort of thing.  For instance, the program

{ for (i = 1; i <= NF; ++i) print $i }

will print the words one per line.  It's not much harder to strip
punctuation, force to lower case, or surround with quotes and comma
separate for that matter!

Gene

dw3w+@andrew.cmu.edu (Database Work) (06/15/91)

tony@disk.uucp was asking how to remove duplicate words from a word list.
In unix, this is trivial. Just do

sort < word.list | uniq > new.word.list

I know that the sort filter is included with MS-DOS, so all you really
need is the uniq program, for which I'm certain source must be
available somewhere, if someone hasn't ported it to DOS already.
Anyone know of a uniq for DOS?

Failing this, see if you can't get access to a unix system.  Good
luck.

Tod McQuillin
dw3w+devin@andrew.cmu.edu

smfst2@unix.cis.pitt.edu (Seth M Fuller) (06/19/91)

Mix Software of Texas has a set of Unix-like utilities for MSDOS.
I'm pretty sure an implementation of uniq is one of them. The
programs are very good and only $19.95. There number is 1-800-333-0330.
They even have an implementation of the Bourne shell.

Seth M. Fuller

w8sdz@rigel.acs.oakland.edu (Keith Petersen) (06/19/91)

WSMR-SIMTEL20.ARMY.MIL [192.88.110.20]

Directory PD1:<MSDOS.TXTUTL>
 Filename   Type Length   Date    Description
==============================================
UNEEK101.ZIP  B   14900  910614  Eliminate duplicate records in text files, w/C

This program is very fast.  It assumes that the input file has been
sorted (I use Vernon Buerg's SORTF).  The input file is renamed to
filename.BAK and the new file has the name of the original.

UNEEK is *much* faster than any version of "uniq" that I've tried.

Keith
--
Keith Petersen
Maintainer of the MSDOS, MISC and CP/M archives at SIMTEL20 [192.88.110.20]
Internet: w8sdz@WSMR-SIMTEL20.Army.Mil    or     w8sdz@vela.acs.oakland.edu
Uucp: uunet!umich!vela!w8sdz                          BITNET: w8sdz@OAKLAND