des0mpw@colman.newcastle.ac.uk (M.P. Ward) (02/08/91)
Some time ago I presented the results of using perl on a wordlist to look for various `interesting' words and was mildly chastised for not giving the code. I think it makes an interesting exercise, especially illustrating the use of associative arrays. So here they are: Some things you can do with grep/egrep since you are just matching each word against a regular expression eg: Words with two equal components: egrep '^(.*)\1$' wordlist One each of all vowels (including y) in order: (I have split this line) grep '^[^aeiouy]*a[^aeiouy]*e[^aeiouy]*\ i[^aeiouy]*o[^aeiouy]*u[^aeiouy]*y[^aeiouy]*$' But most require more powerful tools - and perl has them all: Note the idiom: join('',sort(split(//,$word))) which returns a string containing all the characters in $word sorted in order: ------------------------------------ # find words with letters in alphabetical order # Usage: alpha file # Only interested in fairly long words (at least 6 letters): open(WORDS,"grep '......' $ARGV[0] |"); while ($word = <WORDS>) { chop($word); # remove the newline character # print word if word = sortword: print "$word\n" if ($word eq join('',sort(split(//,$word)))); } ----------------------------------- # Find all anagram words # Usage: anagrams wordlist open(WORDS,"$ARGV[0]"); while ($word = <WORDS>) { chop($word); # record each word under its sorted letters: $sortword = join('',sort(split(//,$word))); $words{$sortword} .= "$word "; } foreach $sortword (keys %words) { # if a set of sorted letters has two or more words stored then they # are anagrams, so print them: $list = $words{$sortword}; # count no. of spaces in list (== no of words): if (($list =~ tr/ / /) >= 2) { chop($list); print "$list\n"; } } ----------------------------------- # single anagram finder # Usage: anag letters wordlist $input = join('',sort(split(//,$ARGV[0]))); $l = length($input); # pick words with all letters in given word: open(WORDS,"grep '^[$input]*\$' $ARGV[1] |"); while (<WORDS>) { chop; next if (length != $l); # must have right length $sortword = join('',sort(split(//))); if ($input eq $sortword) { print "$_\n"; } } ----------------------------------- # find the last few words in an anagram dictionary: # Usage: last-anag file # Only interested in words which don't contain # `early' letters of the alphabet: open(WORDS,"grep -v '[a-s]' $ARGV[0] |"); while ($word = <WORDS>) { chop ($word); # store word under its sorted letters: $words{join('',sort(split(//,$word)))} .= $word . " "; } # print results sorted: foreach $sortword (sort(keys %words)) { print "$sortword: $words{$sortword}\n"; } ----------------------------------- # find words with no letters repeated # Usage: no-repeats file # Only interested in fairly long words (at least 12 letters): open(WORDS,"grep '............' $ARGV[0] |"); while ($word = <WORDS>) { chop($word); $sortword = join('',sort(split(//,$word))); print "$word\n" unless ($sortword =~ /(.)\1/); } ----------------------------------- # ``reversable words'' finder # Usage: revword file # Only interested in fairly long words (at least 6 letters): open(WORDS,"grep '......' $ARGV[0] |"); while (<WORDS>) { chop; # print word if reversed word has been found: print "$_\n" if ($words{reverse} == 1); $words{$_} = 1; # Reverse the order of the last two lines if you want to list # palindrome words as well. } ----------------------------------- I hope you find this of some interest/amusement. Finally I noticed that the anagrams program (which lists all anagrams in the word list) eventually required over 17 Mbytes of memory to analyse a 650k wordlist. This seems to be rather excessive since each word is stored at most twice (as a key and as an array entry). Martin. print "Just another ";open(P,"/usr/dict/words");@_= <P>; grep(((/^Perl.$/)&&(tr/P/p/,chop,chop,print)),@_);print " hacker"; JANET: Martin.Ward@uk.ac.durham Internet (eg US): Martin.Ward@DURHAM.AC.UK or if that fails: Martin.Ward%uk.ac.durham@nfsnet-relay.ac.uk or even: Martin.Ward%DURHAM.AC.UK@CUNYVM.CUNY.EDU BITNET: IN%"Martin.Ward@DURHAM.AC.UK" UUCP:...!mcvax!ukc!durham!Martin.Ward