[comp.lang.perl] Longest word composed of Unix commands

sahayman@iuvax.cs.indiana.edu (Steve Hayman) (12/13/90)

Have you ever wondered what the longest word is that can be spelled
with consecutive Unix commands?  (i.e. "fingertip" = "finger" + "tip")
You have?  Well, stop worrying.  Here's a script that will find them
by seeing which words in /usr/dict/words can be spelled via
combinations of commands in /bin:/usr/bin:/usr/ucb 

ok ok it's a dumb script, but how many of you knew that
"testicular" can be spelled with standard Ultrix commands?
I ran this on an Ultrix system and the longest words
produced were

    watershed
    prescript
    fingertip
    predicate
    extricate
    flintlock
    printmake
    collinear
    manometric
    manuscript
    communique
    testicular
    clearheaded
    fingerprint

don't waste too much time running this ...

#!/usr/bin/perl
# unixword
# find the words in /usr/dict/words that can be constructed
# out of unix commands.  sort by length.
# when I run this on our ultrix machine, the longest words I get are
#	manometric
#	manuscript
#	communique
#	testicular
#	clearheaded
#	fingerprint
#
# OK it's a silly script. don't waste too much time running it.
# (it may take a few minutes)
# steve hayman
# dec 12/1990

@dirs = ( '/bin', '/usr/bin', '/usr/ucb');
$wordlist = '/usr/dict/words';

# step 1: get a list of file names in the various directories

foreach $dir ( @dirs ) {
    opendir(DIR, $dir) || die "Can't opendir $dir: $!";
    push(@files, readdir(DIR));
    close(DIR);
}

# step 2: protect metacharacters like '.' or '[' which
# can occur in the file names 

foreach $f ( @files ) {
    $f =~ s/[.[]/\\$&/g;
}

# step 3: construct a suitable regular expression matching
# all these filenames

$re = '^(' . join("|", @files) .  ')+$' ;

# step 4: match the dictionary file against this pattern; store words that
# match the pattern - assoc. array indexed by word, containing word len.

open(DICT, $wordlist) || die "Can't open $wordlist: $!";

while ( <DICT> ) {
    chop;
    $len{$_} = length if /$re/io;
}

# step 5: print word list in order of length

foreach $word ( sort lengthwise keys %len ) {
    print "$word\n";
}


sub lengthwise {
    $len{$a} - $len{$b};
}

tchrist@convex.COM (Tom Christiansen) (12/13/90)

sahayman@iuvax.cs.indiana.edu (Steve Hayman) writes:

:I ran this on an Ultrix system and the longest words
:produced were
:    watershed
:    prescript
:    fingertip
:    predicate
:    extricate
:    flintlock
:    printmake
:    collinear
:    manometric
:    manuscript
:    communique
:    testicular
:    clearheaded
:    fingerprint
:don't waste too much time running this ...

Neat.

But don't you know that telling someone not to something is the best way
to get it to happen? :-)

This one took ~600 CPU seconds on a 2.5 megabyte dictionary.

--tom

<<<8 CHARS>>>
arcuated
arvicole
asellate
assorted
calendar
calfkill
Colville
diffused
dullhead
dulseman
educated
educatee
errorful
excalate
excudate
expanded
extrared
exuviate
eyestalk
farewell
fattrels
feedhead
fingered
flathead
flatware
flatweed
flockman
Fulfulde
headlock
headwall
headwear
hostname
indented
indentee
killcalf
killweed
makefile
mandatee
Manville
morefold
prateful
predwell
preprint
preshare
presumed
pretreat
producal
revulsed
shadbush
shareman
shearman
shelfful
shellful
shellman
sleepful
spellful
sufflate
suffused
tailhead
tartrate
ultrared
unmassed
unmeated
unmudded
unmulled
untalked
viduated
wellhead

9 CHARS:
clearcole
clearweed
commodate
compacted
cucullate
exululate
fatheaded
fingertip
flintlock
manducate
preassume
predefeat
predetail
predetest
prescript
printmake
sheartail
shellhead
shepstare
splittail
strippage
subcellar
sulfatase
tartrated
timeshare
uncompact
unicelled
watershed
windowful
windowman

<<<10 CHARS>>>
astipulate
communique
exsufflate
extipulate
loggerhead
manuscript
prediction
prefearful
preinstall
proflogger
sheepshead
stringsman
tartarated
unexpanded
unicellate
unmodelled

<<<11 CHARS>>>
clearheaded
fingerprint
printscript
splitfinger
subcultrate
uncompacted
unicellular
unmeditated
unmodulated

<<<12 CHARS>>>
killeekillee
loggerheaded
unmedullated
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
"With a kernel dive, all things are possible, but it sure makes it hard
 to look at yourself in the mirror the next morning."  -me

maart@cs.vu.nl (Maarten Litmaath) (12/15/90)

In article <77888@iuvax.cs.indiana.edu>,
	sahayman@iuvax.cs.indiana.edu (Steve Hayman) writes:
)
)Have you ever wondered what the longest word is that can be spelled
)with consecutive Unix commands?  (i.e. "fingertip" = "finger" + "tip")
)You have?  Well, stop worrying.  Here's a script that will find them
)by seeing which words in /usr/dict/words can be spelled via
)combinations of commands in /bin:/usr/bin:/usr/ucb 

Nice indeed, Steve!
I've changed your script in a few ways, though:

	- /etc and /usr/etc are now searched too, which leads to the
	  next change
	- it's checked if an entry is really an executable
	- double entries (from different directories) are removed

	and most importantly

	- it's shown HOW each word can be broken up into UNIX commands!

Some words have more than 1 `representation' in the `UNIX vector space'.
Example:
	view
	vi-e-w

I don't have much experience with Perl yet, so my version of the script
may be improved too.
Here's some output:

	Ac-ta-e-on
	Cal-cut-ta
	Sh-ar-on
	Sh-e-ld-on
	Wall-ac-e
	W-ar-sa-w
	ac-comm-od-at-e
	as-sum-e
	as-tr-id-e
	clear-head-ed
	col-line-ar
	e-du-cat-e
	e-man-at-e
	enroll-e-e
	ex-e-cut-e
	id-e-at-e
	man-at-e-e
	on-e-time
	pr-e-sum-e
	pr-e-tty
	refer-e-e
	sed-at-e
	su-cc-e-ed
	test-at-e
	time-sh-ar-e
	tr-e-as-on
	w-ar-head
	w-ar-time
	w-at-e-rsh-ed

Here's the new script:

--------------------cut here--------------------
#!/usr/local/bin/perl
# unixword v2.0
# find the words in /usr/dict/words that can be constructed
# out of unix commands.  sort alphabetically.
# show how each word can be constructed from which commands.
# /etc and /usr/etc are now searched too.
#
# v1.0 by steve hayman, dec 12/1990
# v2.0 by maarten litmaath, dec 15/1990

@dirs = ( '/bin', '/usr/bin', '/usr/ucb', '/etc', '/usr/etc');
$wordlist = '/usr/dict/words';

# step 1: get a list of executables in the various directories
# step 2: leave out all entries containing non-alphabetic characters
# use an associative array to get rid of duplicate entries

foreach $dir ( @dirs ) {
    opendir(DIR, $dir) || die "Can't opendir $dir: $!";
    foreach $f (readdir(DIR)) {
	$ent = $dir . '/' . $f;
	if ($f !~ /\W|_|\d/ && -x $ent && ! -d $ent) {
	    $files{$f} = 0;
	}
    }
    close(DIR);
}

@files = keys(%files);

# step 3: construct a suitable regular expression matching
# all these filenames

$re = '^(' . join("|", @files) .  ')+$' ;

# step 4: match the dictionary file against this pattern; store words that
# match the pattern - assoc. array indexed by word, containing word len.

open(DICT, $wordlist) || die "Can't open $wordlist: $!";

while ( <DICT> ) {
    chop;
    $len{$_} = length if /$re/io;
}

# breakup() returns an array of all possible `breakups' of its argument
# example for `abcd':
# a-b-c-d
# a-b-cd
# a-bc-d
# a-bcd
# ab-c-d
# ab-cd
# abc-d
# abcd

sub breakup {
	local($word) = @_;
	local(@L) = 1 .. length($word) - 1;
	local(@ans, @sufs, $pre, $prelen, $suf);

	for $prelen (@L) {
		$pre = substr($word, 0, $prelen);
		@sufs = &breakup(substr($word, $prelen));
		foreach $suf (@sufs) {
			push(@ans, $pre . '-' . $suf);
		}
	}
	push(@ans, $word);
	@ans;
}

$brkupre = '^(-' . join("|-", @files) .  ')+$' ;

# step 5: print word list alphabetically, show how each word can be
# broken up

foreach $word ( sort keys %len ) {
	@tries = &breakup($word);
	foreach $try (@tries) {
		print "$try\n" if "-$try" =~ /$brkupre/io;
	}
}
--
In the Bourne shell syntax tabs and spaces are equivalent almost everywhere.
The exception: _indented_ here documents.  :-(
Does anyone remember the famous mistake Makefile-novices often make?