[comp.unix.questions] Permuted Indexing

tim@ora.UUCP (Tim O'Reilly) (09/26/87)

Some time back, I responded to a query about ptx, and
offered to mail instructions to anyone who was interested,
or to post if demand warranted.  After getting 30 or 40
queries, I was trying to decide whether or not that was
enough interest to post, but then I managed to munge the
file where I was saving the requests, so that kind of
decided me.

I apologize if the treatment is a little long winded for the
net; it is adapted for an article I've written for
Communixations, which was aimed at novices.  Even as it is,
I've cut it down a little.  Anyway, here goes:


Generating Permuted Indexes with ptx

The first time someone new to UNIX takes a look at the front of the Refer-
ence Manual, he or she is likely to be surprised by the most unlikely look-
ing document:  the ubiquitous permuted index.  The index looks something
like this:

                     maintainer   ar: archive and library ......... ar(1)
                            ar:   archive and library maintainer .. ar(1)
                           time   at: execute commands at a later . at(1)
            processing language   awk: pattern scanning and ....... awk(1)
                    at: execute   commands at a later time ........ at(1)
                            at:   execute commands at a later time  at(1)
pattern scanning and processing   language   awk: ................. awk(1)
      at: execute commands at a   later time ...................... at(1)
                ar: archive and   library maintainer .............. ar(1)
        ar: archive and library   maintainer ...................... ar(1)
                language   awk:   pattern scanning and processing . awk(1)
      awk: pattern scanning and   processing language ............. awk(1)
                   awk: pattern   scanning and processing language  awk(1)
at: execute commands at a later   time ............................ at(1)

(This example actually shows a complete permuted index based on the three
commands ar, at, and awk.  This miniature index is used as an example
throughout this article.)

Like the Reference Manual itself, the permuted index takes a little getting
used to, but is fairly useful once that hurdle has been crossed.  To find
the command you want, you simply scan down the middle of the page, looking
for a keyword of interest on the right side of the blank gutter. Once you
find the keyword you want, you can read (with contortions) the brief
description of the command that makes up the entry.  If things still look
promising, you can look all the way over to the right for the name of the
relevant command page.  The permuted index is no substitute for a real
index, but...like the manual itself, it has survived because it is easy to
maintain--if you know the secret of how it's done.

An understanding of how the permuted index is created is especially useful
if you are a technical writer and need to create one for your manual set,
but it isn't a bad idea to become familiar with the process even if you're
a casual user.  Once you understand how ptx works and why it was done that
way, you'll be much more forgiving of its peculiarities.

Let's start with the manual pages themselves.  Typically, each man page is
kept in its own file.  On systems with online documentation, the files are
kept in the directories /usr/man/manN, where N is a digit from 1 to 8
corresponding to the appropriate section of the manual (Note that the later
sections have been rearranged in some implementations of the UNIX documen-
tation.)  Each file also has the section number appended to its name.  So,
for example, the source file manual page for the awk command would be kept
in the file /usr/man/man1/awk.1.

Since man pages are added whenever a new command is written, it is desir-
able for the permuted index to be generated automatically from the man
pages themselves.  Each man page begins with lines listing the command name
and a brief description.  For example:

     .TH CAT 1 "UNIX Programmer's Manual"
     .SH NAME
     awk \- pattern scanning and processing language

The line following the .SH NAME macro is the one we want.  It is easy
enough to extract only this line, simply by grepping for the \- which is
the one consistent feature of this line.  A simple script like the follow-
ing will do the trick:

     rm ptx.raw
     for x in `ls -d /usr/man/man[1-8]`
     do
     cd $x
     grep '^[a-z0-9][a-z0-9]*  *\\- ' * >> /usr/tim/ptx.raw
     done

The -d option to ls simply tells it to list the directory name, but not the
contents.  This list is then input to the for loop in the script.  The
pattern given to grep should select all lines beginning with one or more 
lowercase letters or digits, followed by one or more spaces and a hyphen 
escaped with a backslash.  Grep will automatically include the file name 
the pattern was found in.  There is a slight chance that you may
get a few extra lines containing the pattern, which you will have to
edit out by hand, but in general, you'll get only the lines you want.  You
also need to watch out that * doesn't expand into a list of files that is
too long for the shell to handle--but this should only be a problem if your
system includes even more commands than usual.  (There are ways to avoid
both of these problems, but they involve using commands like
awk or sed, and are not discussed here.) 

The file ptx.raw will look like this:

     ar.1:  ar \- archive and library maintainer
     at.1:  at \- execute commands at a later time
     awk.1:  awk \- pattern scanning and processing language

It is at this point that the ptx program itself comes in.  Ptx will take
each line in a file and permute it.  That is, it will take each word,
except words specified in the file /usr/lib/eign or a file you specify with
the -i (ignore) option, and generate a line in which that word has been
rotated to the front of the line.  (For each input line, you therefore get
as many output lines as there are potential keywords in the input line.)
The output is then sorted, and then rotated again so that the keyword
appears in the middle of the line.  The final output file contains troff
macros of the form:

     .xx "tail" "before keyword" "keyword and after" "head"

The before keyword and keyword and after fields contain as much of the
input line as will fit around the keyword; the remainder, if any, is placed
in either the head or tail field, with at least one of these two fields
left empty.  If the -r option to ptx is used, the first word on the initial
input line will not be permuted, but will be placed in a fifth field at the
end of the line, where it can be used as a reference identifier.

In the case of the standard UNIX permuted index, this reference number is
not a page number, but the command name and manual section number in the
form command(n) (e.g.  awk(1).  Before running ptx, it is desirable to
transform the filename provided by grep into an identifier with this form.
Typically, the hyphen separating the command name itself from the brief
description is also converted to a colon.  This is easy to do with sed.

     sed '
             s/\(.*\)\.\(.*\):/\1(\2) /
             s/ \\- /: /' ptx.raw > ptx.in

Our input file now looks like this:

     ar(1) ar: archive and library maintainer
     at(1) at: execute commands at a later time
     awk(1) awk: pattern scanning and processing language

We now run ptx with the -r option.  (In addition, if the index will be for-
matted with troff instead of nroff, we might want to use the -w option to
set the line length to more than the default 72 characters:

     ptx -r -w80 ptx.in ptx.out

The output from ptx (assuming only the three lines shown in our input exam-
ple above) looks like this:

     .xx "maintainer" "" "ar: archive and library" "" ar(1)
     .xx "" "ar:" "archive and library maintainer" "" ar(1)
     .xx "time" "" "at: execute commands at a later" "" at(1)
     .xx "processing language" "" "awk: pattern scanning and" "" awk(1)
     .xx "" "at: execute" "commands at a later time" "" at(1)
     .xx "" "at:" "execute commands at a later time" "" at(1)
     .xx "" "pattern scanning and processing" "language" "awk:" awk(1)
     .xx "" "at: execute commands at a" "later time" "" at(1)
     .xx "" "ar: archive and" "library maintainer" "" ar(1)
     .xx "" "ar: archive and library" "maintainer" "" ar(1)
     .xx "language" "awk:" "pattern scanning and processing" "" awk(1)
     .xx "" "awk: pattern scanning and" "processing language" "" awk(1)
     .xx "" "awk: pattern" "scanning and processing language" "" awk(1)
     .xx "" "at: execute commands at a later" "time" "" at(1)

The next question is how the devil you get from this output to the print-
able permuted index.  There is no description of this essential detail on
the man page for ptx!

If you're lucky, your system includes a small troff macro package called
mptx that will do the job for you.  (It includes only a definition of the
.xx macro.)  If you're unlucky, you need to write a definition of the .xx
macro yourself.  Check to see if the file /usr/lib/tmac/tmac.ptx exists.
If it does, you're in luck.  If it doesn't, here's a definition you can
use:

     .tr                 \" Translate   to unpaddable space
     .nr )y \n(.lu-.65i  \" Store line length less .65 inches in register )y
     .nr )x \n()yu/2u    \" Store half of register y into register )x
     .de xx              \" Begin definition of macro xx
     .ds s1              \" Initialize inter-field spacing string s1 to zero
     .if \w'\\$2' .ds s1  \|  \" If 2nd arg exists, assign space to s1
     .ds s2              \" Initialize inter-field spacing string s2 to zero
     .if \w'\\$4' .ds s2  \|  \" If 4th arg exists, assign space to s2
     .ta \\n()yu-\w' 'u  \" Set tab stop at register y, less width of a space
     .\" Next line:  move to center of page (reg x) less the width
     .\" of the first and second args, plus associated spaces
     .\" Then print the args and spaces, followed by the third and
     .\" fourth args.  Follow with a leader out to the tab, and the 5th arg
     \h'\\n()xu-\w'\\$1\\*(s1\\$2   'u'\\$1\\*(s1\\$2   \\$3\\*(s2\\$4 \a \\$5
     ..
     .nf \" Use no fill mode so tabs work correctly
     .\" If using troff, add font and size specifications as well.

If the macro definition file exists, you can use nroff or troff with the
-mptx option.  Otherwise, either install your own definition in
/usr/lib/tmac or prepend the macro definition to the index file.  Since the
ptx macro package doesn't do even rudimentary page formatting, you probably
want to use it in conjunction with another macro package, such as -ms.
Then format, using either:

     nroff -ms -mptx ptx.out | yourspooler

or

     nroff -ms ptxmac ptx.out | yourspooler

The final output should look like the example shown at the start of this
article (allowing for differences between nroff and troff, of course).

If you are really in the business of producing a permuted index, you could
combine all of the above commands into a single script.  (A full script
could use a variety of mechanisms to weed out or avoid extra lines, as well
as doing other error checking.)  Generating a permuted index is then a
matter of typing a single command.  It's too bad no-one ever wrote down how
to do it before!


Tim O'Reilly is president of O'Reilly & Associates, Inc., a consulting firm
that writes manuals and training materials designed to help people get more
out of their computers.  You can reach him at O'Reilly & Associates, Inc.,
981 Chestnut Street, Newton, MA 02164; (617) 527-4210; or uunet!ora!tim.
-- 
Tim O'Reilly (617) 527-4210
O'Reilly & Associates, Inc., Publishers of Nutshell Handbooks
981 Chestnut Street, Newton, MA 02164
UUCP:	uunet!ora!tim      ARPA:   tim@ora.uu.net