tim@ora.UUCP (Tim O'Reilly) (09/26/87)
Some time back, I responded to a query about ptx, and offered to mail instructions to anyone who was interested, or to post if demand warranted. After getting 30 or 40 queries, I was trying to decide whether or not that was enough interest to post, but then I managed to munge the file where I was saving the requests, so that kind of decided me. I apologize if the treatment is a little long winded for the net; it is adapted for an article I've written for Communixations, which was aimed at novices. Even as it is, I've cut it down a little. Anyway, here goes: Generating Permuted Indexes with ptx The first time someone new to UNIX takes a look at the front of the Refer- ence Manual, he or she is likely to be surprised by the most unlikely look- ing document: the ubiquitous permuted index. The index looks something like this: maintainer ar: archive and library ......... ar(1) ar: archive and library maintainer .. ar(1) time at: execute commands at a later . at(1) processing language awk: pattern scanning and ....... awk(1) at: execute commands at a later time ........ at(1) at: execute commands at a later time at(1) pattern scanning and processing language awk: ................. awk(1) at: execute commands at a later time ...................... at(1) ar: archive and library maintainer .............. ar(1) ar: archive and library maintainer ...................... ar(1) language awk: pattern scanning and processing . awk(1) awk: pattern scanning and processing language ............. awk(1) awk: pattern scanning and processing language awk(1) at: execute commands at a later time ............................ at(1) (This example actually shows a complete permuted index based on the three commands ar, at, and awk. This miniature index is used as an example throughout this article.) Like the Reference Manual itself, the permuted index takes a little getting used to, but is fairly useful once that hurdle has been crossed. To find the command you want, you simply scan down the middle of the page, looking for a keyword of interest on the right side of the blank gutter. Once you find the keyword you want, you can read (with contortions) the brief description of the command that makes up the entry. If things still look promising, you can look all the way over to the right for the name of the relevant command page. The permuted index is no substitute for a real index, but...like the manual itself, it has survived because it is easy to maintain--if you know the secret of how it's done. An understanding of how the permuted index is created is especially useful if you are a technical writer and need to create one for your manual set, but it isn't a bad idea to become familiar with the process even if you're a casual user. Once you understand how ptx works and why it was done that way, you'll be much more forgiving of its peculiarities. Let's start with the manual pages themselves. Typically, each man page is kept in its own file. On systems with online documentation, the files are kept in the directories /usr/man/manN, where N is a digit from 1 to 8 corresponding to the appropriate section of the manual (Note that the later sections have been rearranged in some implementations of the UNIX documen- tation.) Each file also has the section number appended to its name. So, for example, the source file manual page for the awk command would be kept in the file /usr/man/man1/awk.1. Since man pages are added whenever a new command is written, it is desir- able for the permuted index to be generated automatically from the man pages themselves. Each man page begins with lines listing the command name and a brief description. For example: .TH CAT 1 "UNIX Programmer's Manual" .SH NAME awk \- pattern scanning and processing language The line following the .SH NAME macro is the one we want. It is easy enough to extract only this line, simply by grepping for the \- which is the one consistent feature of this line. A simple script like the follow- ing will do the trick: rm ptx.raw for x in `ls -d /usr/man/man[1-8]` do cd $x grep '^[a-z0-9][a-z0-9]* *\\- ' * >> /usr/tim/ptx.raw done The -d option to ls simply tells it to list the directory name, but not the contents. This list is then input to the for loop in the script. The pattern given to grep should select all lines beginning with one or more lowercase letters or digits, followed by one or more spaces and a hyphen escaped with a backslash. Grep will automatically include the file name the pattern was found in. There is a slight chance that you may get a few extra lines containing the pattern, which you will have to edit out by hand, but in general, you'll get only the lines you want. You also need to watch out that * doesn't expand into a list of files that is too long for the shell to handle--but this should only be a problem if your system includes even more commands than usual. (There are ways to avoid both of these problems, but they involve using commands like awk or sed, and are not discussed here.) The file ptx.raw will look like this: ar.1: ar \- archive and library maintainer at.1: at \- execute commands at a later time awk.1: awk \- pattern scanning and processing language It is at this point that the ptx program itself comes in. Ptx will take each line in a file and permute it. That is, it will take each word, except words specified in the file /usr/lib/eign or a file you specify with the -i (ignore) option, and generate a line in which that word has been rotated to the front of the line. (For each input line, you therefore get as many output lines as there are potential keywords in the input line.) The output is then sorted, and then rotated again so that the keyword appears in the middle of the line. The final output file contains troff macros of the form: .xx "tail" "before keyword" "keyword and after" "head" The before keyword and keyword and after fields contain as much of the input line as will fit around the keyword; the remainder, if any, is placed in either the head or tail field, with at least one of these two fields left empty. If the -r option to ptx is used, the first word on the initial input line will not be permuted, but will be placed in a fifth field at the end of the line, where it can be used as a reference identifier. In the case of the standard UNIX permuted index, this reference number is not a page number, but the command name and manual section number in the form command(n) (e.g. awk(1). Before running ptx, it is desirable to transform the filename provided by grep into an identifier with this form. Typically, the hyphen separating the command name itself from the brief description is also converted to a colon. This is easy to do with sed. sed ' s/\(.*\)\.\(.*\):/\1(\2) / s/ \\- /: /' ptx.raw > ptx.in Our input file now looks like this: ar(1) ar: archive and library maintainer at(1) at: execute commands at a later time awk(1) awk: pattern scanning and processing language We now run ptx with the -r option. (In addition, if the index will be for- matted with troff instead of nroff, we might want to use the -w option to set the line length to more than the default 72 characters: ptx -r -w80 ptx.in ptx.out The output from ptx (assuming only the three lines shown in our input exam- ple above) looks like this: .xx "maintainer" "" "ar: archive and library" "" ar(1) .xx "" "ar:" "archive and library maintainer" "" ar(1) .xx "time" "" "at: execute commands at a later" "" at(1) .xx "processing language" "" "awk: pattern scanning and" "" awk(1) .xx "" "at: execute" "commands at a later time" "" at(1) .xx "" "at:" "execute commands at a later time" "" at(1) .xx "" "pattern scanning and processing" "language" "awk:" awk(1) .xx "" "at: execute commands at a" "later time" "" at(1) .xx "" "ar: archive and" "library maintainer" "" ar(1) .xx "" "ar: archive and library" "maintainer" "" ar(1) .xx "language" "awk:" "pattern scanning and processing" "" awk(1) .xx "" "awk: pattern scanning and" "processing language" "" awk(1) .xx "" "awk: pattern" "scanning and processing language" "" awk(1) .xx "" "at: execute commands at a later" "time" "" at(1) The next question is how the devil you get from this output to the print- able permuted index. There is no description of this essential detail on the man page for ptx! If you're lucky, your system includes a small troff macro package called mptx that will do the job for you. (It includes only a definition of the .xx macro.) If you're unlucky, you need to write a definition of the .xx macro yourself. Check to see if the file /usr/lib/tmac/tmac.ptx exists. If it does, you're in luck. If it doesn't, here's a definition you can use: .tr \" Translate to unpaddable space .nr )y \n(.lu-.65i \" Store line length less .65 inches in register )y .nr )x \n()yu/2u \" Store half of register y into register )x .de xx \" Begin definition of macro xx .ds s1 \" Initialize inter-field spacing string s1 to zero .if \w'\\$2' .ds s1 \| \" If 2nd arg exists, assign space to s1 .ds s2 \" Initialize inter-field spacing string s2 to zero .if \w'\\$4' .ds s2 \| \" If 4th arg exists, assign space to s2 .ta \\n()yu-\w' 'u \" Set tab stop at register y, less width of a space .\" Next line: move to center of page (reg x) less the width .\" of the first and second args, plus associated spaces .\" Then print the args and spaces, followed by the third and .\" fourth args. Follow with a leader out to the tab, and the 5th arg \h'\\n()xu-\w'\\$1\\*(s1\\$2 'u'\\$1\\*(s1\\$2 \\$3\\*(s2\\$4 \a \\$5 .. .nf \" Use no fill mode so tabs work correctly .\" If using troff, add font and size specifications as well. If the macro definition file exists, you can use nroff or troff with the -mptx option. Otherwise, either install your own definition in /usr/lib/tmac or prepend the macro definition to the index file. Since the ptx macro package doesn't do even rudimentary page formatting, you probably want to use it in conjunction with another macro package, such as -ms. Then format, using either: nroff -ms -mptx ptx.out | yourspooler or nroff -ms ptxmac ptx.out | yourspooler The final output should look like the example shown at the start of this article (allowing for differences between nroff and troff, of course). If you are really in the business of producing a permuted index, you could combine all of the above commands into a single script. (A full script could use a variety of mechanisms to weed out or avoid extra lines, as well as doing other error checking.) Generating a permuted index is then a matter of typing a single command. It's too bad no-one ever wrote down how to do it before! Tim O'Reilly is president of O'Reilly & Associates, Inc., a consulting firm that writes manuals and training materials designed to help people get more out of their computers. You can reach him at O'Reilly & Associates, Inc., 981 Chestnut Street, Newton, MA 02164; (617) 527-4210; or uunet!ora!tim. -- Tim O'Reilly (617) 527-4210 O'Reilly & Associates, Inc., Publishers of Nutshell Handbooks 981 Chestnut Street, Newton, MA 02164 UUCP: uunet!ora!tim ARPA: tim@ora.uu.net