[comp.sources.d] Bug in ispell / ispell enhancements

bobm@rtech.UUCP (01/24/87)

(I sent out this article in net.sources.bugs while I was reading the
discussion there.  I forgot to cross-post to this group, as I had
intended.  Sorry for the duplication.)

First of all, ispell is a slick program.  I've been using a shell
script dialogue tying together grope / spell / grep to do this for
years.  It's much nicer this way (my script built an edit script to
run over the file as it went along - a real kludge).

I had to fix the short -> int tpgrp bug in the signal handling to get it
to run (on a pyramid).  Then somebody tried the -l option and it
crashed.  Go through the "checkfile()" routine in ispell.c.  There
are a couple of "putc" calls that need to have "if (!lflag)" checks
put on them to avoid output to an uninitialized file pointer.

As I said, I really like the thing, but I've been making some mods, and
have a few suggestions.

mods:

command line options to allow alternate dictionary files, alternate
personal dictionary files, and allow characters other than alpha
to be counted as "word" characters.  Also, a shell variable to set
for your personal dictionary rather than "ispell.words".

The extra characters option has a provision for specifying full 8-bit
characters, intended for international character set use.  Actually,
the reason I added the option was to be able to get &'s counted for things
like "AT&T", and get underscores counted, as these critters have a habit of
turning up in technical documents.  It was easy to simply make the character
check capable of handling 8 bit characters - array lengths can be 256 as
easily as 128 - the way of specifying hyperascii on the command line is
pretty crude, but it works.  There are a few extra characters that cause
odd things to happen through inability to put words containing them in the
dictionary (slashes), collision with formatter syntax (periods, backslahes), 
or screwing up control (newlines, escapes, tabs), but the results should
simply be odd but explicable actions as opposed to crashes.  It's unlikely
that you'd want to use them except to try to break the program.

To make use of alternate dictionaries without doing a lot of file shuffling,
buildhash also gets arguments to specify the input/output files.

These are all pretty simple, and I will send out these mods soon.

Something I am thinking about doing is enhancing the roff handling.  What
I have in mind is putting roff macros in the dictionary with a "." prefix,
and using the flags to indicate special actions for that command.  Then,
when a formatter macro was found, the dictionary could be checked.  If not
there, do what's done now (simply ignore the token).  Some things the
flags could cause:

	ignore the whole line, not just the command.

	pick up nroff argument 1 as a file name, query the user, and
	append it to the list of files to be processed if desired.
	Used for the .so command, for instance.  Ability to pick up
	an argument other than 1 may be useful for local macros, but
	gets more complicated.

	start ignoring text until you find a formatter command to turn
	it back on again.  Useful for any roff macros / preprocessor
	commands that bracket stuff which isn't going to be english,
	probably.  Macro definitions, macros intended to bracket code
	fragments and eqn spring to mind.  Could be a variant of this
	allowing / disallowing nesting.

	ignore stuff until a line ending with a period - for tbl
	formatting commands.

There may also be a use for coding special treatments for backslash
sequences based on the character following the backslash.  Also, ispell
doesn't pick up use of the ' instead of . to begin a command, or recognize
nroff comments.  These are nits - 99.9% of the time people don't use
such things anyway.

Some thoughts:

I was surprised to note how the personal dictionary was handled.  If
the hash table were restructured to allow insertion of new entries at
runtime (use malloc to allocate new nodes, change word from an index
to a pointer, etc), you could interpret and enter the personal dictionary
into the hash table rather than using an auxiliary structure, except maybe
keeping the results of "i" command entries to insert into the file.  You
would then have one access method instead of two for word lookup, and more
important, you could have the slash codes on the words in your personal
dictionary as well.  Or your personal roff macros if I did what I
suggested above.  I haven't looked at this in detail, so I don't know
if it's really feasible, or how much change is required.  If done, another
bonus is that you could combine dictionaries on the command line rather
than having to duplicate basic stuff across multiple dictionaries to
handle special classes of jargon.

To really allow this thing to handle foreign languages, you would need
different ending rules.  It might be possible to devise abstractions
to be coded into the dictionary stating what the substitution rules are
for the various flags.  I know this one would be a LOT of work.  It's just
a thought.

Anyway, I LIKE it!  Even if I haven't set things up to make it convenient
to use it on my usenet articles yet.

-- 

Bob McQueer
{amdahl, sun, mtxinu, hoptoad, cpsc6a}!rtech!bobm

cudcv@warwick.UUCP (01/28/87)

In article <620@rtech.UUCP> bobm@rtech.UUCP (Bob Mcqueer) writes:
>
>To really allow this thing to handle foreign languages, you would need
>different ending rules.  It might be possible to devise abstractions
>to be coded into the dictionary stating what the substitution rules are
>for the various flags.  I know this one would be a LOT of work.  It's just
>a thought.
>
>Bob McQueer
>{amdahl, sun, mtxinu, hoptoad, cpsc6a}!rtech!bobm

Anybody likely to teach it British English ?  I like the program, now if only
it could spell ...
-- 
UUCP:   ...!mcvax!ukc!warwick!cudcv	PHONE:  +44 203 523037
JANET:  cudcv@uk.ac.warwick.daisy       ARPA:   cudcv@daisy.warwick.ac.uk
Rob McMahon, Computing Services, Warwick University, Coventry CV4 7AL, England