[comp.sources.wanted] want to compress ascii text in strings not files

kgdykes@watmath.waterloo.edu (Ken Dykes) (01/09/91)

is there an effective algorithm (what is it? :-) or even lazier a C source
for a function (pascal, 286 assembler can also be survived :-)
which will reduce in-memory requirements for strings.

My application has several-K of strings of ascii subset used for messages
and diagnostics. (they vary from 1-2 chars long to about 90chars in length,
   probably average length of about 20-40chars)
I am hoping to reduce it's already large memory requirements at the expense
of a little CPU.

the bytes may be considered 7bit ascii subset (alphabetics,digits,punctuation,
white-space, ie: no ctrl-chars or \0177rubouts) (the ctrl-chars are represented
by C printf-like sequences such as \n for newline, \f for formfeed, etc)
actually the sequence "\n" occurs in >50% of the strings, so perhaps
a special representation like ctrl-J could be used :-) :-)

the in memory byte-capacity is actually 9bit not 8bit (honeywell bull dps8
architecture, or if you prefer pretend it's DEC Tops-10 or Tops-20, words
of 36 bits (four 9bit bytes))
although i would be interested in a more-protable 8-bit, not 9bit,
algorithm too.

i can think of some crude shift-&-pack algorithms, but am hoping someone
has something nifty & efficient.
If i can get a 33% space reduction without obscene amounts of cpu required
i will be ecstatic.
the simplest shift-2bits-&-pack would give me about 28% with a lot of
fairly expensive shift operations.

actually i suppose an accurate description of a compress/decompress
alogrithm would be prefered come to think of it, it will after all be
a teeny part of what is essentially a commercial package and i wouldnt
want to be accused of stealing code :-)

   -ken, not at all versed in data compression
-- 
   - Ken Dykes, Software Development Group, UofWaterloo, Canada [43.47N 80.52W]
         kgdykes@watmath.waterloo.edu  [129.97.128.1]        watmath!kgdykes
         postmaster@watbun.waterloo.edu       B8 P6/6 s+ f+ m t w e r p

kdq@demott.com (Kevin D. Quitt) (01/10/91)

In article <1991Jan9.053459.13231@watmath.waterloo.edu> kgdykes@watmath.waterloo.edu (Ken Dykes) writes:
>is there an effective algorithm (what is it? :-) or even lazier a C source
>for a function (pascal, 286 assembler can also be survived :-)
>which will reduce in-memory requirements for strings.

    You can try DEC's RAD50, or a home-brewed equivalent.  SPACE is 0,
A-Z, then 0-9, followed by three more characters.  I've used these
latter as an escape to other character sets.  This encoding will allow
putting 3 characters into 2 bytes.  Not the greatest, but it's easy.


-- 
 _
Kevin D. Quitt         demott!kdq   kdq@demott.com
DeMott Electronics Co. 14707 Keswick St.   Van Nuys, CA 91405-1266
VOICE (818) 988-4975   FAX (818) 997-1190  MODEM (818) 997-4496 PEP last

cur022%cluster@ukc.ac.uk (Bob Eager) (01/10/91)

In article <1991Jan9.053459.13231@watmath.waterloo.edu>, kgdykes@watmath.waterloo.edu (Ken Dykes) writes:
> is there an effective algorithm (what is it? :-) or even lazier a C source
> for a function (pascal, 286 assembler can also be survived :-)
> which will reduce in-memory requirements for strings.

I have a small C program that takes a file of strings (and message numbers)
as input, and generates two constant arrays and a tiny function for retrieving
any given message by number.

It stores each English word once, and each message is stored as a set of
pointers to the words.

Letters are stored as 6 bit characters with a shift character to get the
punctuation stuff.

If there is enough interest, I'll post it.
-------------------------+-------------------------------------------------
Bob Eager                | University of Kent at Canterbury
                         | +44 227 764000 ext 7589
-------------------------+-------------------------------------------------