[comp.compression] Atronomical data compression

sns@lo-fan.caltech.edu (Sam Southard Jr.) (03/23/91)

I have a topic/question that should be suitable for this newsgroup.  I am a
software engineer working on the software for the Keck Telescope (the new 10m
in Hawaii).  One of the things I am doing is taking the data from the CCD to
a VMEbus controller crate (Sun 1E running VxWorks) and from there to a
Sun 4/470 over the ethernet.

Each image can be up to 4096x4096 pixels (a 2x2 mosaic of 2048x2048 CCDs), each
pixel being 16 bits.  Obviously, if this kind of data is going over the
Ethernet, we want to compress it as much as possible.

The compression I can do is restricted in a few ways, including:

	1) It can not take much CPU, since some real-time control of the
		instrument is being done by the same VxWorks box.  Much is a
		term that is yet to be defined.
	2) Simple schemes like counting the number of times a pixel occurs
		in a row and comressing that to a single occurance and a count
		are out, because there is a random element to the data (cosmic
		rays, etc.), which means that it is unlikely that there will be
		many cases where this method would compress the data.

I have thought of a few methods which I can try, but I am interested in looking
at as many ways as possible.  I have lots of actual data to try out various
compression methods, and a reasonable amount of CPU to burn in choosing the
algorithm (or two or three) which will actualyl be used.

Does anyone have any suggestiongs?

-- 

Sam Southard, Jr.
{sns@deimos.caltech.edu|{backbone}!cit-vax!deimos!sns}

philip@beeblebrox.dle.dg.com (Philip Gladstone) (03/25/91)

>>>>> On 23 Mar 91 01:35:57 GMT, sns@lo-fan.caltech.edu (Sam Southard Jr.) said:

Sam> I have a topic/question that should be suitable for this newsgroup.  I am a
Sam> software engineer working on the software for the Keck Telescope (the new 10m
Sam> in Hawaii).  One of the things I am doing is taking the data from the CCD to
Sam> a VMEbus controller crate (Sun 1E running VxWorks) and from there to a
Sam> Sun 4/470 over the ethernet.

Sam> I have thought of a few methods which I can try, but I am interested in looking
Sam> at as many ways as possible.  I have lots of actual data to try out various
Sam> compression methods, and a reasonable amount of CPU to burn in choosing the
Sam> algorithm (or two or three) which will actualyl be used.

Sam> Does anyone have any suggestiongs?

The first question that comes to mind is whether you want a lossless
scheme or a lossy scheme?

If you want lossless compression, then I must ask you why? You already
admit that there is noise in the input signal and I bet you get noise
introduced from atmospheric turbulence.

If you want lossy compression, then you have to decide how much loss
you are prepared to accept, In general the lossy alogorithms do much
better at compressing pictures than the lossless ones. The improvement
can be a factor of 10 better.

The whole issue of whether lossy as opposed to lossless compression is
'better' is being discussed ad nauseam in various other groups, so I
don't want to start it here. But the question as to which is
appropriate for particular applications is relevant.

Philip
--
Philip Gladstone         Dev Lab Europe, Data General, Cambridge, UK

   "I really wish I'd listened to what my mother told me when I was young."
   "Why, what did she tell you?"
   "I don't know, I didn't listen."

sns@lo-fan.caltech.edu (Sam Southard Jr.) (03/26/91)

In article <PHILIP.91Mar24144205@beeblebrox.dle.dg.com>, philip@beeblebrox.dle.dg.com (Philip Gladstone) writes:
|> >>>>> On 23 Mar 91 01:35:57 GMT, sns@lo-fan.caltech.edu (Sam Southard Jr.) said:
|> 
|> The first question that comes to mind is whether you want a lossless
|> scheme or a lossy scheme?
|> 
|> If you want lossless compression, then I must ask you why? You already
|> admit that there is noise in the input signal and I bet you get noise
|> introduced from atmospheric turbulence.

This question has been asked by many people who have sent me responses to
my original message (BTW: Thanks!).  I apologize for not answering it in
my inital posting - since I was so involved with the thing the answer
was so obvious to me thast I didn't even bother addressing the issue.

The answer is, I need lossless compression.  This code is going in the
actual instrument.  Astronomers will not be happy if they were informed
that they will get anything other than the actual data, bit for bit.  The
decision which was made (I agree with it) was that the astronomer should
decide what to do with the data.  

One of the reasons for this is that these instruments are often used to
look at very faint light sources and features which are quite often
fairly close to the noise level.  It would be extremely difficult for
anyone but the individual astronomer to decide if a particular small
delta somewhere was significant.  Imagine what an astronomer would do to
me if he found out that the spectral line he based his thesis on in fact
might have been merely noise introduced by a compression algorithm!

|> If you want lossy compression, then you have to decide how much loss
|> you are prepared to accept, In general the lossy alogorithms do much
|> better at compressing pictures than the lossless ones. The improvement
|> can be a factor of 10 better.

Unfortunately, they are completely unacceptable in this case.

xanthian@zorch.SF-Bay.ORG (Kent Paul Dolan) (03/26/91)

sns@deimos.caltech.edu (Sam Southard Jr.) writes:

> I have a topic/question that should be suitable for this newsgroup. I
> am a software engineer working on the software for the Keck Telescope
> (the new 10m in Hawaii). One of the things I am doing is taking the
> data from the CCD to a VMEbus controller crate (Sun 1E running
> VxWorks) and from there to a Sun 4/470 over the ethernet.

> Each image can be up to 4096x4096 pixels (a 2x2 mosaic of 2048x2048
> CCDs), each pixel being 16 bits. Obviously, if this kind of data is
> going over the Ethernet, we want to compress it as much as possible.

> The compression I can do is restricted in a few ways, including:

> 1) It can not take much CPU, since some real-time control of the
> instrument is being done by the same VxWorks box. Much is a term that
> is yet to be defined.

> 2) Simple schemes like counting the number of times a pixel occurs in
> a row and comressing that to a single occurance and a count are out,
> because there is a random element to the data (cosmic rays, etc.),
> which means that it is unlikely that there will be many cases where
> this method would compress the data.

> I have thought of a few methods which I can try, but I am interested
> in looking at as many ways as possible. I have lots of actual data to
> try out various compression methods, and a reasonable amount of CPU to
> burn in choosing the algorithm (or two or three) which will actualyl
> be used.

> Does anyone have any suggestiongs?

Sure, else what's the point of responding.  ;-)

Data which is primarily of one (background) value, but contains enough
speckley noise to defeat run length or repeated string style (LZW, etc)
encodings, such as your CCD data, or other scanned real world images of
sparse data (engineering line drawings, e.g.) are ideal candidates for
arithmetic encoding.  Arithmetic encoding gains its efficiency by being
able to effectively encode a symbol set which is severely skewed to a
few of the possible values, (preferably one, where it is capable of
encoding the prevelant symbol in a small fraction of one bit), but may
in a large dataset contain at least one of _each_ symbol.  It is quite
immune to speckley noise, because it encodes symbol (here half-pixel) by
symbol rather than run by run, so a speckle just bumps the count for a
nearly empty bin by one for a while having little other effect on the
compression, rather than spoiling a run every time a speckle appears.

The down side is that (at least the implementation I know) is quite CPU
intensive. I understand there are much more effective methods, but I've
been quite impressed with the effectiveness of the simple routine in the
June, 1987 CACM on data of this type.  It might suit your needs for speed,
or be effective enough in compression you would consider dedicating a box
just to the compression task just to make use of this slower technique.

I have the code from the CACM article, with the modification of ANSI C
prototypes, which I've compiled under a C compiler with prototypes, and
under BSD Unix, but not both at once. I'd be glad to pass the code as is
to a few requests, but even happier to pass it on to someone who would
make sure it compiles under a modern Unix C compiler with the given
makefile, and then post it here or put it up for FTP access or both.

It has the typical BSD long file names, so it would be a little extra
work to unpack and use on a system with 14 character file name limits; a
bit of work by the recipient on the shar file with an editor, and then
on the #include statements and makefile when the stuff is unpacked, will
take care of that.

I also have a clever modification which replaces the "slide the highest
frequency bin to to the front of a linearly searched list for efficient
access" frequency count table of the CACM implementation with a heap
based table, which works but is not quite in such clean shape, if anyone
is interested. This is faster for a less skewed set of symbol
frequencies such as you might get by using it to replace the Huffman
coding part of lharc, or compressing common text, where the order (log
(symbol set size)) access to a heap bin beats the order(symbol set size)
access of the linearly searched table to find a bin, but loses in the
current case, where the expected table search would be very close to
order(1) in the highly skewed frequencies case.

Both sets of routines still have the general aspect of the CACM code,
which was written for tutorial rather than speed purposes (it does a
subroutine call per input or output bit, e.g.), so there is lots of room
to speed either routine up to make it less grim than my current
experience.

Again, I understand arithmetic coding compression has moved on long past
this code, but this is still a good place to start, because it has a
great set of documentation in the CACM article to assist understanding.

I haven't touched this stuff since June, from mental health problems, so
I'd be just as happy to pass it on to someone interested in pursuing the
matter further.

Kent, the man from xanth.
<xanthian@Zorch.SF-Bay.ORG> <xanthian@well.sf.ca.us>

henden@hpuxa.acs.ohio-state.edu (Arne A. Henden) (03/27/91)

In article <1991Mar23.013557.28151@nntp-server.caltech.edu>
 sns@deimos.caltech.edu (Sam Southard Jr.) writes:
>I have a topic/question that should be suitable for this newsgroup.
   [ ..much deleted!]
>[compressing astronomical images]
> Obviously, if this kind of data is going over the
>Ethernet, we want to compress it as much as possible.
>
>Does anyone have any suggestiongs?
>
>Sam Southard, Jr.
>{sns@deimos.caltech.edu|{backbone}!cit-vax!deimos!sns}

  Most astronomical images rattle around some low value with
a few spikes reaching near saturation (stars).  Others have
proposed some interesting techniques, such as using a difference
model.
  One technique that we wanted to try, but have never taken
the time to program, is to use bit plane compression.  For the
example above, divide the image into 16 1-bit images and
then run-length compress each image.  You will find a very
high compression ratio on the high order bits and essentially
zero compression on the low order bits.  I'd bet the average
would be about 40 percent compression, but the task would
be quite CPU intensive.
  We were originally going to use this technique for our
remote observing quick-look images, but find that run-length
compression with an adjustable threshold works well enough
for now, and intend to buy a pair of boards using the c-cube
chip in the near future to try a different approach.
Arne Henden  Astronomy Dept.

warnock@stars.gsfc.nasa.gov (Archie Warnock) (03/28/91)

In article <1991Mar27.021241.6339@magnus.acs.ohio-state.edu>,
henden@hpuxa.acs.ohio-state.edu (Arne A. Henden) writes... 
>  One technique that we wanted to try, but have never taken
>the time to program, is to use bit plane compression.  For the

Hi Arne :-)  Long time, eh?

I've looked into a variant on this idea - just by dividing the image 
into the high- and low-order bytes and comparing the compression factor 
this way with that for the entire (virgin) image.  Used a couple of 
standard PC-type compression programs like PKZIP.  It helped, but not as 
much as I'd have hoped.  Typically, the resulting compressed images were 
about 10% - 15% smaller than if I just left the image alone.  You might 
do better by breaking things up into individual bit-planes, but the last 
few planes would be so noisy, you might not.  Bottom line seems to be 
that it's easy (and fairly fast) to get the first 50% or so.  Recoding 
from 16-bit numbers to 8-bit differences gets you that much, and only 
costs a single addition per pixel to restore.  The hard work starts if 
you want more.

----------------------------------------------------------------------------
-- Archie Warnock                     Internet:  warnock@stars.gsfc.nasa.gov
-- ST Systems Corp.                   SPAN:      STARS::WARNOCK
-- NASA/GSFC                          "Unix - JCL for the 90s"

dwells@fits.cx.nrao.edu (Don Wells) (04/03/91)

In article <4638@dftsrv.gsfc.nasa.gov> warnock@stars.gsfc.nasa.gov
(Archie Warnock) writes:

   In article <1991Mar27.021241.6339@magnus.acs.ohio-state.edu>,
   henden@hpuxa.acs.ohio-state.edu (Arne A. Henden) writes... 
   >  One technique that we wanted to try, but have never taken
   >the time to program, is to use bit plane compression.  For the

   I've looked into a variant on this idea - just by dividing the image 
   into the high- and low-order bytes and comparing the compression factor 
   this way with that for the entire (virgin) image.  Used a couple of 
   standard PC-type compression programs like PKZIP.  It helped, but not as 
   much as I'd have hoped.  Typically, the resulting compressed images were 
   about 10% - 15% smaller than if I just left the image alone.  You might 
   do better by breaking things up into individual bit-planes, but the last 
   few planes would be so noisy, you might not.  

Last November a German astronomer asked me to compress several HST
images. The result of my experiments is in the anonFTP server on
fits.cx.nrao.edu [192.33.115.8] in directory /FITS/HST. The first of
the six files which I processed is:

       5158080 Oct 19 14:42 w0bs0102t_cvt.c0h
       4380234 Nov 15 16:29 w0bs0102t_cvt.c0h.Z
       3088384 Nov 16 00:16 w0bs0102t_cvt.c0h.tar

The .Z is just "compress" [LZW], which got 15% in this case. The
".tar" contains:

           931 Nov 15 23:56 1990 README
           674 Nov 16 00:08 1990 Makefile
           923 Nov 15 22:53 1990 bmerge.c
           917 Nov 15 22:14 1990 bsplit.c
        495707 Nov 16 00:11 1990 w0bs0102t_cvt.c0h.0.Z
       2579040 Nov 16 00:11 1990 w0bs0102t_cvt.c0h.1

The original file has been split by bsplit.c into even and odd bytes.
The even bytes compressed by 80%, but the odd (noisy low order) bytes
were incompressible. Program bmerge.c can zipper the 3.1_MB of files
back together to re-create the the original 5.2_MB file. In this case
the technique removed 40% of the original bits (half of 80%). For a FP
file you could get 25% immediately by splitting into four streams so
that the favorable statistics of the exponent bytes could be
exploited. In a binary table (visibility data) it would pay to split
the rows into many separate byte streams to exploit the differing
statistics of the various columns and of the bytes inside those
columns. The multiple bytestream notion is a special case of the idea
of splitting the stream into bitstreams and compressing them
separately.

   Bottom line seems to be 
   that it's easy (and fairly fast) to get the first 50% or so.  Recoding 
   from 16-bit numbers to 8-bit differences gets you that much, and only 
   costs a single addition per pixel to restore.  The hard work starts if 
   you want more.

I agree with Archie's remarks about the virtues of finite differences.
My purpose in this posting is to point out that he may have been a bit
too pessimistic about the efficacy of simple even-odd bytestream
compression. 
--

Donald C. Wells             Associate Scientist        dwells@nrao.edu
National Radio Astronomy Observatory                   +1-804-296-0277
Edgemont Road                                     Fax= +1-804-296-0278
Charlottesville, Virginia 22903-2475 USA            78:31.1W, 38:02.2N