[comp.mail.mush] compressed folders revisited

hirchert@ux1.cso.uiuc.edu (Kurt Hirchert) (02/09/90)

On the machine where I receive my e-mail, mail folders (primarily from
the last month or two) occupy 80% or more of the disk space I am using.
Thus, the suggestion that mush support compressed mail folders seemed
highly attractive to me.  The subsequent discussion convinced me that
building in automatic use of compress/uncompress would be both
difficult and inefficient.  Unfortunately, an approach like the zfolder
macro, although easier to do, is even less efficient, so I thought I
would suggest another approach:

From the discussion, it would appear that mush performs the following
low level operations on a mail file:
1. reading the entire file sequentially
2. reading selected sequences of bytes from the file randomly -- the
   size and location of these sequences can apparently be identified
   during the sequential pass
3. appending data (of known size?) to the end of the file
4. creating a new file
5. renaming a new file to replace an old file
6. deleting an old file

What I am suggesting is that it might be possible for mush to recognize
that some classes of mail files require special processing and to fork
an new process to do that special processing.  The process could accept
"commands" and data through stdin and produce status information and
data on stdout.  The specific commands needed to allow efficient
implementation may require some further study, but as a first cut, I
would suggest the operations above plus an operation to identify those
points where subsequent random reads are likely to begin.  Assumining
that both the file containing the process and the conditions under
which it is invoked are made user specifiable, mush can be extended to
handle new mail storage formats to write such a process, without Dan
and Bart's intervention.  To illustrate this, I will suggest how three
such special processes might be constructed:

Example 1: Compressed Folders

Although the existing commands compress and uncompress are ill suited
for this, the underlying LZW algorithm is not so hopeless.  Both the
compress and uncompress algorithms involve the creation of the same
internal dictionary and this creation process is additive, that is,
the dictionary later in the file is normally a superset of the
dictionary early in the process.  Thus, only relatively little
information should be needed to allow uncompression of data in the
middle of the file using the dictionary corresponding to the end of
the file.  The implementation of the "commands" might then be as
follows:
1. sequential read - Do standard zcat/uncompress algorithm, except
   one also creates a list of correspondences between offsets into
   the compressed data and offsets in the resulting uncompressed data.
2. random read - Search the list of correspondences to find the highest
   output offset less than the requested starting point.  Do standard
   zcat/uncompress from that point, discarding output bytes before and
   after the requested range.
3. append - If dictionary is not already built, do uncompress algorithm
   building dictionary without constructing uncompressed text.  Then
   start compress algorithm at the end of the file, augmenting the
   existing dictionary.
4. create file - Initialize dictionary as for standard compress
   algorithm.
5. rename - Do ordinary rename (except for possible "smart" handling
   of .Z extension).
6. delete - Do ordinary delete (except for possible "smart" handling
   of .Z extension).
7. position noting - This could be taken into account in building the
   offset correspondences during sequential read.  I would suggest
   maintaining a circular buffer of temporary correspondences with
   small separation for recently uncompressed data.  When a request is
   received to note a position, the temporary position best covering
   that request can be added to the permanent list and all earlier
   temporary positions discarded.  If the buffer fills up, the oldest
   temporary positions can be discarded.  (The temporary buffer needs
   to be large enough that mush can recognize interesting positions and
   request that they be noted before the appropriate temporary
   correspondence "falls off the end".)

There is one other special case:  Some times the dictionary in LZW is
filled and the algorithm restarts with a fresh dictionary.  For this
processor, I would note the offset at which this occurs and then fork
a copy of the processor to handle requests for reading or writing
beyond that offset.  Thus, there would be a separate copy of the
process for each distinct dictionary.

Note that this implementation _never_ produces the uncompressed file,
only uncompressed portions of it.  This should help to keep down the
peak disk usage!

Example 2: Bart's Variation on Compressed Folders

The above method for implementing append should be significantly faster
than uncompressing the file, appending, and recompressing it, but if
the file becomes large enough, it still might be too slow for some
people.  Bart suggests the variation of using two files to represent
a folder, a compressed base file and and uncompressed extension.
The implementation of the "commands" might then be as follows:
1. sequential read - First apply same algorithm as for example 1 to
   base file.  Note the number of bytes produced.  Then deliver the
   bytes from the extension file.
2. random read - If the bytes are before the breakpoint, use the
   implementation from example 1.  If the bytes follow, subtract
   breakpoint and get corresponding bytes from the extension.  (Since
   mush will be asking for messages, requests should never span the
   breakpoint, but if they did, you just do these two methods one
   after the other.)
3. append - If dictionary already available, do as for example 1.
   Otherwise, append data to extension file.  (Optionally, if extension
   file gets "too large", build dictionary and append all of its
   current contents as for example 1 and truncate the extension file
   back to being a zero length file.
4. create file - initialize base file as for example 1 and create
   zero size extension file.
5. rename - Rename both files.
6. delete - Delete both files.
7. position noting - note positions before the breakpoint as for
   example 1, but don't bother noting positions after the breakpoint.

Example 3 - Directory as Mail File

Someone wanted to modify mush to deal with a scheme where each message
was in a separate file and the directory was the "mail file".

1. sequential read - Reach each file in order, building a list of which
   byte offset are in which file.
2. random read - Use the list to determine which file to read from.
3. append - Write new file in directory.  (If it can be recognized that
   this corresponds to exactly the text in a file in another directory
   (i.e., the copy during the update of a mail file), link the file
   rather than writing a new one.  Additional "commands" may be
   required to give the special process enough hints to recognize
   this case.)
4. create file - Create directory.
5. rename - Rename directory.
6. delete - Recursively delete the directory and its contents.
7. position noting - Ignore.


OK folks - how does this approach sound to you?
-- 
Kurt W. Hirchert     hirchert@ncsa.uiuc.edu
National Center for Supercomputing Applications

schaefer@CSE.OGI.EDU (Barton E. Schaefer) (02/09/90)

In <1990Feb8.184406.2683@ux1.cso.uiuc.edu> Kurt Hirchert wrote:
} Subject: compressed folders revisited (long)
}
} From the discussion, it would appear that mush performs the following
} low level operations on a mail file:
} 1. reading the entire file sequentially

Correct.  It also copies the entire file to a temporary location, which
can be freely modified without altering the original file.  That's why,
for example, if you "merge" folders or run the "edit" command on a
message and then exit without updating, the original file is unchanged.

} 2. reading selected sequences of bytes from the file randomly -- the
}    size and location of these sequences can apparently be identified
}    during the sequential pass

Yes.  Location but not size at the moment, though I suppose it could
be worked out.

} 3. appending data (of known size?) to the end of the file

True, on "merge" (including "undigest -m").  The size is not necessarily
known.  On update, it may also add information in the middle of the file
(the "Status:" header that so annoys Berkeley-mail-haters).

} 4. creating a new file

Only on "save" and friends, unless you count the temp file created when
loading the folder.

} 5. renaming a new file to replace an old file

Mush *never* does this.  It either overwrites and truncates (on systems
that have ftruncate()) or truncates and rewrites (everywhere else) the
original file.

} 6. deleting an old file

This happens only for the temp files or for folders that are found at
update time to be empty (all messages deleted).

} What I am suggesting is that it might be possible for mush to recognize
} that some classes of mail files require special processing and to fork
} an new process to do that special processing.

This might be fine when it is possible to fork a second process.  Mush
does run on non-UNIX systems, though ....

} The process could accept
} "commands" and data through stdin and produce status information and
} data on stdout.

The other problem with this is getting everything connected appropriately
to avoid deadlock.  See the recent discussion on comp.unix.questions.
Not to imply that it's impossible.

When mush is fixed so it doesn't need backward seeks to load a folder, a
trivial way to implement a subset of this idea would be to have mush
check the "folder" it is loading for execute permission.  If execute is
set, mush runs the "folder" as a command, perhaps giving it "-r" as an
argument, and snarfs into its temp file from the output of that command.
Similarly, on any "save" or "update" whose target turns out to be an
executable file, it could run it with "-a" or "-w" as appropriate, and
write to the standard input of the command.  Then it's up to the "folder"
to correctly modify "itself".  But the idea of having the process hang
around to be both read from and written to, as a replacement for the temp
file, doesn't strike me as the best solution.

} Assumining
} that both the file containing the process and the conditions under
} which it is invoked are made user specifiable, mush can be extended to
} handle new mail storage formats to write such a process, without Dan
} and Bart's intervention.

"User specifiable" might be a problem, except with a scheme like testing
the executable bit in the file permissions.  It might, however, be
possible to make it compile-time specifiable.  In fact, it might be
possible to link the file-manipulation functions directly into mush, as
is done with some of the MMDF functions now, and not need separate
processes at all.

Another difficulty with one process for both input and output is that we
want future mushes to be able to handle several folders at once, so you
can move messages around from one to the other or create new folders of
messages taken from one or more others.  Furthermore, all of this should
be able to happen "virtually", that is, without actually changing the
underlying "real" folders until updates are done.  If you are forced to
fork one process, keep it running, and connect two file descriptors to
it, for every folder you access, you're going to be in real trouble.

I don't know the LZW encoding algorithm well enough to comment on the
example given.  It sounds reasonable, except for the general arguments
I've been mentioning; how would those change things?

} Example 3 - Directory as Mail File

In connection with having multiple folders open, there's going to be
support for locating different messages in different files, which pretty
much takes care of the directory-as-folder problem -- you just treat it
as a whole bunch of one-message folders, all open at the same time.

} OK folks - how does this approach sound to you?

Sounds promising but in need of more work.  For example, another planned
addition to mush is "index" files that store the seek offsets currently
determined during the read pass; with this info stored ahead of time, the
requirements for the read "process" are simplified.


-- 
Bart Schaefer          "February.  The hangnail on the big toe of the year."
                                                                    -- Duffy

schaefer@cse.ogi.edu (used to be cse.ogc.edu)

ronald@robobar.co.uk (Ronald S H Khoo) (02/12/90)

In article <9002090129.AA23568@cse.ogi.edu> schaefer@CSE.OGI.EDU (Barton E. Schaefer) writes:
> For example, another planned
> addition to mush is "index" files that store the seek offsets currently
> determined during the read pass; with this info stored ahead of time, the
> requirements for the read "process" are simplified.

Will they be compatible with msg's ._<mailbox> file ?

-- 
Eunet: Ronald.Khoo@robobar.Co.Uk   Phone: +44 1 991 1142    Fax: +44 1 998 8343
Paper: Robobar Ltd. 22 Wadsworth Road, Perivale, Middx., UB6 7JD ENGLAND.
$Header: /usr/ronald/.signature,v 1.2 90/01/26 15:17:15 ronald Exp $ :-)

schaefer@ogicse.ogc.edu (Barton E. Schaefer) (02/13/90)

In article <1990Feb12.084426.11212@robobar.co.uk> ronald@ibmpcug.CO.UK (Ronald S H Khoo) writes:
} In article <9002090129.AA23568@cse.ogi.edu> schaefer@CSE.OGI.EDU (Barton E. Schaefer) writes:
} > For example, another planned
} > addition to mush is "index" files that store the seek offsets currently
} > determined during the read pass
} 
} Will they be compatible with msg's ._<mailbox> file ?

Probably not, unless someone gives us a description of the format,
and it seems sensible.

In particular, I don't even know what "msg" is, nor do I have the
slightest idea what a ._<mailbox> file looks like.  Dan?
-- 
Bart Schaefer          "February.  The hangnail on the big toe of the year."
                                                                    -- Duffy

schaefer@cse.ogi.edu (used to be cse.ogc.edu)

ronald@robobar.co.uk (02/20/90)

In article <2634@uvaarpa.virginia.edu> mer6g@uvaarpa.Virginia.EDU (Marc Rouleau) writes:
> argv@sun.UUCP (Dan Heller) writes:
> |Ronald S H Khoo writes:
> |> } Will they be compatible with msg's ._<mailbox> file ?
> |
> |Last I heard "msg", I was at SRI --the program has since changed its
> |name to the more commonly known name, Elm.
> 
> I think we're talking about the MMDF UA "msg".

Argh. So much news to catch up on!  Yes, we are talking about the MMDF "msg".
Sorry about the response delay.  No I'm not running MMDF yet.  I may well
be soon (at another site) for all kindsa reasons.  msg compatibility would
be useful, but I can't help (yet) esp cos my MMDF tape is still "on it's way"
from the UK distribution centre.

I didn't know there was another "msg" other than the variety one finds in
instant noodles.  sorry, dan & bart.

Have a nice day y'all.
-- 
Eunet: Ronald.Khoo@robobar.Co.Uk   Phone: +44 1 991 1142    Fax: +44 1 998 8343
Paper: Robobar Ltd. 22 Wadsworth Road, Perivale, Middx., UB6 7JD ENGLAND.
$Header: /usr/ronald/.signature,v 1.2 90/01/26 15:17:15 ronald Exp $ :-)

david@ms.uky.edu (David Herron -- a slipped disk) (02/28/90)

The `msg' being referred to is the one in the MMDF distribution.  The ._
file is otherwise known as the "binary box".  It keeps track of the location
of each message within the mailbox, and the status (read/unread/deleted/etc).

The major flaw with it is that the disk addresses aren't in network byte
order..  (soon to be fixed..)

BTW, the name change from msg -> elm was at my insistence...  But that was
years and years ago.

-- 
<- David Herron; an MMDF guy                              <david@ms.uky.edu>
<- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
<-
<- Now arrived at a nameserver near you:  david@davids.mmdf.com