hirchert@ux1.cso.uiuc.edu (Kurt Hirchert) (02/09/90)
On the machine where I receive my e-mail, mail folders (primarily from the last month or two) occupy 80% or more of the disk space I am using. Thus, the suggestion that mush support compressed mail folders seemed highly attractive to me. The subsequent discussion convinced me that building in automatic use of compress/uncompress would be both difficult and inefficient. Unfortunately, an approach like the zfolder macro, although easier to do, is even less efficient, so I thought I would suggest another approach: From the discussion, it would appear that mush performs the following low level operations on a mail file: 1. reading the entire file sequentially 2. reading selected sequences of bytes from the file randomly -- the size and location of these sequences can apparently be identified during the sequential pass 3. appending data (of known size?) to the end of the file 4. creating a new file 5. renaming a new file to replace an old file 6. deleting an old file What I am suggesting is that it might be possible for mush to recognize that some classes of mail files require special processing and to fork an new process to do that special processing. The process could accept "commands" and data through stdin and produce status information and data on stdout. The specific commands needed to allow efficient implementation may require some further study, but as a first cut, I would suggest the operations above plus an operation to identify those points where subsequent random reads are likely to begin. Assumining that both the file containing the process and the conditions under which it is invoked are made user specifiable, mush can be extended to handle new mail storage formats to write such a process, without Dan and Bart's intervention. To illustrate this, I will suggest how three such special processes might be constructed: Example 1: Compressed Folders Although the existing commands compress and uncompress are ill suited for this, the underlying LZW algorithm is not so hopeless. Both the compress and uncompress algorithms involve the creation of the same internal dictionary and this creation process is additive, that is, the dictionary later in the file is normally a superset of the dictionary early in the process. Thus, only relatively little information should be needed to allow uncompression of data in the middle of the file using the dictionary corresponding to the end of the file. The implementation of the "commands" might then be as follows: 1. sequential read - Do standard zcat/uncompress algorithm, except one also creates a list of correspondences between offsets into the compressed data and offsets in the resulting uncompressed data. 2. random read - Search the list of correspondences to find the highest output offset less than the requested starting point. Do standard zcat/uncompress from that point, discarding output bytes before and after the requested range. 3. append - If dictionary is not already built, do uncompress algorithm building dictionary without constructing uncompressed text. Then start compress algorithm at the end of the file, augmenting the existing dictionary. 4. create file - Initialize dictionary as for standard compress algorithm. 5. rename - Do ordinary rename (except for possible "smart" handling of .Z extension). 6. delete - Do ordinary delete (except for possible "smart" handling of .Z extension). 7. position noting - This could be taken into account in building the offset correspondences during sequential read. I would suggest maintaining a circular buffer of temporary correspondences with small separation for recently uncompressed data. When a request is received to note a position, the temporary position best covering that request can be added to the permanent list and all earlier temporary positions discarded. If the buffer fills up, the oldest temporary positions can be discarded. (The temporary buffer needs to be large enough that mush can recognize interesting positions and request that they be noted before the appropriate temporary correspondence "falls off the end".) There is one other special case: Some times the dictionary in LZW is filled and the algorithm restarts with a fresh dictionary. For this processor, I would note the offset at which this occurs and then fork a copy of the processor to handle requests for reading or writing beyond that offset. Thus, there would be a separate copy of the process for each distinct dictionary. Note that this implementation _never_ produces the uncompressed file, only uncompressed portions of it. This should help to keep down the peak disk usage! Example 2: Bart's Variation on Compressed Folders The above method for implementing append should be significantly faster than uncompressing the file, appending, and recompressing it, but if the file becomes large enough, it still might be too slow for some people. Bart suggests the variation of using two files to represent a folder, a compressed base file and and uncompressed extension. The implementation of the "commands" might then be as follows: 1. sequential read - First apply same algorithm as for example 1 to base file. Note the number of bytes produced. Then deliver the bytes from the extension file. 2. random read - If the bytes are before the breakpoint, use the implementation from example 1. If the bytes follow, subtract breakpoint and get corresponding bytes from the extension. (Since mush will be asking for messages, requests should never span the breakpoint, but if they did, you just do these two methods one after the other.) 3. append - If dictionary already available, do as for example 1. Otherwise, append data to extension file. (Optionally, if extension file gets "too large", build dictionary and append all of its current contents as for example 1 and truncate the extension file back to being a zero length file. 4. create file - initialize base file as for example 1 and create zero size extension file. 5. rename - Rename both files. 6. delete - Delete both files. 7. position noting - note positions before the breakpoint as for example 1, but don't bother noting positions after the breakpoint. Example 3 - Directory as Mail File Someone wanted to modify mush to deal with a scheme where each message was in a separate file and the directory was the "mail file". 1. sequential read - Reach each file in order, building a list of which byte offset are in which file. 2. random read - Use the list to determine which file to read from. 3. append - Write new file in directory. (If it can be recognized that this corresponds to exactly the text in a file in another directory (i.e., the copy during the update of a mail file), link the file rather than writing a new one. Additional "commands" may be required to give the special process enough hints to recognize this case.) 4. create file - Create directory. 5. rename - Rename directory. 6. delete - Recursively delete the directory and its contents. 7. position noting - Ignore. OK folks - how does this approach sound to you? -- Kurt W. Hirchert hirchert@ncsa.uiuc.edu National Center for Supercomputing Applications
schaefer@CSE.OGI.EDU (Barton E. Schaefer) (02/09/90)
In <1990Feb8.184406.2683@ux1.cso.uiuc.edu> Kurt Hirchert wrote: } Subject: compressed folders revisited (long) } } From the discussion, it would appear that mush performs the following } low level operations on a mail file: } 1. reading the entire file sequentially Correct. It also copies the entire file to a temporary location, which can be freely modified without altering the original file. That's why, for example, if you "merge" folders or run the "edit" command on a message and then exit without updating, the original file is unchanged. } 2. reading selected sequences of bytes from the file randomly -- the } size and location of these sequences can apparently be identified } during the sequential pass Yes. Location but not size at the moment, though I suppose it could be worked out. } 3. appending data (of known size?) to the end of the file True, on "merge" (including "undigest -m"). The size is not necessarily known. On update, it may also add information in the middle of the file (the "Status:" header that so annoys Berkeley-mail-haters). } 4. creating a new file Only on "save" and friends, unless you count the temp file created when loading the folder. } 5. renaming a new file to replace an old file Mush *never* does this. It either overwrites and truncates (on systems that have ftruncate()) or truncates and rewrites (everywhere else) the original file. } 6. deleting an old file This happens only for the temp files or for folders that are found at update time to be empty (all messages deleted). } What I am suggesting is that it might be possible for mush to recognize } that some classes of mail files require special processing and to fork } an new process to do that special processing. This might be fine when it is possible to fork a second process. Mush does run on non-UNIX systems, though .... } The process could accept } "commands" and data through stdin and produce status information and } data on stdout. The other problem with this is getting everything connected appropriately to avoid deadlock. See the recent discussion on comp.unix.questions. Not to imply that it's impossible. When mush is fixed so it doesn't need backward seeks to load a folder, a trivial way to implement a subset of this idea would be to have mush check the "folder" it is loading for execute permission. If execute is set, mush runs the "folder" as a command, perhaps giving it "-r" as an argument, and snarfs into its temp file from the output of that command. Similarly, on any "save" or "update" whose target turns out to be an executable file, it could run it with "-a" or "-w" as appropriate, and write to the standard input of the command. Then it's up to the "folder" to correctly modify "itself". But the idea of having the process hang around to be both read from and written to, as a replacement for the temp file, doesn't strike me as the best solution. } Assumining } that both the file containing the process and the conditions under } which it is invoked are made user specifiable, mush can be extended to } handle new mail storage formats to write such a process, without Dan } and Bart's intervention. "User specifiable" might be a problem, except with a scheme like testing the executable bit in the file permissions. It might, however, be possible to make it compile-time specifiable. In fact, it might be possible to link the file-manipulation functions directly into mush, as is done with some of the MMDF functions now, and not need separate processes at all. Another difficulty with one process for both input and output is that we want future mushes to be able to handle several folders at once, so you can move messages around from one to the other or create new folders of messages taken from one or more others. Furthermore, all of this should be able to happen "virtually", that is, without actually changing the underlying "real" folders until updates are done. If you are forced to fork one process, keep it running, and connect two file descriptors to it, for every folder you access, you're going to be in real trouble. I don't know the LZW encoding algorithm well enough to comment on the example given. It sounds reasonable, except for the general arguments I've been mentioning; how would those change things? } Example 3 - Directory as Mail File In connection with having multiple folders open, there's going to be support for locating different messages in different files, which pretty much takes care of the directory-as-folder problem -- you just treat it as a whole bunch of one-message folders, all open at the same time. } OK folks - how does this approach sound to you? Sounds promising but in need of more work. For example, another planned addition to mush is "index" files that store the seek offsets currently determined during the read pass; with this info stored ahead of time, the requirements for the read "process" are simplified. -- Bart Schaefer "February. The hangnail on the big toe of the year." -- Duffy schaefer@cse.ogi.edu (used to be cse.ogc.edu)
ronald@robobar.co.uk (Ronald S H Khoo) (02/12/90)
In article <9002090129.AA23568@cse.ogi.edu> schaefer@CSE.OGI.EDU (Barton E. Schaefer) writes: > For example, another planned > addition to mush is "index" files that store the seek offsets currently > determined during the read pass; with this info stored ahead of time, the > requirements for the read "process" are simplified. Will they be compatible with msg's ._<mailbox> file ? -- Eunet: Ronald.Khoo@robobar.Co.Uk Phone: +44 1 991 1142 Fax: +44 1 998 8343 Paper: Robobar Ltd. 22 Wadsworth Road, Perivale, Middx., UB6 7JD ENGLAND. $Header: /usr/ronald/.signature,v 1.2 90/01/26 15:17:15 ronald Exp $ :-)
schaefer@ogicse.ogc.edu (Barton E. Schaefer) (02/13/90)
In article <1990Feb12.084426.11212@robobar.co.uk> ronald@ibmpcug.CO.UK (Ronald S H Khoo) writes: } In article <9002090129.AA23568@cse.ogi.edu> schaefer@CSE.OGI.EDU (Barton E. Schaefer) writes: } > For example, another planned } > addition to mush is "index" files that store the seek offsets currently } > determined during the read pass } } Will they be compatible with msg's ._<mailbox> file ? Probably not, unless someone gives us a description of the format, and it seems sensible. In particular, I don't even know what "msg" is, nor do I have the slightest idea what a ._<mailbox> file looks like. Dan? -- Bart Schaefer "February. The hangnail on the big toe of the year." -- Duffy schaefer@cse.ogi.edu (used to be cse.ogc.edu)
ronald@robobar.co.uk (02/20/90)
In article <2634@uvaarpa.virginia.edu> mer6g@uvaarpa.Virginia.EDU (Marc Rouleau) writes: > argv@sun.UUCP (Dan Heller) writes: > |Ronald S H Khoo writes: > |> } Will they be compatible with msg's ._<mailbox> file ? > | > |Last I heard "msg", I was at SRI --the program has since changed its > |name to the more commonly known name, Elm. > > I think we're talking about the MMDF UA "msg". Argh. So much news to catch up on! Yes, we are talking about the MMDF "msg". Sorry about the response delay. No I'm not running MMDF yet. I may well be soon (at another site) for all kindsa reasons. msg compatibility would be useful, but I can't help (yet) esp cos my MMDF tape is still "on it's way" from the UK distribution centre. I didn't know there was another "msg" other than the variety one finds in instant noodles. sorry, dan & bart. Have a nice day y'all. -- Eunet: Ronald.Khoo@robobar.Co.Uk Phone: +44 1 991 1142 Fax: +44 1 998 8343 Paper: Robobar Ltd. 22 Wadsworth Road, Perivale, Middx., UB6 7JD ENGLAND. $Header: /usr/ronald/.signature,v 1.2 90/01/26 15:17:15 ronald Exp $ :-)
david@ms.uky.edu (David Herron -- a slipped disk) (02/28/90)
The `msg' being referred to is the one in the MMDF distribution. The ._ file is otherwise known as the "binary box". It keeps track of the location of each message within the mailbox, and the status (read/unread/deleted/etc). The major flaw with it is that the disk addresses aren't in network byte order.. (soon to be fixed..) BTW, the name change from msg -> elm was at my insistence... But that was years and years ago. -- <- David Herron; an MMDF guy <david@ms.uky.edu> <- ska: David le casse\*' {rutgers,uunet}!ukma!david, david@UKMA.BITNET <- <- Now arrived at a nameserver near you: david@davids.mmdf.com