[comp.mail.mush] Huge folders

tneff@bfmny0.UU.NET (Tom Neff) (02/02/90)

I too have noticed that folders tend to get huge.  When they do, Mush's
performance gets excruciatingly slow.  Loading them means scanning
the entire file searching for message starts.  This can take forever!

And yet breaking the folders up into smaller ones is an unsatisfying
solution, because you lose the ability to manipulate the entire collection
of messages with "pick," "sort" etc.  This is what we use Mush for to
begin with - hate to give it up.

So another idea occurs to me - how about INDEXING huge folders?  Storing
a list of start-of-message pointers in a separate index file (in the
same directory as the folder) would let you access a huge folder in
seconds.  The header fields Mush displays for the current screenful of
messages could be grabbed in a few seek-and-reads.  As you need other
headers, you go get them.  (Mush could even keep loading headers in the
background after displaying the current set and entering the shell in
the foreground.)

How it might work:  The user decides that the currently opened folder
("+mysave") is huge and should be indexed.  He issues the Mush command:

	index

This creates "+X.mysave" containing start-of-message file pointers for
the folder.  Mush remembers the folder is indexed and will update the
index file whenever the folder itself is updated.

In a later Mush session the user selects folder "+mysave" and Mush
notices that "+X.mysave" also exists.  If it is newer than "+mysave"
then the index is loaded and its pointers used to achieve a fast "scan"
of the folder; only the needed messages are actually read from the
folder file.  If the index exists but is older than the folder file Mush
gets smart.  If the old indices LOOK like they point to messages, and
the folder is just bigger, then Mush fast-scans the "old" portion and
brute force reads the "new" before display.  This is the normal case
when a mail delivery agent appends new messages to a folder.  But if the
indices look WRONG now (indicating that somebody edited or otherwise
touched the folder with some other program since the last Mush session),
Mush warns the user "Index obsolete - rebuild? [y]" and prompts.  (I
haven't thought in any depth about what Mush does if the user answers
"no", but clearly Mush doesn't use the index.)

A final optimization for huge folders would be to update the original
file IN PLACE if the user's changes don't require moving any text
around, e.g., deleting new messages while leaving old ones untouched.
I realize not everyone's OS permits this, but it would make a nice
compile time switch.

schaefer@ogicse.ogc.edu (Barton E. Schaefer) (02/03/90)

First, some introductory remarks on folder compression:

In article <5C98ACE2A8@crdos1> davidsen@crdos1.crd.ge.com writes:
} 
}   If this gets added, and I would love to see it, provision should be
} made to provide a compress and uncompress string, which, if defined,
} would be loaded with the parameters and executed. This would allow not
} only compress, but also things like arc, zoo, zip, lzhuf, etc, archives
} of folders. Many of these run on DOS as well as UNIX.
} 
}                            Make it work right

Of course.  As I pointed out in an earlier article, the main difficulty
with compressed folders is loading from a pipe.  If you're willing to
live with two temp files -- uncompress to the first temp and load it into
the "working" temp from there, then recompress back to the original after
update -- it could be implemented easily.  But in that case you might as
well use the "zfolder" cmd and scripts I'm about to post new versions of,
because that's what they do.

}   This should be integrated into the save commands, too. No use allowing
} a compresses folder if you can't add to it. And perhaps an option to not
} recompress until exit, so going thru your mail and sending stuff into
} folders would not thrash folders if multiple things were added.

See my earlier comments on the infeasibility of "save" into a compressed
folder.  I see what you're driving at -- uncompress when the first save
to that folder is issued, then remember that you need to compress again
at exit time -- but I really think my scheme of saving to a secondary
folder that is not kept compressed, and then merging when necessary, is
more efficient both in terms of time (assuming that the most recently
saved messages are the ones that are most frequently needed, accesses
are quicker because you need not uncompress) and in terms of disk space.

In article <48677cdf.20b6d@apollo.HP.COM> ced@apollo.HP.COM (Carl Davidson) writes:
} 
} I, too, have some mail folders that are huge (> 2 Mbytes). The ability
} to compress/decompress folders "on-the-fly" would be nice. Even better
} would be to store mail messages in a hypertext database.

Goodness, not asking for much, are we? :-)

} I also realize that this is a pipe dream, so I would gladly settle for
} auto compress/decompress.

If that other Davids*n's user-specifable packing/unpacking strings get
implemented, you can probably use them to connect folder loading to any
kind of database you like.  It's a little beyond what mush is designed
to be to have that kind of database manager built in.  (Dan is free to
contradict me on this. :-)

In article <15147@bfmny0.UU.NET> tneff@bfmny0.UU.NET (Tom Neff) writes:
} I too have noticed that folders tend to get huge.  When they do, Mush's
} performance gets excruciatingly slow.  Loading them means scanning
} the entire file searching for message starts.  This can take forever!
} 
} So another idea occurs to me - how about INDEXING huge folders?

Various means for implementing this very thing have been under discussion
for some time.  What hasn't been solved is detection of corrupted folders
or index files, when the index appears valid to external checks (like the
modification times) but actually doesn't agree with the folder.  The
algorithms for doing this validation are understood, but implementation
appears to require a complete rewrite of the folder loading code (which
was hard enough to get right in the first place).

In other words, it's on our "some day" list.

} A final optimization for huge folders would be to update the original
} file IN PLACE if the user's changes don't require moving any text
} around, e.g., deleting new messages while leaving old ones untouched.

Hmmm ....
-- 
Bart Schaefer                             "Live and don't learn, that's us."
                                                                   -- Hobbes

schaefer@cse.ogi.edu (used to be cse.ogc.edu)

tneff@bfmny0.UU.NET (Tom Neff) (02/03/90)

One other wish list item would make life with huge (and less than
huge) folders easier over slow baud rate connections:

Allow a "+initial_command" switch on the Mush invocation line, and obey
it the command before initial display.  This would be generally useful,
like the "+cmd" feature in 'vi' and 'less'.

Specifically what I would tend to do on slow speed dialup lines
is say

	mush -f mylist +last-msg			# or +$

so that I enter the folder at the BACK rather than having to sit
fidgeting through the entire initial display and THEN switch to
the final screenful of messages.

So if the general +initial_command facility is too hard, it would
at least be great to add a -G switch to jump to the end of the
specified folder before display.


OK smoke em if you got em.  :-)

schaefer@ogicse.ogc.edu (Barton E. Schaefer) (02/05/90)

In article <15151@bfmny0.UU.NET> tneff@bfmny0.UU.NET (Tom Neff) writes:
} One other wish list item would make life with huge (and less than
} huge) folders easier over slow baud rate connections:
} 
} Allow a "+initial_command" switch on the Mush invocation line, and obey
} it the command before initial display.  This would be generally useful,
} like the "+cmd" feature in 'vi' and 'less'.

Just the other day I was trying to figure out how to implement a -e
option, ala sed, perl, etc., which would be pretty much equivalent.
You also ought to be able to use multiple -I or -F options, which you
can't at the moment.

} Specifically what I would tend to do on slow speed dialup lines
} is say
} 
} 	mush -f mylist +last-msg			# or +$
} 
} so that I enter the folder at the BACK rather than having to sit
} fidgeting through the entire initial display and THEN switch to
} the final screenful of messages.

I take it you have "alias mush 'mush -C'" or the like, so that it
wouldn't work to use "mush -N"?

Note also that the "curses" mode can now be turned on and off in
the .mushrc file, so you can

    if $TERM == slow-dialup-terminal-type	# whatever
	curses off
    endif

or, alternately, get rid of the alias for -C and use

    if $TERM == fast-at-the-office-type
	curses
    endif

If the two types are the same I'm sure you can figure out some way to
differentiate; e.g. put "setenv TTY `tty`" in your .login and then

    if $TTY =~ *ttyd*
	curses off
    endif

I'm not rejecting your suggestion, I'm just offering workarounds for
the present situation.
-- 
Bart Schaefer                             "Live and don't learn, that's us."
                                                                   -- Hobbes

schaefer@cse.ogi.edu (used to be cse.ogc.edu)