[comp.mail.elm] Faster reading of mailboxes with indexing

henseler@uniol.UUCP (Herwig Henseler) (02/28/90)

Hello World,

When Un*x-mailers read a mailbox, they have to scan the whole file for
"From_"-lines to detect the top of the messages. So does elm. I can hardly
imagine a more uneffective way of achieving this aim! The mailbox format
is old enough to overcome an improvement...

Idea: Why not index this positions in a second file, so that only this
      file (with seek-positions for every "From_"-line together with the
      "From:"-entry, the "Subject:"-line and the total amount of lines)
      has to be scanned to build the internal tables for elm. This will be
      _much_ faster !

This would not break any standard, because this file may not exist for the
incoming mailbox. Maybe some MTA's will adopt this mechanism!
Even if not, it would speed up reading my folders. The only danger is
deleting mails from an indexed mailbox, but this can be detected via the
last-modified-date of the file.

Comments?

	bye, Herwig
--
## Herwig Henseler (CS-Student) D-2930 Varel, Tweehoernweg 69 | Brain fault- ##
## EMail: henseler@uniol.UUCP (..!uunet!unido!uniol!henseler) | core dumped  ##

syd@DSI.COM (Syd Weinstein) (03/01/90)

henseler@uniol.UUCP (Herwig Henseler) writes:

>When Un*x-mailers read a mailbox, they have to scan the whole file for
>"From_"-lines to detect the top of the messages. So does elm. I can hardly
>imagine a more uneffective way of achieving this aim! The mailbox format
>is old enough to overcome an improvement...

>Idea: Why not index this positions in a second file, so that only this
>      file (with seek-positions for every "From_"-line together with the
>      "From:"-entry, the "Subject:"-line and the total amount of lines)
>      has to be scanned to build the internal tables for elm. This will be
>      _much_ faster !
This was discusses a while back in the development group.  Two proposals
were considered, one imbed the index in the file itself as a fake pseudo
message, the second was to use a seperate file, with the sub ideas
of one file per user or one file per mail file.

However, this whole point becomes less important as we head toward the
Content-Length: header which allows for seeking over the body anyway
and we do need to read the headers anyway.
-- 
=====================================================================
Sydney S. Weinstein, CDP, CCP                   Elm Coordinator
Datacomp Systems, Inc.				Voice: (215) 947-9900
syd@DSI.COM or {bpa,vu-vlsi}!dsinc!syd	        FAX:   (215) 938-0235

ror@grassys.bc.ca (Richard O'Rourke) (03/01/90)

In article <1990Feb28.230830.9818@DSI.COM>, syd@DSI.COM (Syd Weinstein) writes:
> henseler@uniol.UUCP (Herwig Henseler) writes:
# 
# >When Un*x-mailers read a mailbox, they have to scan the whole file for
# >"From_"-lines to detect the top of the messages. So does elm. I can hardly
# >imagine a more uneffective way of achieving this aim! The mailbox format
# >is old enough to overcome an improvement...
# 
# >Idea: Why not index this positions in a second file, so that only this
# >      file (with seek-positions for every "From_"-line together with the
# >      "From:"-entry, the "Subject:"-line and the total amount of lines)
# >      has to be scanned to build the internal tables for elm. This will be
# >      _much_ faster !
# This was discusses a while back in the development group.  Two proposals
# were considered, one imbed the index in the file itself as a fake pseudo
# message, the second was to use a seperate file, with the sub ideas
# of one file per user or one file per mail file.
# 

If you're going to go through this much trouble, I respectfully
recommend a reading of the applicable X.400 docs on message data
base handling.  I'm not suggesting you spend your next year
implementing X.400.  I am suggesting that if you are going to take
the step of using 'mangled' message files or some sort of keyed
or database message system, that a perusal of applicable standards
is in order.  It would be a step in the right direction.

# =====================================================================
# Sydney S. Weinstein, CDP, CCP                   Elm Coordinator
# Datacomp Systems, Inc.				Voice: (215) 947-9900
# syd@DSI.COM or {bpa,vu-vlsi}!dsinc!syd	        FAX:   (215) 938-0235

les@chinet.chi.il.us (Leslie Mikesell) (03/02/90)

In article <1990Feb28.230830.9818@DSI.COM> syd@DSI.COM writes:

>However, this whole point becomes less important as we head toward the
>Content-Length: header which allows for seeking over the body anyway
>and we do need to read the headers anyway.

I'd like to see Content-Length: handling for compatibility with the
AT&T PMX mailers (attach/detach of multi-part messages would be nice
too).  Does anything else currently use it?
However, it would still save time to have an optional copy of the
headers and file offsets of each message stored in a 2nd file.
Perhaps you could just dump the internal index when saving a mailbox
over a certain size, then next time check for that file and if it
exists, checkpoint the last entry to verify that it is unchanged up
to that point, and merge in any appended items.
If you want something really different, I'd like to see something
like a zoo archive with the body compressed as an optional storage
format.

Les Mikesell
  les @chinet.chi.il.us