[comp.sources.misc] v06i045: Dissection utility for over-large mbox files

allbery@uunet.UU.NET (Brandon S. Allbery - comp.sources.misc) (03/05/89)

Posting-number: Volume 6, Issue 45
Submitted-by: mirk@warwick.UUCP (Mike Taylor)
Archive-name: dissect2

[Okay, so csplit can't be this tricky.  Still, you could do wonders with it
and a shell script wrapper....  Of course, BSD may not have "csplit".  ++bsa]

Here is a simple and self-explainatory little number for comp.sources.misc
It should run on any UNIX machine, though I've only tried it on a sun3 with
Berkeley 4.3.  It splits a large mbox into individually named personal
mboxes for each person who has composed one or more of the mbox's
constituent articles.  See the manual page for more details.

-------------------------- Cut here, cheese-heads! --------------------------
#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create the files:
#	Makefile
#	Manifest
#	README
#	dissect.1
#	dissect.c
# This archive created: Sat Jan 21 18:30:57 1989
# By:	Mike Taylor ()
export PATH; PATH=/bin:$PATH
if test -f 'Makefile'
then
	echo shar: will not over-write existing file "'Makefile'"
else
cat << \SHAR_EOF > 'Makefile'
all:		dissect.c
		cc -O -s dissect.c -o dissect
		rm -f count
		ln dissect count

dissect:	dissect.c
		cc -O -s dissect.c -o dissect

count:		dissect.c
		cc -O -s dissect.c -o count
SHAR_EOF
fi # end of overwriting check
if test -f 'Manifest'
then
	echo shar: will not over-write existing file "'Manifest'"
else
cat << \SHAR_EOF > 'Manifest'
-rw-r--r--  1 mirk     csother       182 Jan 21 18:01 Makefile
-rw-r--r--  1 mirk     csother       315 Jan 21 18:28 Manifest
-rw-r--r--  1 mirk     csother       564 Jan 21 18:24 README
-rw-r--r--  1 mirk     csother      1955 Jan 21 18:21 dissect.1
-rw-r--r--  1 mirk     csother      3194 Jan 21 17:57 dissect.c
SHAR_EOF
fi # end of overwriting check
if test -f 'README'
then
	echo shar: will not over-write existing file "'README'"
else
cat << \SHAR_EOF > 'README'
Evening, all.

This is a program written in a hurry by me one night because I was
sick of wading through 1/4M mailboxes, trying to find some archaic
piece of correspondance.  It breaks up a large mailbox (or several
of them, if you like) into smaller ones, named after the sender of
the pieces of mail they contain.  See the manual entry if this is
unclear.  Mail bugs, flames, pieces of frozen vomit, slices of
intestinal lining etc., to mirk@uk.ac.warwick.cs.  That's about it
really.  Lap it up!

PS.  1st man: "My dog's got not nose"
     2nd man: "Frog off."
SHAR_EOF
fi # end of overwriting check
if test -f 'dissect.1'
then
	echo shar: will not over-write existing file "'dissect.1'"
else
cat << \SHAR_EOF > 'dissect.1'
.\" @(#)dissect.1 1.17 89/01/20 SMI; from HACKERS 1.1
.TH DISSECT 1 "20 January 1989"
.SH NAME
dissect \- Break up an mbox into smaller mboxes
.br
count \- Count number of articles in an mbox
.SH SYNOPSIS
.B dissect
.I filename1
.I [ filename2 ... ]
.br
.B count
.I filename1
.I [ filename2 ... ]
.br
.SH DESCRIPTION
.B dissect
reads through one or more files in mbox format (eg. the file mbox created
by most "mail" programs, and the newsgroup files created by rn(1)).  It
creates new files, each named after the sender of an item of mail in one
of the specified mboxes, and in that file, deposits copies of all mail
sent by that user, so that together, the new files contain exactly the
same data as the old ones.  If the files that would be created already
exist, then
.B dissect
will append the news items in the specified mboxes onto the end of the
existing files.
.B dissect
will refuse to overwrite any of its arguments.
.sp
.B count
counts how many articles are in each mbox specified on the command-line,
and prints this on standard output.
.SH EXAMPLES
example% ls
.br
mbox
.br
example% dissect mbox
.br
example% ls
.br
VIRUS-L     cee074      erict       jec1        mbox
.br
andy        chip.uucp   hjt         martin      weemba
.br
example% count mbox martin hjt
.br
count:  11 items of mail in input file mbox.
.br
count:   1 items of mail in input file martin.
.br
count:   1 items of mail in input file hjt.
.br
example% dissect hjt
.br
dissect: won't overwrite input file hjt.
.SH "SEE ALSO"
.BR mail(1),
.BR rn(1),
.SH BUGS
.B dissect
creates the new files using only the local name of the user who sent
the mail item being saved - thus a piece of mail sent by a user
.B mirk@uk.ac.warwick.cs
would be saved in a file called simply
.B mirk.
.SH AUTHOR
.B dissect
and
.B count
were written by Michael Taylor (mirk@uk.ac.warwick.cs) in the early hours
of the morning of Friday, 20th January, 1989, on Warwick University's
Sun3 "emerald".
SHAR_EOF
fi # end of overwriting check
if test -f 'dissect.c'
then
	echo shar: will not over-write existing file "'dissect.c'"
else
cat << \SHAR_EOF > 'dissect.c'
/****************************************************************************\
|*                                                                          *|
|*  Dissect.c: a rough-and-ready heap of junk to split a file in mbox       *|
|*             format into a number of mbox-format files, each containing   *|
|*             all the messages from a sender whose mail was in the         *|
|*             original mbox, and named after that sender.                  *|
|*                                                                          *|
|*  Also:      it will count the number of articles in each mbox in its     *|
|*             argument list, when called with argv[0] not equal to         *|
|*             dissect.                                                     *|
|*                                                                          *|
|*  This program written in the early hours of 21st January 1989.           *|
|*  Copyright (C) 1989 by Mike Taylor.  No rights reserved - copy me!       *|
|*                                                                          *|
\****************************************************************************/

#include <stdio.h>
#include <strings.h>

#define LINELEN 1024

extern char *fgets ();
static int onlycount = 0;

/*--------------------------------------------------------------------------*/

int handle (argv, index)
  char **argv;
  int index;
{
  FILE *fp;
  FILE *to = NULL;
  static char name[LINELEN];
  static char line[LINELEN];
  static char last[LINELEN] = "\n";
  char *cp;
  int flag = 0;

  if ((fp = fopen (argv[index], "r")) == NULL) {
    (void) fprintf (stderr, "%s: couldn't open input file %s.\n",
		    argv[0], argv[index]);
    return (1);
  }

  while (fgets (line, LINELEN, fp) != NULL) {
    if ((!strncmp (line, "From ", 5)) && (*last == '\n')) {
      flag++;
      if (!onlycount) {
	(void) fclose (to);
	(void) strcpy (name, line+5);
	for (cp = name; (*cp != ' ') && (*cp != '@') && (*cp != '%'); cp++);
	*cp = '\0';
	if (!strcmp (name, argv[index])) {
	  (void) fprintf (stderr, "%s: won't overwrite input file %s.\n",
			  argv[0], argv[index]);
	  continue;
	}
	if ((to = fopen (name, "a")) == NULL) {
	  (void) fprintf (stderr, "%s: couldn't open output file %s.\n",
			  argv[0], name);
	  return (1);
	}
      }
    }
    if ((to != NULL) && (!onlycount))
      (void) fputs (line, to);
    (void) strcpy (last, line);
  }
  if (flag == 0)
    (void) fprintf (stderr, "%s: found no mail in input file %s.\n",
		    argv[0], argv[index]);
  else
    if (onlycount)
      (void) printf ("%s: %3d items of mail in input file %s.\n",
		     argv[0], flag, argv[index]);
  return (flag == 0);
}

/*--------------------------------------------------------------------------*/

main (argc, argv)
  int argc;
  char **argv;
{
  int status = 0;
  int i;

  if (argc == 1) {
    (void) fprintf (stderr, "Usage: %s file [ file ... ]\n", argv[0]);
    exit (255);
  }

  if (strcmp (argv[0], "dissect"))
    onlycount = 1;

  for (i = 1; i < argc; i++)
    status += handle (argv, i);

  exit (status);
}

/*--------------------------------------------------------------------------*/
SHAR_EOF
fi # end of overwriting check
#	End of shell archive
exit 0
______________________________________________________________________________
Mike Taylor - {Christ,M{athemat,us}ic}ian ...  Email to: mirk@uk.ac.warwick.cs
*** Unkle Mirk sez: "Em9 A7 Em9 A7 Em9 A7 Em9 A7 Cmaj7 Bm7 Am7 G Gdim7 Am" ***
------------------------------------------------------------------------------