[comp.lang.perl] Email parsing in perl?

jas@ISI.EDU (Jeff Sullivan) (02/16/91)

Does anyone have some code that parses email messages, extracting all
of the useful info in them by field? (e.g., To: From: Reply-TO:,
Subject: cc:, and the rest as body)?

I'm sure someone's done this; don't want to reinvent the wheel.

jas

--
--------------------------------------------------------------------------
Jeffrey A. Sullivan		| Senior Systems Programmer
jas@venera.isi.edu		| Information Sciences Institute
jas@isi.edu                    	| University of Southern California

rrr@u02.svl.cdc.com (Rich Ragan) (02/17/91)

In <16798@venera.isi.edu> jas@ISI.EDU (Jeff Sullivan) writes:


>Does anyone have some code that parses email messages, extracting all
>of the useful info in them by field? (e.g., To: From: Reply-TO:,
>Subject: cc:, and the rest as body)?

Johan Vromans rfc822.pl parses mail headers quite nicely. I can't
recall which perl ftp location I got it from right now but maybe
you can just try them or someone else will remember. If all else
fails, I'll mail you a copy.
--
Richard R. Ragan   rrr@svl.cdc.com    (408) 496-4340 
Control Data Corporation - Silicon Valley Operations
5101 Patrick Henry Drive, Santa Clara, CA 95054-1111

tchrist@convex.COM (Tom Christiansen) (02/17/91)

From the keyboard of jas@ISI.EDU (Jeff Sullivan):
:Does anyone have some code that parses email messages, extracting all
:of the useful info in them by field? (e.g., To: From: Reply-TO:,
:Subject: cc:, and the rest as body)?
:
:I'm sure someone's done this; don't want to reinvent the wheel.

If you want something pretty spiffy, see Chip Salzenburg's deliver
package.  If you just want to roll your own for some other purpose,
Larry and Randal have a nice example on p 183 of their Camel Book;

    $* = 1;
    $header =~ s/\n\s+/ /g;      # Merge continuation lines.
    %head = ('FRONTSTUFF', split(/^([-\w]+):/, $header));

Which puts the so-called UNIX From_ line as $head{'FRONTSTUFF'}
etc.  It does not process multiple headers as you might want it to.
Off the top of my head, you should be able to munge this into use:

    $/ = ''; # paramode
    $* = 1;
    $_ = <>; # read header

    @hdrs = split( /^([-\w]+):\s*/ );
    shift @hdrs;  # don't need leading stuff

    while ( ($name, $text) = splice(@hdrs,0,2) ) {
	$text =~ s/\n/ /g;  # maybe don't want multlines
	$Header{$name} .= ", " if $Headers{$name};
	$Header{$name} .= $text;
    }

    for $header (sort keys %Header) {
	print "<$header>: $Header{$header}\n";
    }


--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
 "All things are possible, but not all expedient."  (in life, UNIX, and perl)

merlyn@iwarp.intel.com (Randal L. Schwartz) (02/17/91)

In article <1991Feb16.185313.1789@convex.com>, tchrist@convex (Tom Christiansen) writes:
| Off the top of my head, you should be able to munge this into use:
| 
|     $/ = ''; # paramode
|     $* = 1;
|     $_ = <>; # read header
| 
|     @hdrs = split( /^([-\w]+):\s*/ );
|     shift @hdrs;  # don't need leading stuff
| 
|     while ( ($name, $text) = splice(@hdrs,0,2) ) {
| 	$text =~ s/\n/ /g;  # maybe don't want multlines
| 	$Header{$name} .= ", " if $Headers{$name};
| 	$Header{$name} .= $text;
|     }
| 
|     for $header (sort keys %Header) {
| 	print "<$header>: $Header{$header}\n";
|     }

Here's what I'm using everytime someone sends me a mail message (this
is an extract... hopefully, nothing above it or below it is necessary
for this example.  sigh.):

##################################################
$message = join("",<STDIN>);
($header,$body) = split(/\n\n/,$message,2); $body; # use $body for perl -w
$header =~ s/\n(\s+)/\377\1/g; # hide continuations
@headerlines = split(/\n/,$header);
for $h (split(/\s+/,<<EOL)) {
Reply-To From Sender Apparently-To To Cc Subject Date Message-ID
EOL
	@copy = @headerlines;
	@h = grep(s/^$h:\s*//i, @copy);
	next unless @h;
	if ($h =~ /^(Apparently-To|To|Cc)$/) {
		%addr = ();
		for (@h) {
			for (split(/,/)) {
				$addr{&cleanaddr($_)}++;
			}
		}
		@h = sort keys addr;
	}
	$field{$h} = join("\n",@h)."\n";
}
for (sort keys %field) {
	print "$_:\t", join("\n\t",split(/\n/,$field{$_})),"\n";
}
print "\n";

($from) = split(/\n/,$field{"Reply-To"}.$field{"From"}); # first from gets it
%tocc = ();
for ("Apparently-To","Cc","To") {
	for (split(/\n/,$field{$_})) {
		$tocc{$_}++;
	}
}
($subject) = split(/\n/,$field{"Subject"});
$subject = "(Life, the Universe, Everything)" unless $subject;

# at this point,
# $message is the entire original message
# $header contains a massaged header, $body contains the body
# $field{X} contains headers for X
# (with some massage on [Apparently-]To and Cc)
# $from is where you should send msgs back
# %tocc contains the "to"-type fields

# ... (stuff left out here)

sub cleanaddr {
	local($_) = @_;
	s/\s*\(.*\)\s*//;
	1 while s/.*<(.*)>.*/\1/;
	s/^\s+//;
	s/\s+$//;
	$_;
}
##################################################

&cleanaddr is simpleminded, but it seems to get the job done.

print "Just another Perl hacker,"
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/