[comp.lang.perl] perl version of from

jv@mh.nl (Johan Vromans) (01/04/90)

Note: I have redirected follow-ups to comp.lang.perl.

In article <JV.89Dec21221143@mhres.mh.nl> jv@mh.nl (Johan Vromans) writes:

| In article <1989Dec20.222732.5633@trigraph.uucp> john@trigraph.uucp (John Chew) writes:
|    Here's a new version of from.sed, my sed script that does the job
|    of from(1) better and faster.  It now truncates long subjects,
|    correctly handles messages without subjects and From lines with %
|    or @foo: routing.
| 
|    Yes, I tried writing this in Perl.  I'm not an expert Perl programmer,
|    but I couldn't get it to run faster than about 70% slower than sed.

To which I replied:

| I've been using a perl version of 'from' for a long time, so I trow it
| in. [...]

| It runs about as fast as the sed version. Typical times for a large
| mailbox (46585 lines) real/user/sys 50/16/8 for sed, 50/22/7 for perl.

Script fragment:

  while ( $line = <> ) {
    chop ($line);
    # scan until "From_" header found
    next unless $line =~ /^From\s+(\S+)\s+.*(\w{3}\s+\d+\s+\d+:\d+)/;

I was pointed out by John J. Chew <poslfit@gpu.UTCS.UToronto.CA> that
tightening the search for "From " would speed up the program by 30%.
He suggested:

  while ( <> ) {
    next unless /^From /;
    chop ($line);
    next unless /^From\s+(\S+)\s+.*(\w{3}\s+\d+\s+\d+:\d+)/;

Well, I tried it, and -NOT to my surprise- I found out that the major
speedup is caused by leaving out the assignment to the variable $line
and postponing the chop. I couldn't imagine (knowing how Larry likes
optimisation) that

  next unless /^From\s+(\S+)\s+.*(\w{3}\s+\d+\s+\d+:\d+)/;

would take more time to fail than

  next unless /^From /;

With the speedups, the perl script beats the sed script on both large
and small mailboxes:

~ > wc -lc INBOX
    163   6927 INBOX
~ > dotime 5 perl src/perl.pl INBOX
      Avg  Pass 1     2     3     4     5
     ----- ------- ----- ----- ----- -----
real   0.2     0.4   0.2   0.2   0.2   0.2
user   0.0     0.0   0.0   0.0   0.0   0.0
sys    0.1     0.1   0.1   0.1   0.1   0.1
~ > dotime 5 sed -f from.sed INBOX
      Avg  Pass 1     2     3     4     5
     ----- ------- ----- ----- ----- -----
real   0.5     0.7   0.4   0.5   0.4   0.4
user   0.1     0.1   0.1   0.1   0.1   0.1
sys    0.2     0.2   0.2   0.3   0.2   0.2
~ > wc -lc maildir/pax
  46585 1240000 maildir/pax
~ > dotime 5 perl src/from.pl maildir/pax
      Avg  Pass 1     2     3     4     5
     ----- ------- ----- ----- ----- -----
real  21.9    21.9  20.3  21.1  25.7  20.7
user  14.0    14.4  14.3  14.1  13.7  13.6
sys    5.9     5.8   4.9   5.7   7.4   5.9
~ > dotime 5 sed -f from.sed maildir/pax
      Avg  Pass 1     2     3     4     5
     ----- ------- ----- ----- ----- -----
real  23.1    23.4  22.7  22.9  23.1  23.5
user  14.8    14.8  14.9  14.8  14.3  15.2
sys    7.4     7.4   7.1   7.3   7.8   7.2

I have posted the "dotime" program to alt.sources, for whoever thinks
she/he can use it.

Have fun!

Johan
--
Johan Vromans				       jv@mh.nl via internet backbones
Multihouse Automatisering bv		       uucp: ..!{uunet,hp4nl}!mh.nl!jv
Doesburgweg 7, 2803 PL Gouda, The Netherlands  phone/fax: +31 1820 62944/62500
------------------------ "Arms are made for hugging" -------------------------

subbarao@phoenix.Princeton.EDU (Kartik Subbarao) (05/20/91)

Anyone know of a cooler way to do this?
---------

#! /usr/princeton/bin/perl

open(mail, $ENV{'MAIL'}) || exit;

while (<mail>) {
    if (/^From /) {      # Note this is different from From:
        chop ($out = $_); print $out;
        while (<mail>) {
            if (/^$/) { 
                print "\n"; 
                last; 
            }
            if (/^Subject:/) {
                chop;
                @subj = split(/[: ]/,$_);
                print ' "'.$subj[2].'"'."\n";
                last;
            }
        }
    }
}
--
internet% ypwhich

subbarao@phoenix.Princeton.EDU -| Internet
kartik@silvertone.Princeton.EDU (NeXT mail)  
SUBBARAO@PUCC.BITNET			          - Bitnet

jv@mh.nl (Johan Vromans) (05/20/91)

> Anyone know of a cooler way to do this?

Depends on what you call 'cool'	:-) .

#!/bin/sh
# This is a shell archive (produced by shar 3.49)
# To extract the files from this archive, save it to a file, remove
# everything above the "!/bin/sh" line above, and type "sh file_name".
#
# made 05/20/1991 10:09 UTC by jv@pronto
# Source directory /u1/users/jv
#
# existing files will NOT be overwritten unless -c is specified
#
# This shar contains:
# length  mode       name
# ------ ---------- ------------------------------------------
#   2214 -rwxr-xr-x from
#
# ============= from ==============
if test -f 'from' -a X"$1" != X"-c"; then
	echo 'x - skipping from (File already exists)'
else
echo 'x - extracting from (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'from' &&
X#!/usr/bin/perl
X
X# This program requires perl version 3.0, patchlevel 4 or higher
X
X# @($)@ from	1.5 - from.pl
X
X# Show messages from a Unix mailbox. With -n: shown message numbers also.
X#
X# Usage "from [-n] MAILBOX..."
X#
X# Don't forget: perl is a Practical Extract and Report Language!
X#
X# Copyright 1989,1990 Johan Vromans <jv@mh.nl>, no rights reserved.
X# Usage and redistribution is free and encouraged.
X
X# Default output format
Xformat =
X@<<<<<<<<<<< "@<<<<<<<<<<<<" ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<..
X$date,        $from,         $subj
X.
X
X# Output format when sequence numbers are requested
Xformat format_n =
X@>: @<<<<<<<<<<< "@<<<<<<<<<<<<" ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<..
X$seq, $date,      $from,         $subj
X.
X
X# Parse and stash away -n switch, if provided
Xif ($#ARGV >= 0 && $ARGV[0] eq '-n') {
X  shift (@ARGV);
X  $~ = "format_n";
X}
X
X# Use system mailbox if none was specified on the command line
Xif ( $#ARGV < 0 ) {
X  if ( ! ($user = getlogin)) {
X    @a = getpwuid($<);
X    $user = $a[0];
X  }
X  if ( -r "/usr/mail/$user" ) {		# System V
X    @ARGV = ("/usr/mail/$user");
X  }
X  elsif ( -r "/usr/spool/mail" ) {	# BSD
X    @ARGV = ("/usr/spool/mail/$user");
X  }
X  else {
X    printf STDERR "No mail for $user.\n";
X    exit 1;
X  }
X}
X  
X$seq = 0;
X# Read through input file(s)
Xwhile (<>) {
X
X  # Look for a "From_" header (See RFC822 and associated documents).
X  next unless /^From\s+(\S+)\s+.*(\w{3}\s+\d+\s+\d+:\d+)/;
X
X  chop;
X  $from = $1;  
X  $date = $2;
X  if ( $date eq "" || $from eq "" ) {
X    print STDERR "Possible garbage: $_\n";
X    next;
X  }
X
X  $seq++;
X  # Get user name from uucp path
X  $from = $1 if $from =~ /.*!(.+)/;
X
X  # Now, scan for Subject or empty line
X  $subj = "";
X  while ( <> ) {
X    chop ($_);
X
X    if ( /^$/ || /^From / ) {
X      # force fall-though
X      $subj = "<none>" unless $subj;
X    }
X    else {
X      $subj = $1 if /^Subject\s*:\s*(.*)/i;
X      if ( /^From\s*:\s*/ ) {
X        $_ = $';
X        if ( /\((.+)\)/i ) { $from = $1; } 
X        elsif ( /^\s*(.+)\s*<.+>/i ) { $from = $1; } 
X        elsif ( /^<.+>\s*(.+)/i ) { $from = $1; } 
X      }
X    }
X
X    # do we have enough info?
X    if ( $from && $subj ) {
X      write;
X      last;
X    }
X  }
X}
SHAR_EOF
chmod 0755 from ||
echo 'restore of from failed'
Wc_c="`wc -c < 'from'`"
test 2214 -eq "$Wc_c" ||
	echo 'from: original size 2214, current size' "$Wc_c"
fi
exit 0
-- 
Johan Vromans				       jv@mh.nl via internet backbones
Multihouse Automatisering bv		       uucp: ..!{uunet,hp4nl}!mh.nl!jv
Doesburgweg 7, 2803 PL Gouda, The Netherlands  phone/fax: +31 1820 62911/62500
------------------------ "Arms are made for hugging" -------------------------

rearl@watnxt3.ucr.edu (Robert Earl) (05/20/91)

In article <azioL/5ej/g.c@idunno.Princeton.EDU> subbarao@phoenix.Princeton.EDU (Kartik Subbarao) writes:

|   Anyone know of a cooler way to do this?

It's Perl, there's ALWAYS a cooler way.  :-)

|		   @subj = split(/[: ]/,$_);
|		   print ' "'.$subj[2].'"'."\n";

This was sort of confusing, did you really want it to print just the
first word of a subject?  Mine won't do that, but it's easy enough to
change.

#! /usr/bin/perl

open(mail, $ENV{'MAIL'}) || exit;

while (<mail>) {
    # skip this block unless it's "From ", else print it
    /^From / || next; chop; print;
    while (<mail>) {
	chop;

	# for the other way, substitute this line:
	# ((length) ? s/^Subject: *([^: ]+)/print qq| "$1"\n|/e : print "\n")

	((length) ? s/^Subject: */print qq| "$'"\n|/e : print "\n")
	    && last;
    }
}
print "\n";


print "Just another Perl hacker, ";
--
______________________________________________________________________
			\					
 robert earl		/	"Love is a many splintered thing"
 rearl@watnxt3.ucr.edu	\		--Sisters of Mercy
 rearl@gnu.ai.mit.edu	/

tchrist@convex.COM (Tom Christiansen) (05/21/91)

From the keyboard of subbarao@phoenix.Princeton.EDU (Kartik Subbarao):
:Anyone know of a cooler way to do this?
:
:#! /usr/princeton/bin/perl
:
:open(mail, $ENV{'MAIL'}) || exit;
:
:while (<mail>) {
:    if (/^From /) {      # Note this is different from From:
:        chop ($out = $_); print $out;
:        while (<mail>) {
:            if (/^$/) { 
:                print "\n"; 
:                last; 
:            }
:            if (/^Subject:/) {
:                chop;
:                @subj = split(/[: ]/,$_);
:                print ' "'.$subj[2].'"'."\n";
:                last;
:            }
:        }
:    }
:}

Cooler?  Hmm...  Well, here's you I added:

    default open to /usr/spool/mail/$USER unless $MAIL set
    accept command line arguments for from filters 
    make everything line up prettier
    be case insensitive
    strip local domain from hostname
    don't stop on first subject line lest there be more
    removed unneeded split and chops
    reduced the line count by 4 for job security through obscurity :-)

Here's the code:

    #!/usr/bin/perl
    open (MAIL, $ENV{'MAIL'} || '/usr/spool/mail/'.$ENV{'USER'}) || exit;
    $only = join('|', @ARGV);
    $domain = '\.convex\.com';
    while (<MAIL>) {
	if (s/^from\s+//i) {       		# start of message
	    @ARGV && (/$only/io || next);
	    s/$domain//io if $domain;
	    printf "%-18s %-25s", /(\S+)\s+(.*)/;
	    while (<MAIL>) {
		/^$/ && (print("\n"), last);
		s/^subject:\s*(.*)\n$/"$1"/i  && print;
	    }
	}
    }


--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
		"So much mail, so little time." 

gamin@ireq-robot.hydro.qc.ca (Martin Boyer) (05/22/91)

jv@mh.nl (Johan Vromans) writes:

   > From: subbarao@phoenix.Princeton.EDU (Kartik Subbarao)
   > Subject: perl version of from(1)
   > Date: 19 May 91 23:54:24 GMT
   > Organization: American Chemical Society

   > Anyone know of a cooler way to do this?

   Depends on what you call 'cool'	:-) .

   [shar of from.pl 1.5 deleted]

Here's my version modified to:

    handle multiple mailboxes (default is $HOME/INBOX, created by vm)
    reset message counter for each mailbox
    test for presence of mailbox instead of spool directory in BSD systems


#!/usr/bin/perl
#----Le laboratoire de robotique de l'Institut de recherche d'Hydro-Quebec-----
# 
# Nom     : from [-n] [MAILBOX] [MAILBOX...]
# Fonction: Imprime un sommaire de la boite postale,
#	    avec -n, imprime les numeros de message aussi.
# Fichiers: from, /usr/spool/mail, ~/INBOX
# Notes   : Requiert perl version 3.0, patchlevel 4 ou mieux.
# 
# Cree    :    1989  ------------ Johan Vromans <jv@mh.nl>
# Modifie : 21 mai 91 ---------1- Martin Boyer <mboyer@ireq-robot.hydro.qc.ca>
#	    Copyright 1989, 1990 Johan Vromans <jv@mh.nl>, no rights reserved.
# 
# Historique: 
# 
# 21 mai 91 ---------1- Martin Boyer <mboyer@ireq-robot.hydro.qc.ca>
# 	Modifie la version 1.5 de Johan Vromans: ajoute INBOX
# 	et remis le compteur de message a zero pour chaque fichier.
#------------------------------------------------------------------------------

$inbox = "$ENV{'HOME'}/INBOX";

# Default output format
format =
@<<<<<<<<<<< "@<<<<<<<<<<<<" ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<..
$date,        $from,         $subj
.

# Output format when sequence numbers are requested
format format_n =
@>: @<<<<<<<<<<< "@<<<<<<<<<<<<" ^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<..
$seq, $date,      $from,         $subj
.

# Parse and stash away -n switch, if provided
if ($#ARGV >= 0 && $ARGV[0] eq '-n') {
  shift (@ARGV);
  $~ = "format_n";
}

# Use system mailbox and user INBOX 
# if none was specified on the command line
if ( $#ARGV < 0 ) {
    if ( ! ($user = getlogin)) {
	@a = getpwuid($<);
	$user = $a[0];
    }
    if ( -r "/usr/mail/$user" ) {	    # System V
	@ARGV = ("/usr/mail/$user");
    }
    elsif ( -r "/usr/spool/mail/$user" ) {  # BSD
	@ARGV = ("/usr/spool/mail/$user");
    }
    if ( $inbox && -r $inbox ) {
	@ARGV = ($inbox, @ARGV);
    }
    if ($#ARGV < 0)  {
	printf STDERR "No mail for $user.\n";
	exit 1;
    }
}
if ( $#ARGV < 0 ) {
  if ( ! ($user = getlogin)) {
    @a = getpwuid($<);
    $user = $a[0];
  }
  if ( -r "/usr/mail/$user" ) {		
    @ARGV = ("/usr/mail/$user");
  }
  elsif ( -r "/usr/spool/mail/$user" ) {	# BSD
    @ARGV = ("/usr/spool/mail/$user");
  }
  else {
    printf STDERR "No mail for $user.\n";
    exit 1;
  }
}
  
# Read through input file(s)
while (<>) {

  if (eof) {$seq = 0;}

  # Look for a "From_" header (See RFC822 and associated documents).
  next unless /^From\s+(\S+)\s+.*(\w{3}\s+\d+\s+\d+:\d+)/;

  chop;
  $from = $1;  
  $date = $2;
  if ( $date eq "" || $from eq "" ) {
    print STDERR "Possible garbage: $_\n";
    next;
  }

  $seq++;
  # Get user name from uucp path
  $from = $1 if $from =~ /.*!(.+)/;

  # Now, scan for Subject or empty line
  $subj = "";
  while ( <> ) {
    chop ($_);

    if ( /^$/ || /^From / ) {
      # force fall-though
      $subj = "<none>" unless $subj;
    }
    else {
      $subj = $1 if /^Subject\s*:\s*(.*)/i;
      if ( /^From\s*:\s*/ ) {
        $_ = $';
        if ( /\((.+)\)/i ) { $from = $1; } 
        elsif ( /^\s*(.+)\s*<.+>/i ) { $from = $1; } 
        elsif ( /^<.+>\s*(.+)/i ) { $from = $1; } 
      }
    }

    # do we have enough info?
    if ( $from && $subj ) {
      write;
      last;
    }
  }
}

--
Martin Boyer                            mboyer@ireq-robot.hydro.qc.ca
Institut de recherche d'Hydro-Quebec    mboyer@ireq-robot.uucp
Varennes, QC, Canada   J3X 1S1
+1 514 652-8412