[comp.unix.questions] UNIX SHELL PROG. & ELM QUESTIONS

moe@starnet.uucp (Moe S.) (05/10/91)

I appreciate any help on these questions:

1. If I have a large (500+ messages) in a mailbox-format file,
   what is the best way to mail a file to everyone of the 500+? 
   (using elm or any other way).

2. If I have a file containing some names and email addresses such as
   this: 
	
	      xyz@jkjk.jkyu.reyui (John J. Doe)
	      Mark L. Lost <apple!mark@eee.dfsjk.jkj>
	      Joe!!! jjj@jhdf.434r.er

  (Such a file can be obtained by doing
	grep "^From:" mailbox_file )

  How can I re-organize the file (using awk, sed, etc...) so that
  the email addresses are the first field in every line in the file? 
  Note that all addresses will contain at least one of the two
  characters: "@" or "!".
  Let's say the file is too big to make manual editing practical.


Thanks again.

Moe

rickert@mp.cs.niu.edu (Neil Rickert) (05/12/91)

In article <1991May10.064610.25802@starnet.uucp> moe@starnet.uucp (Moe S.) writes:
>1. If I have a large (500+ messages) in a mailbox-format file,
>   what is the best way to mail a file to everyone of the 500+? 
>   (using elm or any other way).

 500 is probably stretching the capabilities of much software.  Most mailers
pass the message to the transport (MTA) as arguments, and 500 may exceed
the max allowed.

 If you are running sendmail as an MTA, the easiest way may be to
extract the 'From:' lines and make each into a 'Bcc:' line for the
new message which you then feed into 'sendmail' with the '-t'
option (which implies that the recipient addresses come from 'To:'
'Cc:' and 'Bcc:' headers.)  You message can also include a 'To:'
with a group name - 'To: multiple_recipients:;' to make sure
an 'Apparently-To:' in not generated.

>2. If I have a file containing some names and email addresses such as
>   this: 
>	      xyz@jkjk.jkyu.reyui (John J. Doe)
>	      Mark L. Lost <apple!mark@eee.dfsjk.jkj>
>	      Joe!!! jjj@jhdf.434r.er
>  How can I re-organize the file (using awk, sed, etc...) so that
>  the email addresses are the first field in every line in the file? 

 Very difficult.  Probably beyond the abilities it awk, sed, etc.  If
Larry Wall happens to be reading this he may suggest perl.  The trouble is
that the syntax of RFC822 addresses is quite complex, and as X.400 gateways
become more common the extreme cases of RFC822 addresses are increasingly
likely to show up.

-- 
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
  Neil W. Rickert, Computer Science               <rickert@cs.niu.edu>
  Northern Illinois Univ.
  DeKalb, IL 60115                                   +1-815-753-6940

russell@ccu1.aukuni.ac.nz (Russell J Fulton;ccc032u) (05/13/91)

rickert@mp.cs.niu.edu (Neil Rickert) writes:

>In article <1991May10.064610.25802@starnet.uucp> moe@starnet.uucp (Moe S.) writes:
>>2. If I have a file containing some names and email addresses such as
>>   this: 
>>	      xyz@jkjk.jkyu.reyui (John J. Doe)
>>	      Mark L. Lost <apple!mark@eee.dfsjk.jkj>
>>	      Joe!!! jjj@jhdf.434r.er
>>  How can I re-organize the file (using awk, sed, etc...) so that
>>  the email addresses are the first field in every line in the file? 

> Very difficult.  Probably beyond the abilities it awk, sed, etc.  If
>Larry Wall happens to be reading this he may suggest perl.  The trouble is
>that the syntax of RFC822 addresses is quite complex, and as X.400 gateways
>become more common the extreme cases of RFC822 addresses are increasingly
>likely to show up.

Yes, I would be inclined to use perl or icon. Icon is probably the more
powerful for this sort of thing. You need something with very flexible 
pattern matching.  Your approach depends on whether it is 'one off' or if it
is going to turn into a routine job. If the former it *may* be quicker to
grit your teeth and do it by hand. If not the you may find the time taken 
to write a perl or icon script worthwhile. 

As for choosing between perl and icon, I would say if you are already well
versed in regular expressions and the like go for perl, otherwise icon.

Cheers, Russell
-- 
Russell Fulton, Computer Center, University of Auckland, New Zealand.
<rj_fulton@aukuni.ac.nz>

felps@convex.com (Robert Felps) (05/13/91)

In <1991May10.064610.25802@starnet.uucp> moe@starnet.uucp (Moe S.) writes:

>I appreciate any help on these questions:

>1. If I have a large (500+ messages) in a mailbox-format file,
>   what is the best way to mail a file to everyone of the 500+? 
>   (using elm or any other way).

Sounds like your #2 questions is will answer this. Strip the addresses
out then send the message to them. The problem is the limit of 550+
addresses by most mailers. I'd try writing a script to strip the
addresses put them in a file, split the file, then feed the files
as arguments to ucb/mail with a -s "subject" argument too.

>2. If I have a file containing some names and email addresses such as
>   this: 
>	
>	      xyz@jkjk.jkyu.reyui (John J. Doe)
>	      Mark L. Lost <apple!mark@eee.dfsjk.jkj>
>	      Joe!!! jjj@jhdf.434r.er

>  (Such a file can be obtained by doing
>	grep "^From:" mailbox_file )

>  How can I re-organize the file (using awk, sed, etc...) so that
>  the email addresses are the first field in every line in the file? 
>  Note that all addresses will contain at least one of the two
>  characters: "@" or "!".
>  Let's say the file is too big to make manual editing practical.

Hmmm. I noticed others posted suggestions of PERL or icon. I don't see
this as much a language question as I do a standard format and the
unexpected non-conformance to that standard. Someone touched on this with
the reference to the RFC. Here's is quick shot that gets a large percentage
of the addresses but it doesn't handle all of the off the wall cases.
for example, I ran it through my $MAIL file and had a blank line in the
output. When I looked at what caused it I had a message with the line,

From: Roger Rabbit is expecting to see you later today.....

from a wonderful secretary that could care less if the mailer uses the
From: header. So those are going to be difficult to message out or catch.

Here's the code, unfortunately it uses nawk because of the Field Separator:

--------------------------------- cut here ----------------------------------
nawk 'BEGIN {
FS="[   ()<>:]"   # space, tab, left/right paren, left/rigth angles, colon
}
/^F[rR][oO][mM]:/  {
  for ( i = 1; i <= NF; i++ ) {
    if ( index($i,"@") ) {
      print $i
      break
    }
    else if ( index($i,"!") ) {
      print $i
      break
    }
  }
#  print "i="i "   NF="NF
#  print
  if ( i > NF )
    if ( length($2) )
      print $2
    else
      print $3
}' $MAIL
--------------------------------- cut here ----------------------------------

If you don't have nawk and you don't know awk send me mail and I'll convert
it to awk (old awk).

>Thanks again.

>Moe

Hope it helps,
Robert Felps            I do not speak for  felps@convex.com
Convex Computer Corp    Convex and I seldom Product Specialist
3000 Waterview Parkway  speak for myself.   Tech. Assistant Ctr
Richardson, Tx.  75080  VMS? What's that?   1(800) 952-0379

clewis@ferret.ocunix.on.ca (Chris Lewis) (05/14/91)

In article <1991May12.022641.18961@mp.cs.niu.edu> rickert@mp.cs.niu.edu (Neil Rickert) writes:
>In article <1991May10.064610.25802@starnet.uucp> moe@starnet.uucp (Moe S.) writes:
>>1. If I have a large (500+ messages) in a mailbox-format file,
>>   what is the best way to mail a file to everyone of the 500+? 
>>   (using elm or any other way).

> 500 is probably stretching the capabilities of much software.  Most mailers
>pass the message to the transport (MTA) as arguments, and 500 may exceed
>the max allowed.

> If you are running sendmail as an MTA, the easiest way may be to
>extract the 'From:' lines and make each into a 'Bcc:' line for the
>new message which you then feed into 'sendmail' with the '-t'
>option (which implies that the recipient addresses come from 'To:'
>'Cc:' and 'Bcc:' headers.)  You message can also include a 'To:'
>with a group name - 'To: multiple_recipients:;' to make sure
>an 'Apparently-To:' in not generated.

You don't have to resort to all this wierdness.  If you're using sendmail,
smail 2.5 or smail 3.1, plus probably many other MTA's, what you really
want to do is prepare a file containing all of the recipients, and use
the "include" mechanism in the "aliases" file for the MTA (in sendmail
and smail 2.5, don't know about smail 3.1) it's /usr/lib/aliases.

Ie, for the ferret mailing list I have:

    ferret-list-out	:include:/u/clewis/ferrets/mail-list
			:include:/u/clewis/ferrets/anon-list

The files are in the following format:
	e-mail-address (full name)
The full name is optional.

Then, if you send mail to "ferret-list-out", the MUA (mush/elm/mailx/Mail etc.)
doesn't even know about the alias - none of the mail headers have any of the
names in the subscription list.  The MTA, *not* the MUA, does the expansion in memory,
and parcels out groups of the addresses in the command lines to multiple
invocations of uux or tcpip etc.  Smail 2.5 doesn't appear to have any limit
on the number of recipients other than available memory (it mallocs the entries
into a linked list)

In fact, it's often better to invoke the MTA directly rather than using the MUA
to send it, because you have a bit better control of what the headers will
look like.  For example, you want the "From:" line to refer to the
logical address for sending in individual items.  This is a copy of the
shell script I use to send out mailing list items:

    #	Takes one argument - the item to be sent.
    if [ ! -r "$1" ]
    then
	echo "No such article"
	exit
    fi
    #	Check that I've not buggered up the numbering scheme
    if [ -r articles/$1 -o -r articles/$1.Z ]
    then
	echo "Article Clash $1"
	exit
    fi
    #	Construct Envelppe
    echo "Subject: Issue $1" > /tmp/$$
    echo "From: ferret-list@ferret.ocunix.on.ca (Ferret Mailing List)" >> /tmp/$$
    echo "To: ferret-list@ferret.ocunix.on.ca" >> /tmp/$$
    echo "" >> /tmp/$$
    #	Send it
    cat /tmp/$$ $1 | smail -R ferret-list-out
    rm -f /tmp/$$
    #	Archive what I just sent
    mv $1 articles
    compress articles/$1

Notice that I construct the Subject:, From: and To: lines myself, tack on
a blank line, and then concatenate article itself and shove thru smail directly.
The destination is the command line argument to smail, not the To: line.
(With sendmail you may have to use an option to inhibit To: line expansion.)

(The -R option to smail 2.5 tells it to reroute all of the addresses instead
of trying to send directly then discovering most of the addresses are not
full paths, and then rerouting.  This is an efficiency concern, plus
the fact that without the -R, smail 2.5 won't multicast unless the addresses
are full bang path.  Multicast is more than one recipient per uux invocation.
REAL important with a list of 500 entries!  The ferret list is about 75, and
it multicasts down to 17 individual uux invocations)

>>2. If I have a file containing some names and email addresses such as
>>   this: 
>>	      xyz@jkjk.jkyu.reyui (John J. Doe)
>>	      Mark L. Lost <apple!mark@eee.dfsjk.jkj>
>>	      Joe!!! jjj@jhdf.434r.er
>>  How can I re-organize the file (using awk, sed, etc...) so that
>>  the email addresses are the first field in every line in the file? 

> Very difficult.  Probably beyond the abilities it awk, sed, etc.  If
>Larry Wall happens to be reading this he may suggest perl.  The trouble is
>that the syntax of RFC822 addresses is quite complex, and as X.400 gateways
>become more common the extreme cases of RFC822 addresses are increasingly
>likely to show up.

Very difficult only if the input is entirely arbitrary and you actually
have to parse the addresses.   However, the first and second addresses are
already in a "standard" form, the first being what you can use directly (at
least in a smail 2.5 alias file).  The second is simple to convert.  Then,
it's a matter of converting all of the other formats into the first one.

The third example is a difficult one to handle simply because
a simple sed script can't tell which of the two tokens is the actual address
because they both have mailing metacharacters.  So, you make a simplifying
assumption, and assume that a token with a "@" is a real email address, and
alternately, a token of the form something!something is the real email address
as long as there aren't more than one adjacent !.

This sed script works for the above sample plus some other forms:

	sed -e '/^\(.*\)<\(.*\)>\(.*\)$/s//\2 \1 \3/' \
	    -e '/^\(.*\)(\(.*\))\(.*\)$/s//\1 \2 \3/' \
	    -e '/^\(.*\) \([^ ][^ ]*@[^ ][^ ]*\)$/s//\2 \1/' \
	    -e '/^\(.*\) \([^! ][^ ]*![^! ][^ ]*\)$/s//\2 \1/' \
	    -e 's/^  *//' \
	    -e 's/  *$//' \
	    -e 's/  */ /g' \
	    -e 's/ / (/' \
	    -e 's/$/)/'

	1) convert <> forms to address first.
	2) remove () from addr (name) forms
	3) Move all tokens with something@something to the beginning
	4) Move all tokens with something!something to the beginning
	   (don't do this for tokens with !!!* )
	5, 6, 7) Strip extraneous blanks
	8, 9) put the () back in.

Yes, it would be a bit easier to program in perl, and easier to get fancier.
But not particularly necessary.  You'll probably end up with a few it
didn't parse correctly, but you can fix them manually.
-- 
Chris Lewis, Phone: (613) 832-0541, Domain: clewis@ferret.ocunix.on.ca
UUCP: ...!cunews!latour!ecicrl!clewis; Ferret Mailing List:
ferret-request@eci386; Psroff (not Adobe Transcript) enquiries:
psroff-request@eci386 or Canada 416-832-0541.  Psroff 3.0 in c.s.u soon!