[comp.unix.questions] Pattern matching with awk

lin@CS.WMICH.EDU (Lite Lin) (03/04/91)

  This is a simple question, but I don't see it in "Freqently Asked
Questions", so...
  I'm trying to identify all the email addresses in email messages, i.e.,
patterns with the format user@node.  Now I can use grep/sed/awk to find
those lines containing user@node, but I can't figure out from the manual
how or whether I can have access to the matching pattern (it can be
anywhere in the line, and it doesn't have to be surrounded by spaces,
i.e., it's not necessarily a separate "field" in awk).  If there is no
way to do that in awk, I guess I'll do it with lex (yytext holds the
matching pattern).
  Any response will be appreciated.
  Thanks,
	Lite

tchrist@convex.COM (Tom Christiansen) (03/04/91)

From the keyboard of lin@CS.WMICH.EDU (Lite Lin):
:  This is a simple question, but I don't see it in "Freqently Asked
:Questions", so...
:  I'm trying to identify all the email addresses in email messages, i.e.,
:patterns with the format user@node.  Now I can use grep/sed/awk to find
:those lines containing user@node, but I can't figure out from the manual
:how or whether I can have access to the matching pattern (it can be
:anywhere in the line, and it doesn't have to be surrounded by spaces,
:i.e., it's not necessarily a separate "field" in awk).  If there is no
:way to do that in awk, I guess I'll do it with lex (yytext holds the
:matching pattern).

Well, I wouldn't try to do it in awk, but that doesn't mean we have to 
jump all the way to a C program!  

    perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n"/ge;'

that does a fair good job, but there are a lot of duplicates, 
so let's not print any we've already seen:

    perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n" unless $seen{$1}++/ge;'

A more sordid approach might be:

    #!/usr/bin/perl
    while (<>) { s/([-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; } 
    print join("\n", sort keys %seen), "\n";

But you've got a basic problem in that you can't distinguish 
message-ids from real addresses.  A message_id@host looks
a lot (in some cases indistinguishably so) from a user_id@host.

Here's a half-hearted attempt to weed out a few strays:

    #!/usr/bin/perl
    while (<>) { s/([a-zA-Z][-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; } 
    print join("\n", grep(!/^(AA)?\d/, sort keys %seen)), "\n";

--tom

ps: dunno what all this ``node'' talk is.  My manual talks 
    about nodes in the filesystem section, hosts in the
    networking section.  Or do you mail directly to i-nodes? :-)
--
"UNIX was not designed to stop you from doing stupid things, because
 that would also stop you from doing clever things." -- Doug Gwyn

 Tom Christiansen                tchrist@convex.com      convex!tchrist

nolan@tssi.UUCP (Michael Nolan) (03/05/91)

lin@CS.WMICH.EDU (Lite Lin) writes:


>  This is a simple question, but I don't see it in "Freqently Asked
>Questions", so...
>  I'm trying to identify all the email addresses in email messages, i.e.,
>patterns with the format user@node.  Now I can use grep/sed/awk to find
>those lines containing user@node, but I can't figure out from the manual
>how or whether I can have access to the matching pattern (it can be
>anywhere in the line, and it doesn't have to be surrounded by spaces,
>i.e., it's not necessarily a separate "field" in awk).

If you have nawk or gawk, use the match function, which sets two variables:  

RSTART - the first position in the string matched by the pattern.
RLENGTH - the length of the string matching the pattern

A pattern to match any single mail address might be rather ugly, though.
If you assume all the following:

1.  Upper case and lower case letters are permitted
2.  Dash, underscore, and period are permitted
3.  There is only one @ [I'm not sure this assumption is valid, though!]
4.  There may be several ! or % in the 'user' portion
5.  No commas or spaces 

Then that gives a pattern something like this

[a-zA-Z0-9.\-_%!]+@[a-zA-Z0-9.\-_]+

I've escaped the dash, I suppose it might be necessary to escape other
characters as well.  Have I left anything out that might occur in strange
but otherwise valid mail addresses?
------------------------------------------------------------------------------
Michael Nolan                              "Software means never having
Tailored Software Services, Inc.            to say you're finished."       
Lincoln, Nebraska (402) 423-1490            --J. D. Hildebrand in UNIX REVIEW
UUCP:      tssi!nolan (or try sparky!dsndata!tssi!nolan)
Internet:  nolan@helios.unl.edu (if you can't get the other address to work)

louk@tslwat.UUCP (Lou Kates) (03/06/91)

In article <1991Mar04.051048.5864@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
>From the keyboard of lin@CS.WMICH.EDU (Lite Lin):
>:  I'm trying to identify all the email addresses in email messages, i.e.,
>:patterns with the format user@node.  Now I can use grep/sed/awk to find
>:those lines containing user@node, but I can't figure out from the manual
>:how or whether I can have access to the matching pattern (it can be
>:anywhere in the line, and it doesn't have to be surrounded by spaces,
>:i.e., it's not necessarily a separate "field" in awk).  If there is no
>:way to do that in awk, I guess I'll do it with lex (yytext holds the
>:matching pattern).
>
>Well, I wouldn't try to do it in awk, but that doesn't mean we have to 
>jump all the way to a C program!  
>
>    perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n"/ge;'

The following   awk  program looks   for expressions of the  form
word@word where word contains only letters, numbers  and dots and
the field separator is anything except letters, numbers, dots and
@. You  can  change the regular  expressions in order to vary the
effect:

BEGIN { FS = "[^.a-zA-Z0-9@]+"; 
	word = "[.a-zA-Z0-9]+";  
	addr = "^" word "@" word "$" 
      }
{ for(i=1; i<=NF; i++) if ($i ~ addr) print $i }

Lou Kates, Teleride Sage Ltd., louk%tslwat@watmath.waterloo.edu

tchrist@convex.COM (Tom Christiansen) (03/06/91)

From the keyboard of louk@tslwat.UUCP (Lou Kates):
:The following   awk  program looks   for expressions of the  form
:word@word where word contains only letters, numbers  and dots and
:the field separator is anything except letters, numbers, dots and
:@. You  can  change the regular  expressions in order to vary the
:effect:
:
:BEGIN { FS = "[^.a-zA-Z0-9@]+"; 
:	word = "[.a-zA-Z0-9]+";  
:	addr = "^" word "@" word "$" 
:      }
:{ for(i=1; i<=NF; i++) if ($i ~ addr) print $i }

$ awk -f foo.awk < file
awk: syntax error near line 5
awk: illegal statement near line 5

You meant a nawk program, not an awk program.  

You definitely need to have more characters in there -- consider
folks with dashes in their hostnames.  That's why my regexp was
more complicated.


--tom
--
	I get so tired of utilities with arbitrary, undocumented,
	compiled-in limits.  Don't you?

Tom Christiansen		tchrist@convex.com	convex!tchrist

peter@doe.utoronto.ca (Peter Mielke) (03/07/91)

In <1994@tssi.UUCP>, tssi!nolan writes:
> lin@CS.WMICH.EDU (Lite Lin) writes:
> >  This is a simple question, but I don't see it in "Freqently Asked
> >Questions", so...
> >  I'm trying to identify all the email addresses in email messages, i.e.,
> >patterns with the format user@node.  Now I can use grep/sed/awk to find
> >those lines containing user@node, but I can't figure out from the manual
> >how or whether I can have access to the matching pattern (it can be
> >anywhere in the line, and it doesn't have to be surrounded by spaces,
> >i.e., it's not necessarily a separate "field" in awk).
> 
> [stuff about awk or gawk]
> 
> Then that gives a pattern something like this
> 
> [a-zA-Z0-9.\-_%!]+@[a-zA-Z0-9.\-_]+
> 
> I've escaped the dash, I suppose it might be necessary to escape other
> characters as well.  Have I left anything out that might occur in strange
> but otherwise valid mail addresses?

Or you could use sed to transform the address when it matches. eg.

sed -e 's/\([a-zA-Z0-9.\-_%!]*\)@\([a-zA-Z0-9.\-_]*\)/machine: \2 userid: \1/'

-- 
Peter Mielke                                    peter@doe.utoronto.ca
Dictionary of Old English Project               utgpu!utzoo!utdoe!peter
University of Toronto

tchrist@convex.COM (Tom Christiansen) (03/08/91)

From the keyboard of peter@doe.utoronto.ca (Peter Mielke):
:sed -e 's/\([a-zA-Z0-9.\-_%!]*\)@\([a-zA-Z0-9.\-_]*\)/machine: \2 userid: \1/'

Did you actually try that?  Does it work for multiple addrs on the same
line?  Does it print out a nice report free of the extra text?

I don't think so.

--tom
--
	I get so tired of utilities with arbitrary, undocumented,
	compiled-in limits.  Don't you?

Tom Christiansen		tchrist@convex.com	convex!tchrist