lin@CS.WMICH.EDU (Lite Lin) (03/04/91)
This is a simple question, but I don't see it in "Freqently Asked Questions", so... I'm trying to identify all the email addresses in email messages, i.e., patterns with the format user@node. Now I can use grep/sed/awk to find those lines containing user@node, but I can't figure out from the manual how or whether I can have access to the matching pattern (it can be anywhere in the line, and it doesn't have to be surrounded by spaces, i.e., it's not necessarily a separate "field" in awk). If there is no way to do that in awk, I guess I'll do it with lex (yytext holds the matching pattern). Any response will be appreciated. Thanks, Lite
tchrist@convex.COM (Tom Christiansen) (03/04/91)
From the keyboard of lin@CS.WMICH.EDU (Lite Lin):
: This is a simple question, but I don't see it in "Freqently Asked
:Questions", so...
: I'm trying to identify all the email addresses in email messages, i.e.,
:patterns with the format user@node. Now I can use grep/sed/awk to find
:those lines containing user@node, but I can't figure out from the manual
:how or whether I can have access to the matching pattern (it can be
:anywhere in the line, and it doesn't have to be surrounded by spaces,
:i.e., it's not necessarily a separate "field" in awk). If there is no
:way to do that in awk, I guess I'll do it with lex (yytext holds the
:matching pattern).
Well, I wouldn't try to do it in awk, but that doesn't mean we have to
jump all the way to a C program!
perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n"/ge;'
that does a fair good job, but there are a lot of duplicates,
so let's not print any we've already seen:
perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n" unless $seen{$1}++/ge;'
A more sordid approach might be:
#!/usr/bin/perl
while (<>) { s/([-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; }
print join("\n", sort keys %seen), "\n";
But you've got a basic problem in that you can't distinguish
message-ids from real addresses. A message_id@host looks
a lot (in some cases indistinguishably so) from a user_id@host.
Here's a half-hearted attempt to weed out a few strays:
#!/usr/bin/perl
while (<>) { s/([a-zA-Z][-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; }
print join("\n", grep(!/^(AA)?\d/, sort keys %seen)), "\n";
--tom
ps: dunno what all this ``node'' talk is. My manual talks
about nodes in the filesystem section, hosts in the
networking section. Or do you mail directly to i-nodes? :-)
--
"UNIX was not designed to stop you from doing stupid things, because
that would also stop you from doing clever things." -- Doug Gwyn
Tom Christiansen tchrist@convex.com convex!tchristnolan@tssi.UUCP (Michael Nolan) (03/05/91)
lin@CS.WMICH.EDU (Lite Lin) writes: > This is a simple question, but I don't see it in "Freqently Asked >Questions", so... > I'm trying to identify all the email addresses in email messages, i.e., >patterns with the format user@node. Now I can use grep/sed/awk to find >those lines containing user@node, but I can't figure out from the manual >how or whether I can have access to the matching pattern (it can be >anywhere in the line, and it doesn't have to be surrounded by spaces, >i.e., it's not necessarily a separate "field" in awk). If you have nawk or gawk, use the match function, which sets two variables: RSTART - the first position in the string matched by the pattern. RLENGTH - the length of the string matching the pattern A pattern to match any single mail address might be rather ugly, though. If you assume all the following: 1. Upper case and lower case letters are permitted 2. Dash, underscore, and period are permitted 3. There is only one @ [I'm not sure this assumption is valid, though!] 4. There may be several ! or % in the 'user' portion 5. No commas or spaces Then that gives a pattern something like this [a-zA-Z0-9.\-_%!]+@[a-zA-Z0-9.\-_]+ I've escaped the dash, I suppose it might be necessary to escape other characters as well. Have I left anything out that might occur in strange but otherwise valid mail addresses? ------------------------------------------------------------------------------ Michael Nolan "Software means never having Tailored Software Services, Inc. to say you're finished." Lincoln, Nebraska (402) 423-1490 --J. D. Hildebrand in UNIX REVIEW UUCP: tssi!nolan (or try sparky!dsndata!tssi!nolan) Internet: nolan@helios.unl.edu (if you can't get the other address to work)
louk@tslwat.UUCP (Lou Kates) (03/06/91)
In article <1991Mar04.051048.5864@convex.com> tchrist@convex.COM (Tom Christiansen) writes: >From the keyboard of lin@CS.WMICH.EDU (Lite Lin): >: I'm trying to identify all the email addresses in email messages, i.e., >:patterns with the format user@node. Now I can use grep/sed/awk to find >:those lines containing user@node, but I can't figure out from the manual >:how or whether I can have access to the matching pattern (it can be >:anywhere in the line, and it doesn't have to be surrounded by spaces, >:i.e., it's not necessarily a separate "field" in awk). If there is no >:way to do that in awk, I guess I'll do it with lex (yytext holds the >:matching pattern). > >Well, I wouldn't try to do it in awk, but that doesn't mean we have to >jump all the way to a C program! > > perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n"/ge;' The following awk program looks for expressions of the form word@word where word contains only letters, numbers and dots and the field separator is anything except letters, numbers, dots and @. You can change the regular expressions in order to vary the effect: BEGIN { FS = "[^.a-zA-Z0-9@]+"; word = "[.a-zA-Z0-9]+"; addr = "^" word "@" word "$" } { for(i=1; i<=NF; i++) if ($i ~ addr) print $i } Lou Kates, Teleride Sage Ltd., louk%tslwat@watmath.waterloo.edu
tchrist@convex.COM (Tom Christiansen) (03/06/91)
From the keyboard of louk@tslwat.UUCP (Lou Kates):
:The following awk program looks for expressions of the form
:word@word where word contains only letters, numbers and dots and
:the field separator is anything except letters, numbers, dots and
:@. You can change the regular expressions in order to vary the
:effect:
:
:BEGIN { FS = "[^.a-zA-Z0-9@]+";
: word = "[.a-zA-Z0-9]+";
: addr = "^" word "@" word "$"
: }
:{ for(i=1; i<=NF; i++) if ($i ~ addr) print $i }
$ awk -f foo.awk < file
awk: syntax error near line 5
awk: illegal statement near line 5
You meant a nawk program, not an awk program.
You definitely need to have more characters in there -- consider
folks with dashes in their hostnames. That's why my regexp was
more complicated.
--tom
--
I get so tired of utilities with arbitrary, undocumented,
compiled-in limits. Don't you?
Tom Christiansen tchrist@convex.com convex!tchristpeter@doe.utoronto.ca (Peter Mielke) (03/07/91)
In <1994@tssi.UUCP>, tssi!nolan writes: > lin@CS.WMICH.EDU (Lite Lin) writes: > > This is a simple question, but I don't see it in "Freqently Asked > >Questions", so... > > I'm trying to identify all the email addresses in email messages, i.e., > >patterns with the format user@node. Now I can use grep/sed/awk to find > >those lines containing user@node, but I can't figure out from the manual > >how or whether I can have access to the matching pattern (it can be > >anywhere in the line, and it doesn't have to be surrounded by spaces, > >i.e., it's not necessarily a separate "field" in awk). > > [stuff about awk or gawk] > > Then that gives a pattern something like this > > [a-zA-Z0-9.\-_%!]+@[a-zA-Z0-9.\-_]+ > > I've escaped the dash, I suppose it might be necessary to escape other > characters as well. Have I left anything out that might occur in strange > but otherwise valid mail addresses? Or you could use sed to transform the address when it matches. eg. sed -e 's/\([a-zA-Z0-9.\-_%!]*\)@\([a-zA-Z0-9.\-_]*\)/machine: \2 userid: \1/' -- Peter Mielke peter@doe.utoronto.ca Dictionary of Old English Project utgpu!utzoo!utdoe!peter University of Toronto
tchrist@convex.COM (Tom Christiansen) (03/08/91)
From the keyboard of peter@doe.utoronto.ca (Peter Mielke): :sed -e 's/\([a-zA-Z0-9.\-_%!]*\)@\([a-zA-Z0-9.\-_]*\)/machine: \2 userid: \1/' Did you actually try that? Does it work for multiple addrs on the same line? Does it print out a nice report free of the extra text? I don't think so. --tom -- I get so tired of utilities with arbitrary, undocumented, compiled-in limits. Don't you? Tom Christiansen tchrist@convex.com convex!tchrist