lin@CS.WMICH.EDU (Lite Lin) (03/04/91)
This is a simple question, but I don't see it in "Freqently Asked Questions", so... I'm trying to identify all the email addresses in email messages, i.e., patterns with the format user@node. Now I can use grep/sed/awk to find those lines containing user@node, but I can't figure out from the manual how or whether I can have access to the matching pattern (it can be anywhere in the line, and it doesn't have to be surrounded by spaces, i.e., it's not necessarily a separate "field" in awk). If there is no way to do that in awk, I guess I'll do it with lex (yytext holds the matching pattern). Any response will be appreciated. Thanks, Lite
tchrist@convex.COM (Tom Christiansen) (03/04/91)
From the keyboard of lin@CS.WMICH.EDU (Lite Lin): : This is a simple question, but I don't see it in "Freqently Asked :Questions", so... : I'm trying to identify all the email addresses in email messages, i.e., :patterns with the format user@node. Now I can use grep/sed/awk to find :those lines containing user@node, but I can't figure out from the manual :how or whether I can have access to the matching pattern (it can be :anywhere in the line, and it doesn't have to be surrounded by spaces, :i.e., it's not necessarily a separate "field" in awk). If there is no :way to do that in awk, I guess I'll do it with lex (yytext holds the :matching pattern). Well, I wouldn't try to do it in awk, but that doesn't mean we have to jump all the way to a C program! perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n"/ge;' that does a fair good job, but there are a lot of duplicates, so let's not print any we've already seen: perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n" unless $seen{$1}++/ge;' A more sordid approach might be: #!/usr/bin/perl while (<>) { s/([-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; } print join("\n", sort keys %seen), "\n"; But you've got a basic problem in that you can't distinguish message-ids from real addresses. A message_id@host looks a lot (in some cases indistinguishably so) from a user_id@host. Here's a half-hearted attempt to weed out a few strays: #!/usr/bin/perl while (<>) { s/([a-zA-Z][-%:.\w]+@[-@%:.\w]+)/$seen{$1}++/ge; } print join("\n", grep(!/^(AA)?\d/, sort keys %seen)), "\n"; --tom ps: dunno what all this ``node'' talk is. My manual talks about nodes in the filesystem section, hosts in the networking section. Or do you mail directly to i-nodes? :-) -- "UNIX was not designed to stop you from doing stupid things, because that would also stop you from doing clever things." -- Doug Gwyn Tom Christiansen tchrist@convex.com convex!tchrist
nolan@tssi.UUCP (Michael Nolan) (03/05/91)
lin@CS.WMICH.EDU (Lite Lin) writes: > This is a simple question, but I don't see it in "Freqently Asked >Questions", so... > I'm trying to identify all the email addresses in email messages, i.e., >patterns with the format user@node. Now I can use grep/sed/awk to find >those lines containing user@node, but I can't figure out from the manual >how or whether I can have access to the matching pattern (it can be >anywhere in the line, and it doesn't have to be surrounded by spaces, >i.e., it's not necessarily a separate "field" in awk). If you have nawk or gawk, use the match function, which sets two variables: RSTART - the first position in the string matched by the pattern. RLENGTH - the length of the string matching the pattern A pattern to match any single mail address might be rather ugly, though. If you assume all the following: 1. Upper case and lower case letters are permitted 2. Dash, underscore, and period are permitted 3. There is only one @ [I'm not sure this assumption is valid, though!] 4. There may be several ! or % in the 'user' portion 5. No commas or spaces Then that gives a pattern something like this [a-zA-Z0-9.\-_%!]+@[a-zA-Z0-9.\-_]+ I've escaped the dash, I suppose it might be necessary to escape other characters as well. Have I left anything out that might occur in strange but otherwise valid mail addresses? ------------------------------------------------------------------------------ Michael Nolan "Software means never having Tailored Software Services, Inc. to say you're finished." Lincoln, Nebraska (402) 423-1490 --J. D. Hildebrand in UNIX REVIEW UUCP: tssi!nolan (or try sparky!dsndata!tssi!nolan) Internet: nolan@helios.unl.edu (if you can't get the other address to work)
louk@tslwat.UUCP (Lou Kates) (03/06/91)
In article <1991Mar04.051048.5864@convex.com> tchrist@convex.COM (Tom Christiansen) writes: >From the keyboard of lin@CS.WMICH.EDU (Lite Lin): >: I'm trying to identify all the email addresses in email messages, i.e., >:patterns with the format user@node. Now I can use grep/sed/awk to find >:those lines containing user@node, but I can't figure out from the manual >:how or whether I can have access to the matching pattern (it can be >:anywhere in the line, and it doesn't have to be surrounded by spaces, >:i.e., it's not necessarily a separate "field" in awk). If there is no >:way to do that in awk, I guess I'll do it with lex (yytext holds the >:matching pattern). > >Well, I wouldn't try to do it in awk, but that doesn't mean we have to >jump all the way to a C program! > > perl -ne 's/([-.\w]+@[-.\w]+)/print "$1\n"/ge;' The following awk program looks for expressions of the form word@word where word contains only letters, numbers and dots and the field separator is anything except letters, numbers, dots and @. You can change the regular expressions in order to vary the effect: BEGIN { FS = "[^.a-zA-Z0-9@]+"; word = "[.a-zA-Z0-9]+"; addr = "^" word "@" word "$" } { for(i=1; i<=NF; i++) if ($i ~ addr) print $i } Lou Kates, Teleride Sage Ltd., louk%tslwat@watmath.waterloo.edu
tchrist@convex.COM (Tom Christiansen) (03/06/91)
From the keyboard of louk@tslwat.UUCP (Lou Kates): :The following awk program looks for expressions of the form :word@word where word contains only letters, numbers and dots and :the field separator is anything except letters, numbers, dots and :@. You can change the regular expressions in order to vary the :effect: : :BEGIN { FS = "[^.a-zA-Z0-9@]+"; : word = "[.a-zA-Z0-9]+"; : addr = "^" word "@" word "$" : } :{ for(i=1; i<=NF; i++) if ($i ~ addr) print $i } $ awk -f foo.awk < file awk: syntax error near line 5 awk: illegal statement near line 5 You meant a nawk program, not an awk program. You definitely need to have more characters in there -- consider folks with dashes in their hostnames. That's why my regexp was more complicated. --tom -- I get so tired of utilities with arbitrary, undocumented, compiled-in limits. Don't you? Tom Christiansen tchrist@convex.com convex!tchrist
peter@doe.utoronto.ca (Peter Mielke) (03/07/91)
In <1994@tssi.UUCP>, tssi!nolan writes: > lin@CS.WMICH.EDU (Lite Lin) writes: > > This is a simple question, but I don't see it in "Freqently Asked > >Questions", so... > > I'm trying to identify all the email addresses in email messages, i.e., > >patterns with the format user@node. Now I can use grep/sed/awk to find > >those lines containing user@node, but I can't figure out from the manual > >how or whether I can have access to the matching pattern (it can be > >anywhere in the line, and it doesn't have to be surrounded by spaces, > >i.e., it's not necessarily a separate "field" in awk). > > [stuff about awk or gawk] > > Then that gives a pattern something like this > > [a-zA-Z0-9.\-_%!]+@[a-zA-Z0-9.\-_]+ > > I've escaped the dash, I suppose it might be necessary to escape other > characters as well. Have I left anything out that might occur in strange > but otherwise valid mail addresses? Or you could use sed to transform the address when it matches. eg. sed -e 's/\([a-zA-Z0-9.\-_%!]*\)@\([a-zA-Z0-9.\-_]*\)/machine: \2 userid: \1/' -- Peter Mielke peter@doe.utoronto.ca Dictionary of Old English Project utgpu!utzoo!utdoe!peter University of Toronto
tchrist@convex.COM (Tom Christiansen) (03/08/91)
From the keyboard of peter@doe.utoronto.ca (Peter Mielke): :sed -e 's/\([a-zA-Z0-9.\-_%!]*\)@\([a-zA-Z0-9.\-_]*\)/machine: \2 userid: \1/' Did you actually try that? Does it work for multiple addrs on the same line? Does it print out a nice report free of the extra text? I don't think so. --tom -- I get so tired of utilities with arbitrary, undocumented, compiled-in limits. Don't you? Tom Christiansen tchrist@convex.com convex!tchrist