tkevans@oss670.UUCP (Tim Evans) (05/01/91)
My vendor's proprietary e-mail package, non-standard in almost every way (including not using sendmail), is generating duplicate headers. Generally, these are the Subject header and/or the From header, and they are always consecutive. That is, identical Subject headers are generated, and they occur consecutively. Since we are connected to the outside world via UUCP, this means we are doing Bad Things to other people's mail. Naturally, I'm anxious to fix this, but the vendor, whose contract doesn't require UUCP support or compliance with RFC-822, isn't. I'm not able to fix the mailer myself, but can pass its output through standard filters--awk, sed, etc.--before it goes out the door. My first thought was to pass things through 'uniq', but this would also delete consecutive identical lines in the body (the mailer doesn't distinguish between header and body). The probability of consecutive, identical lines in the body of mail messages seems low, but not low enough to chance this. So, can anyone provide a solution that would delete the second (and subsequent?) occurrences of identical lines that are RFC-822-style headers? I'd prefer not using 'perl' as I haven't installed it here yet (Real Soon Now). -- INTERNET tkevans%woodb@mimsy.umd.edu UUCP ...!{rutgers|ames|uunet}!mimsy!woodb!tkevans US MAIL 6401 Security Blvd, 2-Q-2 Operations, Baltimore, MD 21235 PHONE (301) 965-3286
lyndon@cs.athabascau.ca (Lyndon Nerenberg) (05/02/91)
[ Tried mailing this but oss670.uucp was unknown to us ] In comp.mail.headers you write: >I'm not able to fix the mailer myself, but can pass its output >through standard filters--awk, sed, etc.--before it goes >out the door. My first thought was to pass things through 'uniq', >but this would also delete consecutive identical lines in the body (the >mailer doesn't distinguish between header and body). The probability >of consecutive, identical lines in the body of mail messages seems >low, but not low enough to chance this. You almost answered your own question :-) Use sed to split the headers and body into seperate files. Run the header file through sort|uniq, then append the body file. Note that you will have to deal with header continuation lines somehow. A short piece of C code should handle folding the headers, and unfolding them when you're done. Perhaps the easiest way to deal with this would be to write the entire filter in C. All you need to do is maintain a linked list of headers you have seen. During the scanning phase, if you encounter a header that's already on the linked list, ignore it (and any possible continuation lines). If it's a new header, start up a second linked list of lines containing the header contents. If there are continuation lines in the header, simply append them to the linked list for that header. This eliminates the need to fold/spindle/mutilate the header continuation lines. Once you've fallen out of the headers, just copy the message body through and you're done!
tchrist@convex.COM (Tom Christiansen) (05/02/91)
From the keyboard of lyndon@cs.athabascau.ca (Lyndon Nerenberg): :[ Tried mailing this but oss670.uucp was unknown to us ] right, me too. :In comp.mail.headers you write: : :>I'm not able to fix the mailer myself, but can pass its output :>through standard filters--awk, sed, etc.--before it goes :>out the door. My first thought was to pass things through 'uniq', :>but this would also delete consecutive identical lines in the body (the :>mailer doesn't distinguish between header and body). The probability :>of consecutive, identical lines in the body of mail messages seems :>low, but not low enough to chance this. : :You almost answered your own question :-) : :Use sed to split the headers and body into seperate files. Run the header :file through sort|uniq, then append the body file. Note that you will :have to deal with header continuation lines somehow. A short piece of :C code should handle folding the headers, and unfolding them when you're :done. That's a lot of work!! :Perhaps the easiest way to deal with this would be to write the entire :filter in C. All you need to do is maintain a linked list of headers :you have seen. During the scanning phase, if you encounter a header that's :already on the linked list, ignore it (and any possible continuation :lines). If it's a new header, start up a second linked list of lines :containing the header contents. If there are continuation lines in the :header, simply append them to the linked list for that header. This :eliminates the need to fold/spindle/mutilate the header continuation :lines. :Once you've fallen out of the headers, just copy the message body :through and you're done! That's a HELLUVA lotta work! Here's an awk solution: #!/bin/awk -f /^$/ { body = 1 } { if (!body) { if (lastline == $0) next lastline = $0 } print } And here's a perl solution: perl -ne 'print if (/^$/ .. eof) || $lastline ne $_; $lastline = $_' If you want solutions for non-consecutive or especially multi-line headers, ask, but I can lay odds they'll be in perl. :-) --tom
rickert@mp.cs.niu.edu (Neil Rickert) (05/02/91)
In article <1991May01.234739.25672@convex.com> tchrist@convex.COM (Tom Christiansen) writes: >: >:>I'm not able to fix the mailer myself, but can pass its output >:>through standard filters--awk, sed, etc.--before it goes >:>out the door. My first thought was to pass things through 'uniq', > >That's a lot of work!! If you look in the C-news package, and particularly in 'tear' and 'anne.jones' (plus an associated awk script) you might find most of the hard work has been done already. -- =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= Neil W. Rickert, Computer Science <rickert@cs.niu.edu> Northern Illinois Univ. DeKalb, IL 60115 +1-815-753-6940
peter@ficc.ferranti.com (Peter da Silva) (05/02/91)
awk 'BEGIN {h=1; l=""} /^$/ {h=0} {if(h==0 || l!=$0) print; l = $0}' -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
loren@ural.Eng.Sun.COM (05/04/91)
In comp.mail.headers you write: >I'm not able to fix the mailer myself, but can pass its output >through standard filters--awk, sed, etc.--before it goes >out the door. My first thought was to pass things through 'uniq', >but this would also delete consecutive identical lines in the body (the >mailer doesn't distinguish between header and body). The probability >of consecutive, identical lines in the body of mail messages seems >low, but not low enough to chance this. I expect an AWK script is the best choice if you don't want to use PERL. I can't rember AWK syntax since I've started using PERL (How quickly I forget :) But you basically want to delete any adjacent duplicate lines until the end of the header (first blank line) and then copy the rest of the file as is. If you wanted a PERL solution, I would have written it. Hope this is of some help. Loren ----------------------------------------------------------------------------- Mr Loren L. Hart The Ada Ace Group, Inc loren@cup.portal.com P.O. Box 36195 San Jose, CA 95158
weimer@garden.ssd.kodak.com (Gary Weimer (253-7796)) (05/07/91)
>I'm not able to fix the mailer myself, but can pass its output >through standard filters--awk, sed, etc.--before it goes >out the door. My first thought was to pass things through 'uniq', >but this would also delete consecutive identical lines in the body (the >mailer doesn't distinguish between header and body). The probability >of consecutive, identical lines in the body of mail messages seems >low, but not low enough to chance this. Since I haven't seen a non-perl solution that works yet, here's mine. Actually I have two (don't ask me why). The second is more robust and handles all examples in the test file. ============ Start test file ======================= This is the first line First continued line Another continued line Another continued line with extras A repeated line A repeated line A repeated line with continuation A repeated line with continuation One more line Body of message Body of message More lines 2nd paragraph Body of message Body of message More lines ============ End test file ======================= ============ Start 1st solution file ======================= #!/bin/awk -f # assumes first line is not blank (doesn't modify header if it is) # assumes continuation lines do not make a "line" unique, i.e. # A line followed by # a continuation line # is a "duplicate" of: # A line followed by # a different continuation line BEGIN{cont = " "} # tab is continuation character /^$/,//{ # /<carret><dollar>/,/<CTRL-D>/{ print $0; next} substr($0,1,1) == cont { # don't print continuation line if first if (!del) {print $0} # part of line was a repeat next} prev == $0 { # this and any continuation is repeat del = 1; next} { # print line since not repeat del = 0; print $0; prev = $0} ============ End 1st solution file ======================= ============ Start 2st solution file ======================= #!/bin/awk -f # skips blank lines at start of file (can be printed) # compares continuation lines BEGIN{contflg = " "} # tab is continuation character {if (!fndhdr){ # handle blank lines before header if ($0 == ""){ # print $0; # print blank lines before header next} else{ fndhdr = 1}}} /^$/,//{ # /<carret><dollar>/,/<CTRL-D>/{ print $0; next} substr($0,1,1) == contflg { if (nm != 0 && nm < np && prev[nm+1] == $0){ # still seams to be repeat nm++} else{ # line is not a repeat if (nm == 0){ # we already knew was not repeat np++} else{ for (i=1; i<=nm; i++) # print what we thought was a repeat print prev[nm]; np = nm + 1; nm = 0} print $0; prev[np] = $0} # keep track of continuation lines next} prev[1] == $0 { # assume line is repeat nm = 1; next} { # print line since not repeat nm = 0; print $0; np = 1; prev[np] = $0} ============ End 2st solution file =======================