[comp.unix.questions] Eliminating Duplicate Mail Headers

tkevans@oss670.UUCP (Tim Evans) (05/01/91)

My vendor's proprietary e-mail package, non-standard in almost every
way (including not using sendmail), is generating duplicate headers.
Generally, these are the Subject header and/or the From header,
and they are always consecutive.  That is, identical Subject
headers are generated, and they occur consecutively.

Since we are connected to the outside world via UUCP, this means
we are doing Bad Things to other people's mail.  Naturally, I'm
anxious to fix this, but the vendor, whose contract doesn't require
UUCP support or compliance with RFC-822, isn't.

I'm not able to fix the mailer myself, but can pass its output
through standard filters--awk, sed, etc.--before it goes
out the door.  My first thought was to pass things through 'uniq',
but this would also delete consecutive identical lines in the body (the
mailer doesn't distinguish between header and body).  The probability
of consecutive, identical lines in the body of mail messages seems
low, but not low enough to chance this.

So, can anyone provide a solution that would delete the second (and
subsequent?) occurrences of identical lines that are RFC-822-style
headers?  I'd prefer not using 'perl' as I haven't installed it here
yet (Real Soon Now).
-- 
INTERNET	tkevans%woodb@mimsy.umd.edu
UUCP 		...!{rutgers|ames|uunet}!mimsy!woodb!tkevans
US MAIL		6401 Security Blvd, 2-Q-2 Operations, Baltimore, MD  21235	
PHONE		(301) 965-3286

lyndon@cs.athabascau.ca (Lyndon Nerenberg) (05/02/91)

[ Tried mailing this but oss670.uucp was unknown to us ]

In comp.mail.headers you write:

>I'm not able to fix the mailer myself, but can pass its output
>through standard filters--awk, sed, etc.--before it goes
>out the door.  My first thought was to pass things through 'uniq',
>but this would also delete consecutive identical lines in the body (the
>mailer doesn't distinguish between header and body).  The probability
>of consecutive, identical lines in the body of mail messages seems
>low, but not low enough to chance this.

You almost answered your own question :-)

Use sed to split the headers and body into seperate files. Run the header
file through sort|uniq, then append the body file. Note that you will 
have to deal with header continuation lines somehow. A short piece of
C code should handle folding the headers, and unfolding them when you're
done.

Perhaps the easiest way to deal with this would be to write the entire
filter in C. All you need to do is maintain a linked list of headers
you have seen. During the scanning phase, if you encounter a header that's
already on the linked list, ignore it (and any possible continuation
lines). If it's a new header, start up a second linked list of lines
containing the header contents. If there are continuation lines in the
header, simply append them to the linked list for that header. This
eliminates the need to fold/spindle/mutilate the header continuation
lines.

Once you've fallen out of the headers, just copy the message body
through and you're done!

tchrist@convex.COM (Tom Christiansen) (05/02/91)

From the keyboard of lyndon@cs.athabascau.ca (Lyndon Nerenberg):
:[ Tried mailing this but oss670.uucp was unknown to us ]

right, me too.

:In comp.mail.headers you write:
:
:>I'm not able to fix the mailer myself, but can pass its output
:>through standard filters--awk, sed, etc.--before it goes
:>out the door.  My first thought was to pass things through 'uniq',
:>but this would also delete consecutive identical lines in the body (the
:>mailer doesn't distinguish between header and body).  The probability
:>of consecutive, identical lines in the body of mail messages seems
:>low, but not low enough to chance this.
:
:You almost answered your own question :-)
:
:Use sed to split the headers and body into seperate files. Run the header
:file through sort|uniq, then append the body file. Note that you will 
:have to deal with header continuation lines somehow. A short piece of
:C code should handle folding the headers, and unfolding them when you're
:done.

That's a lot of work!!


:Perhaps the easiest way to deal with this would be to write the entire
:filter in C. All you need to do is maintain a linked list of headers
:you have seen. During the scanning phase, if you encounter a header that's
:already on the linked list, ignore it (and any possible continuation
:lines). If it's a new header, start up a second linked list of lines
:containing the header contents. If there are continuation lines in the
:header, simply append them to the linked list for that header. This
:eliminates the need to fold/spindle/mutilate the header continuation
:lines.

:Once you've fallen out of the headers, just copy the message body
:through and you're done!

That's a HELLUVA lotta work!

Here's an awk solution:

    #!/bin/awk -f
    /^$/ { body = 1 }
    {
        if (!body) {
            if (lastline == $0) next
            lastline = $0
        }
        print
    }

And here's a perl solution:

    perl -ne 'print if (/^$/ .. eof)  || $lastline ne $_; $lastline = $_'


If you want solutions for non-consecutive or especially multi-line
headers, ask, but I can lay odds they'll be in perl. :-)

--tom

rickert@mp.cs.niu.edu (Neil Rickert) (05/02/91)

In article <1991May01.234739.25672@convex.com> tchrist@convex.COM (Tom Christiansen) writes:
>:
>:>I'm not able to fix the mailer myself, but can pass its output
>:>through standard filters--awk, sed, etc.--before it goes
>:>out the door.  My first thought was to pass things through 'uniq',
>
>That's a lot of work!!

 If you look in the C-news package, and particularly in 'tear' and
'anne.jones' (plus an associated awk script) you might find most of the
hard work has been done already.

-- 
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
  Neil W. Rickert, Computer Science               <rickert@cs.niu.edu>
  Northern Illinois Univ.
  DeKalb, IL 60115                                   +1-815-753-6940

peter@ficc.ferranti.com (Peter da Silva) (05/02/91)

awk 'BEGIN {h=1; l=""} /^$/ {h=0} {if(h==0 || l!=$0) print; l = $0}'
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

loren@ural.Eng.Sun.COM (05/04/91)

In comp.mail.headers you write:

>I'm not able to fix the mailer myself, but can pass its output
>through standard filters--awk, sed, etc.--before it goes
>out the door.  My first thought was to pass things through 'uniq',
>but this would also delete consecutive identical lines in the body (the
>mailer doesn't distinguish between header and body).  The probability
>of consecutive, identical lines in the body of mail messages seems
>low, but not low enough to chance this.

I expect an AWK script is the best choice if you don't want to use PERL.
I can't rember AWK syntax since I've started using PERL (How quickly
I forget :)  But you basically want to delete any adjacent duplicate
lines until the end of the header (first blank line) and then copy the
rest of the file as is.  If you wanted a PERL solution, I would have
written it.  Hope this is of some help.

Loren

-----------------------------------------------------------------------------
Mr Loren L. Hart                        The Ada Ace Group, Inc
loren@cup.portal.com                    P.O. Box 36195
                                        San Jose, CA  95158

weimer@garden.ssd.kodak.com (Gary Weimer (253-7796)) (05/07/91)

>I'm not able to fix the mailer myself, but can pass its output
>through standard filters--awk, sed, etc.--before it goes
>out the door.  My first thought was to pass things through 'uniq',
>but this would also delete consecutive identical lines in the body (the
>mailer doesn't distinguish between header and body).  The probability
>of consecutive, identical lines in the body of mail messages seems
>low, but not low enough to chance this.

Since I haven't seen a non-perl solution that works yet, here's mine.
Actually I have two (don't ask me why). The second is more robust and
handles all examples in the test file.

============ Start test file =======================

This is the first line
First continued
	line
Another continued
	line
Another continued
	line with extras
A repeated line
A repeated line
A repeated line
	with continuation
A repeated line
	with continuation
One more line

Body of message
Body of message
More lines

2nd paragraph
Body of message
Body of message
More lines
============ End test file =======================

============ Start 1st solution file =======================
#!/bin/awk -f

# assumes first line is not blank (doesn't modify header if it is)
# assumes continuation lines do not make a "line" unique, i.e.
#     A line followed by
#         a continuation line
# is a "duplicate" of:
#     A line followed by
#         a different continuation line

BEGIN{cont = "	"}	# tab is continuation character

/^$/,//{		# /<carret><dollar>/,/<CTRL-D>/{
    print $0;
    next}

substr($0,1,1) == cont {	# don't print continuation line if first
    if (!del) {print $0}	# part of line was a repeat
    next}

prev == $0 {	# this and any continuation is repeat
    del = 1;
    next}

{		# print line since not repeat
    del = 0;
    print $0;
    prev = $0}
============ End 1st solution file =======================

============ Start 2st solution file =======================
#!/bin/awk -f

# skips blank lines at start of file (can be printed)
# compares continuation lines

BEGIN{contflg = "	"}	# tab is continuation character

{if (!fndhdr){		# handle blank lines before header
    if ($0 == ""){
#        print $0;	# print blank lines before header
        next}
    else{
        fndhdr = 1}}}

/^$/,//{		# /<carret><dollar>/,/<CTRL-D>/{
    print $0;
    next}

substr($0,1,1) == contflg {
    if (nm != 0 && nm < np && prev[nm+1] == $0){ # still seams to be repeat
        nm++}
    else{		# line is not a repeat
        if (nm == 0){	# we already knew was not repeat
            np++}
        else{
            for (i=1; i<=nm; i++) # print what we thought was a repeat
                print prev[nm];
            np = nm + 1;
            nm = 0}
        print $0;
        prev[np] = $0}	# keep track of continuation lines
    next}

prev[1] == $0 {	# assume line is repeat
    nm = 1;
    next}

{		# print line since not repeat
    nm = 0;
    print $0;
    np = 1;
    prev[np] = $0}
============ End 2st solution file =======================