[comp.mail.headers] Eliminating Duplicate Mail Headers

tkevans@oss670.UUCP (Tim Evans) (05/01/91)

My vendor's proprietary e-mail package, non-standard in almost every
way (including not using sendmail), is generating duplicate headers.
Generally, these are the Subject header and/or the From header,
and they are always consecutive.  That is, identical Subject
headers are generated, and they occur consecutively.

Since we are connected to the outside world via UUCP, this means
we are doing Bad Things to other people's mail.  Naturally, I'm
anxious to fix this, but the vendor, whose contract doesn't require
UUCP support or compliance with RFC-822, isn't.

I'm not able to fix the mailer myself, but can pass its output
through standard filters--awk, sed, etc.--before it goes
out the door.  My first thought was to pass things through 'uniq',
but this would also delete consecutive identical lines in the body (the
mailer doesn't distinguish between header and body).  The probability
of consecutive, identical lines in the body of mail messages seems
low, but not low enough to chance this.

So, can anyone provide a solution that would delete the second (and
subsequent?) occurrences of identical lines that are RFC-822-style
headers?  I'd prefer not using 'perl' as I haven't installed it here
yet (Real Soon Now).
-- 
INTERNET	tkevans%woodb@mimsy.umd.edu
UUCP 		...!{rutgers|ames|uunet}!mimsy!woodb!tkevans
US MAIL		6401 Security Blvd, 2-Q-2 Operations, Baltimore, MD  21235	
PHONE		(301) 965-3286

lyndon@cs.athabascau.ca (Lyndon Nerenberg) (05/02/91)

[ Tried mailing this but oss670.uucp was unknown to us ]

In comp.mail.headers you write:

>I'm not able to fix the mailer myself, but can pass its output
>through standard filters--awk, sed, etc.--before it goes
>out the door.  My first thought was to pass things through 'uniq',
>but this would also delete consecutive identical lines in the body (the
>mailer doesn't distinguish between header and body).  The probability
>of consecutive, identical lines in the body of mail messages seems
>low, but not low enough to chance this.

You almost answered your own question :-)

Use sed to split the headers and body into seperate files. Run the header
file through sort|uniq, then append the body file. Note that you will 
have to deal with header continuation lines somehow. A short piece of
C code should handle folding the headers, and unfolding them when you're
done.

Perhaps the easiest way to deal with this would be to write the entire
filter in C. All you need to do is maintain a linked list of headers
you have seen. During the scanning phase, if you encounter a header that's
already on the linked list, ignore it (and any possible continuation
lines). If it's a new header, start up a second linked list of lines
containing the header contents. If there are continuation lines in the
header, simply append them to the linked list for that header. This
eliminates the need to fold/spindle/mutilate the header continuation
lines.

Once you've fallen out of the headers, just copy the message body
through and you're done!