[news.software.b] Anyone interested in end-to-end checksums in news?

jef@well.UUCP (Jef Poskanzer) (11/16/89)

Seems like we could add an end-to-end checksum to netnews articles in
an upward compatible fashion.  Add a new header field, "Checksum: ",
based on the entire article except the Path: and Checksum: headers.
Modify the news software to add the checksum to locally-posted
articles, and check it if present on articles from elsewhere.

The only problem would be if the string "\nChecksum: " itself got munged.
But then we are no worse off than we are now.

Participation would be strictly voluntary.  If you are interested in
protecting your articles against getting munged, you have an incentive
to add in the checksum; if you are interested in seeing fewer munged
articles, you have an incentive to check the checksum.

Note that this means no more gratuitous header re-writing.  Bet Henry
like it for this reason...
---
Jef

      Jef Poskanzer  jef@well.sf.ca.us  {ucbvax, apple, hplabs}!well!jef
            "Talkers are no good doers." -- Shakespeare, Henry VI

tneff@bfmny0.UU.NET (Tom Neff) (11/16/89)

One problem with checksumming news is that if done wrong, it could
presume too much about the internal representation of articles.  If I
have an EBCDIC machine that stores all text files as fixed-length
blank-padded 80 character logical records with no concept of '\n', I
should still be able to run News.  Such machines could ignore the
Checksum field, of course.  But they're as prey to noise and article
corruption as anyone else, and deserve the benefits of the feature.

Checksumming could work if the following rules apply:

 * Start with the first nonblank line of the article body; do not
   include headers.  (Think about the "Path" field.)

 * Only checksum nonblank lines.

 * Only count characters in the graphics-64 set; use their ASCII
   numeric representations.  (Non-ASCII machines can just translate
   while computing.)

A new field like Checksum is only worth adding if practically everyone
can get some use out of it.  The above suggestions would allow lots
of different machines both to generate and check the field.

-- 
When I was [in Canada] I found their jokes like their   | Tom Neff
roads -- not very long and not very good, leading to a  | tneff@bfmny0.UU.NET
little tin point of a spire which has been remorselessly
obvious for miles without seeming to get any nearer. -- Samuel Butler.

henry@utzoo.uucp (Henry Spencer) (11/17/89)

In article <14594@well.UUCP> Jef Poskanzer <jef@well.sf.ca.us> writes:
>Seems like we could add an end-to-end checksum to netnews articles in
>an upward compatible fashion.  Add a new header field, "Checksum: ",
>based on the entire article except the Path: and Checksum: headers.
>Modify the news software to add the checksum to locally-posted
>articles, and check it if present on articles from elsewhere.

Such a scheme existed in an early version of C News.  We eventually
abandoned it.  The problem is, what do you do when you receive an
article with a bad checksum?  The underlying difficulty is that news
not infrequently travels via networks that corrupt the data in "benign"
ways, e.g. substituting spaces for tabs.  Throwing away such articles
means you don't see perfectly-readable news.  Keeping them and reporting
on them just increases the noise level in the sysadmin's mailbox, since
all too often the responsible parties won't (or can't) fix their software.
We thought about checksumming only non-blank characters, but that drives
the cost up considerably, and there's still the question of what to do
with a bad article.  It just didn't seem worth it.

>Note that this means no more gratuitous header re-writing.  Bet Henry
>like it for this reason...

Alas, not so.  Since it is *necessary* to rewrite the Path header, the
checksum has to be recomputed every time anyway.
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

coolidge@brutus.cs.uiuc.edu (John Coolidge) (11/17/89)

tneff@bfmny0.UU.NET (Tom Neff) writes:
>Checksumming could work if the following rules apply:

> * Start with the first nonblank line of the article body; do not
>   include headers.  (Think about the "Path" field.)

I agree with the original proposal here. Checksum all of the standard
headers except Path: (and Xref:). Message-id's seem to be the most
common thing to get mashed in transmission. They're also the most
_important_ thing that gets mashed most of the time. Checksumming
should start with the headers.

>A new field like Checksum is only worth adding if practically everyone
>can get some use out of it.  The above suggestions would allow lots
>of different machines both to generate and check the field.

That's true. On the other hand, I don't see anything in Checksum:
that would imply that headers not get checksummed. The most important
reason for me to want Checksum: is to protect against bogus
message-ids. Of course, there _is_ the side benefit that it would
stop gratuitous header re-writing (hurrah!).

--John

--------------------------------------------------------------------------
John L. Coolidge     Internet:coolidge@cs.uiuc.edu   UUCP:uiucdcs!coolidge
Of course I don't speak for the U of I (or anyone else except myself)
Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed.
You may redistribute this article if and only if your recipients may as well.
New NNTP connections always available! Send mail if you're interested.

lmb@vicom.com (Larry Blair) (11/17/89)

In article <14594@well.UUCP> Jef Poskanzer <jef@well.sf.ca.us> writes:
=Seems like we could add an end-to-end checksum to netnews articles in
=an upward compatible fashion.  Add a new header field, "Checksum: ",
=based on the entire article except the Path: and Checksum: headers.
=Modify the news software to add the checksum to locally-posted
=articles, and check it if present on articles from elsewhere.

I brought up this issue a while back and was soundly thrashed.  Actually,
a checksum is not the answer.  What is needed is a CRC.

My main interest, at the time, was to try to stop the bogus Message-ID's
that were causing problems.  I found that there are a large number of
munged articles circulating on Usenet.

Obviously, there is the problem that either the Path: line has to be ignored
or the CRC will be to be recalculated with every hop.
-- 
Larry Blair   ames!vsi1!lmb   lmb@vicom.com

lmb@vicom.com (Larry Blair) (11/17/89)

In article <14922@bfmny0.UU.NET> tneff@bfmny0.UU.NET (Tom Neff) writes:
=Checksumming could work if the following rules apply:
=
= * Start with the first nonblank line of the article body; do not
=   include headers.  (Think about the "Path" field.)

Who cares about the article body?  It is the headers that need to be protected,
particularly the Message-ID.
-- 
Larry Blair   ames!vsi1!lmb   lmb@vicom.com

tneff@bfmny0.UU.NET (Tom Neff) (11/17/89)

Exempting specific header fields (Path, Xref) from checksumming is an
acceptable alternative to checksumming just the body, provided you don't
mind enthroning a specific set of headers in the RFC (this proposal
needs one, by the way).

I should point out that there is a limit to what you can do with a
proven-corrupted message if the corruption includes the header.
You cannot attempt to promulgate error information elsewhere,
because you do not know if the Message-ID is valid.

It seems worthwhile to log corrupted articles carefully including path
information, so that suspect links can be investigated.  The info could
be collected periodically a la Arbitron and collated centrally for
announcement on a new cable TV show, "Usenet's Most Wanted" :-)
-- 
"Of course, this is a, this is a Hunt, you   |  Tom Neff
will -- that will uncover a lot of things.   |  tneff@bfmny0.UU.NET
You open that scab, there's a hell of a lot
of things... This involves these Cubans, Hunt, and a lot of hanky-panky
that we have nothing to do with ourselves." -- RN 6/23/72

jef@well.UUCP (Jef Poskanzer) (11/17/89)

In the referenced message, coolidge@cs.uiuc.edu wrote:
}                                  Message-id's seem to be the most
}common thing to get mashed in transmission.

Uh, doesn't this seem a little unlikely to you?  Doesn't it seem more
likely that bits all over the articles are getting munged with equal
probability, but we tend to notice when it happens to Message-Id's
since that results in duplicate articles?  In fact, precisely this
realization is what led me to propose Checksum: -- if it's true, then
munging is *far* more common than previously thought.
---
Jef

      Jef Poskanzer  jef@well.sf.ca.us  {ucbvax, apple, hplabs}!well!jef
    "If you give me six lines written by the most honest man, I will find
           something in them to hang him." -- Cardinal de Richelieu

tneff@bfmny0.UU.NET (Tom Neff) (11/17/89)

In article <1989Nov16.191837.13850@vicom.com> lmb@vicom.COM (Larry Blair) writes:
>Who cares about the article body?  It is the headers that need to be protected,
>particularly the Message-ID.

I think this could officially be called the "News for its own sake"
position. :-)
-- 
"The country couldn't run without Prohibition.       ][  Tom Neff
 That is the industrial fact." -- Henry Ford, 1929   ][  tneff@bfmny0.UU.NET

coolidge@brutus.cs.uiuc.edu (John Coolidge) (11/17/89)

jef@well.UUCP (Jef Poskanzer) writes:

>In the referenced message, coolidge@cs.uiuc.edu wrote:
>}                                  Message-id's seem to be the most
>}common thing to get mashed in transmission.

>Uh, doesn't this seem a little unlikely to you?  Doesn't it seem more
>likely that bits all over the articles are getting munged with equal
>probability, but we tend to notice when it happens to Message-Id's
>since that results in duplicate articles?

Hmm. I've seen plenty of articles with munged Message-id's, and
many of those had munged text too. I don't remember ever seeing
an article with munged text in which the Message-id _wasn't_
munged, though. This is probably because, if the text gets mashed
but the Message-id doesn't, then downstream feeds don't take the
mashed copy but only the good one (all of the duplicate, mashed
Message-id postings I've seen seem to have been sent twice by the
originating site...).

This might imply that checksumming Message-id, while not sufficient
to stop all body-munging, will have the effect of stopping lots of
it by halting propagation. In any case, checksumming all possible
header fields and the entire text is the right thing to do anyway.

--John

--------------------------------------------------------------------------
John L. Coolidge     Internet:coolidge@cs.uiuc.edu   UUCP:uiucdcs!coolidge
Of course I don't speak for the U of I (or anyone else except myself)
Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed.
You may redistribute this article if and only if your recipients may as well.
New NNTP connections always available! Send mail if you're interested.

coolidge@brutus.cs.uiuc.edu (John Coolidge) (11/17/89)

tneff@bfmny0.UU.NET (Tom Neff) writes:

>Exempting specific header fields (Path, Xref) from checksumming is an
>acceptable alternative to checksumming just the body, provided you don't
>mind enthroning a specific set of headers in the RFC (this proposal
>needs one, by the way).

What really needs to be done is to integrate the Checksum: feature
and all of the bugs in the RFC's that have come to light in this
group over the last few months and make up an all-new Standards for
Transmission of Usenet News RFC. Limit specifically the ability of
header-mungers to work their evil voodoo, fix implementation-specific
RFC bogosities, that sort of thing.

>It seems worthwhile to log corrupted articles carefully including path
>information, so that suspect links can be investigated.  The info could
>be collected periodically a la Arbitron and collated centrally for
>announcement on a new cable TV show, "Usenet's Most Wanted" :-)

Yup, this sounds like the right thing to do. Shunt the damaged
article off to the side, report the problem, and make sure it
doesn't go past your site.

--John

--------------------------------------------------------------------------
John L. Coolidge     Internet:coolidge@cs.uiuc.edu   UUCP:uiucdcs!coolidge
Of course I don't speak for the U of I (or anyone else except myself)
Copyright 1989 John L. Coolidge. Copying allowed if (and only if) attributed.
You may redistribute this article if and only if your recipients may as well.
New NNTP connections always available! Send mail if you're interested.

jef@well.UUCP (Jef Poskanzer) (11/17/89)

In the referenced message, henry@utzoo.uucp (Henry Spencer) wrote:
}In article <14594@well.UUCP> Jef Poskanzer <jef@well.sf.ca.us> writes:
}>Note that this means no more gratuitous header re-writing.  Bet Henry
}>like it for this reason...
}
}Alas, not so.  Since it is *necessary* to rewrite the Path header, the
}checksum has to be recomputed every time anyway.

Huh?  I didn't say no header re-writing, I said no gratuitous header
re-writing.  Did you read this:

}>based on the entire article except the Path: and Checksum: headers.
			      ^^^^^^     ^^^^^
?  Unless I'm missing something, you compute it once, when an article
is submitted, and each site checks it upon reception.  That's the whole
point of an end-to-end check.  Sure it means that the Path: is not
protected, but I can live with that.

As for the other issues you mentioned, that's why I proposed that this
be voluntary.  Some sites, mine certainly among them, will decide that
reliable transport is more important than getting the miniscule number
of articles that can get here only through EBCDIC links or other
bogosities.
---
Jef

      Jef Poskanzer  jef@well.sf.ca.us  {ucbvax, apple, hplabs}!well!jef
   "I do not believe that this generation of Americans is willing to resign
    itself to going to bed each night by the light of a Communist moon..."
                            -- Lyndon B. Johnson

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (11/17/89)

Re: checksums or crc for news articles

>Note that this means no more gratuitous header re-writing.  Bet Henry
>like it for this reason...

You could use a crc or checksum that isn't sensitive to header ordering.
For example, do a crc on each line in the header and sum those together.


-- 
Jon Zeeff    		<zeeff@b-tech.ann-arbor.mi.us>
Branch Technology 	<zeeff%b-tech@iti.org>

allbery@NCoast.ORG (Brandon S. Allbery) (11/19/89)

Checksumming can't be relied upon in the case of ASCII <-> EBCDIC translations
anyway -- EBCDIC is not a single character set, but a related family of
character sets which are each slightly different from the other.  Worse, there
are characters in EBCDIC which don't map to ASCII (which may affect postings
originating from BITNET sites) and characters in ASCII which map to two or
possibly more characters in EBCDIC (consider "|").

++Brandon
-- 
Brandon S. Allbery    allbery@NCoast.ORG, BALLBERY (MCI Mail), ALLBERY (Delphi)
uunet!hal.cwru.edu!ncoast!allbery ncoast!allbery@hal.cwru.edu bsa@telotech.uucp
*(comp.sources.misc mail to comp-sources-misc[-request]@backbone.site, please)*
*Third party vote-collection service: send mail to allbery@uunet.uu.net (ONLY)*
expnet.all: Experiments in *net management and organization.  Mail me for info.

lemke@radius.UUCP (Steve Lemke) (11/19/89)

In article <14603@well.UUCP> Jef Poskanzer <jef@well.sf.ca.us> writes:
}			      ^^^^^^     ^^^^^
}?  Unless I'm missing something, you compute it once, when an article
}is submitted, and each site checks it upon reception.  That's the whole
}point of an end-to-end check.  Sure it means that the Path: is not
}protected, but I can live with that.

OK, so it gets checked upon reception, but maybe I'm missing something
here - what happens if the check fails?  This doesn't seem like it will
happen real-time (as the transfer is taking place) but rather when the
rnews executes, which is usually after it's too late to ask for the article
to be sent again.  Well, do you somehow try to request that article again
later?  Of course, if it was munched then you don't really know how to
ask for it again, do you?  So, do you ignore it?  You only want non-munched
news, regardless of what it was that actually caused the checksum to fail?
Just forget about anything that doesn't pass?

I just recently started running bnews and everything seems to be working
fine, but I'm no expert - I'm just curious as to what happens in the event
that your proposed checksum fails upon receipt.

-- 
=============================================================================
===== Steve Lemke, Engineering Quality Assurance, Radius Inc., San Jose =====
===== Reply to: radius!lemke@apple.com    (Coming soon: radius.com ...) =====
===== AppleLink: Radius.QA;    GEnie: S.Lemke;    Compu$erve: 73627,570 =====

jef@well.UUCP (Jef Poskanzer) (11/19/89)

In the referenced message, radius!lemke@apple.com (Generic Account) wrote:
}                                                  You only want non-munched
}news, regardless of what it was that actually caused the checksum to fail?
}Just forget about anything that doesn't pass?

Yep.  Log it, especially the path, and drop it in the bit bucket.  Or
at least, this is what I intend to do.  Other sites are perfectly free
to do otherwise.  I actually considered posting it in junk, but then
if a non-munged version of the article happened along it would get
rejected.  I could fix this, but I think it would take more hacking
than I'm interested in doing.

Hmm.  I can just see Brad getting back at me for my references lines by
adding bogus checksums to all his articles.  Well, if he wants all his
articles to get dropped, I guess that's ok...
---
Jef

      Jef Poskanzer  jef@well.sf.ca.us  {ucbvax, apple, hplabs}!well!jef
     "...for DEATH awaits you all, with nasty sharp pointy teeth!" -- Tim

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (11/20/89)

Re: what to do if a checksum on a news article is bad

If you only have one source for news, there is nothing you can do.  If you
have multiple sources, you can reject the article in the hope that someone
will offer you a good one.  Or, what I would prefer would be to keep the
article along with some indication that it is probably bad and should be
replaced with a good copy if it comes along.


-- 
Jon Zeeff    		<zeeff@b-tech.ann-arbor.mi.us>
Branch Technology 	<zeeff@b-tech.mi.org>

It's 1989.  Does your software support the ISO 8859 character sets?

blarson@dianne.usc.edu (bob larson) (11/20/89)

In article <1989Nov18.185648.3525@NCoast.ORG> allbery@ncoast.ORG (Brandon S. Allbery) writes:
>Checksumming can't be relied upon in the case of ASCII <-> EBCDIC translations

Protection against non-invertable translations is one of the reasons
for the "checksum" (should be crc) as far as I am conserned.  Remember
all the problems caused when one of the RN (?) patches was mangled to
a significant portion of the net due to a nonstandard news feed (via
BITNET) that expanded a tab to spaces?  Invertable transformations
shouldn't be a problem, it's just a little harder to computate the crc
on the system that decides not to store articles in ASCII.

Having just the sites that have multiple redundant feeds drop mashed
articles would be a big help reducing the number of such articles
propigating.  (usc is such a site.)

-- 
Bob Larson	blarson@dianne.usc.edu		usc!dianne!blarson
--**		To join Prime computer mailing list		**---
info-prime-request@ais1.usc.edu		usc!ais1!info-prime-request