[net.news.adm] curious 2.10.3 efficiency situation

reid@glacier.ARPA (Brian Reid) (03/07/86)

I upgraded to 2.10.3 about a month ago, and except for the "expire" bug that
has already been reported here, I am quite happy with it.

However, about a week ago I made a stupid mistake that caused all incoming
news from all of our feeds to vanish into a pit. It was 3 days before I was
able to figure out what I had done (forgotten to turn on the setuid bit on a
piece of home-grown distribution software), so we lost 3 days' worth of news.

Once I fixed the bug, I got one of our feeds (decwrl) to retransmit the 3
days' worth of news. This was 155 batches of compressed news, and thus 155
separate calls to rnews via uuxqt. This was 3 days ago.

What happened then was both horrible and fascinating. Our machine has had
load factors of 10 to 12 around the clock for the last 3 days, and as I
watch the queues and analyze things, it looks as though it is not likely to
quiesce soon. Even as I type this message, it is about midnight, I am the
only user on the machine, and the load factor is 9.95.

The increased rate of news-flow, a pulse of 3 days' worth of news shipped in
one 8-hour burst, gave our system an impulse that is not damping out.
Glacier is now loaded down badly enough that it is unable to process news as
fast as the ordinary news is coming in, and the situation is actually
getting worse rather than better. In 2 days the number of unprocessed uucp
"X." files has grown from 155 to 220.

This is an elementary queueing problem, of course. The rate at which 2.10.3
rnews (in conjunction with compress 4.0, uuxqt, /bin/sh, and all of the
other accessories) seems to be able to process incoming messages on our
VAX-750 is just barely fast enough to keep up with the rate at which
ordinary messages are arriving.  Giving it a sudden burst of 1000 articles
has saturated things, and increased some cost (perhaps the history file
search?) enough that the uucico|uuxqt|sh|compress|rnews pipeline is not able
to process messages as fast as they are arriving.

What I am going to do to fix this problem is to run news at nice -16 all
weekend; that should give it enough of an increased speed that I can drain
the queues.
-- 
	Brian Reid	decwrl!glacier!reid
	Stanford	reid@SU-Glacier.ARPA

reid@glacier.ARPA (Brian Reid) (03/08/86)

Well, I've managed to get my netnews under control, after several days of
wrestling with it. I'm worried, though, that this is just the first of many
hiccups like this, and that perhaps a Vax 750 is not a big enough computer
to be a backbone site running this news software.

The problem is that we have 2 primary feeds; each potentially feeds us about
500 articles a day. In the steady state, when we make regular contact with
both feeds, we end up with about 150 from oliveb and about 350 from decwrl.
The compress/inews pipeline is able to handle about 3 articles a minute, so
our normal load is about 3 hours per day (out of 24) devoted to incoming
news.

When we were out of action for a couple of days, the steady-state rhythym
was broken, and both oliveb and decwrl queued all of the messages for us.
They arrived and sat in UUXQT queues, and while they sat there there was no
record that we had the article, and we didn't send it out to the other feed,
and so on: a vicious cycle. The net result was that we were getting a
torrent of 1000 articles per day, with many duplicates, from our two feeds.
Since we had a backlog of 2500 articles waiting to be processed, we never
caught up; glacier was not able to process 1000 articles per day, but only
by catching up, by getting all 3500 articles processed, would the input flow
be reduced back down to 500 per day.

What I finally did to fix it was to hotwire inews/rnews to run at nice -18,
and then to replace /bin/csh with a program that said "sorry, no logins
permitted at this time". This kept users off the machine while letting
uucico through. With no other users on the machine, but with uucp running
full bore, I was able to get my backlog of 3500 messages processed in about
20 hours. Then I reset everything back to normal; we've been running this
way for 3 hours now and the UUXQT queues are normal.

I don't yet have a solution to this problem, but I predict that we are not
the last site that it will happen to. If I had not been able to kick all of
my users off the machine for a day, then I really don't know how we could
have recovered from this, other than by asking our feeds to remove
net.religion and net.politics and that sort of thing, in order to decrease
the arrival traffic.
-- 
	Brian Reid	decwrl!glacier!reid
	Stanford	reid@SU-Glacier.ARPA

jon@msunix.UUCP (Otto) (03/08/86)

In article <5044@glacier.ARPA>, reid@glacier.ARPA (Brian Reid) writes:
> The increased rate of news-flow, a pulse of 3 days' worth of news shipped in
> one 8-hour burst, gave our system an impulse that is not damping out.
> Glacier is now loaded down badly enough that it is unable to process news as
> fast as the ordinary news is coming in, and the situation is actually
> getting worse rather than better. In 2 days the number of unprocessed uucp
> "X." files has grown from 155 to 220.
> 

Ouch!  I only keep two weeks worth of news on our machine, which is a
Sun-2/120.  It can do a uuxqt -> sh -c compress -d | rnews on a 50K batch
in about three and a half minutes (unloaded).  Guess I don't have anything
to complain about.  Once I saw someone type "la" (used to be a link to w, like
uptime) three times on the console of a VAX-11/750  that was having major
problems.  It looked something like:

root% la
load average: 27.48, 20.68, 15.34
root% la
load average: 54.23, 41.94, 33.74
root% la
load average: 81.23, 74.23, 64.23

panic: (some suicide note from the OS, in hex)

(crash, die, incinerate, more hex)


All I wanted was a Pepsi!		Jonathan Hue
And she wouldn't get it for me!		LMSC-Mechanisms & Servos
(c) Suicidal Tendencies, 1984	{amdcad!cae780,sun!sunncal}!leadsv!msunix!jon

phil@amdcad.UUCP (Phil Ngai) (03/08/86)

Brian, your analysis sounds quite intriguing at first but I can't seem
to make the numbers really work out. To start with, you say a normal
day's load of news takes 3 hours to process. While you are down or
trying to catch up, your feeds send you everything and since you have
two feeds, I would expect your load of news to double from 500 to 1000
articles per day. This is only 6 hours of news in the worse case, so
you should be able to make considerable progress in the other 18 hours
catching up.

If you were down for a week, you'd have to spend 7 * 6 or 42 hours to
catch up. During that time, you'd get shipped another 12 hours of news
and while processing that 12 hours you get 3 hours, etc. Although the
spike takes a while to dampen out, it should indeed do so.

There is another factor. Looking through my logs, I see that my 780
processes 5 articles/min but rejects 19 articles/min. While you are
catching up you should be rejecting 50% of your articles and if a 750
behaves similarly, a day's worth of news ought to take 3 (normal
processing) + 3/4 (reject four times as fast) hours.

I haven't considered modem time here, it too is subject to a latch-up
effect under the right conditions. We have four dialers, so we
probably don't have to worry.
-- 
 "We must welcome the future, remembering that soon it will become the
  present, and respect the past, knowing that once it was all that was
  humanly possible."

 Phil Ngai +1 408 749 5720
 UUCP: {ucbvax,decwrl,ihnp4,allegra}!amdcad!phil
 ARPA: amdcad!phil@decwrl.dec.com

reid@glacier.ARPA (Brian Reid) (03/09/86)

Grumble. Don't tell me that this didn't happen. It did.

It is absolutely true that the time to process one news article increased
substantially while the torrent was arriving. Or, more precisely, the number
of articles per minute that the machine could process was decreasing. In
particular, the time to process one message became longer than the expected
interval between arrivals of messages, with the result that the queues were
increasing in length. The particular queue that was increasing in length was
the /usr/spool/uucp/X. queue.

Although I spent my energy on trying to fix this problem rather than trying
to diagnose it, I did make a few measurements and form a few theories.

I believe that the extra time to process articles was caused by some
combination of these two factors:
   (1) The huge /usr/spool/uucp/X. directory (200+ entries at the beginning;
	900+ entries before I took drastic action. I believe that this 
	slowed uuxqt down considerably.
   (2) The huge history file. At the moment my /usr/lib/news/history
	file is 1.2 megabytes, and has 17000 messages in it. This is about
	double what it usually runs. I believe that inews is running a bit
	slower when the history file gets this big (it still has to search
	to resolve hash conflicts).

Unfortunately, I forgot to turn on system accounting before all of this, so
I don't have any hard data about how well it was working. 
-- 
	Brian Reid	decwrl!glacier!reid
	Stanford	reid@SU-Glacier.ARPA

mc68020@gilbbs.UUCP (Tom Keller) (03/10/86)

In article <5128@glacier.ARPA>, reid@glacier.ARPA (Brian Reid) writes:
> 
> Grumble. Don't tell me that this didn't happen. It did.
> 
> It is absolutely true that the time to process one news article increased
> substantially while the torrent was arriving. Or, more precisely, the number
> of articles per minute that the machine could process was decreasing. In
> particular, the time to process one message became longer than the expected
> interval between arrivals of messages, with the result that the queues were
> increasing in length. The particular queue that was increasing in length was
> the /usr/spool/uucp/X. queue.
> 
> Although I spent my energy on trying to fix this problem rather than trying
> to diagnose it, I did make a few measurements and form a few theories.
> 
> I believe that the extra time to process articles was caused by some
> combination of these two factors:
>    (1) The huge /usr/spool/uucp/X. directory (200+ entries at the beginning;
> 	900+ entries before I took drastic action. I believe that this 
> 	slowed uuxqt down considerably.
>    (2) The huge history file. At the moment my /usr/lib/news/history
> 	file is 1.2 megabytes, and has 17000 messages in it. This is about
> 	double what it usually runs. I believe that inews is running a bit
> 	slower when the history file gets this big (it still has to search
> 	to resolve hash conflicts).
> 
> Unfortunately, I forgot to turn on system accounting before all of this, so
> I don't have any hard data about how well it was working. 
> -- 
> 	Brian Reid	decwrl!glacier!reid
> 	Stanford	reid@SU-Glacier.ARPA

   I also have problems with the UUXQT program bogging down.  Now, I have to
tell you that I am running a machine which is not in the same UNIVERSE with
VAXen.  My machine is a Tandy 6000.  Part of the problem *I* am having is that
when I have 12-20 processes all clamoring for disk I/O, a couple of modems 
clamoring for I/O, plus me bitching at my console about mollasses in 
January, my poor 'lil braindamaged Z-80A IOP goes virtually catatonic (though
not in the canonical sense).  It *DOES* bog down badly (as well you might
imagine)


   I don't receive tons of traffic, and I only receive traffic during the off
hours at night.  Which means that my system is working *VERY* hard to keep
up with what *DOES* come down the pike (I have over 110 newsgroups that me
feed site does *NOT* feed me).

  I understand there is some way to ensure that only one istantiation of
UUXQT is executed at any one time.  Can someone elaborate on this for me?
I am very new to all this (I've never had to consider such things in my
3 years of using XENIX.  I have no gurus to point out the True Path to me).


  Any other suggestions for relieving my system of terminal indigestion?

  (please, do not crack wise about getting a more powerful system.  I would
   dearly love to do so...but first I need a job to pay rent, etc., *THEN*
   maybe I can consider getting a new system...maybe)

-- 

====================================

Disclaimer:  I hereby disclaim any and all responsibility for disclaimers.

tom keller
{ihnp4, dual}!ptsfa!gilbbs!mc68020

(* we may not be big, but we're small! *)

geoff@desint.UUCP (Geoff Kuenning) (03/12/86)

Before I get to my real point, I can't resist defending Phil Ngai:

In article <5128@glacier.ARPA>, reid@glacier.ARPA (Brian Reid) writes:
> 
> Grumble. Don't tell me that this didn't happen. It did.
> 
> It is absolutely true that the time to process one news article increased
> substantially while the torrent was arriving.

Brian, I don't think Phil ever said that your load spike didn't happen.
He simply said that the facts you posted in your original article
didn't support blaming queing for all of your problems.  This second article
(5128@glacier) gave more data, explaining that the arrival rate in your
queues had increased due to the increased service time;  this makes
your explanation much more plausible.  However, I still think that if
you have more than one UUXQT running you have a problem because (a)
you're quite possible thrashing, and (b) with the possible exception of
Honey Danber (Peter? any comment?) no uucp system should allow more
than one UUXQT at a time.  Which brings me to my main point:

In article <63@gilbbs.UUCP> mc68020@gilbbs.UUCP (Tom Keller) writes:

>   I understand there is some way to ensure that only one istantiation of
> UUXQT is executed at any one time.  Can someone elaborate on this for me?
> I am very new to all this (I've never had to consider such things in my
> 3 years of using XENIX.  I have no gurus to point out the True Path to me).

Actually, Tom, the rumor reached you in a bit of a confused state.  All
systems should limit uuxqt's to one-at-a-time.  Unfortunately, there's
a bug in older systems:  the original uucp authors simply never foresaw
the day of hours-long uuxqt's, and they wrote uuxqt to ignore the lock
file if it was more than 50 minutes old.  Later systems either fixed the bug,
or switched to a more reliable locking mechanism that doesn't depend on
timeouts to break hung locks.

Fortunately, the workaround is extremely easy:  you simply have to ensure that
the uuxqt lock file is always less than an hour old.  This can be done
with the following crontab entry:

    15,45 * * * * * if [ -r /usr/spool/uucp/LCK.XQT ] ; then
	touch /usr/spool/uucp/LCK.XQT ; fi

(note that this has to fit all on one line, unlike the way I typed it).

One minor problem:  make sure your /etc/rc file has the line

    rm -f /usr/spool/uucp/LCK.XQT

or, once you have had a uuxqt crash, you will never run uuxqt's again!
-- 

	Geoff Kuenning
	{hplabs,ihnp4}!trwrb!desint!geoff

grr@unirot.UUCP (George Robbins) (03/12/86)

In article <5044@glacier.ARPA> reid@glacier.ARPA (Brian Reid) writes:
>I upgraded to 2.10.3 about a month ago, and except for the "expire" bug that
>has already been reported here, I am quite happy with it.
>
>What happened then was both horrible and fascinating. Our machine has had
>load factors of 10 to 12 around the clock for the last 3 days, and as I
>watch the queues and analyze things, it looks as though it is not likely to
>quiesce soon. Even as I type this message, it is about midnight, I am the
>only user on the machine, and the load factor is 9.95.
>
>The increased rate of news-flow, a pulse of 3 days' worth of news shipped in
>one 8-hour burst, gave our system an impulse that is not damping out.
>Glacier is now loaded down badly enough that it is unable to process news as
>fast as the ordinary news is coming in, and the situation is actually
>getting worse rather than better. In 2 days the number of unprocessed uucp
>"X." files has grown from 155 to 220.
>
>What I am going to do to fix this problem is to run news at nice -16 all
>weekend; that should give it enough of an increased speed that I can drain
>the queues.
>-- 
>	Brian Reid	decwrl!glacier!reid
>	Stanford	reid@SU-Glacier.ARPA

You don't indicate which program is hogging your system the most, but here
are several suggestions.

1)	Compress, as distributed with news, is a real core and cpu hog.  This
	is fine if you have the resources, but if not will cause thrashing and
	general sleepiness.  There are options in the program for smaller
	systems that are easy to set up.

2)	History file processing, unless using the BSD style indexed file
	access, is really ineffieient.  Consider kludging the BSD option
	to use any indexed file stuff you have handy.

3)	All the pipeing that goes on with compressed news can't be helping.
	One of those unix features that's nice to have, but screws you in
	high-volume situations.

4)	If you are running mod.sources vnews, afbuild isn't the fastest thing
	in town.  Check the hashing algorithm - you mean you don't like to
	do divide instructions for every character?




-- 
George Robbins			uucp:	...!ihnp4!tapa!grr
P.O. Box 177				...!caip!unirot!grr
Lincoln U, PA  19352	[Any ideas herein are not responsible for themselves!]

msc@saber.UUCP (Mark Callow) (03/13/86)

>   I understand there is some way to ensure that only one istantiation of
> UUXQT is executed at any one time.  Can someone elaborate on this for me?

There had better be only one instance of uuxqt or you'll have real
problems.  uuzqt uses the lock file /usr/spool/uucp/LCK.XQT to ensure
only one uuxqt is running.

UNNNNFortunately uuxqt ignores the lock file if it is more than a
certain age.  The default age (in 4.2 BSD's uucp at least) is 60
minutes.  This is far too short if you are receiving batched news,
especially if you have unbatchnews niced.
-- 
From the TARDIS of Mark Callow
msc@saber.uucp,  sun!saber!msc@decwrl.dec.com ...{ihnp4,sun}!saber!msc
"Boards are long and hard and made of wood"

wedgingt@udenva.UUCP (Will Edgington/Ejeo) (03/13/86)

In article <175@desint.UUCP> geoff@desint.UUCP (Geoff Kuenning) writes:
>One minor problem:  make sure your /etc/rc file has the line
>
>    rm -f /usr/spool/uucp/LCK.XQT
>
>or, once you have had a uuxqt crash, you will never run uuxqt's again!

While you're at it, you might as well:

rm -f /usr/spool/uucp/LCK.*

There sure wasn't *ANY* uucp activity while the system was down, so you
might as well remove them all while you're there.  Terminal lock files
(LCK..tty??) are especially critical; removing them at reboot often means
uucp knows the ports are available for use that much sooner.  Sites with
heavy uucp use will find things just a little bit easier, especially if
they also go down 6 nights a week for backup (as mine do).
-- 
Will Edgington		 | Phone: (303) 871-2081 (work), 722-5738 (home)
Computing Services Staff | USnail: BA 469, 2020 S. Race, Denver CO 80210
University of Denver	 | Home: 2035 S. Josephine #102, Denver CO 80210
Electronic Address (UUCP only): {hplabs, seismo}!hao!udenva!wedgingt
or {boulder, cires, ucbvax!nbires, cisden}!udenva!wedgingt

dsp@ptsfa.UUCP (David St. Pierre) (03/14/86)

In article <1951@saber.UUCP> msc@saber.UUCP (Mark Callow) writes:
>
>There had better be only one instance of uuxqt or you'll have real
>problems.  uuzqt uses the lock file /usr/spool/uucp/LCK.XQT to ensure
>only one uuxqt is running.
>
>UNNNNFortunately uuxqt ignores the lock file if it is more than a
>certain age.  The default age (in 4.2 BSD's uucp at least) is 60
>minutes.  This is far too short if you are receiving batched news,
>especially if you have unbatchnews niced.

For both of these reasons and others, I have taken an extreme approach.
I just checked my (BNU) logfiles and found 1038 executions today. If I
were to let cunbatch grab hours of that time during prime time, I'd
probably be lynched. I can run multiple UUXQTs but don't think it really
helps this problem much.

Here's my "cunbatch":
=====================

cat > news.$$

I just let files drop into my .XQTDIR/.Xqtdir; UUXQT goes off to find more
interesting things to do. I can keep up with incoming traffic without having
queueing problems in UUCP or ageing problems with LCK files. AND I can try to
do some sensible things with incoming news.

Here's something I call "cunbatch.sh" - I'm System V only so it may not work
on your system. Then again it may:
==================================

trap 'rm -f LCK..news; trap 0; exit' 1 2 3 15
cd /usr/spool/uucp/.Xqtdir
if [ -f LCK..news ]
then
	ps -p `cat LCK..news` > /dev/null
	if [ $? -eq 0 ]
	then
		exit 0
	fi
fi

echo $$ > LCK..news

while :
do
	for FILE in `ls -tr news.* 2>/dev/null`
	do
		/usr/lbin/compress -d < $FILE | /usr/lbin/rnews
		rm $FILE
	done
	FILES=`ls -tr news.* 2>/dev/null`
	if [ -z "$FILES" ]
	then
		rm LCK..news
		exit 0
	fi
done


I haven't been too thrilled with system performance or rnews article insertion
speed when there are multiple batch files being processed concurrently. We
run a lot of background daemons on our system which tend to get higher priority
than rnews during the daytime. So I've built a shell LCK-file around the
unbatching process. A second instance of cunbatch.sh finding the first still
running will just go away. If the LCK file is not present or the process is
no longer active, it will start up. A single instance of cunbatch.sh will
process all files present when it started or new ones which arrived after it
started.

I drive cunbatch.sh out of crontab hourly during the evening - it was just
too SLLOOOWWWW during the day. 'Sides, people are supposed to be working
part of the time around here, right?  I can process a day's inputs in 1-2
hours so I don't slow down the sites I feed too much.
Others who don't have those constraints could probably run cunbatch.sh nohup 
right out of cunbatch. You'd still gain the effects of not tying up UUCP 
and being able to single-thread batch file processing.

One additional benefit I derived from the LCK-file processing is that I also
"taught" expire to honor it:
============================

trap 'rm -f $LOCKFILE; trap 0; exit' 1 2 3 15
LOCKFILE=/usr/spool/uucp/.Xqtdir/LCK..news
if [ -f $LOCKFILE ]
then
	ps -p `cat $LOCKFILE` > /dev/null
	if [ $? -eq 0 ]
	then
		echo expire.sh | at now + 30 minutes >>/dev/null
		exit 0
	fi
fi

echo $$ > $LOCKFILE

PATH=/usr/lib/news:$PATH
...shortened for brevity
	expire -e 10
...
> /usr/lib/news/ohistory
rm -f $LOCKFILE

I don't have to worry about one process trying to add to the history file
while another is trying to destroy it.
-- 
David St. Pierre		{ihnp4,dual,qantel,lll-crg}!ptsfa!dsp

boyce@daemen.uucp (DWB) (03/16/86)

In article <1951@saber.UUCP>, msc@saber.UUCP (Mark Callow) writes:
> >   I understand there is some way to ensure that only one istantiation of
> > UUXQT is executed at any one time.  Can someone elaborate on this for me?
> 
> UNNNNFortunately uuxqt ignores the lock file if it is more than a
> certain age.  The default age (in 4.2 BSD's uucp at least) is 60
> minutes.  This is far too short if you are receiving batched news,
> especially if you have unbatchnews niced.

If you have the source I'd suggest you fix it that way like we did.

In file "uucp.h" BSD of course.
----------------
change X_LOCKTIME from 3600 to maybe 28800 (8 hours) or there abouts.

This would certainly remedy the problem for the next you're
bombarded with news.

-- 
Doug Boyce   Daemen College, Buffalo NY
UUCP : decvax!sunybcs!daemen!boyce  or  ihnp4!kitty!daemen!boyce
ARPA : boyce%buffalo@csnet-relay.arpa

   "The mind is a terrible thing....and most be stopped in our lifetime."