[comp.mail.sendmail] load balancing

emv@a.cc.umich.edu (Ed Vielmetti) (02/10/89)

There's a very slow x.29 connection connecting the vax with the IBM 3090,
and a lot of mail traffic through that line.  This short-term solution has
been in place for 2 1/2 years; until the mainframe folks get the tcp/ip
boxes running & working we're stuck with this.

Originally umix did everything: name service, uucp gatewaying, netnews,
and mail.  After a disaster where mail from one dean to another dean
got delayed too long & someone missed a meeting, some attention was
paid to the situation & we got some more equipment.  There are now
three Suns in the mail ranch: a (sharkey), b (shadooby), and c (mailrus).
We're colonizing a fourth (d, or eliminator, formerly emptys).  I'd like
to describe how things are spread around.

The big source of congestion on umix happened when parts of the internet
were down.  Failed mail to internet sites sat in the umix queues competing
for space with the queueing necessary to get mail into um and ub.  To
cope with this, the sendmail.cf on umix was modified to punt mail destined
for off-campus (*.com,*.edu,!umich.edu,*.arpa,*.gov etc) to mailgw.cc.umich.edu,
which has an A record for 'a' and MX records for 'a' and 'c'.  Thus downtime
on the internet wouldn't translate into big queues & wasted CPU time on umix.

The system was also quite susceptable to extremely heavy loads immediately
after booting.  Imagine the scenario -- go down for an hour, come back up,
and 60 different systems are all trying to connect to you in the first
5 minutes you're back up.  The solution to this was relatively straightforward:
add MX records to the relay systems so that downtime or heavy loading
would cause mail to back up on-campus (on 'c') rather than off-campus,
and so that if it was necessary to meter the flow of mail inbound 
you could collect most of it nearby.  This shows up deficiencies in some
mailers -- notable the IBM VM mailers that don't support MX records,
so there's still a blast of BITNET relay mail every time the system
goes up and down.

All of the name service has been moved from umix to the other machines
in the ranch, almost all of the uucp connections are gone from umix,
and netnews is split between mailrus and sharkey. 

As proof of the usefulness of dividing up 'inbound' and 'outbound' relay
functions on a gateway, the day that we turned this on the queues on 
umix were pushing 1000; after draining off all of the remote mail the
queue on umix went down to 50 and the queue on 'a' was at 50 undeliverable
because of problems off-campus.

--
Edward Vielmetti, U of Michigan Computing Center mail group
Home of MTV - Mail Television