[comp.unix.questions] Backups Query

shawn@mit-eddie.MIT.EDU (Shawn F. Mckay) (03/06/87)
Howdy, I have a few ideas I wonder if people would be willing to ponder,
and perhaps lend a hand with, I'm very open to totally new ideas as well,
ths basic idea is to make backups at our site easy, simple, and reliable.

Thanks for any help.

Ideas for system backups;

The problem --

	We have 1 main system, (a vax w/ra's), and two clusters of little
systems. (Different types, but they look like workstations). Within these
clusters, there are critical machines, and non-critical machines.

The critical machines need to be backed up on a nightly basis, and the
non-cricical machines need to be backed up on a weekly basis, or on
user request. (Being development machines, it would be nice to have them
backed up when people to some amount of work, rather then all the time).

Resources --

We have two operations people, who have to make sure other things keep
working as well, which limits the time spent on backups, to something
less then 50%, (Would be nice to chop that number down below 10%).

We have as many mag tapes as it takes, but would be nice to cut this number
down as well, so it takes less time.

                 ---- The soloutions so far ----

Backup type (a)

Procedure --

	o Do a full dump of the main system once a week, with incrementals
	  done each day.
	  (Difference being mainly physical/method)
	o Do a full backup of all remotes's each week, some each day, and
	  incrementals each night.

Pros --
	o Given the systems listed below, I can't really see any.

Cons --
	o Great in number in size, I think they are obvious, the main being
	  number of tapes/ manhours/ and obvious cpu usage.

Backup type (b)

Procedure --

	o The main system has two sets of file systems, (primary/all),
	  the primary set is backed up weekly, and the whole system
	  is backed up monthly, (i.e. 'all'). Incrementals are done on
	  all file systems at level 9 daily. If they expand to more then
	  1 tape, a full dump of all file systems becomes needed.

	o The remote systems each have there own cron script to initiate
	  backups to there individual tape drives, and do so on a regular
	  basis as is needed by that particular system, reporting errors
	  to a human, but otherwise being quiet. (This for incrementals,
	  we still need to save full dumps more then a night).

	o Critical remotes may optionally send some data to our main system,
	  or perhaps shadow something in compressed format to another remote.

Pros	--
	o We cut out a great deal of human intervention
	o We gain reliable, tested, backups.
	o Done at night, so minimal cpu loss
	o Done with a tape for remotes, so minimal tape use,
	  except for full dumps, which still have to get there own tape.

Cons	--
	o The tape drive(*s*) must work
	o Each machine, MUST have it's own drive.
	o The need for high quality tapes comes up fast, since
	  they will be left sitting in the drive all day in most cases.
	o The potential to write over a good tape which someone didn't
	  remember to swap out of the drive exsists.
	o The potential for someone to forget to put a tape in the drive
	  on a critical system exsists.

	(I'm sure if I want to nitpick, I could add more lines to this).

Backup type (c)

Procedure --

	o Procedure is complex, I'll explain by players -

	Main system will be called master, and remotes slaves. (Original eh?)

	Master will query slaves each night to ask them to give it an idea
	of how much data has been modified that day, and what total bytage
	it needs to have saved.

	When the slave replies with a number, they master then decieds if this
	is a full, or incremental save time, based on knowing how much data
	is reasonable to save with an incremental. I have allways felt that if
	you have to save more then a third of the disk with an incremental it's
	time for a full dump. (This makes it easier to restore).

	The master then has several options, based on how each system was
	backed up last.

		a) Save the incremental data to it's drive somewhere,
	 	   or to a designated host on the cluster to store such
		   information. I'll call this type of host a 'buddy', 
		   since it would be saving information for it's buddy.
		   Every system opn the cluster is a buddy for at most 2
		   systems, but it could be any 2 systems based on how
		   much space that buddy has left.

		b) (B was in a, wasn't it? Oh well).

		c) The master could also decied to save data to it's own
		   tape drive, which works well as an option, but would 
		   probably be an 'overflow' option, more then a regular
		   option.

	Alot of what will make this system better then most, is the master
	slave relationship, for example, if master tells slave 'save leve 9
	to /dev/mt0', (for it's tape drive), and slave says 'cant-offline',
	then the master can reissue the next way to save the level 9, by
	saying something like 'save level 9 remote host', where host is the
	name of a buddy that master has checked to see has the space for it,
	and then the slave sends it's level 9 dump image.

Pros -
	o As stated above, it's got a stronger will to work.
	o Less human intervention, although to make sure people
	  know where everything is, it must have strong/clear
	  event logging.
	o Less tapes, since it only uses tapes as part of a cycle of 
	  data preservation, making it very hard to lose a great deal
	  of anything, since if the tape dies, it might be on a buddy, 	
	  of the master may have a copy.
	o Automation, just ask the master to get you the latest copy of
	  file 'x' from system 'y', and let it deal with where it
	  put the file.
	o Room for expansion, if you add a new remote, or new type of remote,
	  all you have to do is define it to the master, in what should be
	  a simple text database, and write the interface for the remote.

Cons --
	o A system this nice has bugs, and takes a while to write, if you 
	  wan't to do a good job.
	o Once up, it would require people to read the manual.
	o If the master is down, all hell breaks loose, right?
	  Wrong. As I didn't remember to mention, (and at 300 baud, will
	  mention right here), if the master goes down, 1 node from each
	  cluster, should have a copy of the main systems database, and
	  a program to allow it to become an emergency master, (but not
	  a long term master, because this would lead to chaos).
	o I'm sure there are more, but that's why I'm asking for comments.

Backup type (d)

------------------------------------------------------------

	** This space left blank for your very welcomed ideas **

------------------------------------------------------------


Final comments;

I would also like to use data compression in some step before data gets
written out to a real tape, it's unclear what the tradeoff's are, I
would expect that to lose a bit, in a high compression tape, would be
a problem, to use a low compression method, would be useless, so comments
here are welcome also.

				Thanks in advance,
				  -- Shawn

Reply paths;
----------------------------------------
Usenet: mit-eddie!shawn, think!ima!haddock!shawnm
Arpanet: Shawn at Mit-Mc, Shawn at Mit-Ai
Internet: shawn@eddie.mit.edu, shawn@borax.lcs.mit.edu
Chaosnet: Shawn@Mit-eecs, Shawn@Mit-eddie