martin@felix.uucp (Martin McKendry) (04/19/88)
If I had an acceptable approach to the following problem, I could build a lot simpler algorithms at the next layer up. Maybe someone has thought about this before. I have a distributed system, typically 20-40 machines, but it can be as high as 200 and as low as 1 (or zero). Some of these machines have disk, but most do not. I need to keep a single integer for the entire system that increases monotonically. Call this an 'epic counter'. The purpose of the epic counter is to control various kinds of versions -- crash counts & versions of a recovery/watchdog mechanism, in particular. The catch is that I don't want to use any permanent storage to do this. And I don't want to depend on any particular machine, or be 'n-way' fault tolerant, for any value of 'n'. It must always be possible to get a new value of the epic counter issued (e.g., to 'timestamp' this re-boot of a workstation). Even if there is only one station up, this must be possible. I'm open to the semantics in the case of partitions. If we ever get all stations down, it is OK for the first one up to restart the counter (but not if he is merely partitioned temporarily). I'm looking at various forms of election, etc., but I have the added constraint that there cannot be any rigid notion of a quorum -- we have to work out 'who is up' by timeouts. We do know the maximal set (at least right now), but I'd prefer not to use that. I think it unlikely that we are the first to encounter this problem. Any suggestions or pointers? Maybe Knuth Book 15, Chapter 9, Page 13598 has a solution? -- Martin S. McKendry; FileNet Corp; {hplabs,trwrb}!felix!martin Strictly my opinion; all of it