ken@gvax.cs.cornell.edu (Ken Birman) (07/29/90)
> From: carey@cs.wisc.edu (Michael Carey) > To: ken@gvax.cs.cornell.edu > Subject: RE: roll your own supercomputer > Status: R > Ken, > I saw your note on the net. Just FYI, there is actually a facility of > that nature - not ISIS-based, but quite effective - available (at no charge > for universities, I'm pretty sure) from the University of Wisconsin. It's > called Condor, and currently manages jobs for about 170 workstations (from > Sun, DEC, IBM, and HP) here. Folks in our dept who do simulation studies > rely heavily on it as a way to get lots of CPU days in a short time; folks > who do things like explore large search spaces (e.g., to understand what the > search space looks like for optimizing very large join queries) often run > many-hour programs on it. It does periodic checkpointing, and it hops off > a workstation when the workstation's owner returns. If you're interested > in it, or know of folks who would be, it's supported here by Mike Litzkow > (mike@cream.cs.edu); he's one of the dept's research programmers. There > was a paper about it in the 8th ICDCS Conference (in 1988) about it called > "Condor - A Hunter of Idle Workstations" (by Litzkow, Livny, and Mutka). ... I am aware of Condor, but I just in case other readers of this group are interested I am posting this message. My feeling is that Condor makes a lot of assumptions about why people are trying to manage the resources in their machine and what it means to schedule a task. Although quite nice for the simulation work being done at Wisconsin, many applications would have problems with IO performance degradations factors of 2-3, and the Condor concept of job checkpointing is also very specific to the type of jobs Wisconsin is running on the system. Also, I have the impression that Condor isn't very fault-tolerant, but I could be out of touch with the most recent release of this system. I would be more interested in seeing a "resource management tool" on which more specific solutions such as Condor could be layered. Anyhow, thanks for the pointer!
bin@primate.wisc.edu (Brain in Neutral) (07/31/90)
From article <43872@cornell.UUCP>, by ken@gvax.cs.cornell.edu (Ken Birman): > > My feeling is that Condor makes a lot of assumptions about why people > are trying to manage the resources in their machine and what it means > to schedule a task. Although quite nice for the simulation work being > done at Wisconsin, many applications would have problems with IO performance > degradations factors of 2-3, and the Condor concept of job checkpointing > is also very specific to the type of jobs Wisconsin is running on the > system. Also, I have the impression that Condor isn't very fault-tolerant, > but I could be out of touch with the most recent release of this > system. Mostly correct. Recently the CS department moved a lot of its machines from one building to another. Condor jobs that were started before the move resumed and completed after the moved. (I don't know if this involved movement of the Condor admin machines or not. If so, this is that much more impressive.) Paul DuBois dubois@primate.wisc.edu