[comp.sys.isis] roll your own supercomputer

ken@gvax.cs.cornell.edu (Ken Birman) (07/29/90)

> From: carey@cs.wisc.edu (Michael Carey)
> To: ken@gvax.cs.cornell.edu
> Subject: RE: roll your own supercomputer
> Status: R

> Ken,

> I saw your note on the net.  Just FYI, there is actually a facility of
> that nature - not ISIS-based, but quite effective - available (at no charge
> for universities, I'm pretty sure) from the University of Wisconsin.  It's
> called Condor, and currently manages jobs for about 170 workstations (from
> Sun, DEC, IBM, and HP) here.  Folks in our dept who do simulation studies
> rely heavily on it as a way to get lots of CPU days in a short time;  folks
> who do things like explore large search spaces (e.g., to understand what the
> search space looks like for optimizing very large join queries) often run
> many-hour programs on it.  It does periodic checkpointing, and it hops off
> a workstation when the workstation's owner returns.  If you're interested
> in it, or know of folks who would be, it's supported here by Mike Litzkow
> (mike@cream.cs.edu);  he's one of the dept's research programmers.  There
> was a paper about it in the 8th ICDCS Conference (in 1988) about it called
> "Condor - A Hunter of Idle Workstations" (by Litzkow, Livny, and Mutka).

... I am aware of Condor, but I just in case other readers of this
group are interested I am posting this message.

My feeling is that Condor makes a lot of assumptions about why people
are trying to manage the resources in their machine and what it means
to schedule a task.  Although quite nice for the simulation work being
done at Wisconsin, many applications would have problems with IO performance
degradations factors of 2-3, and the Condor concept of job checkpointing
is also very specific to the type of jobs Wisconsin is running on the
system.  Also, I have the impression that Condor isn't very fault-tolerant,
but I could be out of touch with the most recent  release of this 
system.

I would be more interested in seeing a "resource management tool" 
on which more specific solutions such as Condor could be layered.

Anyhow, thanks for the pointer!

bin@primate.wisc.edu (Brain in Neutral) (07/31/90)

From article <43872@cornell.UUCP>, by ken@gvax.cs.cornell.edu (Ken Birman):
> 
> My feeling is that Condor makes a lot of assumptions about why people
> are trying to manage the resources in their machine and what it means
> to schedule a task.  Although quite nice for the simulation work being
> done at Wisconsin, many applications would have problems with IO performance
> degradations factors of 2-3, and the Condor concept of job checkpointing
> is also very specific to the type of jobs Wisconsin is running on the
> system.  Also, I have the impression that Condor isn't very fault-tolerant,
> but I could be out of touch with the most recent  release of this 
> system.

Mostly correct.  Recently the CS department moved a lot of its machines
from one building to another.  Condor jobs that were started before the
move resumed and completed after the moved.  (I don't know if this involved
movement of the Condor admin machines or not.  If so, this is that much
more impressive.)

Paul DuBois
dubois@primate.wisc.edu