[comp.sys.isis] pmake problem and fix

gustav@arp.anu.edu.au (Zdzislaw Meglicki) (02/21/91)

I have encountered a problem with pmake in ISIS 2.1 which I have
traced down (after beeing coerced into it by Ken) to what looks 
like a haphazard use of pointers. In a file .../isisv2.1/demos/pmk/pmkio.c
you'll find the following (line 503):

  if ((int) (stream=fopen(p_fname,"w"))< 0)
  {
    printf("%s: ", p_fname);
    perror("pmkio open error 2");
    return;
  }

and further on (line 661):

  if ((int)(stream=fopen(p_fname,"w"))< 0)
  {
    printf("%s: ", p_fname);
    perror("pmkio open error 3");
    return;
  }

Under 4.3BSD and SunOS fopen returns a NULL pointer on failure.
This means that if a failure occurs the above will not detect it,
and if the correctly returned pointer happens to return a negative
integer if cast on int then the failure will be "detected" even if
there isn't any. This is exactly what was happening when I attempted
to run pmake on my systems. The files were created, but they were
left empty, because functions write_graph and show_run were returning
prematurely. 

Another problem was caused by the fact that printf on stdout is mixed
with perror which prints and flushes on stderr. This resulted in
the error messages being garbled.

The fix to the above is (line 503):

  if (!(stream=fopen(p_fname,"w")))
  {
    fprintf(stderr, "%s: cannot open for writing\n", p_fname); fflush(stderr);
    perror("pmkio open error 2");
    fprintf(stderr, "uid = %d, euid = %d, gid = %d, egid = %d\n",
                    getuid(), geteuid(), getgid(), getegid());
    fflush(stderr);
    return;
  }
  else
  {
    fprintf(stderr, "%s: successfully opened for writing\n", p_fname);
    fflush(stderr);
  }

And likewise for line 661. Note that being a bit of a voyer I also added
messages for a successful opening, which you should probably comment out
or remove after making sure that things work properly.

At this stage pmake began to open graph files and write on them correctly,
but now another problem emerged. In file pmkexec.c a temporary file name
is constructed (line 668):

   strcpy(fname,(char *)tempnam(PMK_SCR,"state"));

Under SunOS (I am not sure about 4.3BSD) "/usr/tmp" is automatically
prepended to "stateAAA????" unless the user has TMPDIR defined in his/her
environment which points elsewhere. This made pmkexec write those graph
files on /usr/tmp. But /usr/tmp is local to a CPU. When pmkexec 
passed this file name to other CPUs they couldn't find the files to open
because they weren't in their /usr/tmp. Only after I have defined TMPDIR
to point to an area mounted on all CPUs involved and accessed by the same
name things began to work. I got pmake finally to execute various compilation
steps in parallel on CPUs specified in /usr/spool/isis/sites.

Now about pmake itself: it's a great demo, and a useful one to that.
I, for that matter, intend to use it in my work. I would like to suggest
various improvements though. In the first place, pmkexec should have
some kind of a configuration file. You may have several different types
of CPUs on the network. Even identical CPUs may have different versions
of OS, or different versions of gcc or Fortran. The user should be able
to tell pmkexec which particular CPUs of all mentioned in /usr/spool/isis/sites
should be used for the compilation. Alternatively that information should
be accessible through /usr/spool/isis/sites itself. Once the proper CPUs 
are selected, the final decision as to which should be used should depend 
not so much on the number of users on the given CPU but on the load. By the 
way, the number of users is not 0 as described in the manual but 4, see line
1522 of pmkexec.c. The note in the comment beneath that line indeed states:

                   [...] We should either
                   check for non-idle users, or wait until a proper rexec
                   service which knows about load factors.

In the meantime you can tinker with this particular part of the code by hand
and make it check for specific hostnames or change numbers of users (I put
it up immediately to 10).

Hope this helps
Gustav

-- 
   Gustav Meglicki, gustav@arp.anu.edu.au,
   Automated Reasoning Project, RSSS, and Plasma Theory Group, RSPhysS,
   The Australian National University, G.P.O. Box 4, Canberra, A.C.T., 2601, 
   Australia, fax: (Australia)-6-249-0747, tel: (Australia)-6-249-0158

ken@gvax.cs.cornell.edu (Ken Birman) (02/22/91)

In article <1991Feb21.101910@arp.anu.edu.au> gustav@arp.anu.edu.au (Zdzislaw Meglicki) writes:
> ...
>I, for that matter, intend to use it in my work. I would like to suggest
>various improvements though. In the first place, pmkexec should have
>some kind of a configuration file. You may have several different types
>of CPUs on the network. Even identical CPUs may have different versions
>of OS, or different versions of gcc or Fortran. The user should be able
>to tell pmkexec which particular CPUs of all mentioned in /usr/spool/isis/sites
>should be used for the compilation. Alternatively that information should
>be accessible through /usr/spool/isis/sites itself. Once the proper CPUs 
>are selected, the final decision as to which should be used should depend 
>not so much on the number of users on the given CPU but on the load. By the 
>way, the number of users is not 0 as described in the manual but 4, see line
>1522 of pmkexec.c. The note in the comment beneath that line indeed states:
>
>                   [...] We should either
>                   check for non-idle users, or wait until a proper rexec
>                   service which knows about load factors.
>
>In the meantime you can tinker with this particular part of the code by hand
>and make it check for specific hostnames or change numbers of users (I put
>it up immediately to 10).

These are good points.  Pmake was developed by a Masters student who
did his work and left Cornell to return to HP two years ago; since then,
nobody has worked on the program at all.

I hope that you, or other ISIS/pmake users will consider adapting the
program to use the V3.0 network resource manager (i.e. to select machines
to run on) and perhaps to extend its load balancing policies as you
suggest.  We will be happy to help out if you run into more problems.

The network resource manager is a "proper rexec service which knows about 
load factors", although it might need some extensions if pmake needs to get
access to those factors.  This is just the sort of thing we might be
willing to add, as long as the architecture of the resulting system is
clean and simple... and the extended pmake code becomes available to
other ISIS users, of course.

Ken