turner@Apple.COM (Douglass Turner) (03/21/90)
Here is the collection of responses that I got from various people kind enough to respond to my query about doing parallel ray tracing on the IRIS. ///////////////////////////////////////////////////////////////////////// This is from George Drettakis: I'm just finishing my MSc here, and I designed and implemented a hieararchical scheme that uses severl IRIS connected over an ethernet. The rendering algorithm is a global illumination approach developed locally that goes one step further than radiosity and ray-tracing combined. The approach is based on space-subdivision that splits the original 3-d scene into cubes adaptively depending on the difficulty in each volume. I assign a group of sub-volumes to each iris, and then do the processing for the volume on each IRIS in parallel. This actually includes a bit of ray-casting that we do for various purposes in our algorithm. I use the taskcreate/taskdestroy calls to implement this, and I use the usnewsema and usnewlock call for locking and synchronisation. The man pages are pretty straightforward, but if you have trouble I can put together a sample program and send it to you. I found using the m_get_myid function from the "sequent compatibilty library" was also useful, as it gives you a unique 1 to n-1 proc id, for n procs. For ray-tracing, all you need to do is to split the screen, and give each pixel to a processor, and you're done. The key is to avoid using any global variables, and speaking from painful experience, other peoples code is full of it, no pun intended, as they didnt have parallelism in mind. George Drettakis (416) 978 5473 Dynamic Graphics Project Internet: dret@dgp.toronto.edu Computer Systems Research Institute Bitnet: dret@dgp.utoronto Toronto, Ontario M5S 1A4, CANADA ///////////////////////////////////////////////////////////////////////// This is from Steve Lamont: I use brute force, forking off multiple processes. This is probably not optimum, but is guaranteed to work on a number of platforms, including a Cray Y-MP running UNICOS (no shared memory between processes )-: ), a Stardent Titan, a Convex C-220, and a SGI 4D/280GTX. If you're interested, I'll be glad to send you the code -- it is based on the MTV ray tracer, with extensions. Steve Lamont, sciViGuy (919) 248-1120 EMail: spl@ncsc.org NCSC, Box 12732, Research Triangle Park, NC 27709 ////////////////////////////////////////////////////////////////// This contribution is from Gavin at SGI (not Gavin Miller!). In it he mentions a demo tape that IRIS users can purchase ($100.00) that contains source code for numerous demos including the "flyray" demo he discusses. I haven't gotten my copy yet but intend to soon. The demos and source code for other interesting looking stuff is listed in a pamphlet you can get from IRIS (its free) called "IRIS software exchange". I recommend it. Have you seen the 'flyray' interactive ray-tracer demo running on a GTX? It used to be a two-processor ray-tracer (one processor computing ray/scene intersections, the other feeding the graphics pipe) until I fully parallelized it. It will now use one processor to do send scene data (showing the ray-trace going on) to the geometry subsystem, and the other N-1 to compute the ray/scene intersections (if running over the Ethernet using the Distributed Graphics Library all processors compute the ray/scene intersections, with an extra process called on every once in a while to send the results down wire). I used the sproc() system call to fork off processes, and set up a queue which the main program (responsible for displaying the bitmap produced and showing the rays being traced) reads from and all the child processes write into. This queue is protected using the hardware lock feature of GTX systems (see the usinit, usnewlock, ussetlock, and usunsetlock manual pages). Rays are traced in blocks of up to 256 at a time (16x16 chunks), which cuts down on the overhead associated with accessing the queue. The program runs amazingly fast on a 8-processor system, but could benefit from tracing even larger chunks. Source code to the flyray program is available from Silicon Graphics; Monica Schulze is the person to contact (she handles distribution of demo source code). Try calling her at 415-962-3320. For your application, you probably don't need to do any inter-process synchronization; each process will be busy computing its quarter of the image and storing it in memory. You might want to look at the m_fork() manual page; your application will look something like: initialize_stuff (create the voxel space, etc); m_fork(child_process); m_kill_procs(); save_image (or whatever you want to do with it) exit() ... and each child_process does: int mynum = m_get_myid(); int np = m_get_numprocs(); ... based on mynum and numprocs, do part of the screen here... return The m_fork commands are nice because you get automatic synchronization (the m_fork command returns after all child processes are finished), and it figures out how many processors the system has for you. Re: Distributed Graphics library: Yes, it is very easy to use; instead of linking with -lgl_s, you link with -ldgl (and possibly some other libraries, depending on your networking situation; usually, you need to add -lsun -lbsd also). After doing a little setup (you have to tell inetd to listen for DGL requests on a particular port number and start up the DGL daemon), you put your program on the machine that will execute it, and rlogin in to that machine from the SGI machine on which you want the graphics to appear. Then run the program. All the graphics get sent over the network and appear on the machine you are sitting at. However, ethernet speeds are fairly slow, so if you so a lot of drawing you can easily overwhelm the ethernet bandwidth and slow down. The best way around this is to use the GL display-list commands; the object will be stored on the display machine, and only one call needs to be sent over the network to get it displayed. Of course, as networks get faster this will become unnecessary. Feel free to redistribute any of this information. --gavin gavin%krypton@sgi.com ///////////////////////////////////////////////////////////////// This is source code from Michael John Muuss. I haven't looked at it closely. His emal path is mike@brl.mil /* * M A C H I N E . C * * Machine-specific routines for doing raytracing. * Includes parallel and vector support, replacement library routines, etc. * * Author - * Michael John Muuss * * Source - * SECAD/VLD Computing Consortium, Bldg 394 * The U. S. Army Ballistic Research Laboratory * Aberdeen Proving Ground, Maryland 21005 * * Copyright Notice - * This software is Copyright (C) 1987 by the United States Army. * All rights reserved. */ #include <stdio.h> #include <ctype.h> #include <math.h> #include "machine.h" #include "vmath.h" #include "raytrace.h" #include "./debug.h" #ifdef CRAY # include <sys/category.h> # include <sys/resource.h> # include <sys/types.h> #endif #ifdef CRAY2 #undef MAXINT # include <sys/param.h> #endif #ifdef HEP # include <synch.h> # undef stderr # define stderr stdout #endif /* HEP */ #ifdef alliant # include <cncall.h> #endif #if (defined(sgi) && defined(mips) && !defined(SGI4D_Rel2)) # define SGI_4D 1 # include <sys/types.h> # include <sys/prctl.h> # include <ulocks.h> static char *lockfile = "/usr/tmp/rtmplockXXXXXX"; static usptr_t *lockstuff = 0; #endif /* SGI_4D */ #ifdef ardent # include <thread.h> #endif /* * R T _ P R I _ S E T * * Without knowing what the current UNIX "nice" value is, * change to a new absolute "nice" value. */ void rt_pri_set(nval) int nval; { int opri, npri, chg; int bias = -20; #ifdef BSD #define PRIO_PROCESS 0 /* From /usr/include/sys/resource.h */ opri = getpriority( PRIO_PROCESS, 0 ); setpriority( PRIO_PROCESS, 0, nval ); npri = getpriority( PRIO_PROCESS, 0 ); #else /* SysV version, with variable bias. Some standard. */ #ifdef CRAY bias = 0; #endif opri = nice(0) - bias; chg = nval - opri; (void)nice(chg); npri = nice(0) - bias; if( npri != nval ) rt_log("SysV error: wanted nice %d! check bias=%d\n", nval, bias ); #endif rt_log("Priority changed from %d to %d\n", opri, npri); } /* * R T _ C P U G E T * * Return the current CPU limit, in seconds. * Zero or negative return indicates that limits are not in effect. */ int rt_cpuget() { #ifdef CRAY long old; /* 64-bit clock counts */ extern long limit(); if( (old = limit( C_PROC, 0, L_CPU, -1 )) < 0 ) { perror("CPU limit(get)"); } if( old <= 0 ) return(999999); /* virtually unlimited */ return( (old + HZ - 1) / HZ ); #else return(-1); #endif } /* * R T _ C P U S E T * * Set CPU time limit, in seconds. */ void rt_cpuset(sec) int sec; { #ifdef CRAY long old; /* seconds */ long new; /* seconds */ long newtick; /* 64-bit clock counts */ extern long limit(); old = rt_cpuget(); new = old + sec; if( new <= 0 || new > 999999 ) new = 999999; /* no limit, for practical purposes */ newtick = new * HZ; if( limit( C_PROC, 0, L_CPU, newtick ) < 0 ) { perror("rt_cpuset: CPU limit(set)"); } rt_log("Cray CPU limit changed from %d to %d seconds\n", old, newtick/HZ ); /* Eliminate any memory limit */ if( limit( C_PROC, 0, L_MEM, 0 ) < 0 ) { /* Hopefully, not fatal if memory limits are imposed */ perror("rt_cpuset: MEM limit(set)"); } #endif } /* * R T _ A V A I L _ C P U S * * Return the maximum number of physical CPUs that are considered to be * available to this process now. */ int rt_avail_cpus() { int ret = 1; #ifdef SGI_4D ret = prctl(PR_MAXPPROCS); # define RT_AVAIL_CPUS #endif #ifdef alliant long memsize, ipnum, cenum, detnum, attnum; lib_syscfg( &memsize, &ipnum, &cenum, &detnum, &attnum ); ret = attnum; /* # of CEs attached to parallel Complex */ # define RT_AVAIL_CPUS #endif #ifndef RT_AVAIL_CPUS ret = DEFAULT_PSW; #endif return( ret ); } #ifdef CRAY struct taskcontrol { int tsk_len; int tsk_id; int tsk_value; } taskcontrol[MAX_PSW]; #endif #if defined(unix) /* * Cray is known to wander among various pids, perhaps others. */ # define CHECK_PIDS 1 #endif /* * R T _ P A R A L L E L * * Create 'ncpu' copies of function 'func' all running in parallel, * with private stack areas. Locking and work dispatching are * handled by 'func' using a "self-dispatching" paradigm. * No parameters are passed to 'func', because not all machines * can arrange for that. * * This function should not return control until all invocations * of the subroutine are finished. The caller should double-check * this, using cooperation with 'func'. * * Don't use registers in this function. At least on the Alliant, * register context is NOT preserved when exiting the parallel mode, * because the serial portion resumes on some arbitrary processor, * not necessarily the one that serial execution started on. * The registers are not shared. */ void rt_parallel( func, ncpu ) void (*func)(); int ncpu; { #if defined(PARALLEL) #if defined(alliant) && !__STDC__ register int d7; /* known to be in d7 */ register int d6 = ncpu; /* known to be in d6 */ #endif int x; int new; #ifdef CHECK_PIDS int pid = getpid(); #endif if( rt_g.debug & DEBUG_PARALLEL ) rt_log("rt_parallel(0x%x, %d)\n", func, ncpu ); #ifdef HEP for( x=1; x<ncpu; x++ ) { /* This is more expensive when GEMINUS>1 */ Dcreate( *func ); } (*func)(); /* avoid wasting this task */ #endif /* HEP */ #ifdef CRAY #if 0 /* Try to give up processors as soon as they are un needed */ new = 0; TSKTUNE( "DBRELEAS", &new ); #endif /* Create any extra worker tasks */ for( x=1; x<ncpu; x++ ) { taskcontrol[x].tsk_len = 3; taskcontrol[x].tsk_value = x; TSKSTART( &taskcontrol[x], func ); } (*func)(); /* avoid wasting this task */ /* There needs to be some way to kill the tfork()'ed processes here */ /* Wait for them to finish */ for( x=1; x<ncpu; x++ ) { TSKWAIT( &taskcontrol[x] ); } #endif #if alliant # if defined(__STDC__) /* fxc defines it == 0 !! */ # undef __STDC__ # define __STDC__ 2 /* Calls func in parallel "ncpu" times */ concurrent_call(CNCALL_COUNT|CNCALL_NO_QUIT, func, ncpu); # else { asm(" movl d6,d0"); asm(" subql #1,d0"); asm(" cstart d0"); asm("super_loop:"); (*func)(); /* d7 has current index, like magic */ asm(" crepeat super_loop"); } # endif #endif #ifdef convex /*$dir parallel */ for( x=0; x<ncpu; x++ ) { (*func)(); } #endif /* convex */ #ifdef ardent /* The stack size parameter is pure guesswork */ parstack( func, 1024*1024, ncpu ); #endif /* ardent */ #ifdef SGI_4D for( x = 1; x < ncpu; x++) { /* * Ensure that we are attached to the shared arena * used for the ussetlock() calls in RES_ACQUIRE(). * Locks will always be at the same virtual address. */ if( usinit(lockfile) == 0 ) { fprintf(stderr, "rt_parallel: usinit(%s) failed, unable to allocate lock space\n", lockfile); exit(2); } /* * Start a share-group process, sharing ALL resources. * This direct sys-call can be used because none of the * task-management services of, eg, taskcreate() are needed. */ if( (new = sproc( func, PR_SALL, 0 )) < 0 ) rt_log("machine.c/rt_parallel(): sproc failed\n"); } (*func)(); { int pstat; for ( x = 1; x < ncpu; x++) wait(&pstat); } #endif /* sgi */ if( rt_g.debug & DEBUG_PARALLEL ) rt_log("rt_parallel() complete, now serial\n"); #ifdef CHECK_PIDS /* * At this point, all multi-tasking activity should have ceased, * and we should be just a single UNIX process with our original * PID and open file table (kernel struct u). If not, then any * output is going to be written into the wrong file. */ if( pid != (x=getpid()) ) { rt_log("rt_parallel: PID changed from %d to %d, open file table may be botched!\n", pid, x ); } #endif #else /* PARALLEL */ rt_log("rt_parallel( x%x, %d. ): Not compiled for PARALLEL machine\n", func, ncpu ); #endif /* PARALLEL */ } static int lock_usage[12]; /* Lock usage counters */ static int lock_spins[12]; /* Lock spin counters */ static int lock_waitloops[12]; /* # passes through s/w wait loop */ static int lock_busy[12]; /* !0 => lock is busy */ static char *lock_title[12] = { "syscall", "worker", "stats", "results", "model refinement", "???" }; /* * R T _ P R _ L O C K _ S T A T S */ void rt_pr_lock_stats() { register int i; #ifdef SGI_4D register int *p = &rt_g.res_syscall; ulock_t ltp; lockmeter_t meter; lockdebug_t deb; #endif fprintf(stderr, "\nParallel processor interlock summary\n\ # of Uses # Spins Waitloops Purpose\n"); for( i=0; i<3; i++ ) { if(lock_usage[i] == 0) continue; fprintf(stderr,"%10d%10d%10d %s%s\n", lock_usage[i], lock_spins[i], lock_waitloops[i], lock_title[i], lock_busy[i] ? "**STILL LOCKED**" : "" ); #ifdef SGI_4D ltp = (ulock_t) p[i]; if( usctllock( ltp, CL_METERFETCH, &meter ) == 0 ) { fprintf(stderr, "\t\tlm_spins=%d, lm_tries=%d, lm_hits=%d\n", meter.lm_spins, meter.lm_tries, meter.lm_hits ); } else perror("usctllock CL_METERFETCH"); if( usctllock( ltp, CL_DEBUGFETCH, &deb ) == 0 ) { if( deb.ld_owner_pid != -1 ) { fprintf(stderr, "\t\tERROR, ld_owner_pid=%d\n", deb.ld_owner_pid ); } } else perror("usctllock CL_DEBUGFETCH"); #endif } } #ifdef CRAY /* * Interface to Cray-specific Multi-Tasking library */ void RES_INIT(p) register int *p; { if( !rt_g.rtg_parallel ) return; #ifdef PARALLEL LOCKASGN(p); #endif } void RES_ACQUIRE(p) register int *p; { register int i = p - (&rt_g.res_syscall); if( !rt_g.rtg_parallel ) return; if( i < 0 || i > 12 ) { fprintf(stderr, "RES_ACQUIRE(0%o)? %d?\n", p, i); abort(); } #ifdef PARALLEL LOCKON(p); #endif lock_usage[i]++; /* interlocked */ } void RES_RELEASE(p) register int *p; { if( !rt_g.rtg_parallel ) return; #ifdef PARALLEL LOCKOFF(p); #endif } #endif /* CRAY */ #ifdef CRAY1 /* Horrid hack to correct multi-tasking lib botch of calling f_exit(), on XMP */ f_exit() { _exit(0); } #endif #ifdef ardent /* * Interface to Ardent synchronization hardware */ void RES_INIT(p) register int *p; { RES_RELEASE(p); } void RES_ACQUIRE(p) register int *p; { XXX /* Some hairy macro goes here to double-loop on a * synchronization register. * Unfortunately, I can't remember how it goes, and * I can't read the Ardent tape in my Sun, so you get to * rummage in thread.h and write it yourself. -Mike. */ } void RES_RELEASE(p) register int *p; { *p = 1; } #endif /* ardent */ #ifdef SGI_4D void RES_INIT(p) register int *p; { register int i = p - (&rt_g.res_syscall); ulock_t ltp; if( !rt_g.rtg_parallel ) return; if (lockstuff == 0) { (void)mktemp(lockfile); if( rt_g.debug & DEBUG_PARALLEL ) { if( usconfig( CONF_LOCKTYPE, _USDEBUGPLUS ) == -1 ) perror("usconfig CONF_LOCKTYPE"); } lockstuff = usinit(lockfile); if (lockstuff == 0) { fprintf(stderr, "RES_INIT: usinit(%s) failed, unable to allocate lock space\n", lockfile); exit(2); } } ltp = usnewlock(lockstuff); if (ltp == 0) { fprintf(stderr, "RES_INIT: usnewlock() failed, unable to allocate another lock\n"); exit(2); } *p = (int) ltp; lock_usage[i] = 0; } void RES_ACQUIRE(ptr) register int *ptr; { register int i = ptr - (&rt_g.res_syscall); if( !rt_g.rtg_parallel ) return; if( i < 0 || i > 12 ) { fprintf(stderr, "RES_ACQUIRE(x%x)? %d?\n", ptr, i); abort(); } /* Attempt to reduce frequency of library calling sginap() */ if( lock_busy[i] ) { lock_spins[i]++; /* non-interlocked */ while( lock_busy[i] ) lock_waitloops[i]++; } ussetlock((ulock_t) *(ptr)); lock_busy[i] = 1; lock_usage[i]++; /* interlocked */ } void RES_RELEASE( ptr ) register int *ptr; { register int i = ptr - (&rt_g.res_syscall); if( !rt_g.rtg_parallel ) return; lock_busy[i] = 0; /* interlocked */ usunsetlock((ulock_t) *(ptr)); } #endif /* SGI 4D */ #if defined(sgi) && !defined(mips) /* Horrible bug in 3.3.1 and 3.4 and 3.5 -- hypot ruins stack! */ long float hypot(a,b) double a,b; { return(sqrt(a*a+b*b)); } #endif /* sgi */ #ifdef alliant void RES_ACQUIRE(p) register int *p; /* known to be a5 */ { register int i; if( !rt_g.rtg_parallel ) return; i = p - (&rt_g.res_syscall); if( *p ) { lock_spins[i]++; /* non-interlocked */ while( *p ) lock_waitloops[i]++; } #if __STDC__ (void)lock( (char *)p ); /* An alliant libc routine */ #else asm("loop:"); while( *p ) { /* Just wasting time anyways, so log it */ lock_waitloops[i]++; /* non-interlocked */ } asm(" tas a5@"); asm(" bne loop"); #endif lock_usage[i]++; /* interlocked */ } #endif /* alliant */ #ifdef convex RES_ACQUIRE(p) register int *p; { if( !rt_g.rtg_parallel ) return; asm("getlck:"); asm(" tas @0(ap)"); /* try to set the lock */ asm(" jbra.f getlck"); /* loop until successful */ } #endif /* convex */ /////////////////////////////////////////////////////////////////// Heres a plug from Frank Phillips for SGI's course on writing parallel applications. just for reference, SGI offers a very good class (in Mt.View) on writing parallel applications, using several different methods. Definitely recommended. fap@sgi.com frank phillips SGI inSane Diego ////////////////////////////////////////////////////////////////// I hope people find this info useful. I know I will. Thanks to all who responded so quickly. A big unthank you to myself for waiting so long to post. Cheers, Doug Turner turner@apple.com 408-974-0484