[comp.graphics] Parallely Ray Tracing on IRISs

turner@Apple.COM (Douglass Turner) (03/21/90)

Here is the collection of responses that I got from various people
kind enough to respond to my query about doing parallel ray tracing
on the IRIS. 

/////////////////////////////////////////////////////////////////////////
This is from George Drettakis:

I'm just finishing my MSc here, and I designed and implemented a hieararchical 
scheme that uses severl IRIS connected over an ethernet. 
The rendering algorithm is a global illumination approach developed locally 
that goes one step further than radiosity and ray-tracing combined.  
The approach is based on space-subdivision that splits the original 3-d
scene into cubes adaptively depending on the difficulty in each volume.
I assign a group of sub-volumes to each iris, and then do the processing
for the volume on each IRIS in parallel. This actually includes a bit 
of ray-casting that we do for various purposes in our algorithm.
I use the taskcreate/taskdestroy calls to implement this, and I use
the usnewsema and usnewlock call for locking and synchronisation.
The man pages are pretty straightforward, but if you have trouble
I can put together a sample program and send it to you. I found
using the m_get_myid function from the "sequent compatibilty library"
was also useful, as it gives you a unique 1 to n-1 proc id, for n procs.
For ray-tracing, all you need to do is to split the screen, and give
each pixel to a processor, and you're done. The key is to avoid using any
global variables, and speaking from painful experience, other peoples
code is full of it, no pun intended, as they didnt have parallelism in mind.

George Drettakis (416) 978 5473        Dynamic Graphics Project	
Internet: dret@dgp.toronto.edu         Computer Systems Research Institute
Bitnet:	  dret@dgp.utoronto            Toronto, Ontario M5S 1A4, CANADA


/////////////////////////////////////////////////////////////////////////
This is from Steve Lamont:

I use brute force, forking off multiple processes.  This is probably not
optimum, but is guaranteed to work on a number of platforms, including a Cray
Y-MP running UNICOS (no shared memory between processes )-: ), a Stardent
Titan, a Convex C-220, and a SGI 4D/280GTX.  If you're interested, I'll be
glad to send you the code -- it is based on the MTV ray tracer, with
extensions.

Steve Lamont, sciViGuy	(919) 248-1120		EMail:	spl@ncsc.org
NCSC, Box 12732, Research Triangle Park, NC 27709

//////////////////////////////////////////////////////////////////
This contribution is from Gavin at SGI (not Gavin Miller!). In it
he mentions a demo tape that IRIS users can purchase ($100.00) that
contains source code for numerous demos including the "flyray" demo
he discusses. I haven't gotten my copy yet but intend to soon. The 
demos and source code for other interesting looking stuff is 
listed in a pamphlet you can get from IRIS (its free) called
"IRIS software exchange". I recommend it.

Have you seen the 'flyray' interactive ray-tracer demo running on a
GTX?  It used to be a two-processor ray-tracer (one processor
computing ray/scene intersections, the other feeding the graphics
pipe) until I fully parallelized it.  It will now use one processor to
do send scene data (showing the ray-trace going on) to the geometry
subsystem, and the other N-1 to compute the ray/scene intersections
(if running over the Ethernet using the Distributed Graphics Library
all processors compute the ray/scene intersections, with an extra
process called on every once in a while to send the results down
wire).

I used the sproc() system call to fork off processes, and set up a
queue which the main program (responsible for displaying the bitmap
produced and showing the rays being traced) reads from and all the
child processes write into.  This queue is protected using the
hardware lock feature of GTX systems (see the usinit, usnewlock,
ussetlock, and usunsetlock manual pages).

Rays are traced in blocks of up to 256 at a time (16x16 chunks), which
cuts down on the overhead associated with accessing the queue.  The
program runs amazingly fast on a 8-processor system, but could benefit
from tracing even larger chunks.

Source code to the flyray program is available from Silicon Graphics;
Monica Schulze is the person to contact (she handles distribution of
demo source code).  Try calling her at   415-962-3320.

	For your application, you probably don't need to do any
inter-process synchronization; each process will be busy computing its
quarter of the image and storing it in memory.  You might want to look
at the m_fork() manual page; your application will look something
like:

	initialize_stuff (create the voxel space, etc);
	m_fork(child_process);
	m_kill_procs();
	save_image (or whatever you want to do with it)
	exit()

	... and each child_process does:
	int mynum = m_get_myid();
	int np = m_get_numprocs();
	... based on mynum and numprocs, do part of the screen here...
	return

The m_fork commands are nice because you get automatic synchronization
(the m_fork command returns after all child processes are finished),
and it figures out how many processors the system has for you.

Re: Distributed Graphics library:

	Yes, it is very easy to use; instead of linking with -lgl_s, you
link with -ldgl (and possibly some other libraries, depending on your
networking situation; usually, you need to add -lsun -lbsd also).
After doing a little setup (you have to tell inetd to listen for DGL
requests on a particular port number and start up the DGL daemon),
you put your program on the machine that will execute it, and
rlogin in to that machine from the SGI machine on which you want the
graphics to appear.  Then run the program.  All the graphics get sent
over the network and appear on the machine you are sitting at.

    However, ethernet speeds are fairly slow, so if you so a lot of
drawing you can easily overwhelm the ethernet bandwidth and slow down.
The best way around this is to use the GL display-list commands; the
object will be stored on the display machine, and only one call needs
to be sent over the network to get it displayed.  Of course, as
networks get faster this will become unnecessary.

Feel free to redistribute any of this information.

--gavin
gavin%krypton@sgi.com



/////////////////////////////////////////////////////////////////
This is source code from Michael John Muuss. I haven't looked at
it closely. His emal path is mike@brl.mil

/*
 *			M A C H I N E . C
 *
 *  Machine-specific routines for doing raytracing.
 *  Includes parallel and vector support, replacement library routines, etc.
 *
 *  Author -
 *	Michael John Muuss
 *
 *  Source -
 *	SECAD/VLD Computing Consortium, Bldg 394
 *	The U. S. Army Ballistic Research Laboratory
 *	Aberdeen Proving Ground, Maryland  21005
 *  
 *  Copyright Notice -
 *	This software is Copyright (C) 1987 by the United States Army.
 *	All rights reserved.
 */
#include <stdio.h>
#include <ctype.h>
#include <math.h>
#include "machine.h"
#include "vmath.h"
#include "raytrace.h"
#include "./debug.h"

#ifdef CRAY
# include <sys/category.h>
# include <sys/resource.h>
# include <sys/types.h>
#endif

#ifdef CRAY2
#undef MAXINT
# include <sys/param.h>
#endif

#ifdef HEP
# include <synch.h>
# undef stderr
# define stderr stdout
#endif /* HEP */

#ifdef alliant
# include <cncall.h>
#endif

#if (defined(sgi) && defined(mips) && !defined(SGI4D_Rel2))
# define SGI_4D	1
# include <sys/types.h>
# include <sys/prctl.h>
# include <ulocks.h>
static char		*lockfile = "/usr/tmp/rtmplockXXXXXX";
static usptr_t		*lockstuff = 0;
#endif /* SGI_4D */

#ifdef ardent
#	include <thread.h>
#endif

/*
 *			R T _ P R I _ S E T
 *
 *  Without knowing what the current UNIX "nice" value is,
 *  change to a new absolute "nice" value.
 */
void
rt_pri_set(nval)
int	nval;
{
	int opri, npri, chg;
	int bias = -20;

#ifdef BSD
#define	PRIO_PROCESS	0	/* From /usr/include/sys/resource.h */
	opri = getpriority( PRIO_PROCESS, 0 );
	setpriority( PRIO_PROCESS, 0, nval );
	npri = getpriority( PRIO_PROCESS, 0 );
#else
	/* SysV version, with variable bias.  Some standard. */
#ifdef CRAY
	bias = 0;
#endif
	opri = nice(0) - bias;
	chg = nval - opri;
	(void)nice(chg);
	npri = nice(0) - bias;
	if( npri != nval )  rt_log("SysV error:  wanted nice %d! check bias=%d\n", nval, bias );
#endif
	rt_log("Priority changed from %d to %d\n", opri, npri);
}

/*
 *			R T _ C P U G E T
 *
 *  Return the current CPU limit, in seconds.
 *  Zero or negative return indicates that limits are not in effect.
 */
int
rt_cpuget()
{
#ifdef CRAY
	long	old;			/* 64-bit clock counts */
	extern long limit();

	if( (old = limit( C_PROC, 0, L_CPU, -1 )) < 0 )  {
		perror("CPU limit(get)");
	}
	if( old <= 0 )
		return(999999);		/* virtually unlimited */
	return( (old + HZ - 1) / HZ );
#else
	return(-1);
#endif
}

/*
 *			R T _ C P U S E T
 *
 *  Set CPU time limit, in seconds.
 */
void
rt_cpuset(sec)
int	sec;
{
#ifdef CRAY
	long	old;		/* seconds */
	long	new;		/* seconds */
	long	newtick;	/* 64-bit clock counts */
	extern long limit();

	old = rt_cpuget();
	new = old + sec;
	if( new <= 0 || new > 999999 )
		new = 999999;	/* no limit, for practical purposes */
	newtick = new * HZ;
	if( limit( C_PROC, 0, L_CPU, newtick ) < 0 )  {
		perror("rt_cpuset: CPU limit(set)");
	}
	rt_log("Cray CPU limit changed from %d to %d seconds\n",
		old, newtick/HZ );

	/* Eliminate any memory limit */
	if( limit( C_PROC, 0, L_MEM, 0 ) < 0 )  {
		/* Hopefully, not fatal if memory limits are imposed */
		perror("rt_cpuset: MEM limit(set)");
	}
#endif
}

/*
 *			R T _ A V A I L _ C P U S
 *
 *  Return the maximum number of physical CPUs that are considered to be
 *  available to this process now.
 */
int
rt_avail_cpus()
{
	int	ret = 1;

#ifdef SGI_4D
	ret = prctl(PR_MAXPPROCS);
#	define	RT_AVAIL_CPUS
#endif

#ifdef alliant
	long	memsize, ipnum, cenum, detnum, attnum;

	lib_syscfg( &memsize, &ipnum, &cenum, &detnum, &attnum );
	ret = attnum;		/* # of CEs attached to parallel Complex */
#	define	RT_AVAIL_CPUS
#endif

#ifndef RT_AVAIL_CPUS
	ret = DEFAULT_PSW;
#endif
	return( ret );
}

#ifdef CRAY
struct taskcontrol {
	int	tsk_len;
	int	tsk_id;
	int	tsk_value;
} taskcontrol[MAX_PSW];
#endif


#if defined(unix)
	/*
	 * Cray is known to wander among various pids, perhaps others.
	 */
#	define	CHECK_PIDS	1
#endif

/*
 *			R T _ P A R A L L E L
 *
 *  Create 'ncpu' copies of function 'func' all running in parallel,
 *  with private stack areas.  Locking and work dispatching are
 *  handled by 'func' using a "self-dispatching" paradigm.
 *  No parameters are passed to 'func', because not all machines
 *  can arrange for that.
 *
 *  This function should not return control until all invocations
 *  of the subroutine are finished.  The caller should double-check
 *  this, using cooperation with 'func'.
 *
 *  Don't use registers in this function.  At least on the Alliant,
 *  register context is NOT preserved when exiting the parallel mode,
 *  because the serial portion resumes on some arbitrary processor,
 *  not necessarily the one that serial execution started on.
 *  The registers are not shared.
 */
void
rt_parallel( func, ncpu )
void	(*func)();
int	ncpu;
{
#if defined(PARALLEL)

#if defined(alliant) && !__STDC__
	register int d7;	/* known to be in d7 */
	register int d6 = ncpu;	/* known to be in d6 */
#endif
	int	x;
	int	new;
#ifdef CHECK_PIDS
	int	pid = getpid();
#endif

	if( rt_g.debug & DEBUG_PARALLEL )
		rt_log("rt_parallel(0x%x, %d)\n", func, ncpu );

#ifdef HEP
	for( x=1; x<ncpu; x++ )  {
		/* This is more expensive when GEMINUS>1 */
		Dcreate( *func );
	}
	(*func)();	/* avoid wasting this task */
#endif /* HEP */

#ifdef CRAY
#if 0
	/* Try to give up processors as soon as they are un needed */
	new = 0;
	TSKTUNE( "DBRELEAS", &new );
#endif

	/* Create any extra worker tasks */
	for( x=1; x<ncpu; x++ ) {
		taskcontrol[x].tsk_len = 3;
		taskcontrol[x].tsk_value = x;
		TSKSTART( &taskcontrol[x], func );
	}
	(*func)();	/* avoid wasting this task */

	/* There needs to be some way to kill the tfork()'ed processes here */
	/* Wait for them to finish */
	for( x=1; x<ncpu; x++ )  {
		TSKWAIT( &taskcontrol[x] );
	}
#endif

#if alliant
#	if defined(__STDC__)	/* fxc defines it == 0 !! */
#	undef __STDC__
#	define __STDC__	2

	/* Calls func in parallel "ncpu" times */
	concurrent_call(CNCALL_COUNT|CNCALL_NO_QUIT, func, ncpu);

#	else
	{
		asm("	movl		d6,d0");
		asm("	subql		#1,d0");
		asm("	cstart		d0");
		asm("super_loop:");
		(*func)();		/* d7 has current index, like magic */
		asm("	crepeat		super_loop");
	}
#	endif
#endif

#ifdef convex
	/*$dir parallel */
	for( x=0; x<ncpu; x++ )  {
		(*func)();
	}
#endif /* convex */

#ifdef ardent
	/* The stack size parameter is pure guesswork */
	parstack( func, 1024*1024, ncpu );
#endif /* ardent */

#ifdef SGI_4D
	for( x = 1; x < ncpu; x++)  {
		/*
		 *  Ensure that we are attached to the shared arena
		 *  used for the ussetlock() calls in RES_ACQUIRE().
		 *  Locks will always be at the same virtual address.
		 */
		if( usinit(lockfile) == 0 )  {
			fprintf(stderr, "rt_parallel: usinit(%s) failed, unable to allocate lock space\n", lockfile);
			exit(2);
		}

		/*
		 *  Start a share-group process, sharing ALL resources.
		 *  This direct sys-call can be used because none of the
		 *  task-management services of, eg, taskcreate() are needed.
		 */
		if( (new = sproc( func, PR_SALL, 0 )) < 0 )
			rt_log("machine.c/rt_parallel(): sproc failed\n");
	}
	(*func)();
	{
		int pstat;
		
		for ( x = 1; x < ncpu; x++)
			wait(&pstat);
	}
#endif /* sgi */

	if( rt_g.debug & DEBUG_PARALLEL )
		rt_log("rt_parallel() complete, now serial\n");

#ifdef CHECK_PIDS
	/*
	 * At this point, all multi-tasking activity should have ceased,
	 * and we should be just a single UNIX process with our original
	 * PID and open file table (kernel struct u).  If not, then any
	 * output is going to be written into the wrong file.
	 */
	if( pid != (x=getpid()) )  {
		rt_log("rt_parallel:  PID changed from %d to %d, open file table may be botched!\n",
			pid, x );
	}
#endif
#else	/* PARALLEL */
	rt_log("rt_parallel( x%x, %d. ):  Not compiled for PARALLEL machine\n",
		func, ncpu );
#endif	/* PARALLEL */
}

static int	lock_usage[12];		/* Lock usage counters */
static int	lock_spins[12];		/* Lock spin counters */
static int	lock_waitloops[12];	/* # passes through s/w wait loop */
static int	lock_busy[12];		/* !0 => lock is busy */
static char	*lock_title[12] = {
	"syscall",
	"worker",
	"stats",
	"results",
	"model refinement",
	"???"
};

/*
 *			R T _ P R _ L O C K _ S T A T S
 */
void
rt_pr_lock_stats()
{
	register int i;
#ifdef SGI_4D
	register int	*p = &rt_g.res_syscall;
	ulock_t		ltp;
	lockmeter_t	meter;
	lockdebug_t	deb;
#endif

	fprintf(stderr, "\nParallel processor interlock summary\n\ # of Uses   # Spins Waitloops Purpose\n");
	for( i=0; i<3; i++ )  {
		if(lock_usage[i] == 0)  continue;
		fprintf(stderr,"%10d%10d%10d %s%s\n",
			lock_usage[i], lock_spins[i], lock_waitloops[i],
			lock_title[i],
			lock_busy[i] ? "**STILL LOCKED**" : "" );
#ifdef SGI_4D

		ltp = (ulock_t) p[i];
		if( usctllock( ltp, CL_METERFETCH, &meter ) == 0 )  {
			fprintf(stderr,
				"\t\tlm_spins=%d, lm_tries=%d, lm_hits=%d\n",
				meter.lm_spins,
				meter.lm_tries,
				meter.lm_hits );
		} else perror("usctllock CL_METERFETCH");
		if( usctllock( ltp, CL_DEBUGFETCH, &deb ) == 0 )  {
			if( deb.ld_owner_pid != -1 )  {
				fprintf(stderr,
					"\t\tERROR, ld_owner_pid=%d\n",
					deb.ld_owner_pid );
			}
		} else perror("usctllock CL_DEBUGFETCH");
#endif
	}
}

#ifdef CRAY
/*
 *  Interface to Cray-specific Multi-Tasking library
 */
void
RES_INIT(p)
register int *p;
{
	if( !rt_g.rtg_parallel )  return;
#ifdef PARALLEL
	LOCKASGN(p);
#endif
}

void
RES_ACQUIRE(p)
register int *p;
{
	register int i = p - (&rt_g.res_syscall);
	if( !rt_g.rtg_parallel )  return;
	if( i < 0 || i > 12 )  {
		fprintf(stderr, "RES_ACQUIRE(0%o)? %d?\n", p, i);
		abort();
	}
#ifdef PARALLEL
	LOCKON(p);
#endif
	lock_usage[i]++;		/* interlocked */
}

void
RES_RELEASE(p)
register int *p;
{
	if( !rt_g.rtg_parallel )  return;
#ifdef PARALLEL
	LOCKOFF(p);
#endif
}

#endif /* CRAY */

#ifdef CRAY1
/* Horrid hack to correct multi-tasking lib botch of calling f_exit(), on XMP */
f_exit()
{
	_exit(0);
}
#endif

#ifdef ardent
/*
 *  Interface to Ardent synchronization hardware
 */
void
RES_INIT(p)
register int *p;
{
	RES_RELEASE(p);
}

void
RES_ACQUIRE(p)
register int *p;
{
	XXX
	/* Some hairy macro goes here to double-loop on a
	 * synchronization register.
	 * Unfortunately, I can't remember how it goes, and
	 * I can't read the Ardent tape in my Sun, so you get to
	 * rummage in thread.h and write it yourself. -Mike.
	 */
}

void
RES_RELEASE(p)
register int *p;
{
	*p = 1;
}
#endif /* ardent */

#ifdef SGI_4D

void
RES_INIT(p)
register int	*p;
{
	register int i = p - (&rt_g.res_syscall);
	ulock_t	ltp;

	if( !rt_g.rtg_parallel )  return;
	if (lockstuff == 0) {
		(void)mktemp(lockfile);
		if( rt_g.debug & DEBUG_PARALLEL )  {
			if( usconfig( CONF_LOCKTYPE, _USDEBUGPLUS ) == -1 )
				perror("usconfig CONF_LOCKTYPE");
		}
		lockstuff = usinit(lockfile);
		if (lockstuff == 0) {
			fprintf(stderr, "RES_INIT: usinit(%s) failed, unable to allocate lock space\n", lockfile);
			exit(2);
		}
	}
	ltp = usnewlock(lockstuff);
	if (ltp == 0) {
		fprintf(stderr, "RES_INIT: usnewlock() failed, unable to allocate another lock\n");
		exit(2);
	}
	*p = (int) ltp;
	lock_usage[i] = 0;
}

void
RES_ACQUIRE(ptr)
register int	*ptr;
{
	register int i = ptr - (&rt_g.res_syscall);

	if( !rt_g.rtg_parallel )  return;

	if( i < 0 || i > 12 )  {
		fprintf(stderr, "RES_ACQUIRE(x%x)? %d?\n", ptr, i);
		abort();
	}

	/* Attempt to reduce frequency of library calling sginap() */
	if( lock_busy[i] )  {
		lock_spins[i]++;	/* non-interlocked */
		while( lock_busy[i] )  lock_waitloops[i]++;
	}
	ussetlock((ulock_t) *(ptr));
	lock_busy[i] = 1;
	lock_usage[i]++;		/* interlocked */
}

void
RES_RELEASE( ptr )
register int	*ptr;
{
	register int i = ptr - (&rt_g.res_syscall);
	if( !rt_g.rtg_parallel )  return;
	lock_busy[i] = 0;		/* interlocked */
	usunsetlock((ulock_t) *(ptr));
}

#endif /* SGI 4D */

#if defined(sgi) && !defined(mips)
/* Horrible bug in 3.3.1 and 3.4 and 3.5 -- hypot ruins stack! */
long float
hypot(a,b)
double a,b;
{
	return(sqrt(a*a+b*b));
}
#endif /* sgi */

#ifdef alliant
void
RES_ACQUIRE(p)
register int *p;		/* known to be a5 */
{
	register int i;
	if( !rt_g.rtg_parallel )  return;

	i = p - (&rt_g.res_syscall);
	if( *p )  {
		lock_spins[i]++;	/* non-interlocked */
		while( *p )  lock_waitloops[i]++;
	}

#if __STDC__
	(void)lock( (char *)p );	/* An alliant libc routine */
#else
	asm("loop:");
	while( *p )  {
		/* Just wasting time anyways, so log it */
		lock_waitloops[i]++;	/* non-interlocked */
	}
	asm("	tas	a5@");
	asm("	bne	loop");
#endif
	lock_usage[i]++;		/* interlocked */
}
#endif /* alliant */

#ifdef convex
RES_ACQUIRE(p)
register int *p;
{
	if( !rt_g.rtg_parallel )  return;

	asm("getlck:");
	asm("	tas	@0(ap)");	/* try to set the lock */
	asm("	jbra.f	getlck");	/* loop until successful */
}
#endif /* convex */



///////////////////////////////////////////////////////////////////
Heres a plug from Frank Phillips for SGI's course on writing 
parallel applications.

just for reference, SGI offers a very good class (in Mt.View) on writing
parallel applications, using several different methods.
Definitely recommended. 

fap@sgi.com
frank phillips
SGI inSane Diego

//////////////////////////////////////////////////////////////////
I hope people find this info useful. I know I will.
Thanks to all who responded so quickly. A big unthank you to myself
for waiting so long to post.

Cheers,
Doug Turner
turner@apple.com
408-974-0484