[comp.unix.wizards] Bug in 4.3BSD vm_page/fodkluster on VAXen??

ehrlich@psuvax1.psu.edu (Dan Ehrlich) (11/12/87)

We have one fortran program that will consistently cause the kernel to
crash and burn with:

	trap type 9, code = 803fa1fb, pc = 80025b52
	panic: Protection fault

Let me add that we have found no other program written in any language
that will cause this problem.

It would appear that there is a bug in the routine fodkluster in the
file sys/vm_page.c as follows.  After slugging it out with adb, (is
there an easier way? :-( ), it comes down to the process generating a
page fault for virtual address 2.  The translation fault code calls
pagein with arguments (2,0).  Pagein eventually calls fodkluster.

Here is were it gets interesting.  The call to fodkluster looks like:

	fodkluster((proc *)80090ee8,		/* p */
		   (unsigned)0,			/* v0 */
		   (pte *)803fa200,		/* pte0 */
		   (int *)7fffff4c,		/* pkl */
		   (dev_t)26,			/* dev */
		   (daddr_t *)7fffff3c)		/* pbn */

In fodkluster there is some code that looks like:

	fodkluster(p, v0, pte0, pkl, dev, pbn)
	...
	unsigned v, vmin, vmax;
	...
	if (isatsv(p, v0)) {
	    type = CTEXT;
	    vmin = tptov(p, 0);
	    vmax = tptov(p, clrnd(p->p_tsize) - CLSIZE);
	...
	fpte = (fpte *)pte0;
	bn = *pbn;
	v = v0;
	for(klsize=1; klsize < KLMAX; klsize++) {
	    v -= CLSIZE;
	    if (v < vmin)
		break;
	    fpte -= CLSIZE;
	    if (fpte->pg_fod == 0)
		break;
	...

The variable "vmin" is assigned the value zero as on a VAX the macro
"tptov" expands to "((unsigned)0)".  As v0 came into the routine as
zero, the line "v -= CLSIZE" ends up assigning 2**32-1 to v.  This
number is larger than vmin so we continue on to "fpte -= CLSIZE".  It
turns out that "fpte = (struct fpte *)pte0 = p->p0br", i.e. this is the
very first page table entry for this process and there are none before
this one.  The next "if" result is trying to reference an invalid
address resulting in a page fault while the cpu is already running in
kernel mode.

UNIX goes down in flames and you get to spend the next twenty minutes
watching fsck reconstruct your file systems.  Below is a patch that
would appear to fix the problem.

-- Dan Ehrlich

RCS file: RCS/vm_page.c,v
retrieving revision 1.1
diff -c -r1.1 vm_page.c
*** /tmp/,RCSt1002016	Wed Nov 11 17:55:54 1987
--- vm_page.c	Wed Nov 11 13:13:57 1987
***************
*** 1263,1271 ****
  	bn = *pbn;
  	v = v0;
  	for (klsize = 1; klsize < KLMAX; klsize++) {
! 		v -= CLSIZE;
! 		if (v < vmin)
  			break;
  		fpte -= CLSIZE;
  		if (fpte->pg_fod == 0)
  			break;
--- 1263,1271 ----
  	bn = *pbn;
  	v = v0;
  	for (klsize = 1; klsize < KLMAX; klsize++) {
! 		if (v < (vmin+CLSIZE))
  			break;
+ 		v -= CLSIZE;
  		fpte -= CLSIZE;
  		if (fpte->pg_fod == 0)
  			break;