[comp.sys.next] Bugs and Problems

rogerj@batcomputer.tn.cornell.edu (Roger Jagoda) (03/12/90)

Folks,                                                                          
                                                                                
I have noticed the following bugs, have nay of you also noticed these           
(and, come up with fixes/work-arounds!)?                                        
                                                                                
-- Everytime I look in /private/temp I find these files called                  
"k_load000100*" that are anywhere from 20K to 5MB! What are these? After        
a logout, they DO NOT go away! That's a LOT of waste of disk. If they're        
swap files, whay don't they go to /private/vm/swapfile?                         
                                                                                
-- What are the ".places" files that are everywhere? Erasing them seems         
to make no difference...does having them there?                                 
                                                                                
-- When users throw files into the "Black Hole" then log out, the files         
are still there! Look in <HOME DIR>/.NeXT/.NeXTtrash/...that directory          
holds all the things "tagged to be trashed" but not actually removed.           
The average user isn't going to remember to click on Files-Destroy after        
moving things to the "Hole". Now, I know this is supposed to emulate the        
Mac's trash-can icon service, but after a log-off, the "Hole" should be         
considered wasted and the files purged. I can't tell you what a waste of        
disk space this is!                                                             
                                                                                
-- Despite the upgrade to OS 1.0a, we have still noticed a TON of               
printing problems. Some of the most common are that the npd daemon              
crashes with errors:                                                            
                                                                                
" npd	102]: found mismatch on printer status, 0x00000010 !=0x00000000"          
                                                                                
which of course they aren't equal. This looks like a simple buffer              
mismatch, isn't npd doing parity or error checking? Anyway, a restart of        
the npd SOMETIMES works. At other times we get the message:                     
                                                                                
npd: couldn't connect to network daemon, restart npd                            
                                                                                
which is of course nerve racking as that's what you're trying to do!            
                                                                                
This error propmts a reboot of the machine which is really poor. There          
must be a better way...."I just can't go on like this!" BTW, most of the        
npd crashes are from users on Mathematica. Perhaps the people at Wolfram        
haven't quite got the bugs worked out?                                          
                                                                                
-- Another npd error causes this message:                                       
                                                                                
" no reply to status request from npd -203"                                     
"DPSlibrary Context Error code 1101"                                            
"DPSlibrary Context Error code 1103"                                            
                                                                                
Then at that point a console window comes up with a panic statement.            
Now, sometimes typing "c" for continue returns to the WorkSpace, but            
other times there is a real kernel panic and the machine has to be              
restarted. Most of the time, users are in WriteNow when this                    
happens...and that's supposed to be stable!                                     
                                                                                
-- There is a MAJOR bug in the mach_swapon routine. With the 40MB               
Quantum drive in place, swapping is UPPOSED to take place there (i.e.           
the drive gets auto-mounted as /private/swapdisk and swapfile is part of        
that tree). Part of the rc scripts sets the PRIMARY swap site there and         
the secondary swap site to /private/vm/swapfile. Unfortunately, even if         
you set a hiwat (for high-water mark) in /etc/swaptab, if the Quantum           
fails (25% of ours have...5 out of 25 so far!), swapping will go to the         
secondary swap point...in the case of a netboot client, this is the file        
server. There, the swapfile under /private/vm/swapfile grow out of              
control and eventually brings down the server. In our case we have seen         
these grow to almost 200MB...despite the hiwat settings on both the             
client and the server!                                                          
                                                                                
Mach is known to have some of the best memory managment internals of any        
system...what happened here!?                                                   
                                                                                
-- Terminal/Shell have a bug in that they don't properly update utmp or         
wtmp. If you issue "finger" on a system where a user has just logged            
off, it will show that user still logged in AND owning a                        
pseudo-terminal! This means that script commands and other tty-dependent        
commands fail with "permission denied" errors. A work-around so far has         
been to launch more terminals, eventually getting to the tty previously         
owned. Then Mach recognizes the error and turns the "owned" tty over to         
the current user who then has the proper rights, but what if the                
previous owner openned up 6 ttys (why shouldn't he/she, NeXTStep is a           
good windowing environment), the the new user has to launch at least as         
many just to keep up! We've tried a number of things, but all of them           
seem to cause more problems for the "next" user as each time more               
Terminals have to be launched.                                                  
                                                                                
-- NetInfo cloning does NOT appear to work. After setting up the proper         
"serves /.network" property on a to-be-cloned-to netinfo disk-full              
client, you got to the Coniguration server and issue the "nidomain -c           
network <config server name>/network" command. True, the directory              
/etc/netinfo/network.nidb is created on the clonee, but if the Config.          
server goes down, this "cloned" server does NOT kick in and backup with         
passwd or machine idents. What good is it to do the clone then?! We have        
also tried settting "NETMASTER=-YES-" in the clonee's /etc/hostconfig           
file, but this causes worse problems as the portmapper/nmserver daemons         
on the Config server start getting confused and stall the network! See          
more below.                                                                     
                                                                                
So, is there a way to backup the Config. server's netinfo services?             
There NEEDS to be, otherwise netinfo suffers just as YP does when the YP        
server goes down...netinfo is just in another package...:-(                     
                                                                                
-- A BIG bug! NMSERVER/PORTMAP/NIBINDD (all started from /etc/rc) seem          
to have problems when an RPC timeout occurs. The errors look like this:         
                                                                                
netinfo sleeping: RPC: Timed out                                                
lookupd	64]: netinfo sleeping: RPC: Timed out                                   
NFS getattr failed for server <our Config. server here>: RPC: (unknown          
error code)      I like that one...:-)                                          
                                                                                
The the machine crashes and WON'T come back up. According to the startup        
screens, either nmserver completes, but portmap and nibindd crash out.          
                                                                                
Is there a better portmapper  out there we could ftp and install? Have          
other people found this problem?                                                
                                                                                
Some minor bugs:                                                                
                                                                                
The EMACS tutorial (only 800 lines or so) was left off the distribution.        
It is normally in /usr/lib/emacs/etc/TUTORIAL but is not on the 1.0             
distribution disk.                                                              
                                                                                
WriteNow STILL has no underline and CANNOT do accents properly (puts            
them off to the side, not over the letter where they belong). For a             
machine with WYSIWYG DPS, you'd expect better. FrameMaker does it               
correctly!                                                                      
                                                                                
That's all for now. Thanks for listening.                                       
                                                                                
--Roger Jagoda                                                                  
--Cornell University                                                            
--FQOJ@CORNELLA.CIT.CORNELL.EDU                                                 
                                                                                

gerrit@nova.cc.purdue.edu (Gerrit) (03/12/90)

In article <9887@batcomputer.tn.cornell.edu> rogerj@tcgould.tn.cornell.edu (Roger Jagoda) asks about a lot of bugs/differences:

Re: /private/tmp/k_load* files, 20K to 5MB?

These are part of the loadable device drivers; see kern_loader for some
more info.  There should only be one file for loadable device driver type
if I understand correctly and I was under the impression that they are
fixed size - actually about 128K for the standard device type.

>-- What are the ".places" files that are everywhere? Erasing them seems
>to make no difference...does having them there?

They are a "cache" for the browser.  Every time you pass through a
directory the browser attempts to update the file.  You could probably
(although I haven't tried this) run a find every night to dink them if
space is a problem.

Re: black hole vs. trash-can semantics

This is as much a matter of philosophy as anything.  It would be better if
the files in the black hole would "time out" after a while.  Or, you could
set up a find(1) out of cron(8) to dink files in user's .NeXT/.NeXTTrash
directories at regular intervals to get a similar semantic.  I personally
dislike associating logout with "really remove files."  They are really two
seperate concepts that the Mac has overloaded.  Of course, some Mac user's
no longer realize that this is an overloaded function and expect the Mac's
funtionality.  If it is a problem, I'd suggest the cron/find solution.

Re: problems with npd

I have seen two major classes of printing problems with the NeXT.  One is
usually from trying to use a non-NeXT cable to attach the printer to the
cube.  The timing/distance combination is very important for transfers at
that rate of speed.  That will often cause incorrect status matchups and
a sick npd.  Related to that is running the cable around a fridge or other
noisy electrical source.  The other major class of problems is related to
bad PostScript code.  Npd often chokes if you feed it imperfect PostScript.
Mathematica may indeed be a source of this type of problem.  This is
hopefully something that NeXT is working on.

>-- There is a MAJOR bug in the mach_swapon routine. With the 40MB

There is a bug in the implementation of Mach 1.? or NeXT's version of Mach
that is related to swapping on two different devices.  This should be fixed
in 2.0 (both NeXT's 2.0 and Mach 2.0 if I remember all my numbers
correctly).  I'm not sure why your swap drives are failing so regularly; we
only have about 8-10 with swap drives and none have failed as yet that I'm
aware of.  They've only been installed for a little over a month, though.

>Mach is known to have some of the best memory managment internals of any
>system...what happened here!?

Mach may have one of the best theoretical memory management designs
and a fairly good implementation, but Mach is also a fairly young
OS and some of these swapping ideas are fairly new (swapping to
multiple FILES as opposed to partitions, specifying high water/low
water marks, file preferences, and NFS all in the same swapping
package) and there are bound to be a few bugs that aren't found
until the code is actually used by people in real life.  This is
one of those.  As a plug for Mach/NeXT's port/implementation - I've
had nearly 0 OS crashes on any of the 25 or so machines on campus
while running 1.0.

>-- Terminal/Shell have a bug in that they don't properly update utmp or
>wtmp.

Yep.  There is at least one user level bugfix for this on the archives
which utilizes the LoginHook/LogoutHook stuff to do this correctly.
Hopefully NeXT will get this working a bit better in 2.0 as well.
Of course, like any bugs, they should also be reported through the proper
channels (bug_next via campus support person for educational folks) to make
sure they are put in the bugs database to be fixed.

>-- NetInfo cloning does NOT appear to work. After setting up the proper

I haven't worked with this under 1.0 but it was able to get it working
under 0.9 without too much difficulty.  It happened to have some
constraints about subnets and prioritizing access to servers that made it
less than ideal for me, so I haven't used it since then.  This should be a
good question for your NeXT Systems Engineer for your area to work out with
you in case it is a problem with configuration rather than a real bug with
clone servers.  Also, if it is a real bug, it is good if you can
demonstrate it to the SE so that he can take it back to the ranch to get it
properly fixed.

>-- A BIG bug! NMSERVER/PORTMAP/NIBINDD (all started from /etc/rc) seem
>to have problems when an RPC timeout occurs. The errors look like this: 

I haven't seen this and don't have any insights.

>The EMACS tutorial (only 800 lines or so) was left off the distribution. 

Let NeXT know and in the meantine copy it from one of your other hosts or
pick it up from prep.ai.mit.edu.

>WriteNow STILL has no underline and CANNOT do accents properly (puts
>them off to the side, not over the letter where they belong). For a
>machine with WYSIWYG DPS, you'd expect better.  FrameMaker does it
>correctly! 

I think this was discussed here before.  I seem to remember someone saying
that the author of WriteNow believed that underlining was a poor substitute
for italics and therefore wasn't needed.  A nice argument, but I don't buy
it.  You'll probably have to file that as a suggestion with NeXT.  The
accenting problem is a problem with the Text class, if I remember
correctly.  I think that a version of the fixed Text class is forthcoming
(hopefully 2.0) and this will be fixed in a more global fashion.
FrameMaker probably isn't using the Text class and therefore implements
their own, more correct solution.

gerrit

jgreely@oz.cis.ohio-state.edu (J Greely) (03/12/90)

In article <9887@batcomputer.tn.cornell.edu> rogerj@batcomputer.tn.cornell.edu
 (Roger Jagoda) writes:
>I have noticed the following bugs, have nay of you also noticed these
>(and, come up with fixes/work-arounds!)?

Not all of these are bugs (although I suspect all of the blank spaces
at the end of your lines are (cut-and-paste problems?)).
                                                                               
>-- Everytime I look in /private/temp I find these files called
>"k_load000100*" that are anywhere from 20K to 5MB! What are these?
>After a logout, they DO NOT go away! That's a LOT of waste of disk. If
>they're swap files, whay don't they go to /private/vm/swapfile?

They're not swapfiles.  They're created by /usr/etc/kern_loader for
some reason (presumably related to loadable kernel modules), and don't
seem to cause any problems if removed.  If they're really an
annoyance, delete them from cron as often as necessary.

  If you want user files deleted from /tmp on logout, add a LogoutHook
to root's defaults database that executes something like this:
	find /tmp -user ~$1 -print | xargs rm -f


>-- What are the ".places" files that are everywhere? Erasing them seems
>to make no difference...does having them there?

They're used by the Browser to hold icon types and positions for the
Browser, and I believe they're also used to speed up display of
directories under all browser views (subsuming the function of the
.list file from 0.x).

>-- When users throw files into the "Black Hole" then log out, the files
>are still there!

I'd call that a feature, and a damn sight safer than the Mac behavior.
I don't use the Black Hole at all myself, but if I were in the habit,
I'd rather have the system take the cautious view.  Personally, I've
never been quite sure exactly under what circumstances a Mac will
empty the trash can.  In keeping with their general philosophy, NeXT
forces the user to make that choice.

>Now, I know this is supposed to emulate the Mac's trash-can icon
>service, but after a log-off, the "Hole" should be considered wasted
>and the files purged. I can't tell you what a waste of disk space this
>is!

LogoutHook can handle this as well, although I'd limit it to files
more than (say...) 3 days old.  Like so:
	find ~$1/.NeXT/.NeXTtrash -mtime +3 -print | xargs rm -f

>-- Despite the upgrade to OS 1.0a, we have still noticed a TON of
>printing problems. Some of the most common are that the npd daemon
>crashes with errors:
>" npd	102]: found mismatch on printer status, 0x00000010 !=0x00000000"

Nifty.  I hadn't noticed them, but I just found 300 such lines dated
yesterday, averaging one per minute.  Note that these are *not* fatal
errors.  All 300 that I found were stamped with the same process id.
Also, there were no complaints about printing, on this or on the one
other day I have these messages from.

>This error propmts a reboot of the machine which is really poor. There
>must be a better way...."I just can't go on like this!" BTW, most of the
>npd crashes are from users on Mathematica.

Bingo.  We don't have anyone using Mathematica regularly.  If it's not
communicating properly with npd (or, alternatively, if npd is not
sufficiently robust to handle protocol errors) it could certainly be
causing such crashes, which are not necessarily related to the error
message given above.


>-- Another npd error causes this message:
>" no reply to status request from npd -203"
>"DPSlibrary Context Error code 1101"
>"DPSlibrary Context Error code 1103"

I don't see any of the last two, but the first one is quite common
among the 300 "mismatch" errors from yesterday, but again, doesn't
seem to be fatal.

>Then at that point a console window comes up with a panic statement.

Never seen it.

>-- There is a MAJOR bug in the mach_swapon routine. With the 40MB
>Quantum drive in place, swapping is UPPOSED to take place there

I can't comment on this, since we don't own any of the swap drives.
All of the machines on campus were purchased with hard disks.

>-- Terminal/Shell have a bug in that they don't properly update utmp or
>wtmp.

Yup.  In general, utmp and wtmp handling seems to be something that's
not treated seriously.  loginwindow grudgingly handles utmp, but
leaves wtmp for a LoginHook (and they don't supply one that emulates
the default behavior of a "normal" unix machine, sigh).

>A work-around so far has been to launch more terminals, eventually
>getting to the tty previously owned. 

Didn't someone put a program in one of the archives that cleans up
after this mess?

>-- NetInfo cloning does NOT appear to work.
and
>-- A BIG bug! NMSERVER/PORTMAP/NIBINDD (all started from /etc/rc) seem

Having only two NeXTs to play with, I've never done anything big with
NetInfo.  Gerrit?

>NFS getattr failed for server <our Config. server here>: RPC: (unknown
>error code)      I like that one...:-)

That one's so common on anything that runs NFS that it's usually
treated as meaningless noise.  Happens every time something crashes
around here, or goes down for backups.

>WriteNow STILL has no underline and CANNOT do accents properly (puts
>them off to the side, not over the letter where they belong). For a
>machine with WYSIWYG DPS, you'd expect better. FrameMaker does it
>correctly!

Documented.  FrameMaker is about the only thing I know of that gets
accents right.  As for underline, I'm afraid I regard that with a
quiet "so?".  Underlining is only a poor man's italic anyway, so why
mess with it on a system that provides better ways of emphasizing
text?  Yes, FrameMaker supports it, but it supports lots of things
that aren't useful often (but are good to have on the rare occasions
when you *do* need them).
--
J Greely (jgreely@cis.ohio-state.edu; osu-cis!jgreely)

robertl@bucsf.bu.edu (Robert La Ferla) (03/12/90)

In article <9887@batcomputer.tn.cornell.edu> rogerj@batcomputer.tn.cornell.edu (Roger Jagoda) writes:

	Ok, first things first, one I like the bug list BUT it's really a
waste of net bandwidth when you don't insert carriage returns at the end of
your text lines.  I believe the problem lies in that you are probably sending
this message via the NeXT and not seeing the padded spaces in your messages.
So, on with the reply:

> When users throw files into the "Black Hole" then log out, the files
> are still there! Look in <HOME DIR>/.NeXT/.NeXTtrash/...that directory
> holds all the things "tagged to be trashed" but not actually removed.
> The average user isn't going to remember to click on Files-Destroy after
> moving things to the "Hole". Now, I know this is supposed to emulate the
> Mac's trash-can icon service, but after a log-off, the "Hole" should be
> considered wasted and the files purged. I can't tell you what a waste of
> disk space this is!

	No, I don't buy your argument completely.  I think the "Black Hole"
should keep all tagged files but up to some high water mark.  NeXT are you
listening?  Specifically, if the sum of the size of tagged as deleted files
surpass some user specified size (threshol) then the Workspace manager should
delete files starting from the oldest file in the trash until this sum is
under the threshold.  Also, there should be some default for this threshold
that is determined by the amount of disk space in your system.  The user
should have the option of changing the threshold value (stored as a Workspace
NetInfo name) via Preferences.

   __  
  /  \      /         __/_
 /___/ __  /_  __  __  /	INTERNET:	robertl%bucsf@cs.bu.edu
/ \   '_' /_/ |_- / ' /		BITNET:		mete0pc@buacca.bu.edu

lane@sumex-aim.stanford.edu (Christopher Lane) (03/13/90)

> Now, I know this is supposed to emulate the Mac's trash-can icon
> service, but after a log-off, the "Hole" should be considered wasted
> and the files purged.

If you really want this to be the 'defined' behavior on your system, you can
add something like 'rm -rf ~/.NeXT/.NeXTtrash/*' to /etc/logout.std (assumes
of course that most users use 'csh' or some shell that executes 'csh's default
files).

> -- Terminal/Shell have a bug in that they don't properly update utmp or
> wtmp. If you issue "finger" on a system where a user has just logged off,
> it will show that user still logged in AND owning a pseudo-terminal!

The best solution to the wtmp/utmp/lastlog problems I've found so far is:

a) run the MOTD application as a LoginHook along with its matching LogoutHook
b) ignore the advice in MOTD's README file about Terminal's 'setuid' bit
c) run Eric Scott's 'ghostbuster' utility from 'cron' hourly

this not only fixes the problems you mentioned but goes a long way to making
'last' provide information that you can use to evaluate machine utilization.

- Christopher
-------

edwardm@hpcuhc.HP.COM (Edward McClanahan) (03/13/90)

This notestring is very informative, but I thought I'd correct one
misstatement:

Gerrit writes:

> Mach may have one of the best theoretical memory management designs
> and a fairly good implementation, but Mach is also a fairly young
> OS and some of these swapping ideas are fairly new (swapping to
> multiple FILES as opposed to partitions,...         ^^^^^^^^^^^
  ^^^^^^^^^^^^^^

I believe DEC's VMS has been doing this for almost a decade.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  Edward McClanahan
  Hewlett Packard Company
  Mail Stop 47UP              -or-     edwardm%hpda@hplabs.hp.com
  19447 Pruneridge Avenue
  Cupertino, CA  95014                 Phone: (408)447-5651

rca@cslab8g.cs.brown.edu (Ronald C.F. Antony) (03/14/90)

In article <1324@shelby.Stanford.EDU>, lane@sumex-aim.stanford.edu
(Christopher Lane) writes:

> If you really want this to be the 'defined' behavior on your system, you can
> add something like 'rm -rf ~/.NeXT/.NeXTtrash/*' to /etc/logout.std (assumes
> of course that most users use 'csh' or some shell that executes 'csh's
default
> files).
> 
This might be dangerous, if someone uses a shell or terminal but is still 
working when he quits the shell. His files will be gone...

Ronald

------------------------------------------------------------------------------
"The reasonable man adapts himself to the world; the unreasonable one persists
in trying to adapt the world to himself. Therefore all progress depends on the
unreasonable man."  Bernhard Shaw | rca@cs.brown.edu or antony@browncog.bitnet
------------------------------------------------------------------------------