Sun-Spots-Request@RICE.EDU (William LeFebvre) (03/15/88)
SUN-SPOTS DIGEST Monday, 14 March 1988 Volume 6 : Issue 28 Today's Topics: Re: Multi-process debugging using Dbxtool (2) Re: Ethernet problems Re: nfsd problems Re: rcp and rlogin hang in 3.4EXPORT, fix TCP packet size bug in 3.4 AND 3.5 (2) rasterfile(5) to Postscript filter Strange Ethernet error messages Reasonable Ethernet collision rates? Question about Sun 3/160 VME bus speed Connecting an HP LaserJet II to a SUN 4/280? Anyone using a Sun 4 for central time sharing? Any problems in upgrading to 3.5? Send contributions to: sun-spots@rice.edu Send subscription add/delete requests to: sun-spots-request@rice.edu Bitnet readers can subscribe directly with the CMS command: TELL LISTSERV AT RICE SUBSCRIBE SUNSPOTS My Full Name Recent backissues are stored on "titan.rice.edu". For volume X, issue Y, "get sun-spots/vXnY". They are also accessible through the archive server: mail the word "help" to "archive-server@rice.edu". ---------------------------------------------------------------------- Date: Thu, 3 Mar 88 10:03:16 +0100 From: Danny Backx <mcvax!prlb2!kulcs!dannyb@uunet.uu.net> Subject: Re: Multi-process debugging using Dbxtool (1) >From: ha@purdue.edu >...That is, at any time, there could be as many instances of Dbxtool > as there are active processes belonging to the program. This is NOT what is meant in the paper. What you _can_ do is this : - when some process is running, it is possible to attach a dbx process to it, which clearly doesn't have to be the parent... - when you are debugging a process, you can allow it to continue on its own. The commands for this are (inside dbx, of course) 'debug' and 'detach'. NOT 'attach' as both the manual and the paper say. What you also can NOT do, and which is VERY annoying, is to single-step or debug a process while it is fork()-ing. (more general : no file can be debugged if more than one process is runnung it) Now this is something the people at SUN should fix. Multi-process debugging is almost impossible due to this. Moreover, it shouldn't be that hard to fix. All they have to do is really duplicate the process in memory when it forks, and if it is being debugged. (And also when you start debugging it, and multiple processes are running the same core...) Danny Backx Danny Backx | mail: Katholieke Universiteit Leuven Tel: +32 16 200656 x 3537 Dept. Computer Science E-mail: dannyb@kulcs.UUCP | Celestijnenlaan 200 A ... mcvax!prlb2!kulcs!dannyb | B-3030 Leuven dannyb@kulcs.BITNET | Belgium ------------------------------ Date: Wed Mar 2 17:14:14 1988 From: David_T_Lawlor@cup.portal.com Subject: Re: Multi-process debugging using Dbxtool (2) To debug processes that are not the children of dbxtool I do the following: Run the process to debug. While its running -- or waiting for input, get into a dbxtool window (or dbx) Use the command "debug progname <pid>". You can get the pid from the "sh ps ax" command inside dbx dbx then attaches itself to the process. You can then hit ^C to stop the process. use dbx as usual. To finish use the command "detach" to let go of the process. If you don't, and use the quit command, the process you were debugging dies. davel@cup.portal.com ------------------------------ Date: Mon, 29 Feb 88 08:09:28 EST From: steve@cs.umd.edu (Steven D. Miller) Subject: Re: Ethernet problems Reference: v6n20 The behavior you're seeing, and the frequency with which it occurs, makes me think that you're having what is called an Ethernet meltdown. Meltdown behavior is generally caused by using different broadcast addresses on your network. The change from all-zeroes (4.2BSD convention) to all-ones (4.3BSD and DARPA standard convention) is enough to cause the problem, as is the change from unsubnetted broadcasts to subnetted broadcasts. Given the 4.2BSD/SunOS TCP/IP implementation (at least as of SunOS 3.2; later revisions may be fixed, I don't know for sure), any machine that sees a broadcast packet it doesn't recognize will try to forward that packet. An example scenario might be: Host sees broadcast to 128.8.128.255 Host thinks broadcast address is 128.8.0.0, and says, "gee, that's not a broadcast packet, so I should forward it!" Host does ARP for 128.8.128.255. Of course, no one responds. At least, no one should respond... Host keeps ARPing potentially a number of times until it finally gives up. Multiply this by fifty hosts, all pounding on the Ethernet roughly simultaneously (since they all see the different broadcast at the same time), and you get a period during which the Ethernet is unusable. Hosts simply should not forward packets of any sort, and they certainly should not *under any circumstances* forward a broadcast packet. I don't think that gateways should ever forward broadcasts, either; my reasoning is identical to that in RFC 1009. (This document is a gem when it comes to telling TCP/IP implementors and administrators what to do and not to do, and what strange things can happen as a result.) Of course, there is this nice kernel variable "ipforwarding" which can be used to disable forwarding and which you might think can be used to stop this antisocial behavior. Guess again. In a 4.2BSD system, if you turn off ipforwarding, all that will happen is that you'll swap ICMP Network Unreachable messages for ARPs (at a possible packet savings, as you'll definitely only get one ICMP message per broadcast, while you may get more than one ARP). In fact, these ICMP messages will erroneously have the differing broadcast address as their source address, due to a quirk in the 4.2BSD implementation. This makes it rather difficult to find the perpetrator unless you know which machines have which Ethernet addresses... You might be able to verify whether a meltdown is occuring by: jello# etherfind -proto icmp -o -arp If you see big bursts of ARPs or big bursts of ICMP Network Unreachable messages, you've got a meltdown. I'd suspect that it's one of the "oddball" machines on your net that is causing the problem. The correct broadcast address to use is your network number with all-ones in the host field, with modification if you're using subnets. (I.e., we at Maryland would broadcast on 128.8.255.255 if we weren't using an additional eight bits of subnet mask; we do broadcast on, for example, 128.8.128.255 on subnet 128.) You should use the same thing everywhere, though, even if this means using an incorrect broadcast address. I hope this helps. -Steve Spoken: Steve Miller Domain: steve@mimsy.umd.edu UUCP: uunet!mimsy!steve Phone: +1-301-454-1808 USPS: UMIACS, Univ. of Maryland, College Park, MD 20742 [[ Well, this was not the problem that Benson was experiencing, but the problem described here is sufficiently catastrophic (we've had it happen here at Rice) that I decided to include this message anyway. --wnl ]] ------------------------------ Date: Mon, 29 Feb 88 09:02:37 EST From: Steve D. Miller <steve@brillig.umd.edu> Subject: Re: nfsd problems Reference: v6n20 We've seen similar behavior (strange hangings of nfsd instances in disk wait) here. With Chris Torek to do the thinking and me to do the grunt work of tracing kernel data structures, we finally tracked the problem down to a very subtly corrupted inode on the disk. Fsck didn't see anything wrong with it, but there was something about it that caused nfsd to endless loop whenever it touched that inode. (I don't remember exactly what the problem was. I think it had to do with a corrupted directory. Chris, do you remember?) You might be able to reproduce the problem at will by NFS mounting all your server's partitions on a client (say at /tmp/foo), and then running a find(1) to look at each directory. ("find /tmp/foo -name X-X-X" will probably do it. It was a find run on an NFS partition that first brought the problem to our attention.) If your NFS daemons hang, you've probably got the same problem. The cure was not for the faint of heart. We looked at the output of ps axl to find what address the nfsds were waiting on, and what priority they were waiting at. We made a guess as to what it was exactly that they were waiting on data-structure-wise (sources come in handy here, as you can see what in the NFS and filesystem code waits on particular things at a given priority, and guess from that what data structure the wait channel might represent), then turned that structure back into an inode number. We then used ncheck to find the files (actually, I think, directories) corresponding to those inodes, and carefully did a ls -li on the appropriate directories (the normal filesystem worked still) to figure out what inodes in those directories corresponded to which files. We then ran clri on the appropriate inodes, ran fsck to put the new orphans into lost+found, and used the ls -li to put the files in lost+found back into their proper places. The problem has not reoccurred in more than a year of operation. I think it was surmised at the time that it was a problem with the statelessness of NFS. (Races can occur when doing some directory operations, or so I've heard.) I realize that my description is pretty vague. Like I said, it's been a long time. I hope this is enough to lead you (or someone else) down the right path, though... -Steve Spoken: Steve Miller Domain: steve@mimsy.umd.edu UUCP: uunet!mimsy!steve Phone: +1-301-454-1808 USPS: UMIACS, Univ. of Maryland, College Park, MD 20742 ------------------------------ Date: 29 Feb 88 10:22 -0800 From: Dan Razzell <razzell%vision.ubc.cdn@ean.ubc.ca> Subject: Re: rcp and rlogin hang in 3.4EXPORT, fix Reference: v6n17, v6n25 We experienced this too. It certainly seems to be fixed in the 3.5 kernel, but I'm not sure exactly what did it. There is a note in the Release 3.5 Manual in section 5.4, "TCP/IP File Transfer Hangs" which may pertain. You don't have to install the entire 3.5 distribution to get the fix. It's enough to extract the root, system, and network tar files, and from them copy the various TCP/IP daemons and build a new kernel. ------------------------------ Date: Tue, 1 Mar 88 14:31:40 PST From: S.C.Blair <ascway!scb@spar-20.spar.slb.com> Subject: TCP packet size bug in 3.4 AND 3.5 (1) The offensive bug that caused Sun's to talk in 512 byte packets is back again. A confirmation call & subsequent e-mail from Ray Jang there confirmed it's STILL (groan) there. The fix is as follows: adb -w /vmunix tcp_mss+0xac?w 400 tcp_mss+0xac: 0x200 = 0x400 tcp_mss+0xbc?w 400 tcp_mss+0xbc: 0x200 = 0x400 ^D *for those feeling brave in the kernel(warning) you can use 'adb' in this manner: --------------------======= adb -w -k vmunix /dev/kmen This fix will eliminate the need to reboot the machine after doing the patch. The mail from Ray also indicated that some of SUN'S OWN machines had the patch installed, and some didn't. [[ Serves 'em right! --wnl ]] Being an Ex-Sunner, I am surprised that this SIMPLE fix doesn't appear in the 3.5 kernal. Steve Blair Schlumberger Technology Corp-Austin, tx uucp: {backbone}!sun!decwrl!spar!ascway!blair ------------------------------ Date: Thu, 3 Mar 88 11:54 EST From: SYSRUTH@UTORPHYS.BITNET Subject: TCP packet size bug in 3.4 AND 3.5 (2) This bug originally cropped up in 3.4 of SUN UNIX, and a patch for it was broadcast on this list by SUN. For those of you patiently awaiting the arrival of 3.5 so you won't have to worry about it any more: *** It is NOT fixed in 3.5 *** We installed several diskless 3/50's, served off a SUN 4, which requires 3.5 to be running on the 3's. We did the software installation straight off tape, and lo and behold, the generic kernel, the kernel we made for the 3/50's (and the 3/110 we are also serving) all show: # adb /pub/vmunix tcp_mss+0xac?x _tcp_mss+0xac: 200 tcp_mss+0xbc?x _tcp_mss+0xbc: 200 The number should be x400, i.e. 1024 bytes, not 512. The same patch as for 3.4 will still work (i.e. tcp_mss+0xac?w 400 and same for +0xbc). We got these tapes in mid-January. 3.5 was released at the beginning of December. Does anybody know what module the error is in, and how to patch it so that new makes of the kernel will already have the patch applied? I am gradually learning my way around UNIX, but adb is still mostly out of my ken. Thanks. Ruth Milner (preferred) BITNET: sysruth@utorphys Systems Manager InterNet: sysruth@helios.toronto.edu University of Toronto Physics ------------------------------ Date: Sat, 27 Feb 88 20:31:10 +0200 From: leonid@TAURUS.BITNET Subject: rasterfile(5) to Postscript filter Here is a program to convert rasterfile(5) images into PostScript for a LaserWriter. This is a version of psraster I wrote. There should be no trouble compiling and installing it, except the tabs. Since this will be posted through BITNET, tabs will disapear, so please make sure to reconstruct the Makefile before attempting to compile. If you write any fixes into this program, or have any suggestion for imptovement please email them to me so I can incorporate them into the next release. Leonid E-Mail: leonid@taurus.BITNET, @cunyvm.cuny.edu:leonid@Math.Tau.Ac.IL [[ The shar file has been placed in the archives as "sun-source/psraster.shar" and is 13495 bytes long. It can be retrieved via anonymous FTP from the host "titan.rice.edu" or via the archive server with the request "send sun-source psraster.shar". For more information about the archive server, send a mail message containing the word "help" to the address "archive-server@rice.edu". --wnl ]] ------------------------------ Date: Mon, 29 Feb 88 12:54:35 PST From: Jonathan Eisenhamer <jon@mira.astro.ucla.edu> Subject: Strange Ethernet error messages Recently, the following appeared on the console of our Sun 3/50: le0: Received packet with ENP bit in rmd cleared le0: Received packet with STP bit in rmd cleared There were about a dozen pairs of these messages. They have not happened before or since. Just wondering what they mean. Thanks, Jonathan Eisenhamer UCLA Astronomy jon@mira.astro.ucla.edu jon@uclastro.bitnet ------------------------------ Date: Mon, 29 Feb 88 14:47:23 EST From: smb@research.att.com Subject: Reasonable Ethernet collision rates? What are reasonable Ethernet collision rates? What about input and output errors? One one net, where we have a single Sun 3/280, and a bunch of 4.3bsd VAXen, the Sun is running about .02%, and the VAXen are at .3% -- an order of magnitude difference. On the other hand, most (but not all) of the VAXen show no input or output errors (on DEUNAs or DELUAs), while the Sun shows a modest number of each. (Output errors are running at .0025%; input errors are .001%). Virtually all of this traffic would be rcp/rlogin stuff (with some admixture of broadcasts from rwhod and routed). On a second cable, the situation is rather different. We have two file servers and a plethora of clients, wired up using Thinwire cable and DEC's DECConnect scheme (i.e., one station per thinwire segment, all interconnected via DEMPRs and DELNIs). The two file servers, a LANBridge, and two DELNIs are directly on a piece of thick coax. On that net, we're seeing collision rates of .6% or thereabouts. Output errors run about 0% on the clients and .05% on the servers; input errors range from .01% to .06% on the clients and .01% on the server. Most of the clients are not truly diskless; they have small local disks for root, /tmp, swap, and some local bin directories; thus, there's much less NFS traffic than normal. So -- are the rates I'm seeing too high? What constitutes an 'output error'? That, I assume, could differ for different Ethernet controllers; we seem to have ie and le controllers. --Steve Bellovin {ihnp4,ucbvax,allegra}!ulysses!smb smb@ulysses.att.com ------------------------------ Date: Mon, 29 Feb 88 10:45:08 EST From: Ned Danieley <ndd@sunbar.mc.duke.edu> Subject: Question about Sun 3/160 VME bus speed We are using an IKON 10089 DRV11 emulator to do asynchronous 16 bit word DMA transfers to a Sun 3/160. The IKON specs suggest that it will run at up to 2 Meg transfers/second if the bus arbitrator can keep up at that rate. We appear to get only about 650 K though. This is on a very lightly loaded system: basically nothing is happening except for the data acquisition. Should we be seeing a higher rate, or is this a reasonable speed for the Sun bus? Ned Danieley (ndd@sunbar.mc.duke.edu) Basic Arrhythmia Laboratory Duke University Medical Center Durham, NC 27710 (919) 684-6807 or 684-6942 ------------------------------ Date: 29 February 1988 10:14:59 CST From: Steven G. Krantz <C31801SK@WUVMD.BITNET> Subject: Connecting an HP LaserJet II to a SUN 4/280? Does anyone know anything about hooking a Hewlett-Packard Laser Jet II to a SUN Model 4/280? The serial port on the 280 seems to be flawed (is this right) and we can't figure out how to configure the cable (certainly the manual that comes with the Laserjet doesn't give a clue). Thanks for any help. Steven G. Krantz Washington University in St. Louis ------------------------------ Date: 29 Feb 88 16:11:09 GMT From: arnold@emory.UUCP (Arnold D. Robbins {EUCC}) Subject: Anyone using a Sun 4 for central time sharing? I am the Unix Systems Programmer for the Emory University Computing Center. We provide central computing resources to the entire Emory University campus. Currently, we provide Unix on two (rapidly aging) Vax 780s, running Mt. Xinu's educational 4.3 + NFS (which I highly recommend, by the way). We mostly support Computer Science instruction, but also have a number of other departments using the Unix machines for research and/or word processing. At peak load we tend to see about 28 users logged on at once, doing student type things, editing, compiling, running, debugging. We are planning on replacing the two 780's with a single Sun 4. Is there anyone out there who has already done something like this? In other words, if you have a Sun 4 supporting 48 or more simultaneous logins on real terminals (not rlogins), I would like to hear from you. Is it working poorly, well? How much memory and disk must I have? Any and all information about what you've done would be appreciated. (We are not planning on using the Sun 4 to serve any workstations, just as a central timesharing machine.) Thanks in advance, Arnold Robbins ARPA, CSNET: arnold@emory.ARPA BITNET: arnold@emory UUCP: { decvax, gatech, }!emory!arnold DOMAIN: arnold@emory.edu (soon) ------------------------------ Date: Tue, 1 Mar 88 10:27:17 PST From: mordor!tolerant!vsi1!lmb@ut-sally.UUCP (Larry Blair) Subject: Any problems in upgrading to 3.5? We are currently running SunOS 3.4. As an OEM, distributing 3.4 with our new shipments is a nightmare (7 tapes). Sun has been pushing us to use 3.5. I have several questions aimed at those sites that have upgraded: We have heard thru the grape-vine that there are problems with 3.5. The only one that I've seen published in Sun-Spots is the atrun problem. What has been your experience with bugs fixed vs. bugs created? In short, is this upgrade worth doing? According to Sun, 3.5 is released as a full release, rather than an upgrade. What problems did you encounter when trying to upgrade existing systems? How did you prevent the loss of locally modified system files? Sun always seems to assume that they can clobber any files that they want to. * * O Larry Blair * * O VICOM Systems Inc. sun!pyramid----\ * * O 2520 Junction Ave. uunet!ubvax-----!vsi1!lmb * * O San Jose, CA 95134 ucbvax!tolerant/ * * O +1-408-432-8660 ------------------------------ End of SUN-Spots Digest ***********************