chris@umcp-cs.UUCP (08/12/84)
My understanding was that the ``random offline'' problem was due to a timing bug in the UDA50 microcode. If you rave at DEC long enough they will probably swap ROMs (or even boards) for you. Or, you could kludge around with watchdog timers and UBA resets and the UDA init code and try to force it back on every time it goes off line. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci (301) 454-7690 UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
dave@RIACS.ARPA (08/13/84)
From: "David L. Gehrt" <dave@RIACS.ARPA> The distributed berkeley drivers I have seen are buggy. We noticed poor[er than was reasonable] throughput on our 81's, and another site here at ames was having a serious random offline problem. We took a dynamic look at some of the data structures in the driver, and discovered that under heavy load, the controller was being flooded with Get Unit Status (M_OP_GTUNT) commands. If you look into the driver you will see a block of code in udstart() which begins with an "if ((i = ubasetup(..." and ends by sending a Get Unit Status command. The effect of this block of code is the flooding to which I referred above. Removing this behavior clears up a lot of the problems if not all, but we made enough changes that the context diffs are about the same size as the driver. The driver we are currently running works just fine, and has dumpcode (which the distributed code lacked). We haven't gotten around to adding support for more than one device type at a time so our ra60 is not yet installed. The other site here started running the driver and its serious random offline problem went away. There are a number of sites which have picked up the code for our driver and none have reported back any problems as of this writing. Neither of our sites had any microcode upgrades but the legend is that early versions microcode caused all sorts of problems. We have seen a number of modified drivers all of which look like they would solve the problem. We have had plans to add bad block forwarding to our driver for six months, and have received some code which will advance that effort. I'll report any successes in this location. The problem with the effort has been lack of time and lack a source for reliable information in support of the activity, which brings me to a... Minor Flame: After all the time in the field with this hardware (we have had our ra81s for amost a year), I am more than a little dissappointed at the small amount of reliable information on the uda50/ra?? combination in the hands of the DEC field service folks and the users, and with the large amount of misinformation and legend we all seem to be given. Here are a couple of legends I think are or were wide spread and false: 1. "UN*X (TM) scribbles all over the rct (replacement caching tables) used for bad block forwarding." [Not in *any* UN*X driver I have seen.] 2. "The controller forwards bad blocks automatically." [I have seen nothing that indicates that this is true, and lots of bad block reports to indicate that the controller is not forwarding them. In VMS for example the host seems to initiate all bad block forwarding]. Flame off. Because the devices are new and, except for a couple of little problems, have been reliable, and quick, the fact that the bsd distributed drivers I have seen are not correct is very troublesome. Also, it is beginning to look like the users of these devices need to establish their own communications path to diseminate information on the devices and their drivers. Dec has a clear interest in not disclosing too much about the protocol used and other technical details to keep out the competition, but judging from the number of pieces of mail here which start with some variation on the theme "Help with UDA50/RA81 problems!!!" It is clear that there is a need to improve the information flow. So here is a start. I have a 4.2 driver which works fine. [There is no way I know of to determine if it is completely correct, or the most efficient implementation.] Also, I know of a site with a working 4.1 driver, and I will try to get a copy of the diffs for redistribution if there is sufficient interest. I now relinquish the soap box, but I do feel better. dave ----------
rbbb@RICE.ARPA (08/15/84)
From: David Chase <rbbb@RICE.ARPA> To remove some of the mystery (not all): 1) UDA "microcode" bugs: Check your UDA boards - if they are M7161 and M7162, then they are OLD; if they are M7485 and M7486, then they are NEW. I don't think there are many old boards out there anymore, since DEC (at least in our part of the world) went around upgrading the disks on some sort of schedule. We may have unusually responsible field service out here, since everyone else tells horror stories. Whatever version of the driver we are running (for 4.2) doesn't knock the disk offline; uda.c claims it is revision 2.1 84/03/05, and has the unfortunate comment "TO DO: write the bad block forwarding code". 2) Information about these devices can be had from DEC; here are the order numbers and the address: EK-UDA50-UG-002 UDA50 User Guide (mostly hardware info) AA-L619A-TK MSCP Basic Disk Functions Manual AA-L620A-TK Storage System Diagnostic and Utilities Protocol AA-L621A-TK Storage System UNIBUS Port Description I have the first manual, but not the other three. The last three may be ordered as a kit, QP905-GZ UDA50 Programmer's Documentation Kit. The address is: Software Distribution Center Order Adminstration/Processing 20 Forbes Road (NR4) Northboro, MA 01532 3) Deuna information (lots of it) EK-DEUNA-UG-001 Deuna User's Guide. Why anyone would use a Deuna when Interlan boards are available is beyond me, since the Deuna draws about twice as much of everything from the Unibus, and prefers the official DEC H4000 transceiver. Xerox makes one about as big as my fist that seems to work with the Deuna and its diagnostics, except that it lacks the H4000's bogus "heartbeat" (the transceiver asserts "collision" in a special window to let the controller know that its collision detector is still working.) For high density applications ethernet, I recommend DEC's DELNI. It provides 8 connections for a single network tap. It can also operate without any ethernet connection (providing a cheap 8 node psuedo-ethernet) and (if not connected to ethernet) can be tiered to support up to 64 nodes. Cable length restrictions would probably make a 64 node DELNI network a little silly, but it is possible. We have 5 diskless Suns connected to a net through one of these, and have had no trouble from the DELNI. I also recommend this because we have had significant (more than once) problems with bad connections to the ethernet cable itself (sometimes shorting the cable), and people using the network get unhappy. 4) 750 hardware information (this might solve some of the WCS questions, though not how to deal with the DEC-supplied updates), EK-KA750-TD-002 (not necessarily the latest edition). This is NOT for the faint of heart. Now, does anyone out there know any good rumors about "fast fork" for 4.2+n? This uses copy-on-write shared memory; we once heard that this would require a microcode update and was thus delayed. I didn't understand that rumor, since it seems doable with software. Any comments? 5) There is a TM78 (the TU78/TA78 formatter) microcode upgrade floating around; it doesn't break the 4.2 driver (it changed EOT processing in some way, I think to report EOT before any io errors; this helps VMS backup not embarrass itself by running off the end of the tape). We also received this upgrade on some schedule, I think determined by our drive serial number. Hope this clears up some of the hardware confusion out there. drc
andrew@hwcs.UUCP (Andrew Stewart) (08/16/84)
We are running a VAX-11/750 under 4.2BSD with an RA81 driven from a UDA50; like a number of other sites, we have experienced the ra81/uda50 going offline for no apparent reason. Does *anyone* know what causes this? Is there a cure? Is it a problem in the uda50? The ra81? The driver software? Is it (as I suspect) a UBA timing window problem? Any pointers or ideas would be welcomed. I will, as usual, summarise. Andrew Stewart, Dept. of Computer Science, Heriot-Watt University, Edinburgh.
eric@milo.UUCP (08/16/84)
I would like to thank dave for clearing up a mystery that has plagued me for some time. We have three 11/780s, all with RA81s. After 6 months with no problems, one of them decided to go berserk, occasionally going offline, etc. The other two continued to perform flawlessly. Sounds like hardware, right? DEC replaced all the controller boards, all the drives, the memory, and most of the cpu, with no success. Finally, we installed a different driver, which ostensibly only allowed support for ra60s, no mention of change to how the drive was handled. Lo and behold, the problem went away. (I should mention that the first driver was apparently acquired from within DEC, the second, correct, one came over the net. Just goes to show who you should trust). Anyway, I went back and checked, and sure enough, the second driver does not issue the Get_Unit_Status command. Now, there are still some un-answered questions, such as why that particular machine started having problems, since it is not the most heavily loaded system, and we tried swapping things all over the two unibuses to try and minimize the possibility of unibus contention being the problem. Also, once the problem started appearing, it got to the point where the system would fail with only a few people on, mostly idle. Anyway, thanks again for clearing up why the problem go fixed. On a side note, I would like to mention that the local DEC people knew next to nothing about the drive and MSCP in general (in fact, I have all of the "official" documentation that DEC gives them - it is hand written explanations of some of the more common error codes), but to be fair, DEC did fly in an expert to meet with us who was knowledgable about the drive. He also was not able to isolate the problem, but it does seem to be a subtle one. Anyone know if Ultrix has a correct driver? -- eric ...!seismo!umcp-cs!aplvax!eric
henry@utzoo.UUCP (Henry Spencer) (08/18/84)
> Now, does anyone out there know any good rumors about "fast fork" for 4.2+n? > This uses copy-on-write shared memory; we once heard that this would require > a microcode update and was thus delayed. I didn't understand that rumor, > since it seems doable with software. Any comments? What I heard was that the 750 has a microcode bug that prevents copy-on-write from working properly, this being one of the reasons why fast fork has been so long in coming. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
dmmartindale@watcgl.UUCP (Dave Martindale) (08/18/84)
Alex White, who worked on the UDA50 driver here, theorized that the "Get Unit Status" botch was written into the driver at a time when the UDA50 could queue only 15 outstanding requests, which happens to be the same as the number of BDP's available on the 780 UBA. If all of the BDP's are in use, and if you don't use them for anything other than the UDA50, then the UDA can't handle another request anyway right now, and sending out the Get Unit Status really just provides you with a way to get an interrupt at the point that the UDA50 finishes one of the transfers, coincidentally freeing up a BDP. This strategy doesn't work if the UDA50 can handle more than 15 requests (new UDA50's do 22) or if you have some other device using one of the BDP's. In this case, you get constant interrupts.
chris@umcp-cs.UUCP (08/21/84)
First: Please post diff's to the 4.2 UDA driver. Second: The reason for the Get Unit Status command in the first place (the one which floods the UBA and UDA with interrupts) is because *something* has to be done at that point, and I suppose the author of the original code felt that M_OP_GTUNT was the least drastic. Here's the scenario: N requests for Unibus BDPs all granted. (N depends on CPU type.) UDA50 requests BDP. Request is denied because the UBA is out of BDPs. The driver can't wait for one because this is happening at interrupt level. What to do? Solution 1: return from the interrupt code, without doing anything at all. This would work except for one snag: what if there are no transfers pending on that controller? No more interrupts will occur, and we'll never get another shot at grabbing a BDP and starting the transfer. (Another thing would need to be done is allocate the MSCP packet *after* getting the BDP, not before. Unless you know a way to give back an MSCP packet . . . ?) Solution 2: do a Get Unit Status. That doesn't need a BDP and can use the handy MSCP packet that's already been allocated. Unfortunately, it floods the UBA with interrupts until a BDP is finally released. Solution 3 (my favourite but requires hacking the UBA code): Return from the interrupt code after exacting a promise from the UBA code to call the interrupt routine again once a BDP is free. Requires some sort of queueing, alas. Another possibility is to just set a flag someplace and have a callout daemon (the udwatch watchdog routine that isn't there for some reason, perhaps) call the interrupt routine. Easier but not as aesthetic. Solution 4: apply Solution 1 after moving all other devices that use BDPs to another Unibus. That would guarantee that the snag never occurs. However, while technically feasible, it may not be a viable option. (I can just see trying to justify another UBA to the state ``because we can't write the software right''. . . .) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci (301) 454-7690 UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland