zwicky@.erg.sri.com (Elizabeth Zwicky) (02/07/91)
This is my personal list of interesting questions in Large Installation System Administration. The list was originally discussed at the Winter 1991 USENIX in Dallas at the LISA BOF, and is posted at the request of those attending that BOF. This is a reorganized, lengthened, and cleaned up version, which includes the questions added by people there. Asterisks mark leaf nodes, so I can count up how many there are (I have a vague theory that every asterisk is approximately a paper's worth of question). Where I know them, I have listed people working on the problem. A future version will also give references to existing work. I will happily update the list with extra references, questions, names of people and so on if you send them to me. As to why you might be interested in this; well, it'll give you something to think about during those long sleepless nights. More seriously, it might suggest problems you ought to worry about before they bite you; it's a good place to look for paper topics if you think you'd like to write a paper (which is good for you personally in that it impresses people, and good for the world in general in that it spreads information and minimizes redundant work); it may point you towards information, or even just useful ways of stating questions, to help with problems you already have. 1) Storing data. *A) Partitioning disks. Little partitions separate out different uses of disks; big partitions avoid some waste space. How do you decide where to draw the line? How do you balance loads between disks and controllers? Where do you tradeoff between manageability and efficiency? What are the issues you should consider when partitioning disks? (Some seemingly obscure things, like putting high-traffic partitions closer to the center of the disk, can have noticeable effects on performance.) B) Migration systems. Some users have a lot of data that they don't use much but insist on having around just in case. One way to deal with this is to provide ways to silently transfer files away from expensive, size-limited, but quickly accessible magnetic disk onto slower but larger and more extensible media, and transfer them back when they're looked at. Such systems exist, but usually require either major investments of money, or kernel modifications. *i) What options currently exist, and how do they compare> *ii) Supposing infinite resources, what should such a system really do? Can it be done without kernel modifications? How do you decide when to move files off line? Are vendors making the right assumptions about file usage patterns? Is there a single set of right assumptions, and what can you do if there isn't? What do you do about really long delays? How do you reclaim space on tertiary storage? *C) For convenience, the rest of this section is divided between backup systems (designed to restore data lost in the event of a failure of some sort) and archive systems (designed to save data for long periods of time). Current systems do not make this distinction well, so most sites use a backup scheme to provide a historical record of some sort, as well as for immediate recovery, and patch in a second system (or non-system) to deal with files that they know will need to be accessed in ways that the backup system doesn't support. For both backups and archives, assuming that a system has only the one purpose simplifies its design. Is this really a defensible assumption? Even if it is, where do you put the current historical uses of backups? Because we currently use our backup system to support some archive purposes, we keep some tapes for a very long time. We have needed those ancient tapes for purposes we did not forsee, and therefore have pulled off data which we would not have explicitly transferred to archives. D) Backups *i) Almost everybody has a locally designed backup system; there are also commercially available systems. There are no available source of information about what to consider when designing such a system, or what common pitfalls are. [Elizabeth Zwicky, zwicky@erg.sri.com] *ii) Any careful attempt to design a backup system reveals that the available programs which transfer data from filesystems into other forms have severe problems. There are now known serious bugs in every program (including dump) used for this purpose. We need a new one. [Steve Romig, romig@cis.ohio-state.edu] *iii) Techniques for reliably speeding up dump are reasonably well known. Now what do we do about restore? *iv) How do you back up a terabyte? Suppose you have a migration system, with a terabyte or so of secondary storage, what do you do if the building burns down? *E) Archive systems. If a user comes to me and says "I have here 100M of data which I may or may not need to look at some time in the next 20 years," what do I do with it? 2) Security *A) What can you do when users have root? There are many situations in which it is simply impossible to take all root permission away from the users. What are the technical and personal measures you can take to let the users do what they need to without unduly compromising the security of your network? *B) How can you convince users to co-operate with security precautions? You can force them to choose good passwords; if you try hard enough, you can even pretty much not drive them mad in the process. But you can't forcibly prevent people from writing down their passwords, or giving them out to other people. How do you make security safeguards that are livable and comprehensible, and get people not to turn around and destroy them for their own purposes? *C) Trust in confederations. In many situations, systems with separate administrators are grouped together in loose confederations, where administrators on different systems are roughly peers, but need to work together. (For instance, two departments within a company may each have separately administered machines.) In such a situation, you can't force your confederates to be trustworthy, but you may nevertheless have good reason for wanting to share resources in ways that require trust. How do you negotiate that trust while remaining secure? *D) What tools are there for evaluating security, and how do they compare? *E) How do you decide how secure you need to be? 3) Adding machines to your network *A) How do you keep users from adding random machines to the network? *B) Usually, you need to be willing to deal with all reasonable requests to add things. How can you tell which machines are reasonable to add to the network, and which aren't? Someone shows up in your office with a hyper-intelligent coffee maker that runs Mr. UNIX, and wants you to integrate it. Is it going to make the network explode or not? *C) How do you figure out what it costs to add a machine? Obviously, if you have 200 Suns, adding another Sun costs something (adds network load, uses server space, etc.), adding a VAX costs something more (now you have to support another architecture), adding a hyper-intelligent coffee maker running Mr. UNIX costs something more (now you have to support another architecture that nobody can help you with), adding a VMS VAX costs yet more (a whole new operating system, another networking protocol...) But just what are the costs? Some increase linearly; it takes roughly twice as long to compile a program for two different architectures as for one. Some of them are much worse than linear; NFS may be a great thing, but it isn't always the same, and you need to test every architecture as client vs. every architecture as server, giving you an order N! problem. *D) Once you know you have to add a machine, what do you need to do to integrate it? *E) How do you plan your network to make it easier to add things to? *F) What do you do when you need to increase the number of machines on your network by a factor of 2 or more? 4) Buying software *A) How do you determine what the administrative cost of a piece of software is? Programs differ in costs to administer, sometimes in obvious ways (for instance, they require complex and horrible printcap files, or a separate printcap for every user), and sometimes in unobvious ways (using mh for mail makes for vast numbers of small files changing every day, strongly biasing the pattern of file system usage). When you are evaluating software, where do you look for these costs? *B) How do you manage to install software, since vendor installation scripts tend to break things, or to fail? *i) What would a vendor install script that you didn't hate look like? *C) How do you select software to purchase? Given that users and system administrators tend to have different agendas in selecting software, what procedures and criteria can you use to make decisions that everybody can live with? 5) Monitoring usage A) Statistics used for fairness and charging purposes *i) Disk quotas. Berkeley UNIX systems come with a disk accounting system, but it isn't very effective. Many sites have their own accounting systems. What are the appropriate abstractions in such a system? Some quota systems look at file ownership; some look at position of the file in the file hierarchy. What do you do about tracking usage for multi-user groups or projects? How do you determine where quotas are set? Do you set quotas so that they total to no more than the available disk, and risk having users run out of quota before you run out of disk, or do you set them higher, and having users run out of disk before they run out of quota? Since disk space usage changes over time, how do you charge people for it? Who do you charge for it (users, projects, groups)? What do you do when they run out of it? Who gets to decide how much space people need, and what criteria do they use? How do you keep people from cheating? *ii) Printer accounting. There is moderate vendor-supplied support for tracking number of pages printed on some printers. This is not sufficient for most people who want to do printer accounting. There are also other issues; on a PostScript printer, you may spend many hours of printer time to produce a one-page image. Do you start accounting for printer CPU? On a network connected printer, where random machines may send it jobs, how do you even do page accounting? *iii) Process accounting. Again, some versions of UNIX support some process accounting. Unfortunately, business usually want to charge projects, not people, and support for that is non-existent. How do you hack it in? *iv) OK, so you've figured out how to track usage of printers, usage of disk, usage of CPU. How do you provide a single interface to all this data? *v) You have all the data you could possibly want. How do you charge people? Do you charge them for connect time, or CPU cycles, or something else completely? If you don't charge them based on usage, do you charge them a flat fee, or a fee based on availability, or what? *B) Statistics used for capacity estimation. It's easy to tell that you don't have enough network bandwidth, once you run out. How do you tell how much you have left before you run out? How do you know how soon you're going to run out of disk space? How do you know how many usable spare CPU cycles you have? *C) Statistics used to track usage patterns for design and optimization purposes. When you go out to design a backup system, or speed up your network, or otherwise fiddle with things, you often need to know exactly what it is that people do with your system, and when. The sort of information you need tends to be different from the sort of information you need for charging people; for instance, you may need to know the number of files changed in a day, or the number of NFS reads as opposed to NFS writes that occur. How do you figure out all of these things? nfsstat will tell you about NFS traffic (once you figure out what statistics you want and where you're going to keep them and how you're going to analyse them), and if you happen to have sources to it, your backup system can be instrumented to give you information about what files are changing when. But you need to know what statistics are important, how to gather them, and how to figure out what they mean. *D) Statistics for communicating to users. System administrators generally have a vague idea what's going on with their machines. Users rarely have that much of an idea. How do you make available to them information that they can understand about what the computers are doing? *E) Performance monitoring. How are your machines doing, and is it getting better or worse? Are the users complaining because users are like that, or because something is really wrong? And where do you get cute graphs that management likes that show how you're supplying marvelous facilities to people? 6) Clone wars. 100 identical machine are a lot easier to deal with than 100 different machines, and so are 100 mostly identical machines. But how do you get them that way, and how do you keep them that way, and how do you use their identicality to help you? *A) Turning chaos into clones; how do you create a cloned site out of individual machines? *B) Executing across multiple hosts. There are available programs that take a command and execute it on multiple hosts (for instance, gsh) but they tend to be highly site specific. How do you set one up for your site? Or, how about someone writing one that will work for a lot of sites without too much fiddling? C) A cloned facility needs tools to make machines look alike. *i) An overview of existing methods and philosophies for distributing changes between machines. *ii) An improved version of rdist that would be widely available and applicable while implementing useful features like time-outs. [Slightly worked on by Tom Christiansen, tchrist@convex.com] D) When clones develop personality; how to deal with machines that are alike in some ways, but different in others. Machines are never quite perfect clones of each other, especially if they sit on people's desks. In many cases, the changes are encapsulated in pieces of files (for instance, printcap files that differ by default printer, or that have one printer local but all the rest remote). These are currently handled on a case by case basis at most places, with a program to take care of printcaps, and one to build rc.local files, and so forth. *i) Those case by case programs are in themselves of interest. *ii) A more general solution to the problem is also needed; how do you provide a flexible ability to customize files for multiple hosts? [slightly worked on by Elizabeth Zwicky, zwicky@erg.sri.com, separately by Steve Romig, romig@cis.ohio-state.edu] 7) Users as abstractions *A) Creating user accounts; how to make an add user program. Unscientific surveys show that almost every site has their own add user program. There are good reasons for this, which are unlikely to change soon; what would be really nice is a comparative study of add user programs, suggesting what one ought to do in order to be secure, safe, effective, and flexible enough so that it won't have to be rewritten too often. *B) User information beyond the password file. Gecos field or no gecos field, the password file doesn't hold all of the information that you want about users. What other information are people using, and how are they storing it and keeping it in sync? *C) Removing user accounts. Creating accounts is comparatively easy. Removing accounts requires that you clean up all sorts of loose ends, and doing it from programs exposes you to all sorts of interesting problems (for instance, the operator who told the account program to remove an account which had / for a home directory). What are the technical and political pitfalls, and what can you do about them? [Steve Simmons, scs@iti.org] *D) Making your users into clones. Users will persist in having personalities, and in expressing them in their initialization files. What ways exist to force them into some sort of regularity, and what are their pros and cons? [One system is being worked on by J Greely, jgreely@cis.ohio-state.edu] 8) Users as people *A) Training users. Answering questions is all very well, but getting people to where they don't need to ask them is even better, and leaves you with more free time. How do you do that? [Bryan MacDonald, bigmac@erg.sri.com] *B) Users as customers vs. users as pond scum. System administrators are famous for a bad attitude about users (calling them lusers, for instance), but the users are also the people who pay the salaries. What attitude should we take towards the users, and how do we manage to have it and spread it? [Kevin Smallwood, kcs@houdini.cc.purdue.edu] *C) Making users happy without actually fixing anything. You can't always fix everything, especially if you're hiding from user lynch mobs. What are the non-technical tricks that allow you to make the users happier, thereby disassembling the lynch mobs, so that you can peacefully go about your work? *D) Should stupid users get stupid programs? People frequently complain about the difficulty of common UNIX programs, and want to replace them with easier ones for users who claim to be, or are perceived by administrators as being, incapable of using the normal tools. Is this a good idea? If it is, where do you find such programs? Can you find programs that are easy to use that also lead into normal UNIX tools? 9) Training system administrators *A) What is the career path for system administrators? Where do they come from and where do they go? *B) What do you do with new system administrators once you've got them, that gives them information without risking damage to your site? *C) What resources are out there for people to learn from, particularly from other fields? (For instance, people have suggested a reading list including such things as "Search for Excellence", and "The Mythical Man-Month") 10) What do system administrators do? *A) Just what is the point of all this? Are we trying to make existing machines run? Are we trying to provide some level of service? To whom? *B) How are system administrators like and unlike user support people, system programmers, and so on? *C) How do you explain to managers what system administration is like; why it can't be managed the same way that research programming can, why it is difficult and takes trained people who get paid real money? 11) Centralization vs. decentralization *A) How do you figure out where to make the tradeoff between the economies of scale and administrative advantages of centralizing things like disk space and printer service, and the fault tolerance and individual control of distributing them? *B) What administrative functions must be centrally controlled, and which ones can be safely handed out? How do you make central organization in a group with no center (for instance, trying to share a network between projects, where somebody has to administer network addresses, but nobody has authority over anybody else)? 12) Working together *A) How do you administer sites that are physically remote? *B) As a large site, how can you deal with associated tiny sites? If you admister 300 machines in one place, and 3 in another, how do you come up with a system that copes with both? *C) How can you make a confederation of administrators within an organization, and what can one do for you? [Mark Verber, verber@pacific.mps.ohio-state.edu] 13) Are those apples treated with Alar? Motherhood and apple pie reconsidered. *A) Are policy free tools possible, or even advisable? Is it really better to give people the ability to make their own stupid policies easily, or to give them tools that implement intelligent possibilities with a few degrees of freedom? 14) When things break *A) Hack it, or track it? When you run across a problem that has a fix, do you apply the fix even if you don't yet understand the problem, or do you attempt to track it down even if that means leaving things broken? Obviously, rebooting the machine will fix a lot of problems, but sometimes it will keep you from figuring out the bug and reporting to the manufacturer and getting it fixed forever. *B) 24 hour support; do you provide it and if so how? Are beepers evil? Programmers are famous for working at all hours of the night, which is all very well for them, but if you have to deal with a whole bunch of them, they might want help at any hour of the day or night. Most system administrators want a life of some sort; how do you get one while keeping the users happy? *C) 20 questions to ask users when they report a problem. So a user calls you up and says "Mail doesn't work." What do you do then? *D) You have found a problem. You know how to fix it. How do you install the fix in such a way that you don't undo it later? What should you do with the fix besides install it? Tell your vendor? Tell other people? How? 15) Making changes *A) How do you help users adjust to changes? You can't run V7 on a PDP-11 forever; at some point, the users are going to have to change hardware and software platforms. How do you reduce the trauma? *B) When is it time to upgrade? Folk wisdom says "never install an even release", and self-preservation suggests that switching your entire site over to a beta release is not a good idea. But there is no release without bugs, and at some point you're going to have to decide to live with it. When? For that matter, when is the pain and expense of upgrading your hardware platform outweighed by the pain and expense of keeping the old one? *C) Beating swords into plowshares versus buying tractors. Most system administrators are virtuosi at the UNIX philosophy of combining together old tools with baling wire and string, which is cheap in some ways, and gives you that warm glow of accomplishment. On the other hand, there's a lot to be said for throwing out the old and doing something new. When do you decide that you should stop trying to coerce the old operating system (name service, printer system) into working and design a new one from the ground up? *D) How do you manage to keep local "improvements" and still be able to change with the rest of the world? So you rewrote the printer system (or you wrote an adduser program, or you made talk(1) work on everything). And then you bought 10 new machines, 2 each from 5 different hardware vendors, and all your old vendors released new OS versions. What makes this not a really good time to become a carpenter? *E) Justifying the expense. It may be obvious to you that life would be much better if you had more than 4 M of real memory in the Sun 3s that everyone wants to run Sun OS 4.1, OpenWindows, and 5 copies of emacs on, but how do you make it obvious to the people who spend the money? 16) Electronic communication *A) Usenet; how do you control your users and your disk usage without being (too much of) a fascist? *B) Mail *i) Compare and contrast the various methods of getting all the mail for a site to deliver to one place. Among these methods; NFS mount /usr/spool, use aliases or .forwards everywhere to deliver mail to one machine, or one machine per user, deliver mail to home directories, automount /usr/spool/username for every user. 17) Testing *A) You install, reinstall, or upgrade a machine. Without using users as test suites, how do you know it works? *B) You have NFS, or NIS, or Kerberos, or X Windows implementations from many vendors. How do you figure out where they do and don't work with each other? 18) Documentation [A & B both worked on by Elizabeth Zwicky, zwicky@erg.sri.com, and Mark Verber, verber@pacific.mps.ohio-state.edu jointly] *A) What documentation should you produce for your site, aside from that shipped by vendors? What available documentation is out there to give to users? *B) What tools are there to make producing user documentation easier? 19) Little machines become big problems *A) How do you make the PCs talk to the world at large? You need to connect them to big networks, and provide services to them somehow. But it's a very uneasy alliance between PC programs designed for little networks, and protocols designed for big networks. How do you make the connections work smoothly? (And has anyone every met a Mac mail program they actually like? The users liking it doesn't count if the administrators hate it...) 20) Parcelling out the CPU cycles A) How do you let users make use of spare CPU cycles anywhere in the network, without giving them non-spare CPU cycles? *i) In a workstation environment, users coming in over a network, instead of logging in at consoles, need to be distributed easily between machines. How do you do that? How do you mediate between users coming in over the wire, and users at the console of the machine? ii) Single jobs may also want to be distributed, either by *a) using an entire network of machines as an extremely coarse-grained parallel processor or *b) using a more powerful or less loaded machine elsewhere on the network. Facilities for doing the latter exist, certainly; I'm not certain there is even any help for doing the former. How do the available facilities compare? How do you assist people in making programs work in these situations? *B) How do you deal with batch jobs under UNIX? 21) Watching users *A) Users with questions are often unable to adequately describe what they are doing and what the machine is doing in response. What facilities are available for connecting to what they are doing to watch and assist? *B) How do you keep an eye on suspicious or malicious users?
dmckeon@hydra.unm.edu (Denis McKeon) (02/08/91)
Those were a lot of good hard questions about large systems admin. Here's a small tidbit in reply: > *B) How can you convince users to co-operate with security >precautions? You can force them to choose good passwords; if you try >hard enough, you can even pretty much not drive them mad in the >process. But you can't forcibly prevent people from writing down their >passwords, or giving them out to other people. How do you make >security safeguards that are livable and comprehensible, and get >people not to turn around and destroy them for their own purposes? My preferred approach to creating easy-to-remember passwords which are not words in any language is to use the initial letters of easily remembered phrases, for instance: password memorable phrase string NIlmdts Now I lay me down to sleep Ttsciab These two strings come into a bar Witcohe When in the course of human events Csmnlra Congress shall make no law respecting an eoroptf establishment of religion, or prohibiting the free etoatfo excercise thereof; or abridging the freedom of sootpot speech, or of the press; or the rotppta right of the people peaceably to assemble, atptGfa and to petition the Government for a rog redress of grievances. Note that phrases in foreign languages, poetry, even .sig quotes can be used. Benefits: Users will have less need to write down more memorable passwords. Password string is easy to recall once you recall the mnemonic phrase, thus does not need to be written down (up to some fairly small limit of different strings/phrases.) Someone watching you type the password has a harder time visually collecting the letters and remembering them (harder than a word, or someone's name, anyway - which a good system shouldn't allow). In an environment that forces periodic new passwords, the user can jump ahead a few words in the source text - better than switching back & forth between two passwords. (but no good if the cracker knows the previous mnemonic phrase and can attribute it.) (but if the cracker knows only the password string it can map to many phrases.) Drawbacks While this approach makes passwords more memorable, it doesn't produce most-difficult-to-crack-by-brute-force passwords. It also doesn't address people sharing access to their accounts. Some people have a tendency to softly vocalize the mnemonic phrase. Characters in the password string are usually in range a-z, almost all in a-zA-Z, often with initial upper-case letter, thus susceptible to brute force of all combinations of alphas. (but You can adopt the German Model of capitalizing all Nouns.) Cracker could use CD-ROM encyclopedia to generate strings for brute force (seems like more work than all alpha combos). Of course you can enrich the password character mix by doing things like: 1Bs-Iw! One Bell system - It works! but that limits your choice of phrases to those with numeric words - perhaps a combination with the license plate model would be better: O!U812. Oh! You ate one too. (or replace with (usually suggestive) phrase of your choice) (followup suggestions to rec.humor with Subject: YALPPWS - :-) YALPPWS Yet Another License Plate PassWord String :-) Now before you-all in netland get out the flame-throwers, yes, I do realize that randomly generated printing ASCII strings are more secure from brute force attacks than alpha strings - but I think that most people will agree that random strings are harder to remember (unless committed to paper.) This mnemonic phrase approach isn't a panacea - just a spoonful of sugar to help create (optionally enforced) non-word, non-name password strings which are more memorable than random ones. Well, certainly 'nuff said - Followups re memorable passwords to comp.unix.large - perhaps an odd choice, but the only security group at this site is alt.security, and I can't be sure that an alt. group is distributed as the original posting was. -- Denis dmckeon@hydra.unm.edu
bernie@metapro.DIALix.oz.au (Bernd Felsche) (02/14/91)
In <1991Feb07.190832.26048@ariel.unm.edu> dmckeon@hydra.unm.edu (Denis McKeon) writes: >Of course you can enrich the password character mix by doing things like: > 1Bs-Iw! One Bell system - It works! There are pnemonics using digits and alphabetics, like: 0b1ken0b For Star Trek fans 4getful For those who can't remember And other possible combinations of digits which sound like sylables 0,1,2,4,8 and for New Zealanders, 6. (try 6ul-dv8 with an NZer) Use your imagination! Passwords like these are difficult to guess, and are often not obvious, even when written down. It would be extremely difficult to mechanically crack them as well, though not impossible. Generating them would be almost as difficult. One of the best approaches with maintaining good passwords for users is to try to crack the yourself (at off-peak times), using personal data for users. If you crack one, I reckon they should pay you $20. If they forget theirs, it should cost them $5 to get a new one assigned. -- _--_|\ Bernd Felsche #include <std/disclaimer.h> / \ Metapro Systems, 328 Albany Highway, Victoria Park, Western Australia \_.--._/ Fax: +61 9 472 3337 Phone: +61 9 362 9355 TZ=WST-8 v E-Mail: bernie@metapro.DIALix.oz.au | bernie@DIALix.oz.au
adeboer@gjetor.geac.COM (Anthony DeBoer) (02/16/91)
In article <1991Feb14.051446.3088@metapro.DIALix.oz.au> bernie@metapro.DIALix.oz.au (Bernd Felsche) writes: >In <1991Feb07.190832.26048@ariel.unm.edu> dmckeon@hydra.unm.edu (Denis McKeon) writes: > > >>Of course you can enrich the password character mix by doing things like: > >> 1Bs-Iw! One Bell system - It works! > >There are pnemonics using digits and alphabetics, like: > 0b1ken0b For Star Trek fans ^^^^^^^^ ^^^^ Strange, never saw Alec Guiness do SF until Star _Wars_ came out :-) > 4getful For those who can't remember ^^^^^^^ maybe? ;-) -- Anthony DeBoer NAUI#Z8800 | adeboer@gjetor.geac.com | Programmer (n): One who Geac J&E Systems Ltd. | uunet!geac!gjetor!adeboer | makes the lies the Toronto, Ontario, Canada | #include <disclaimer.h> | salesman told come true.
bernie@metapro.DIALix.oz.au (Bernd Felsche) (02/16/91)
In <1991Feb14.051446.3088@metapro.DIALix.oz.au> bernie@metapro.DIALix.oz.au (Bernd Felsche) writes: > 0b1ken0b For Star Trek fans ^^^^^ That should have been WARS. May the force be with you! And the flood of e-mail cease! Hands must have been disconnected from brain for a few cycles. Thanks to all of those who've pointed this out. I don't usually make misteaks :-) -- _--_|\ Bernd Felsche #include <std/disclaimer.h> / \ Metapro Systems, 328 Albany Highway, Victoria Park, Western Australia \_.--._/ Fax: +61 9 472 3337 Phone: +61 9 362 9355 TZ=WST-8 v E-Mail: bernie@metapro.DIALix.oz.au | bernie@DIALix.oz.au