hedrick@topaz.RUTGERS.EDU (Charles Hedrick) (01/16/86)
A previous message suggested using "sendmail -bt" to see how sendmail is going to process an address. This is indeed a handy command for testing how an address will be processed. However the instructions given were not quite right. To see how sendmail is going to deliver mail to a given address, a reasonable thing to type is sendmail -bt 0,4 address Even this isn't quite right, but with "normal" rule sets it should work. Because there is so much confusion about sendmail rules, the rest of this message contains a brief tutorial. My own opinion of sendmail is that it is quite a good piece of work. Many people have complained about the difficulty of understanding sendmail rule sets. However I have also worked with mailers that code address processing directly into the program. I much prefer sendmail. The real problem is not with sendmail, but with the rules. The rules normally shipped from Berkeley have lots of code that does strange Berkeley-specific things, and they are not commented. Also, typical complex rule sets are trying to handle lots of things, forwarding mail among several different mail systems with incompatible addressing conventions. A rule set to handle just old-style (non-domain) UUCP mail would be very simple and easy to understand. But real rule sets are not doing simple things, so they are not simple. For those not familiar with sendmail, -bt invokes the rule tester. It lets you type a set of rule numbers and an address, and then shows you what the rules will do to that address. In addition, rule test mode automatically applies rule 3 before whatever rule you ask it to apply. As we will see shortly, this is a reasonable thing to do. Before describing the rule sets, let me define two terms: "header" and "envelope". Header refers to the lines at the beginning of the message, starting with "from:", "to:", "subject:", etc. Sendmail does process these lines. E.g. with uucp mail it will add its own host name at the beginning of the from line, so that the final recipient stands some change of replying to the message. However sendmail normally does not depend upon the from and to lines to perform its actual delivery. It has more direct knowledge, passed on to it from the program that generated the mail, or if it came from another site, the mailer at that site. This information is referred to as the "envelope", since it is like the addresses on the outside of an envelope. For Arpanet mail, the envelope is passed to the next site by the MAIL FROM: and RCPT TO: commands. For UUCP mail, it is passed on as arguments to the remote rmail command. To see why there have to be separate addresses "on the envelope", consider what happens when you send mail to "john@vax, mary@sun". Two copies of the message will be dispatched, one to vax and the other to sun. The "to: " line in the headers will show both addresses. However the envelope will show only the right address that we want this copy to go to. The copy sent to vax will show "john@vax" and the copy sent to sun will show "mary@sun". If sendmail had to look at the "to: " line, it would never know which of the addresses shown there it was responsible for handling. Anyway, here is what the rules do: 3: always done first. This turns addresses from their normal textual form into a form that the rest of the rules understand. In most cases, all it does it put < > around the name of the host that is next in line. Thus foo@bar turns into foo<@bar>. However it also does a few transformations. E.g. it turns foo!bar!user into bar!user<@foo.UUCP>. Since sendmail accepts either ! syntax or @....UUCP syntax, rule 3 standardizes on @ syntax. It also does a few other minor things. But you won't be far off if you just think of it as adding < > around the host name. 4: always done last. This turns addresses from internal form back into external form. It removes the < > around the host name, and turns foo@bar.UUCP back into bar!foo. Again, there are one or two other minor things, but you won't be too far off if you think of 4 as just removing the < > around the host name. 0: This is the rule that handles the destination address on the envelope. It is in some sense the primary rule. It returns a triple: protocol, host, user. The protocol is usually one of local, TCP, or UUCP. At the moment, it figures this out syntactically. In our rule set, hosts ending in .UUCP are handled by UUCP, the current host is local, and everything else is TCP. As domains are integrated into UUCP, obviously this rule is going to change. This rule does very little other than simply look at the format of the host name, though as usual a few other details are involved (e.g. it removes the local host. So myhost!foo!bar will be sent directly to foo). 1 and 2 are protocol-independent transformations used for sender and recipient lines in the header (i.e. from: and to: lines). In our rule sets, they don't do anything. Each protocol has its own rules to use for sender and recipient lines in the header. E.g. UUCP rules might add the local host name to the beginning of the from line and remove it from the to line. In our rule set, the complexities in these rules are primarily caused by forwarding between UUCP and TCP. The line that defines the mailer for a protocol lists the rule to use for source and recipient, in the S= and R=. Finally, here is the exact sequence in which these rules are used. For example, the first line means that the destination specified in the envelope is processed first by rule 3, then rule 0, then rule 4. envelope recipient: 3,0,4 [actually rule 4 is applied only to the user name portion of what rule 0 returns] envelope sender: 3,1,4 header recipient: 3,2,xx,4 [xx is the rule number specified in R=] header sender: 3,1,xx,4 [xx is the rule number specified in S=] I have the impression that the sender from the envelope (the return-path) may actually get processed twice, once by 3,1,4 and the second time by 3,1,xx,4. However I'm not sure about that. Now for the format of the rules themselves. I'm just going to show some examples, since sendmail comes with a reference manual, which you can refer to. However these examples are probably enough to let you understand any set of rules that makes sense in the first place (which the normal rules do not). This example is from our UUCP definition. It a simplified version of the set of rules used to process the sender specification. As such, the major thing it has to do is to add our host name to the beginning, so that the guy at the end will know that the mail went through us. S13 R$+<@$-.UUCP> $2!$1 u@host.UUCP => host!u R$=U!$+ $2 strip local name R$+ $:$U!$1 stick on our host name Briefly, the first rule turns the address from the form foo<@bar.UUCP> back into bar!foo. The second rule removes our local host name, if it happens to be there already, so we don't get it twice. The third rule adds our host name to the beginning. S13 says that this is the beginning of a new rule set, number 13. R$+<@$-.UUCP> $2!$1 u@host.UUCP => host!u R says that this is a rule. The thing immediately after it, $+<@$-.UUCP> is a pattern. If this pattern matches the address, then the rule "triggers". If the rule triggers, the address is replaced with the "right hand side", i.e. what is after the tab(s). In this rule, the right hand sie is $2!$1. The thing after the next tab(s) is a comment. This rule is used in processing UUCP addresses. As noted above, by the time we get to it, rule 3 has already been applied. So if we had a UUCP address of the form host1!host2!user, it would now be in the form host2!user<@host1.UUCP>. This does match the pattern: $+ <@$- .UUCP> host2!user<@host1.UUCP> $+ and $- are "wildcards" that match anything. $- will match exactly one word, while $+ will match any number. (By the way, with the increasing use of domains, this production should probably use $+.UUCP, not $-.UUCP.) Since the pattern matches, we replace this with the "right hand side" of the rule, $2!$1. $ followed by a digit means the Nth thing matched by a wildcard. In this case there were two wildcards, so $1 = host2!user $2 = host1 The final result is host1!host2!user As you can see, we have simply turned UUCP addresses from the format produced by rule 3 back into normal ! format. The second rule is R$=U!$+ $2 strip local name This is needed because there are situations in which our host name ends up on the beginning of the recipient address. Since we are about to add our host name, we don't want it to be there twice. So if it was there before, we remove it. $= is used to see if something is a member of a specified "class". U happens to be a list of our UUCP host name and any nicknames. So $=U!$+ matches any address that begins with our host name or nickname, then !, then anything else. Suppose we had topaz!host1!host2!user. The match would be $=U !$+ topaz!host1!host2!user The result of the match is that $1 = topaz $2 = host1!host2!user Since the right hand side of this rule is simply "$2", the result is host1!host2!user I.e. we have removed the topaz from the beginning. By the way, the class U used by the rule would have been defined earlier in the file by the statement CUtopaz ru-topaz C defines a class. U is the name of the class. The rest of the line is the list of things that will be in the class. Finally we have the rule R$+ $:$U!$1 stick on our host name The $+ matches anything. In this case the name is host1!host2!user, so the result of the match is $1 = host1!host2!user The result looks slightly obscure. $: is a tag that says to do this only once. The problem is that this rule always applies, since the pattern matches anything. Normally, rules are applied over and over, as long as they apply. In this case, the result would be an infinite loop. Putting $: at the beginning says to do it only once. $U says to use the value of the macro U. Earlier in the file we defined U as our UUCP host name, with a definition DUtopaz Note that there can be a class and a macro with the same name. $=U tests whether something is in the class U. $U is replaced by the value of the macro U. So the final value of this rule, $:$U!$1, is topaz!host1!host2!user So this rule has managed to add our host name to the beginning, as it was supposed to. Since there are no further rules in the set (the next line is the end of file or the beginning of a new rule set), this value is returned. There are several more magic things that can appear in a pattern. The most important are: $* - this is another wild card. It is similar to $+, but $+ matches anything, whereas $* matches both anything and nothing. I.e. $+ matches 1 or more tokens and $* matches 0 or more tokens. So here is a list of the wildcards I have mentioned: $* 0 or more $+ 1 or more $- exactly 1 $=x any member of class x A typical example of $* is a production where we aren't sure whether the user name is before or after the host name: R$*<@$+.UUCP>$* $@$1<@$2.UUCP>$3 This production would test for the host name ending in .UUCP, and return immediately. $@ is a flag you haven't seen yet. It is simply a return statement. It causes the right hand side of this rule to be returned as the final value of this rule set. The other magic thing I will mention is $>. This is a subroutine call. Here is an example taken from rule set 24, which is used to process recipients in TCP mail. Its purpose is to handle the situation where we might have an address like topaz!user@red. (Our host name is topaz. Red is a local host that we talk to via TCP.) I.e. someone is asking us to relay mail to red. Rule 3 will have turned this into user@red<@topaz.UUCP>. What we want to do is get rid of the topaz.UUCP and treat red as the host. (Rule set 0 would do this for the recipient on the envelope. This rule is used for the to: field in the header.) Here is the rule. R$+<@$=U.UUCP> $@$>9$1 in case local!a@b The pattern matches our example, as follows: $+ <@$=U .UUCP> user@red<@topaz.UUCP> Recall that $+ matches anything and $=U tests whether something is our UUCP host name or one of our nicknames. The result of the match is $1 = user@red $2 = topaz The right hand side is $@$>9$1. The $@ is the tag saying to stop the rule set here and return this value. $>9 is a subroutine call. It says to take the right hand side, pass it to rule set 9, and then use the value of rule set 9. The actual right hand side is simply $1, which in this case is user@red. Here is rule set 9: S9 R$*<$*>$* $1$2$3 defocus R$+ $:$>3$1 make canonical R$+ $@$>24$1 and do 24 again The first rule simply removes < >. It is sort of a quick and dirty version of rule 4. In fact we have no < > left, since we have removed the <@topaz.UUCP>. So this rule does not trigger. (Now that I think about it, I suspect it is probably never going to trigger, and so is not needed.) The next rule is a simple subroutine call. It matches anything ($* matches any 0 or more tokens). The right hand side is $:$>3$1 The $: says to do it only once. Since the rule matches anything, you need this, or you will have an infinite loop. The $>3 says to call rule 3 as a subroutine. The $1 is the actual right hand side. Since the left hand side matched the whole address, what this rule does is simply call rule set 3 on the whole address. Recall that rule set 3 basically locates the host name and puts < > around it. So in this case the result is user<@red>. As you can see, it was not enough to remove <@topaz.UUCP>. That leaves us with no host name. We have to call rule 3 to find the current host name and put < > around it. The last rule is really just a goto statement. The pattern is $+, which matches anything, so it always triggers. The right hand side is $@$>24$1. The $@ is the return tag. It says to stop this rule set and return that value. $>24 says to call rule set 24. The actual right hand side is $1, so we call rule set 24 with the whole address. If you recall, this ruleset (9) was called from the middle of 24 when we found user@red<@topaz.UUCP>. So what we have done is to change this into user<@red> and say to start rule set 24 over again. I hope you have found this exposition useful. As a final convenience, here is a "reference card" for reading rule sets. Note that this contains only operators used by the rules. There are plenty of other facilities used in the configuration section which I am not documenting here. (I'd love to see someone produce a complete reference card.) wildcards: $* 0 or more tokens $+ 1 or more tokens $- exactly one token $=x member of class x (x must be a letter, lower/upper case distinct) $~x not a member of class x macro values (usable in pattern or on right hand side) $x value of macro x (x must be a letter, lower/upper case distinct) At least on the Pyramid, $x is replaced by the macro's value when the sendmail.cf file is being read in. on the right hand side: $n string matched by the Nth wildcard $>n call rule set N as a subroutine $@ return $: only do this rule once in rule 0, defining the return value $# protocol $@ host $: user Rutgers extensions, usable only on right hand side $%n take the string matched by the Nth wildcard, look it up in /etc/hosts, and if found use the primary host name $&x use the current value of macro x. x must be a letter. upper and lower case are treated as distinct.
chris@umcp-cs.UUCP (Chris Torek) (01/20/86)
One rather vague point in all the sendmail documentation I have read is that the `subroutine call' RHS macro $>n works by finishing the expansion of the RHS, then taking the entire result and handing it to ruleset n. It might seem that there is a way to hand it just part of an address, which I happen to feel would be extremely useful, but there is none. Also note that the $>n must be the first part of the string on the RHS after $:, $@, or $# (if present). This is not true of the host name canonicalization feature in 4.3 sendmail, $[; that may appear anywhere. While I am running off at the keyboard here, here are some of the other things I would also like to see in sendmail. Remember that this is supposed to be an address rewriting programming language. (What else can you call it?) - Long variable names. 32 characters is an absolute minimum; flexnames preferred. Small name spaces are bad. Remember the *big* 8K (core of course) computers you worked on? - Real control structures, not just `do this once per match' vs. `do this every match'. - Table lookups with return values, tables usually being files of some kind, hopefully fast (e.g., hashed). You might think this is overkill; and maybe it is. But then I am the guy who is recommending 64-bit address spaces on new hardware. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu
earle@smeagol.UUCP (Greg Earle) (02/01/86)
> While I am running off at the keyboard here, here are some of the > other things I would also like to see in sendmail. Remember that > this is supposed to be an address rewriting programming language. > (What else can you call it?) > > - Long variable names. 32 characters is an absolute minimum; > flexnames preferred. Small name spaces are bad. Remember > the *big* 8K (core of course) computers you worked on? > > - Real control structures, not just `do this once per > match' vs. `do this every match'. > > - Table lookups with return values, tables usually > being files of some kind, hopefully fast (e.g., hashed). To which I'll add my two cents worth: - FIX THE PROBLEM WITH LONG DESTINATION ADDRESSES! *Three* times I have seen mail come in from a site 1 away from us, which was intended to be forwarded by my machine, where (because of the incoming news path) the destination address was like 19 sites long. Sendmail (ala 'sort'), "silently truncated" the long dest. addr to 1!2!...!14!15 so it looked like the destination was user (machine 15) at site 14!! In all three cases, the destination address got cut off at 105 characters long. - either extend the 'F' class operator, so that it can return a 2-tuple, with key and associated string, or use a new letter that can perform this feature. I am pretty sure I saw this in the Maryland sendmail changes; it should be standard. There's no reason at this late stage in the game why everybody should have to have a 'uumail' program written especially to sit in between sendmail and uux just so you can use the pathalias data. Sendmail should do this itself, via this changed/new operator. - There seems to be an oversight in Ruleset 0, whereby if you don't put in your own rule to strip off the name of the relay host when sending to the Arpanet from a uucp site, then later on if the receiver tries to reply, the address will be such that it thinks it has a uucp link to the Arpa site that first got your outgoing mail (instead of a TCP link). The reply gets munged ... Our rule for this (Thanks to David Robinson at Caltech) is : # Remove name of relay host if sending to arpanet R$R!$*<@$*$-.arpa>$* $#$M $@$R $:$1@$2$3.arpa$4 rhost!user@any.arpa --------------------- Greg Earle sdcrdcf!smeagol!earle (UUCP) ia-sun2!smeagol!earle@csvax.caltech.edu (ARPA)