[net.bugs.4bsd] awk.lx.l missing \b \r \f \ddd

liberte@uiucdcs.UUCP (04/28/84)
#N:uiucdcs:8200019:000:3684
uiucdcs!liberte    Apr 27 16:14:00 1984

Subject: Awk doesnt allow \b \r \f and \ddd in strings, regular 
	 expressions (except \ddd) or character classes.

Index:	/usr/src/bin/awk/awk.lx.l  4.2BSD

Description:
	According to the awk manual "The version of printf is identical to 
	that used with C".  This seems to indicate that all the escape
	sequences of C should work.  But only \n \t \\ and \" do work.  Thus
	the only way to get a \f is to printf "%c" followed by the code for FF
	which I just now failed to find within 2 minutes.
	
	I subsequently learned that all the escape sequence processing for awk
	is done by lex (similarly for C) (so you can forget about trying to 
	print "\\" "n").  In the interest of compatibility with C, I added
	the missing code.  I also fixed some other inconsistencies:
	  1) \ddd  fixed to allow 1, 2 or 3 octal digits only.
		(awk requires 3 digits and allows digits 8 or 9)
	  2) \. will ignore \ if . is not recognized.
		(awk does not ignore \ except in regular expressions)

Repeat-By:
	Try this awk program:
	  /\ignore[\d]/	{ print "\ignored \\"}
	  /\\/		{ print "formfeed\fbackspace\bcarriage return\r\""}
	  /\10/		{ print "not back space( \010), but \10"}
	  /[\010\t\7]/	{ print "back space, tab or bell\7" }

Fix:
	In the following changes to awk.lx.l, *** is the original.  I fixed
	the diff output so it is easier to read, so the line numbers are gone.

*** /tmp/,RCSt1002731	Sat Apr 21 20:11:09 1984
--- awk.lx.l	Sat Apr 21 20:09:44 1984
***************
*** 24,29
  A	[a-zA-Z_]
  B	[a-zA-Z0-9_]
  D	[0-9]
  WS	[ \t]
  
  %%

--- 25,31 -----
  A	[a-zA-Z_]
  B	[a-zA-Z0-9_]
  D	[0-9]
+ OD	[0-7]
  WS	[ \t]
  
  %%
***************
*** 122...
  <reg>")"	RETURN(')');
  <reg>"^"	RETURN('^');
  <reg>"$"	RETURN('$');
! <reg>\\{D}{D}{D}	{ sscanf(yytext+1, "%o", &yylval); RETURN(CHAR); }
  <reg>\\.	{	if (yytext[1]=='n') yylval = '\n';
  			else if (yytext[1] == 't') yylval = '\t';
  			else yylval = yytext[1];
  			RETURN(CHAR);
  		}

--- ... -----
  <reg>")"	RETURN(')');
  <reg>"^"	RETURN('^');
  <reg>"$"	RETURN('$');
! <reg>\\{OD}{OD}?{OD}?	{ sscanf(yytext+1, "%o", &yylval); RETURN(CHAR); }
  <reg>\\.	{	if (yytext[1]=='n') yylval = '\n';
  			else if (yytext[1] == 't') yylval = '\t';
+ 			else if (yytext[1] == 'b') yylval = '\b';
+ 			else if (yytext[1] == 'r') yylval = '\r';
+ 			else if (yytext[1] == 'f') yylval = '\f';
  			else yylval = yytext[1];
  			RETURN(CHAR);
  		}
***************
*** 137,...
  		yylval = (hack)setsymtab(cbuf, s, 0.0, CON|STR, symtab); RETURN(STRING); }
  <str>\n		{ yyerror("newline in string"); lineno++; BEGIN A; }
  <str>"\\\""	{ cbuf[clen++]='"'; }
  <str,chc>"\\"n	{ cbuf[clen++]='\n'; }
  <str,chc>"\\"t	{ cbuf[clen++]='\t'; }
  <str,chc>"\\\\"	{ cbuf[clen++]='\\'; }
  <str>.		{ CADD; }

--- ... -----
  		yylval = (hack)setsymtab(cbuf, s, 0.0, CON|STR, symtab); RETURN(STRING); }
  <str>\n		{ yyerror("newline in string"); lineno++; BEGIN A; }
  <str>"\\\""	{ cbuf[clen++]='"'; }
+ <str,chc>"\\"{OD}{OD}?{OD}? { sscanf(yytext+1, "%o", &yylval); 
+ 			cbuf[clen++] = (char)yylval; }
  <str,chc>"\\"n	{ cbuf[clen++]='\n'; }
  <str,chc>"\\"t	{ cbuf[clen++]='\t'; }
+ <str,chc>"\\"b	{ cbuf[clen++]='\b'; }
+ <str,chc>"\\"r	{ cbuf[clen++]='\r'; }
+ <str,chc>"\\"f	{ cbuf[clen++]='\f'; }
  <str,chc>"\\\\"	{ cbuf[clen++]='\\'; }
+ <str,chc>"\\".	{ cbuf[clen++]=yytext[1]; }
  <str>.		{ CADD; }
**********************************************************

I had some trouble using:  {OD}{1,3}  instead of  
			   {OD}{OD}?{OD}?  for the <reg> match. 

Any ideas why?


Daniel LaLiberte, U of Illinois, Urbana-Champaign, Computer Science
(uiucdcs!liberte)
{if it's a feature - document it;  if it's a bug - document it or fix it}