[net.unix] regexp

megabyte@chinet.UUCP (Dr. Megabyte) (08/25/86)

I've poured myself over ny manual and looked at regcmp(1), regcmp(3), and
regexp(3), and I'm still not sure how to use these functions.  Could someone
send me some clear info on how to use these functions along with some examples?

For the record: I am running Zeus 3.21 which is SYS III port to those of you 
who are fortunate to have never heard of it.
-- 
_________________________________________________________________________
UUCP:	(1) seismo!why_not!scsnet!sunder		Mark E. Sunderlin
	(2) ihnp4!chinet!megabyte			aka Dr. Megabyte
CIS:	74026,3235					(202) 634-2529
Quote:	"When The Going Gets Tough, The Tough Go Shopping" (9-4 EDT)
Mail:	IRS  PM:PFR:D:NO  1111 Constitution Ave. NW  Washington,DC 20224  

latham@bsdpkh.UUCP (Ken Latham) (08/27/86)

Dr. Megabyte (megabyte@chinet.UUCP) writes:
>I've poured myself over ny manual and looked at regcmp(1), regcmp(3), and
>regexp(3), and I'm still not sure how to use these functions.  Could someone
>send me some clear info on how to use these functions along with some examples?
>
>For the record: I am running Zeus 3.21 which is SYS III port to those of you 
>who are fortunate to have never heard of it.


I am not familiar with Zeus and am only quasi-familiar with sys3, the following
is a sys5 explanation which, if memory serves me, should cover it.

1. regcmp(3) - a function which translates regular expressions
	( a variant of ed(1) style ) to an internal form.  The char pointer
	returned is the address of a ( non-null-terminated ) string that
	represents the regular expression.  This 'compiled' regular expression
	can be interpreted by regex(3).
		If the returned pointer is NULL then you will have to
	'walk' through the regular expression by hand and determine where
	the syntax error is.

2. regcmp(1) - a user level command that will compile files of regular
	expressions into either data files containing the compiled expressions
	or into C files declaring data structures containing same.

3. regex(3) - the compiled regular expression interpreter which parses the
	subject string to determine if it is in fact a member of the language
	described by the compiled regular expression. It returns a pointer to
	the first character in the subject string which caused the pattern
	acceptance to fail.  Usually, this is a '\0' which terminated the
	subject string.  There are many cases where the character that stopped
	the acceptance may not be '\0', this is program dependent.
		A global variable 'loc1' ( according to the manual ) points
	to the position at which the match started in the subject string.
	This is usually the start of the subject string, but may vary with
	the application.

	The ACTUAL NAME of 'loc1' may be different than advertised!!
	on sys5 it is '__loc1' .  You can do a 'nm' on libPW.a to determine
	the name for your version.

EX.
	char *compex, *badchar, *regcomp(), *regex();
	.
	.
	compex = regcomp( "[a-zA-Z][_a-zA-Z0-9]*", 0 );
	if ( compex == NULL )
		.. some error routine to say that the RE is BAD !
	.
	.
	badchar = regex( compex, "A_long_identifier_name" );
	if ( badchar == '\0' && __loc1 == compex )
	{
		...then HOORAH, it was COMPLETE match!!!
	}
	else
	{
		... BOO HISSS, only a partial or no match was made.
		you may want to accept some partial matches in which
		case you can look at what caused the match to fail
		before the string terminator ('\0').  look at *badchar.
	}
	.
	.

	NOTE:
		both "[a-zA-Z][_a-zA-Z0-9]*" and "A_long_identifier_name"
		could just as easily be variables that are pointers to
		strings !!! It is much more useful when used on variables :-).

	
Some side notes:

	If it is the regular expressions and not the actual calls that
	give you problems then you need to buy a text book on the subject
	and get familiar with them.

	If you are familiar with REs then note that the (...)$n  notation
	utilized in regex(3) is an added extension to normal REs.

	The other arguments ret0, ret1 ..., ret9 in regex(3) are there simply
	to provide pointers to regions where the  (...)$n  extractions should
	be copied.  A subexpression surrounded by (....)$1 will extract a 
	substring from the subject string which matches the portion of the
	regular expression enclosed in (...)$1.  The ret0 pointer must hold
	the address of a preallocated area large enough to hold the longest
	possible substring.


	That should just about do it!  Hope that helps.  Sorry if you found
	this long winded, but I wanted to be complete.


			Ken Latham, AT&T-IS (via AGS Inc.), Orlando , FL

			uucp: ihnp4!codas!bsdpkh!latham

root@ozdaltx.UUCP (root) (08/29/86)

In article <516@chinet.UUCP>, megabyte@chinet.UUCP (Dr. Megabyte) writes:
> I've poured myself over ny manual and looked at regcmp(1), regcmp(3), and
> regexp(3), and I'm still not sure how to use these functions.  Could someone

I'll do the best I can.  Hope this helps. My manuals are a little
different in layout, (no section 1,2,3......)

the command regcmp compiles a regular expression (shell style)
into C source code with the output going to file.i or file.c.
The format is in the form, VARIABLE  "expression". The resulting
file.[ic] may be included as part of a C program, (#include file.[ic]).

Regexp(abc,line) applies the regular expression named abc to line.

EXAMPLE:
	Variable Name (space) Expression
	teleno	              "\({0,1}([2-9][01][1-9])$0\){0,1} *"
                              "([2-9][0-9]{2})$1[ -]{0,1}"
                              "([0-9]{4})$2"

Basicly this says:
in field 0 (area code) accept optionally a (
followed by the digits of the specified ranges followed by a
optional ).

In field 1 (exchange) accept a number starting with 2
through 9 plus any other 2 numbers ranging 0-9, followed by an
optional space or dash (-).

Finially, field 2 will accept 4 numbers ranging 0-9.

The above would be typed into a file, then regcmp run on the
file. The resultant file should look like:

/* "({0,1}([2-9][01][1-9])$0){0,1} *([2-9][0-9]{2})$1[ -]{0,1}([0-9]{4})$2" */
char teleno[] {
060,027,00,01,074,00,030,04,020,062,071,030,
03,060,061,030,04,020,061,071,014,00,00,057,
00,00,01,025,040,074,01,030,04,020,062,071,
033,04,020,060,071,02,02,014,01,01,033,03,
040,055,00,01,074,02,033,04,020,060,071,04,
04,014,02,02,064,
0};

In the C program that uses the regcmp output the following line
will apply the expression named teleno to line:

	regex(teleno, line, area, exch, rest);

The program regcmp is a lot easier to use than the function.
Have fun!

Scotty
...ihnp4!killer!ozdaltx!root

"Oh, my friend, it's not what they take away from you that counts-
 It's what you do with what you have left." - Hubert Humphrey

guy@sun.uucp (Guy Harris) (09/01/86)

> > I've poured myself over ny manual and looked at regcmp(1), regcmp(3), and
> > regexp(3), and I'm still not sure how to use these functions.  ...

Note, BTW, that this form of regular expression parser is NOT in the System
V Interface Definition, at least in Issue 2 (Issue 1 describes it, but that
was an error).  The package described in REGEXP(5) is the one in the SVID,
and is the one you should be using.  It is, for example, the package used by
"ed" and "grep"; the only System V software using REGCMP(3) is REGCMP(1).
Not all SVID-compatible systems will have REGCMP(1), REGCMP(3), or
REGEXP(3); they all will have REGEXP(5).

If you only System III, it will be found in REGEXP(7) rather than REGEXP(5).
Other system may place it elsewhere (we don't supply the old "regexp"
package, so we put it in REGEXP(3), along with all the other library
packages).
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)