[comp.lang.perl] Matching parentheses

harald.alvestrand@elab-runit.sintef.no (11/28/90)

I have thought a long time about this, but I am sure there must be some
magic way to do it:
How can I write a pattern that matches (), ((())), ((())()), but NOT ((())?
That is, how do I get a balanced matching of parentheses?
(or quotes, for that matter...)


                   Harald Tveit Alvestrand
Harald.Alvestrand@elab-runit.sintef.no
C=no;PRMD=uninett;O=sintef;OU=elab-runit;S=alvestrand;G=harald
+47 7 59 70 94

merlyn@iwarp.intel.com (Randal Schwartz) (11/29/90)

In article <1990Nov28.150131.28981@ugle.unit.no>, harald.alvestrand@elab-runit writes:
| I have thought a long time about this, but I am sure there must be some
| magic way to do it:
| How can I write a pattern that matches (), ((())), ((())()), but NOT ((())?
| That is, how do I get a balanced matching of parentheses?
| (or quotes, for that matter...)

Regular expressions aren't strong enough.  You need to use some other
method (like recursive descent).

Just another Perl hacker (also in the midst of hacking the final Book draft),
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel: putting the 'backward' in 'backward compatible'..."====/

poage@sunny.ucdavis.edu (Tom Poage) (11/29/90)

In article <1990Nov28.150131.28981@ugle.unit.no> harald.alvestrand@elab-runit.sintef.no writes:
>I have thought a long time about this, but I am sure there must be some
>magic way to do it:
>How can I write a pattern that matches (), ((())), ((())()), but NOT ((())?
>That is, how do I get a balanced matching of parentheses?
>(or quotes, for that matter...)
>
>
>                   Harald Tveit Alvestrand
>Harald.Alvestrand@elab-runit.sintef.no
>C=no;PRMD=uninett;O=sintef;OU=elab-runit;S=alvestrand;G=harald
>+47 7 59 70 94

How about

	#!/bin/perl
	for ('()','((()))','((())())','((())') {
		if (tr/(/(/ !=  tr/)/)/) {
			warn "unmatched parentheses '$_'\n";
		}
	}

Tom.
-- 
Tom Poage, Clinical Engineering
Universiy of California, Davis, Medical Center, Sacramento, CA
poage@sunny.ucdavis.edu  {...,ucbvax,uunet}!ucdavis!sunny!poage

mwm@raven.relay.pa.dec.com (Mike (My Watch Has Windows) Meyer) (11/29/90)

In article <582@sunny.ucdavis.edu> poage@sunny.ucdavis.edu (Tom Poage) writes:
   In article <1990Nov28.150131.28981@ugle.unit.no> harald.alvestrand@elab-runit.sintef.no writes:
   >I have thought a long time about this, but I am sure there must be some
   >magic way to do it:
   >How can I write a pattern that matches (), ((())), ((())()), but NOT ((())?
   >That is, how do I get a balanced matching of parentheses?

Mr. Poage's solution doesn't work - it thinks that )( is a pair of
matching parens. I don't think raw count was what was intended.

   >(or quotes, for that matter...)

This also depends on what you mean by "matching quotes". If all you
want is for every quote character to have a match later, this does it:

	/^[^"]*("[^"]*"[^"]*)*$/

You can incorporate escaped quotes into that pattern (left as an
exercise).  If you want to correctly deal with two kinds of quotes, I
think this does it, but haven't done any testing on it:

	/^[^"']*((("[^"]*")|('[^']*'))[^"']*)*$/

	<mike
--

tchrist@convex.COM (Tom Christiansen) (11/29/90)

In article <1990Nov28.150131.28981@ugle.unit.no> harald.alvestrand@elab-runit.sintef.no writes:
>I have thought a long time about this, but I am sure there must be some
>magic way to do it:
>How can I write a pattern that matches (), ((())), ((())()), but NOT ((())?
>That is, how do I get a balanced matching of parentheses?
>(or quotes, for that matter...)

No, there's no "magic" way of doing it.  The problem is that regular
expressions are really nice, but they just can't handle recursive
structures -- at least, not by themselves.  To deal with these in Perl,
you could interatively process /\(([^\)]*)\)/ to deal with things from the
inside-out.  Simple approaches like that one won't work well with escaped
characters; you'll need to change them into something else (like \201)
before the search, or else you'll match where you don't want to.

--tom

cwitty@csli.Stanford.EDU (Carl Witty) (11/29/90)

In article <1990Nov28.150131.28981@ugle.unit.no> harald.alvestrand@elab-runit.sintef.no writes:

   I have thought a long time about this, but I am sure there must be some
   magic way to do it:
   How can I write a pattern that matches (), ((())), ((())()), but NOT ((())?
   That is, how do I get a balanced matching of parentheses?

There is no pattern that will do it, but here's a program that will.

#!/afs/ir/@sys/local/bin/perl
while (<STDIN>) {
  chop;
  while (s/\(\)//) {}
  if ($_ eq "") {
    print "Balanced\n";
  } else {
    print "Unbalanced\n";
  }
}

Carl Witty
cwitty@cs.stanford.edu

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (11/29/90)

In article <16600@csli.Stanford.EDU> cwitty@csli.Stanford.EDU (Carl Witty) writes:
> In article <1990Nov28.150131.28981@ugle.unit.no> harald.alvestrand@elab-runit.sintef.no writes:
    [ how to detect balanced parentheses? ]
  [ 10-line Perl program ]

#!/bin/sh
sed ':1
s-()--g
t1
s-..*-un-
s-$-balanced-'

---Dan

flee@guardian.cs.psu.edu (Felix Lee) (11/29/90)

>That is, how do I get a balanced matching of parentheses?

This is a job for first-class pattern objects.
	\balanced = *( '(' . \balanced . ')' | *[^()] );
	print $string =~ \balanced;

Too bad they don't exist, yet.
--
Felix Lee	flee@cs.psu.edu

des0mpw@colman.newcastle.ac.uk (M.P. Ward) (11/29/90)

The trick is to repetedly remove parantheses (and their contents) which don't
contain parentheses. If the result has any parantheses left then the match
fails. Another approach is to count +1 for (, -1 for ) and 0 for anything
else. The count must never go negative and must finish at zero.

Here's my perl code for checking {} balancing in LaTeX files:
(I have a similar program to check \begin...\end matching)

#! /usr/bin/perl
# check begin/end pairs in latex file

($myname = $0) =~ s|(.*/)*||;   # strip path component from name
$Usage = "Usage: $myname [file]\n";

if ($#ARGV == 0) {
  $file = "< $ARGV[0]";
  if (! -f $ARGV[0]) {
    die "File not found: $ARGV[0]\n";
  }
} elsif ($#ARGV == -1) {
  $file = "";
} else {
  die "$Usage";
}
open (GOODS,"qtoa $file | egrep -n '{|}' |") ||
                die "qtoa/egrep failed: $!";
undef $/;
$* = 1;
$_ = <GOODS>;   # slurp in begin/end pairs
close GOODS;

# remove \{ and \} (which don't have to match):
s/\\{//g;
s/\\}//g;
 
# repeatedly collapse {...} pairs
while (s/{[^{}]*}//g) {
  1;
}

# eliminate junk lines (with no { or }):
while (s/^[^{}]*\n//g) {
  1;
}
 
# print the result:
print;


			Martin.

JANET: Martin.Ward@uk.ac.durham    Internet (eg US): Martin.Ward@DURHAM.AC.UK
or if that fails:  Martin.Ward%uk.ac.durham@nfsnet-relay.ac.uk  
or even: Martin.Ward%DURHAM.AC.UK@CUNYVM.CUNY.EDU
BITNET: IN%"Martin.Ward@DURHAM.AC.UK" UUCP:...!mcvax!ukc!durham!Martin.Ward

tchrist@convex.COM (Tom Christiansen) (11/30/90)

In article <5209:Nov2909:58:4190@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>  [ 7-line sed program deleted ]

While it's always good to choose the right task, and we shouldn't
forget our simpler tools, I don't think the original poster has yet had
his question answered.  

What he wanted was not merely WHETHER an expression was balanced, but
rather how to pull out a balanced expression from a variable in a Perl
program for subsequent use in that program.

Enclosed below you will find some code that seems to do this.  There's a
&pull_quotes() function that you want to feed two things: the string with
quoted strings in it, and the two characters that you want to call quotes.  

My enclosed examples demo using ` and ', as well as ( and ).
Multi-character quotes are left as an exercise for the reader. :-)

The function returns a list of all the quoted strings, as you 
encounter them proceeding left to right.  For example,

    &pull_quotes('value = (double)(U_L(value) & U_L(str_gnum(st[2])));', '()');

returns 

    ( "double",
      "U_L(value) & U_L(str_gnum(st[2]))",
      "value",
      "str_gnum(st[2])",
      "st[2]" );

Several examples of varying complexity are included.  I'll spare the net
the 8k of expected output these produce.  You should be able to massage
this into whatever you want to do with these matched, quoted strings.

I eagerly await Dan's translation of this into sed.  :-)

--tom

#!/usr/local/bin/perl
# 
# pull_quotes -- tom christiansen, tchrist@convex.com
#
################################################################################
sub show_quotes {
    local($string, $qchars) = @_;
    local($i) = $[;
    local($_);
    local(@list) = &pull_quotes($string,$qchars);

    $qchars =~ s/(.)(.)/$1...$2/;

    print "Extracted ", 0+@list, " ", $qchars, " quote", 
	(@list != 1)?'s':'', " from: <<", $_[0]. ">>\n";

    for (@list) {
	print "Quote #$i is <<", $_, ">>\n";
	$i++;
    } 
    print "\n";
} 


################################################################################
sub pull_quotes { # pull_quotes($string, $quotchars) => @quotestrings
    local($_, $qchars) = @_;

    local($qL, $qR); 		# left and right quote chars, like `' or ()
    local($quote_level);	# current quote level
    local($max_quote);		# deepest we've gotten
    local($qstring);		# tmp space for quote
    local(@quotes);		# list of quotes to return
    local($d) = '\$';		# not sure why this must be here
    local($b) = '\\';		# nor this
    local(@done);		# which quotes we've finished so far

    die "need two just quote chars" if length($qchars) != 2;

    $qL = substr($qchars, 0, 1);
    $qR = substr($qchars, 1, 1);

    s/\\(.)/"\201".ord($1)."\202"/eg;   # protect from backslashes

    $max_quote = $quote_level = $[-1;

    while ( /[$qchars]/ ) {
	if ($& eq $qL) {
	    do { ++$quote_level; } while $done[$quote_level];
	    s/$b$qL/\${QL${quote_level}}/;
	    $max_quote = $quote_level if $max_quote < $quote_level;
	} elsif ($& eq $qR) {
	    s/$b$qR/\${QR${quote_level}}/;
	    $done[$quote_level]++;
	    do { --$quote_level; } while $done[$quote_level];
	} else { 
	    die "unexpected quot char: $&";
	}
    } 
    print "pre-re-interpolated string is $_\n" if $debug;

    for ($quote_level = $[; $quote_level <= $max_quote; $quote_level++) {
	($qstring) = /${d}\{QL$quote_level\}([^\000]*)${d}\{QR$quote_level}/;
	$qstring =~ s/\${QL\d+\}/$qL/g;
	$qstring =~ s/\${QR\d+\}/$qR/g;
	$qstring =~ s/\201(\d+)\202/pack('C',$1)/eg;
	$quotes[$quote_level] = $qstring;
    } 
    @quotes;
} 

################################################################################
################################################################################
################################################################################
################################################################################
################################################################################

# MAIN STARTS HERE


&show_quotes("This has `one' quote", "`'");

&show_quotes(q(Like this: `Jim' and `Bill said, ``Boo'' to me' to the end.), "`'");

&show_quotes("Mom said, `Jim asked ``How come?'', so I told her ``because''.'", "`'");

&show_quotes(q(Mom said, `Don\\'t put Jimmy\\'s `widjet' down the drain!'), "`'");

&show_quotes('a = (b + c) / 2', '()');

&show_quotes('a = (b + (c*d)) / (7+(2/3))', '()');

&show_quotes('value = (double)(U_L(value) & U_L(str_gnum(st[2])));', '()');

&show_quotes(<<'EOQ', "()");
        case 'I':
            str_cat(str,"-");
            str_cat(str,s);
            str_cat(str," ");
            if (*++s) {
                (void)apush((((STBP*)(incstab->str_ptr))->stbp_array ? ((STBP*)(incstab->str_ptr))->stbp_array : ((STBP*)(aadd(incstab)->str_ptr))->stbp_array),str_make(s,0));
            }
            else if (argv[1]) {
                (void)apush((((STBP*)(incstab->str_ptr))->stbp_array ? ((STBP*)(incstab->str_ptr))->stbp_array : ((STBP*)(aadd(incstab)->str_ptr))->stbp_array),str_make(argv[1],0));
                str_cat(str,argv[1]);
                argc--,argv++;
                str_cat(str," ");
            }
EOQ

&show_quotes(<<'EOW', '()');
(defun define-abbrevs (&optional arg)
  "Define abbrevs according to current visible buffer contents.
See documentation of edit-abbrevs for info on the format of the
text you must have in the buffer.
With argument, eliminate all abbrev definitions except
the ones defined from the buffer now."
  (interactive "P")
  (if arg (kill-all-abbrevs))
  (save-excursion
   (goto-char (point-min))
   (while (and (not (eobp)) (re-search-forward "^\(" nil t))
     (let* ((buf (current-buffer))
            (table (read buf))
            abbrevs)
       (forward-line 1)
       (while (progn (forward-line 1)
                     (not (eolp)))
         (setq name (read buf) count (read buf) exp (read buf))
         (skip-chars-backward " \t\n\f")
         (setq hook (if (not (eolp)) (read buf)))
         (skip-chars-backward " \t\n\f")
         (setq abbrevs (cons (list name exp hook count) abbrevs)))
       (define-abbrev-table table abbrevs)))))
EOW

exit 0;

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (11/30/90)

It occurs to me that the solutions posted so far will take forever on
long strings. Here's a linear-time solution:

#!/bin/sh
( echo '0'; fold -1 | sed 's,[^()],,
s,(,1+p,
s,),1-p,' ) | dc | ( grep -s -e - && echo 'unbalanced' || echo 'balanced' )

(Whaddya mean, this isn't comp.unix.shell? :-) )

---Dan

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (12/01/90)

In article <12435:Nov3011:26:5990@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> #!/bin/sh
> ( echo '0'; fold -1 | sed 's,[^()],,
> s,(,1+p,
> s,),1-p,' ) | dc | ( grep -s -e - && echo 'unbalanced' || echo 'balanced' )

Whoops, thanks to William Setzer for pointing out that this doesn't
notice missing close parens.

#!/bin/sh
( echo '0'; echo '0'; fold -1 | sed 's,[^()],,
s,(,1+p,
s,),1-p,' ; echo '-p'
) | dc | ( grep -s -e - && echo 'unbalanced' || echo 'balanced' )

---Dan

ronald@robobar.co.uk (Ronald S H Khoo) (12/01/90)

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

> #!/bin/sh
> ( echo '0'; echo '0'; fold -1 | sed 's,[^()],,

What's fold ?  Have you got an implementation of it in perl that you're
willing to share ?
-- 
ronald@robobar.co.uk +44 81 991 1142 (O) +44 71 229 7741 (H)

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (12/02/90)

In article <1990Dec1.105725.28029@robobar.co.uk> ronald@robobar.co.uk (Ronald S H Khoo) writes:
> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> > #!/bin/sh
> > ( echo '0'; echo '0'; fold -1 | sed 's,[^()],,
> What's fold ?

A program that wraps its input to a specific line length. fold -1 just
puts a newline after each character.

This is somewhat off the subject, though; I was just illustrating one
way to code the parenthesis-counting technique.

---Dan