[comp.lang.perl] Quoting and Splitting

adler@betwixt..caltech.edu (B. Thomas Adler) (09/26/90)

Hi all,

I've just run across a problem that I thought should have a natural
solution (ie, a small one liner) in perl, but I can't seem to find 
it in the manual.

I'm working on a program that parses the contents of a file (say, a
nameserver file), and some of the fields are double-quoted, to preserve
spacing.  My question is, is there a way to have split() split on
white-space, while respecting the restrictions imposed by any double quoting?

ie, I'd like the line
	Field_1  parm_1		"This is example one"

to split into three components, rather than 6.

Any ideas?

-Bo

--
B. Thomas Adler               <adler@tybalt.caltech.edu>
<adler@citjulie.bitnet>       <...!ames!elroy!cit-vax!adler>

hakanson@ogicse.ogi.edu (Marion Hakanson) (09/26/90)

In article <adler.654289321@betwixt> adler@betwixt..caltech.edu (B. Thomas Adler) writes:
>. . .
>spacing.  My question is, is there a way to have split() split on
>white-space, while respecting the restrictions imposed by any double quoting?
>
>ie, I'd like the line
>	Field_1  parm_1		"This is example one"
>
>to split into three components, rather than 6.

This has been discussed several times before.  If you allow the quotes
to be escaped (with a backslash, which can also be escaped by a
backslash, etc.), then you aren't going to be able to do this with a
regular expression.  Even if you don't, the r.e. will be ugly.

Since you mentioned nameserver files, you may find the approach I took
to be of use to you.  Use anonymous FTP to retrieve from host
cse.ogi.edu the file pub/dnsparse-2.0.tar.Z.  Briefly, there is a
lexical analyzer (tokenizer) written in C, which is used by Perl code
to fully parse DNS master files.  The lex-er deals with quotes, etc.,
and the Perl code does the rest.

-- 
Marion Hakanson         Domain: hakanson@cse.ogi.edu
                        UUCP  : {hp-pcd,tektronix}!ogicse!hakanson

worley@compass.com (Dale Worley) (09/26/90)

   From: hakanson@ogicse.ogi.edu (Marion Hakanson)

   This has been discussed several times before.  If you allow the quotes
   to be escaped (with a backslash, which can also be escaped by a
   backslash, etc.), then you aren't going to be able to do this with a
   regular expression.  Even if you don't, the r.e. will be ugly.

Say what?  Write:

	escaped-char = \\.
	non-escaped-char = [^\\"]	(I think Perl requires \
					inside of [] to be quoted now.)
	inside-quote-char = escaped-char | non-escaped-char
	quoted-string = " inside-quote-char* "

All together: /"(\\.|[^\\"])*"/

The only thing you can't do with regexps is recursively embedded
structures.

Dale Worley		Compass, Inc.			worley@compass.com
--
Generally speaking, the Way of the warrior is resolute acceptance of death.
	--  Miyamoto Musashi, 1645