adler@betwixt..caltech.edu (B. Thomas Adler) (09/26/90)
Hi all, I've just run across a problem that I thought should have a natural solution (ie, a small one liner) in perl, but I can't seem to find it in the manual. I'm working on a program that parses the contents of a file (say, a nameserver file), and some of the fields are double-quoted, to preserve spacing. My question is, is there a way to have split() split on white-space, while respecting the restrictions imposed by any double quoting? ie, I'd like the line Field_1 parm_1 "This is example one" to split into three components, rather than 6. Any ideas? -Bo -- B. Thomas Adler <adler@tybalt.caltech.edu> <adler@citjulie.bitnet> <...!ames!elroy!cit-vax!adler>
hakanson@ogicse.ogi.edu (Marion Hakanson) (09/26/90)
In article <adler.654289321@betwixt> adler@betwixt..caltech.edu (B. Thomas Adler) writes: >. . . >spacing. My question is, is there a way to have split() split on >white-space, while respecting the restrictions imposed by any double quoting? > >ie, I'd like the line > Field_1 parm_1 "This is example one" > >to split into three components, rather than 6. This has been discussed several times before. If you allow the quotes to be escaped (with a backslash, which can also be escaped by a backslash, etc.), then you aren't going to be able to do this with a regular expression. Even if you don't, the r.e. will be ugly. Since you mentioned nameserver files, you may find the approach I took to be of use to you. Use anonymous FTP to retrieve from host cse.ogi.edu the file pub/dnsparse-2.0.tar.Z. Briefly, there is a lexical analyzer (tokenizer) written in C, which is used by Perl code to fully parse DNS master files. The lex-er deals with quotes, etc., and the Perl code does the rest. -- Marion Hakanson Domain: hakanson@cse.ogi.edu UUCP : {hp-pcd,tektronix}!ogicse!hakanson
worley@compass.com (Dale Worley) (09/26/90)
From: hakanson@ogicse.ogi.edu (Marion Hakanson) This has been discussed several times before. If you allow the quotes to be escaped (with a backslash, which can also be escaped by a backslash, etc.), then you aren't going to be able to do this with a regular expression. Even if you don't, the r.e. will be ugly. Say what? Write: escaped-char = \\. non-escaped-char = [^\\"] (I think Perl requires \ inside of [] to be quoted now.) inside-quote-char = escaped-char | non-escaped-char quoted-string = " inside-quote-char* " All together: /"(\\.|[^\\"])*"/ The only thing you can't do with regexps is recursively embedded structures. Dale Worley Compass, Inc. worley@compass.com -- Generally speaking, the Way of the warrior is resolute acceptance of death. -- Miyamoto Musashi, 1645