harrison@necssd.NEC.COM (Mark Harrison) (04/18/91)
I would like to split a line into fields, where fields are separated by whitespace. If a field contains whitespace, It should be enclosed in double quotes. If a field needs a double quote, it should backslash escape it. EG, field1 field2 field3 field1 "field 2" "field \"3\"" What I have now: ($key, $str, $cmd) = split("[ \t]+"); 1. What RE should I use for handling double quotes? 2. What RE should I use for handling double quotes with embedded quotes? 3. Is there a better way to do this? Thanks in advance... -- Mark Harrison harrison@ssd.dl.nec.com (214)518-5050 {necntc, cs.utexas.edu}!necssd!harrison standard disclaimers apply...
lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (04/25/91)
In article <781@necssd.NEC.COM> harrison@csl.dl.nec.com writes: : I would like to split a line into fields, where fields are separated by : whitespace. If a field contains whitespace, It should be enclosed in : double quotes. If a field needs a double quote, it should backslash : escape it. EG, : : field1 field2 field3 : field1 "field 2" "field \"3\"" : : What I have now: : : ($key, $str, $cmd) = split("[ \t]+"); : : 1. What RE should I use for handling double quotes? : 2. What RE should I use for handling double quotes with embedded quotes? : 3. Is there a better way to do this? In general, splitting on the delimiter isn't sufficient when you have to build a recognizer for the fields themselves. There are several ways to do it that occur to me, offhand: 1. Preprocess the input line: s/\\"/\200/g; s/"[^"]*"/($tmp=$&) =~ tr# #\201#, $tmp/eg; ($key, $str, $cmd) = split("[ \t]+"); $key =~ tr#\200\201#" #; $str =~ tr#\200\201#" #; $cmd =~ tr#\200\201#" #; [That's kinda bletchy, and not 8-bit clean.] 2. Treat the fields as delimiters: @tmp = split(/("([^"\\]+|\\[\\"])*"|\S+)/); for (@tmp) { s/(\\(.)|")/$2/g; } ($key,$str,$cmd) = @tmp[1,4,7]; [Not obvious unless you know about ()'s and split.] 3. Parse the line one field at a time, using s///g to loop: @tmp = (); s/("([^"\\]+|\\[\\"])*"|\S*)\s+/($tmp[++$#tmp] = $1) =~ s#(\\(.)|")#$2#g/eg; ($key,$str,$cmd) = @tmp; [The nested substitution and push surrogate may give you heartburn.] However, these assume you'll only use double quotes and that any double quotes will start a field. People used to shell quoting may find these restrictions irksome. Here's a routine that does something more like the shell does: #!/usr/bin/perl # A little test harness... while (<>) { ($key,$str,$cmd) = "edsplit($_); print "key = $key\nstr = $str\ncmd = $cmd\n"; } sub quotedsplit { local($_) = @_; local(@fields,$snippet,$field); while ($_ ne '') { $field = ''; for (;;) { if (s/^"(([^"\\]+|\\[\\"])*)"//) { ($snippet = $1) =~ s#\\(.)#$1#g; } elsif (s/^'(([^'\\]+|\\[\\'])*)'//) { ($snippet = $1) =~ s#\\(.)#$1#g; } elsif (s/^\\(.)//) { $snippet = $1; } elsif (s/^([^\s\\'"]+)//) { $snippet = $1; } else { s/^\s+//; last; } $field .= $snippet; } push(@fields, $field); } @fields; } This lets you say things like this: now "isn't the" ti'me for'" all"\ good' "'"men\"" There are three fields there: 1. now 2. isn't the 3. time for all good "men" I suspect I should add this routine to the library. Maybe it should throw away leading whitespace--this one doesn't. Opinions? No doubt someone will enhance it to do backticks, and variable substitution, and I/O redirection, and aliases, and... :-) Larry