[comp.lang.perl] newbie question on split

harrison@necssd.NEC.COM (Mark Harrison) (04/18/91)

I would like to split a line into fields, where fields are separated by
whitespace.  If a field contains whitespace, It should be enclosed in
double quotes.  If a field needs a double quote, it should backslash
escape it.  EG,

	field1	field2		field3
	field1	"field 2"	"field \"3\""

What I have now:

	($key, $str, $cmd) = split("[ \t]+");

1. What RE should I use for handling double quotes?
2. What RE should I use for handling double quotes with embedded quotes?
3. Is there a better way to do this?

Thanks in advance...
-- 
Mark Harrison             harrison@ssd.dl.nec.com
(214)518-5050             {necntc, cs.utexas.edu}!necssd!harrison
standard disclaimers apply...

lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (04/25/91)

In article <781@necssd.NEC.COM> harrison@csl.dl.nec.com writes:
: I would like to split a line into fields, where fields are separated by
: whitespace.  If a field contains whitespace, It should be enclosed in
: double quotes.  If a field needs a double quote, it should backslash
: escape it.  EG,
: 
: 	field1	field2		field3
: 	field1	"field 2"	"field \"3\""
: 
: What I have now:
: 
: 	($key, $str, $cmd) = split("[ \t]+");
: 
: 1. What RE should I use for handling double quotes?
: 2. What RE should I use for handling double quotes with embedded quotes?
: 3. Is there a better way to do this?

In general, splitting on the delimiter isn't sufficient when you have
to build a recognizer for the fields themselves.  There are several ways
to do it that occur to me, offhand:

1.  Preprocess the input line:

    s/\\"/\200/g;
    s/"[^"]*"/($tmp=$&) =~ tr# #\201#, $tmp/eg;
    ($key, $str, $cmd) = split("[ \t]+");
    $key =~ tr#\200\201#" #;
    $str =~ tr#\200\201#" #;
    $cmd =~ tr#\200\201#" #;

    [That's kinda bletchy, and not 8-bit clean.]

2.  Treat the fields as delimiters:

    @tmp = split(/("([^"\\]+|\\[\\"])*"|\S+)/);
    for (@tmp) { s/(\\(.)|")/$2/g; }
    ($key,$str,$cmd) = @tmp[1,4,7];

    [Not obvious unless you know about ()'s and split.]

3.  Parse the line one field at a time, using s///g to loop:

    @tmp = ();
    s/("([^"\\]+|\\[\\"])*"|\S*)\s+/($tmp[++$#tmp] = $1) =~ s#(\\(.)|")#$2#g/eg;
    ($key,$str,$cmd) = @tmp;

    [The nested substitution and push surrogate may give you heartburn.]

However, these assume you'll only use double quotes and that any double
quotes will start a field.  People used to shell quoting may find these
restrictions irksome.  Here's a routine that does something more like
the shell does:

#!/usr/bin/perl

# A little test harness...

while (<>) {
    ($key,$str,$cmd) = &quotedsplit($_);
    print "key = $key\nstr = $str\ncmd = $cmd\n";
}

sub quotedsplit {
    local($_) = @_;
    local(@fields,$snippet,$field);

    while ($_ ne '') {
	$field = '';
	for (;;) {
	    if (s/^"(([^"\\]+|\\[\\"])*)"//) {
		($snippet = $1) =~ s#\\(.)#$1#g;
	    }
	    elsif (s/^'(([^'\\]+|\\[\\'])*)'//) {
		($snippet = $1) =~ s#\\(.)#$1#g;
	    }
	    elsif (s/^\\(.)//) {
		$snippet = $1;
	    }
	    elsif (s/^([^\s\\'"]+)//) {
		$snippet = $1;
	    }
	    else {
		s/^\s+//;
		last;
	    }
	    $field .= $snippet;
	}
	push(@fields, $field);
    }
    @fields;
}

This lets you say things like this:

    now "isn't the" ti'me for'" all"\ good' "'"men\""

There are three fields there:
	1. now
	2. isn't the
	3. time for all good "men"

I suspect I should add this routine to the library.  Maybe it should
throw away leading whitespace--this one doesn't.  Opinions?

No doubt someone will enhance it to do backticks, and variable substitution,
and I/O redirection, and aliases, and...  :-)

Larry