[comp.unix.questions] regexp question, joining EOL

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (01/20/90)

In article <1307@island.uu.net> daniel@island.uu.net ((Dan Smith "Remember MLK")) writes:
: 
: 	Someone here needs a generalized way of turning:
: 
: foo
: bar
: 1.23
: 
: to the following:
: 
: foo_bar
: 1.23
: 
: 	also, it would need to work for:
: 
: foo
: bar
: baz
: 1.23
: goo
: tar
: 293
: 
: 	which should give:
: 
: foo_bar_baz
: 1.23
: goo_tar
: 293
: 
: 	So the rule seems to be "if a line ends with a character, and the next
: line begins with one, replace the newline with a '_'".
: 
: 	I've tried (in vi) "g/[a-z]\n[a-z]/s//_/"...but that doesn't
: cut it.  Any ideas?  (I take it that it may be a two-pass sort of solution).

In the first pass, install perl.		:-)

In the second pass, feed your file to a perl script that says

#!/usr/bin/perl
$/ = "\0";				# line sep is something non-existent
$_ = <>;				# whomp in entire file
s/([a-z])\n([a-z])/${1}_$2/g;		# do it
s/([a-z])\n([a-z])/${1}_$2/g;		# in case of single char identifiers
print;					# whomp out entire file

Alternately, it's pretty easy to do with sed too.  Something like

	N
	:again
	/[a-z]\n[a-z]/{
	    s/\([a-z]\)\n\([a-z]\)/\1_\2/g
	    N
	    b again
	}
	P
	D

In awk, we get something like

{if ($0 ~ /^[a-z]/ && prev ~ /[a-z]$/) ORS="_"
else ORS="\n"
if (prev != "") print prev
prev = $0}
END{ORS="\n"
print prev}

(I'm sure that that could be indented more readably, but I'm scared of
the awk parser.)

Running that through the awk-to-perl translator, we get the following fluff:

#!/usr/bin/perl
eval "exec /usr/local/bin/perl -S $0 $*"
    if $running_under_some_shell;
			# this emulates #! processing on NIH machines.
			# (remove #! line above if indigestible)

eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_]+=)(.*)/ && shift;
			# process any FOO=bar switches

$, = ' ';		# set output field separator
$\ = "\n";		# set output record separator

while (<>) {
    chop;	# strip record separator
    if ($_ =~ /^[a-z]/ && $prev =~ /[a-z]$/) {
	$\ = '_';
    }
    else {
	$\ = "\n";
    }
    if ($prev ne '') {
	print $prev;
    }
    $prev = $_;
}

$\ = "\n";
print $prev;

or, more idiomatically

#!/usr/bin/perl
chop($prev = <>);
while (<>) {
    chop;	# strip record separator
    $prev .= ($_ =~ /^[a-z]/ && $prev =~ /[a-z]$/) ? '_' : "\n";
    print $prev;
    $prev = $_;
}
print $prev,"\n";

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov