[comp.unix.wizards] easy for some

lm@slovax.Eng.Sun.COM (Larry McVoy) (05/09/91)

matthew@gizmo.UK.Sun.COM (Matthew Buller - Sun EHQ - MIS) writes:
> problem: to extract text between start and end patterns in a file
> eg:-
> 
> file:
> 
> pattern1---
> 
> stuff
> stuff
> stuff
> 
> pattern2---

/bin/sh, usage shellscript start_pat stop_pat [files...]

	START=$1; shift
	STOP=$1; shift
	PRINT=
	cat $* | while read x
	do	if [ "$x" = "$STOP" ]
		then	exit 0;
		fi
		if [ "$x" = "$START" ]
		then	PRINT=yes
			continue
		fi
		if [ X$PRINT != X ]
		then	echo "$x";
		fi
	done

/bin/perl, same usage (see the notes on the ".." operator, cool thingy).

	$START = shift;
	$STOP = shift;
	while (<>) {
		if (/^$START$/../^$STOP/) {
			next if /^$START$/;	# skip starting pattern
			last if /^$STOP/;	# done if last;
			print;
		}
	}
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

toma@swsrv1.cirr.com (Tom Armistead) (05/09/91)

In article <6686@male.EBay.Sun.COM> matthew@gizmo.UK.Sun.COM (Matthew Buller - Sun EHQ - MIS) writes:
>
>I am fairly new to unix, and I have a minor question:-
>problem: to extract text between start and end patterns in a file
>eg:-
>
>file:
>
>pattern1---
>
>stuff
>stuff
>stuff
>
>pattern2---
>
>How do I write a short script (preferably /bin/sh) to extract the information
>between the start and end patterns (pattern1/pattern2) into a file.
>
>I have tried to grok the man page for `sed' but no luck.
>
>Any help would be appreciated.
>
>Tnx
>Matt

You could do this with sed.

$ sed -n '/^pattern1---$/,/^pattern2---$/p' < data_file

One problem with this is that it prints out the start and end parameters.  You
may be able to tell SED not to do this, but I don't know how.  So I use egrep.

$ sed -n '/^pattern1---$/,/^pattern2---$/p' < data_file | \
        egrep -v '^pattern1---$|^pattern2---$'

Tom
-- 
Tom Armistead - Software Services - 2918 Dukeswood Dr. - Garland, Tx  75040
===========================================================================
toma@swsrv1.cirr.com                {egsner,letni,ozdaltx,void}!swsrv1!toma

tchrist@convex.COM (Tom Christiansen) (05/09/91)

From the keyboard of lm@slovax.Eng.Sun.COM (Larry McVoy):
:matthew@gizmo.UK.Sun.COM (Matthew Buller - Sun EHQ - MIS) writes:
:> problem: to extract text between start and end patterns in a file
:> eg:-
:> 
:> file:
:> 
:> pattern1---
:> 
:> stuff
:> stuff
:> stuff
:> 
:> pattern2---
:
:/bin/sh, usage shellscript start_pat stop_pat [files...]

ug.

A shell solution is obscene. :-) I don't know how to do it in sed.  An awk
solution would have made certain others happy, but wouldn't have been so
nifty.  

> /bin/perl, same usage (see the notes on the ".." operator, cool thingy).

But since we do happen to be on the perl topic...

> 	$START = shift;
> 	$STOP = shift;
> 	while (<>) {
> 		if (/^$START$/../^$STOP/) {
> 			next if /^$START$/;	# skip starting pattern
> 			last if /^$STOP/;	# done if last;
> 			print;
> 		}
> 	}

The following code should be faster because it's got fewer regexp
compiles.  The /o is to tell perl to compile the pattern only one.  
It also uses the fact that .. returns the sequence number, and that 
the last in the sequence has an E0 appended to it, for example making 
144 be seen as 144E0, which is the same numerically, but you can do 
string or pattern operations on it.

    $START = shift;
    $STOP = shift;

    while (<>) {
	if ( $which = /^$START$/o .. /^$STOP$/o ) {
	    next if $which == 1;
	    last if $which =~ /E/;
	    print;
	} 
    } 

or maybe instead of the next/last pair of lines, just

    next if $which =~ /^1$|E/;

if they want all instances in the stream extracted.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
		"So much mail, so little time."

lewis@tramp.Colorado.EDU (LEWIS WILLIAM M JR) (05/09/91)

In article <574@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:
>matthew@gizmo.UK.Sun.COM (Matthew Buller - Sun EHQ - MIS) writes:
>> problem: to extract text between start and end patterns in a file
... more problem description
>/bin/sh, usage shellscript start_pat stop_pat [files...]
>
... complex shell and perl programs to do

	sed -n '/pattern1/,/pattern2/p' source_file > new_file

bharat@computing-maths.cardiff.ac.uk (Bharat Mediratta) (05/09/91)

In article <1991May8.233803.4485@swsrv1.cirr.com> toma@swsrv1.cirr.com (Tom Armistead) writes:
>In article <6686@male.EBay.Sun.COM> matthew@gizmo.UK.Sun.COM (Matthew Buller - Sun EHQ - MIS) writes:
>>
>>I am fairly new to unix, and I have a minor question:-
>>problem: to extract text between start and end patterns in a file
>>eg:-
>>
>>file:
>>
>>pattern1---
>>
>>stuff
>>stuff
>>stuff
>>
>>pattern2---
>
>You could do this with sed.
>
>$ sed -n '/^pattern1---$/,/^pattern2---$/p' < data_file
>
>One problem with this is that it prints out the start and end parameters.  You
>may be able to tell SED not to do this, but I don't know how.  So I use egrep.
>
>$ sed -n '/^pattern1---$/,/^pattern2---$/p' < data_file | \
>        egrep -v '^pattern1---$|^pattern2---$'

Well, if the patterns only occur once in the file, here's a simple sed
solution:

	sed -e '1,/^pattern1---$/d' -e '/^pattern2---$/,$d' < data_file

As you can see, it deletes all the stuff up to (and including) the first
pattern, and then all the stuff from the second pattern (inclusive) to
the end of the file.  If you have multiple recurrences of this in the
file, you only get the first one.


--
|  Bharat Mediratta  | JANET: bharat@cm.cf.ac.uk                               |
+--------------------+ UUNET: bharat%cm.cf.ac.uk%cunyvm.cuny.edu@uunet.uucp    |
|On a clear disk...  | uk.co: bharat%cm.cf.ac.uk%cunyvm.cuny.edu%uunet.uucp@ukc|
|you can seek forever| UUCP: ...!uunet!cunym.cuny.edu!cm.cf.ac.uk!bharat       |

tchrist@convex.COM (Tom Christiansen) (05/10/91)

From the keyboard of lewis@tramp.Colorado.EDU (LEWIS WILLIAM M JR):
:In article <574@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:
:>matthew@gizmo.UK.Sun.COM (Matthew Buller - Sun EHQ - MIS) writes:
:>> problem: to extract text between start and end patterns in a file
:... more problem description
:>/bin/sh, usage shellscript start_pat stop_pat [files...]
:>
:... complex shell and perl programs to do
:
:	sed -n '/pattern1/,/pattern2/p' source_file > new_file

nope -- you included the endpoints.

i didn't see the original posting, so i don't know whether it's
possibly to have multiple sets of /pat1/,/pat2/ areas
in the file.  if so, the 1,/pat1/d /pat2,$d posting i just
saw won't work.

followups have been redirected.  this isn't particularly wizardly.

--tom
--
Tom Christiansen		tchrist@convex.com	convex!tchrist
		"So much mail, so little time."

lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) (05/10/91)

In article <1991May9.153351.1754@colorado.edu> lewis@tramp.Colorado.EDU (LEWIS WILLIAM M JR) writes:
: In article <574@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:
: >matthew@gizmo.UK.Sun.COM (Matthew Buller - Sun EHQ - MIS) writes:
: >> problem: to extract text between start and end patterns in a file
: ... more problem description
: >/bin/sh, usage shellscript start_pat stop_pat [files...]
: >
: ... complex shell and perl programs to do
: 
: 	sed -n '/pattern1/,/pattern2/p' source_file > new_file

No, that's not what those programs were trying to do.  (Admittedly, the
original spec was unclear.)  The other programs were attempting to omit
the endpoints, taking "between" to mean exclusion of said endpoints.
Some of them were also trying to snab only the text between the first
pair of patterns.  Some were allowing for the patterns to be passed in
as arguments.

Here's the perl equivalent of what you said:

    perl -ne 'print if /pattern1/../pattern2/' source_file >new_file

When using Perl to do the other thing, I personally prefer a straightforward
approach:

    #!/usr/bin/perl
    while (<>) {
	last if /pattern1/;
    }
    while (<>) {
	exit if /pattern2/;
	print;
    }

For hardwired patterns this will generally beat sed.  (Especially if
sed is stupid enough to read the rest of the input file.)
Parameterized patterns can get the same performance using eval:

    #!/usr/bin/perl
    $pattern1 = shift;
    $pattern2 = shift;
    eval <<"END";
	while (<>) {
	    last if /$pattern1/;
	}
	while (<>) {
	    exit if /$pattern2/;
	    print;
	}
    END

Larry Wall
lwall@netlabs.com

marc@mercutio.ultra.com (Marc Kwiatkowski {Host Software-AIX}) (05/16/91)

In article <1991May9.185503.325@jpl-devvax.jpl.nasa.gov> lwall@jpl-devvax.jpl.nasa.gov (Larry Wall) writes:
>   In article <1991May9.153351.1754@colorado.edu> lewis@tramp.Colorado.EDU (LEWIS WILLIAM M JR) writes:
>   : In article <574@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:
>   : >matthew@gizmo.UK.Sun.COM (Matthew Buller - Sun EHQ - MIS) writes:
>   : >> problem: to extract text between start and end patterns in a file
>   : ... more problem description
>   : >/bin/sh, usage shellscript start_pat stop_pat [files...]
>   : >
>   : ... complex shell and perl programs to do
>   : 
>   : 	sed -n '/pattern1/,/pattern2/p' source_file > new_file

>   Here's the perl equivalent of what you said:
>
>       perl -ne 'print if /pattern1/../pattern2/' source_file >new_file
>
>   When using Perl to do the other thing, I personally prefer a straightforward
>   approach:

Ahh.  In grand c.u.w tradition the lesser sin of a non-wizardly question
is met with the greater sin of slightly-correct to downright wrong
answers.  I know this isn't the right newsgroup, but the posters
question hasn't been answered.  The sed suggestions are all wet.
The perl one will work and in terms of execution and readability
is probably the best, but the original poster stated that he 
preferred an answer for /bin/sh.  I am surprised noone suggested 
something like the following answer:

	cat foo | sed -n '
	:lbl00
	    /pattern00/ {
	:lbl01
		n
		/pattern01/ {
			b lbl00
		}
		p
		b lbl01
	}'

The above will filter multiple instances of /pattern00/..../pattern01/.
If only one is desired, replace 'b lbl00' with 'q'.

Note follow-up.  sed, a utility more sinned against than sinning.
--
	------------------------------------------------------------------
	Marc P. Kwiatkowski			Ultra Network Technologies
	Internet: marc@ultra.com		101 Daggett Drive
	uucp: ...!ames!ultra!marc		San Jose, CA 95134 USA
	telephone: 408 922 0100 x249

	Ignore the following signature.
-- 
	------------------------------------------------------------------
	Marc P. Kwiatkowski			Ultra Network Technologies
	Internet: marc@ultra.com		101 Daggett Drive
	uucp: ...!ames!ultra!marc		San Jose, CA 95134 USA