young@vlsi.ll.mit.edu (George Young) (10/13/88)
Is there a 'sed' wizard out there? I often want to take a big ascii file (like a .c file after cc -E) and collapse each group of 'blank' lines into exactly one blank line. 'Blank' here is any combination of blanks, tabs and maybe ^L's. It looks from the documentation that sed should do this quite neatly, using the multiple line pattern space commands with imbedded newlines, but I sure can't figure out how. I'd prefer the resulting blank line to be just a newline. I know there are various other unix tools that could do this, but sed looks like the most appropriate, and I would like to learn to use it's fancier facilities anyway. Any ideas? Just email response if this seems too trivial for broad posting. George Young, Rm. B-141 young@ll-vlsi.arpa MIT Lincoln Laboratory young@vlsi.ll.mit.edu 244 Wood St. Lexington, Massachusetts 02173 (617) 981-2756 . -- George Young, Rm. B-141 young@ll-vlsi.arpa MIT Lincoln Laboratory young@vlsi.ll.mit.edu 244 Wood St. Lexington, Massachusetts 02173 (617) 981-2756
dhesi@bsu-cs.UUCP (Rahul Dhesi) (10/13/88)
In article <192@vlsi.ll.mit.edu> young@vlsi.ll.mit.edu (George Young) writes: >I often want to take a big ascii file >(like a .c file after cc -E) and collapse each group of 'blank' lines >into exactly one blank line. 'Blank' here is any combination of blanks, >tabs and maybe ^L's. Under 4.3BSD, "cat -s" does this, except that I don't know if it also considers ^L to be a blank line. Also, for screen displays only, "less" has an option to do the same. I've never been able to make sed work with patterns that span newlines. -- Rahul Dhesi UUCP: <backbones>!{iuvax,pur-ee}!bsu-cs!dhesi
rar@nascom.UUCP (Alan Ramacher) (10/13/88)
In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes: > Is there a 'sed' wizard out there? I often want to take a big ascii file > (like a .c file after cc -E) and collapse each group of 'blank' lines > into exactly one blank line. 'Blank' here is any combination of blanks, > tabs and maybe ^L's. It looks from the documentation that sed should do this > quite neatly, using the multiple line pattern space commands with imbedded > newlines, but I sure can't figure out how. I'd prefer the resulting blank > line to be just a newline. sed is not powerful enuf for the job, but a simple awk script will work. If you have difficulties writting it, let me know and I will supply one. Good luck. Allan Ramacher
maart@cs.vu.nl (Maarten Litmaath) (10/14/88)
In article <192@vlsi.ll.mit.edu> young@vlsi.ll.mit.edu (George Young) writes:
\Is there a 'sed' wizard out there? I often want to take a big ascii file
\(like a .c file after cc -E) and collapse each group of 'blank' lines
\into exactly one blank line. 'Blank' here is any combination of blanks,
\tabs and maybe ^L's.
How about `awk'?
% cat /usr/local/bin/deblank
#! /bin/sh
exec awk '$0 !~ /^[ \t\f]*$/ { print; prev = 0; next }
{ if (prev) next }
{ prev = 1; print "" }
'
%
--
Hippic sport: |Maarten Litmaath @ Free U Amsterdam:
a contradiction in terms.|maart@cs.vu.nl, mcvax!botter!maart
rupley@arizona.edu (John Rupley) (10/14/88)
In article <136@nascom.UUCP>, rar@nascom.UUCP (Alan Ramacher) writes: > In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes: > > Is there a 'sed' wizard out there? I often want to take a big ascii file > > (like a .c file after cc -E) and collapse each group of 'blank' lines > > into exactly one blank line. 'Blank' here is any combination of blanks, > > tabs and maybe ^L's. > > sed is not powerful enuf for the job, but a simple awk script will > work. The following simple sed script should work (after replacing <sp> etc by the corresponding ascii characters) -- and I suspect it is shorter and will run faster than an equivalent awk script. sed "/^[<sp><tab><lf>]*$/{ N /\n.*[^<sp><tab><lf>]/{ b } D }" filename John Rupley internet: rupley@megaron.arizona.edu uucp: ..{cmcl2 | hao!ncar!noao}!arizona!rupley Dept. Biochemistry, Univ. Arizona, Tucson AZ 85721
csr@drutx.ATT.COM (Steve Roush) (10/14/88)
# strip 1st blank line down to nothing, then print newline # s/.*//p should work, but not on my version # append next line, then strip thru newline, finally loop to see if non-blank ############# NOTE, replace ^L with the real thing sed '/^[ ^L]*$/{ s/.*// p :bogus /^[ ^L]*$/{ N s/.*\n// bbogus } }' steve roush AT&T - BTL Denver 303-538-4860 drutx!csr
wu@spot.Colorado.EDU (WU SHI-KUEI) (10/14/88)
In article <7372@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes: >In article <136@nascom.UUCP>, rar@nascom.UUCP (Alan Ramacher) writes: >> In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes: >> > Is there a 'sed' wizard out there? I often want to take a big ascii file >> > (like a .c file after cc -E) and collapse each group of 'blank' lines >> > into exactly one blank line. 'Blank' here is any combination of blanks, >> > tabs and maybe ^L's. >> >> sed is not powerful enuf for the job, but a simple awk script will >> work. > >The following simple sed script should work (after replacing <sp> etc >by the corresponding ascii characters) -- and I suspect it is shorter and >will run faster than an equivalent awk script. > >sed "/^[<sp><tab><lf>]*$/{ > N > /\n.*[^<sp><tab><lf>]/{ > b > } > D > }" filename Just to confirm the above I timed the following sed and awk scripts on an otherwise empty 3B2/400 running SV3.1. The sed version is approximately 35% faster when applied to a file which contains 948 and 250 lines before and after stripping off the extra blank lines. ------------------------------ cut here ------------------------------ sed -n ' /^[ ]*$/ { p : loop n /^[ ]*$/b loop } /[!-~][!-~]*/p' $* ------------------------------ cut here ------------------------------ ------------------------------ cut here ------------------------------ awk ' BEGIN { empty = 1 } $0 !~ /^[ \t]+$|^$/ {print; empty = 0} $0 ~ /^$|^[ \t]+$/ && empty == 0 {print; empty = 1} ' $* ------------------------------ cut here ------------------------------ Just a guest here. In real life Carl Brandauer {ncar!uunet}!nbires!bdaemon!carl attmail!bdaemon!carl
maart@cs.vu.nl (Maarten Litmaath) (10/15/88)
In article <7372@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes: \sed "/^[<sp><tab><lf>]*$/{ \ N \ /\n.*[^<sp><tab><lf>]/{ \ b \ } \ D \ }" filename Four things: 1) I guess you meant <ff>, instead of <lf>. 2) The script doesn't convert lines containing [<sp><tab><ff>]* to JUST 1 newline. 3) The script contains non-printable characters. 4) The script could have been less complex. For these reasons I suggest the following adjusted script: ------------------------------cut here---------------------------------------- #! /bin/sh chars=" `echo tf | tr tf '\11\14'`" exec sed "/^[$chars]*$/{ N /\n[$chars]*$/D s/[$chars]*// }" $* ------------------------------cut here---------------------------------------- The `awk' solution I posted earlier, had to be modified a bit too: ------------------------------cut here---------------------------------------- #! /bin/sh exec awk '$0 !~ /^[ \t\f]*$/ { print; prev = 0; next } prev == 0 { prev = 1; print "" }' $* -- Hippic sport: |Maarten Litmaath @ Free U Amsterdam: a contradiction in terms.|maart@cs.vu.nl, mcvax!botter!maart
maart@cs.vu.nl (Maarten Litmaath) (10/15/88)
In article <4057@boulder.Colorado.EDU> wu@spot.Colorado.EDU (WU SHI-KUEI) writes:
\------------------------------ cut here ------------------------------
\sed -n '
\/^[ ]*$/ {
^
\ p
\: loop
\ n
\ /^[ ]*$/b loop
\}
\/[!-~][!-~]*/p' $*
\------------------------------ cut here ------------------------------
At the place the caret is pointing to, a space should be added (there's only
the tab)! Furthermore, formfeeds were considered white space too.
This is an indication that non-printable characters (in essential places)
should be avoided. One way to do it here:
chars=" `echo tf | tr tf '\11\14'`"
The first `p' command should be replaced by `s/.*//p', to meet what the
original poster wanted. For the same reason the `/[!-~][!-~]*/p' command
is doing too much: `p' suffices.
\------------------------------ cut here ------------------------------
\awk '
\ BEGIN { empty = 1 }
\ $0 !~ /^[ \t]+$|^$/ {print; empty = 0}
\ $0 ~ /^$|^[ \t]+$/ && empty == 0 {print; empty = 1}
\' $*
\------------------------------ cut here ------------------------------
The `next' command of `awk' can simplify this script (see my previous
posting). Why not use `/^[ \t\f]*$/' for the regular expression?
Your `sed' solution (+ modifications) is the best I've seen so far.
My `awk' solution is easier to program, but a lot slower indeed.
--
Hippic sport: |Maarten Litmaath @ Free U Amsterdam:
a contradiction in terms.|maart@cs.vu.nl, mcvax!botter!maart
dave@lsuc.uucp (David Sherman) (10/17/88)
Some versions of UNIX (including 4.1BSD) include a program call ssp(1) which does this; it's a bit faster than the parallel awk script. Also, if you know your text doesn't contain duplicate lines (nroff output almost always qualifies, for example), good old uniq(1) will do it. David Sherman -- { uunet!attcan att pyramid!utai utzoo } !lsuc!dave
leo@philmds.UUCP (Leo de Wit) (10/17/88)
In article <136@nascom.UUCP> rar@nascom.UUCP (Alan Ramacher) writes: |In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes: |> Is there a 'sed' wizard out there? I often want to take a big ascii file |> (like a .c file after cc -E) and collapse each group of 'blank' lines |> into exactly one blank line. 'Blank' here is any combination of blanks, |> tabs and maybe ^L's. It looks from the documentation that sed should do this |> quite neatly, using the multiple line pattern space commands with imbedded |> newlines, but I sure can't figure out how. I'd prefer the resulting blank |> line to be just a newline. | |sed is not powerful enuf for the job, but a simple awk script will |work. If you have difficulties writting it, let me know and I will |supply one. Good luck. I already mailed George a solution, but couldn't leave this one alone... Sed is most certainly powerful enough - I'll show you in a minute - ; in fact, I think for such a typical text processing job sed is to be preferred. And a not too unimportant reason for that is its speed. And here's my sed-solution; note that tab and formfeed have been coded as ^I and ^L so your pager isn't fooled; you should of course use the control codes in real. (using /bin/sh as command interpreter: ) sed -n -e ' /^[ ^I^L]*$/{ s/^.*$//p : again n s/^[ ^I^L]*$// t again } p' your_file Explanation: whenever you read a line containing only blank characters (i.e. satisfying the first pattern), print just one newline. Discard any blank lines that follow (the 'again' loop). When you're through with the 'first pattern subroutine 8-)' print the non-blank line that's now in the pattern space. Simple enough, huh ? Leo. P.S. I don't doubt it can be done with awk (it could even be programmed with the shell). I however doubt it will be nearly as fast as the sed solution.
allyn@hp-sdd.hp.com (Allyn Fratkin) (10/17/88)
you sed experts are going way overboard here. this is a really trivial problem for sed. try something simple, like the following: sed 's/^[<sp><tab><ff>]*//; /^$/N; /^\n[<sp><tab><ff>]*$/D' this starts by reducing a "blank" line to an empty one. an empty line is joined to the next line, and then if this next line is also "blank", the first line is deleted. i admit that this does require non-printable chars. too bad sed doesn't understand octal escape sequences. to get around the non-printable chars, you could always do something like this: chars=`echo '\040\011\014\c'` # use -n with bsd echo sed "s/^[$chars]*//; /^$/N; /^\n[$chars]*$/D" -- From the virtual mind of Allyn Fratkin allyn@sdd.hp.com San Diego Division - or - Hewlett-Packard Company uunet!ucsd!hp-sdd!allyn
mark@adec23.UUCP (Mark Salyzyn) (10/17/88)
In article <192@vlsi.ll.mit.edu>, young@vlsi.ll.mit.edu (George Young) writes: > I often want to take a big ascii file > (like a .c file after cc -E) and collapse each group of 'blank' lines > into exactly one blank line. 'Blank' here is any combination of blanks, > tabs and maybe ^L's. It looks from the documentation that sed should do this > quite neatly, using the multiple line pattern space commands with imbedded > newlines, but I sure can't figure out how. I'd prefer the resulting blank > line to be just a newline. This is untested, but you should get the idea ... : loop N /^[(blank)(tab)(form feed)]*\n[(blank)(tab)(form feed)]*$/ { s/// b loop } Replace (blank) with the blank character itself. Replace (tab) with the tab character itself. Replace (form feed) with the form feed character itself. -- Mark Salyzyn @ ADEC Systems Inc.
morrell@hpsal2.HP.COM (Michael Morrell) (10/18/88)
/ hpsal2:comp.unix.questions / dave@lsuc.uucp (David Sherman) / 3:49 pm Oct 16, 1988 / Some versions of UNIX (including 4.1BSD) include a program call ssp(1) which does this; it's a bit faster than the parallel awk script. Also, if you know your text doesn't contain duplicate lines (nroff output almost always qualifies, for example), good old uniq(1) will do it. ---------- I thought the original problem was to replace multiple "blank" lines with a single newline (\n), where "blank" included lines with only tabs and formfeeds on them. ssp and uniq will only help if the lines to be combined are really blank.