[comp.text.sgml] sgml stripping

goer@quads.uchicago.edu (Richard L. Goerwitz) (12/06/90)

I keep hearing people talking about stripping out SGML markup.
This is a terrible idea, in general.  There are, however, times
when it is useful to be able to strip out everything that is not
enclosed within <> delimiters.

Here's a small Icon program that will do this.

-Richard (goer@sophist.uchicago.edu)

---- Cut Here and feed the following to sh ----
#!/bin/sh
# This is a shell archive (produced by shar 3.49)
# To extract the files from this archive, save it to a file, remove
# everything above the "!/bin/sh" line above, and type "sh file_name".
#
# made 12/06/1990 07:11 UTC by goer@sophist.uchicago.edu
# Source directory /u/richard/Stripsgml
#
# existing files will NOT be overwritten unless -c is specified
# This format requires very little intelligence at unshar time.
# "if test", "cat", "rm", "echo", "true", and "sed" may be needed.
#
# This shar contains:
# length  mode       name
# ------ ---------- ------------------------------------------
#   3170 -r--r--r-- stripsgml.icn
#   2615 -r--r--r-- stripunb.icn
#   1915 -r--r--r-- readtbl.icn
#   2084 -r--r--r-- slashbal.icn
#    981 -rw-r--r-- README
#    659 -rw-r--r-- Makefile.dist
#
if test -r _shar_seq_.tmp; then
	echo 'Must unpack archives in sequence!'
	echo Please unpack part `cat _shar_seq_.tmp` next
	exit 1
fi
# ============= stripsgml.icn ==============
if test -f 'stripsgml.icn' -a X"$1" != X"-c"; then
	echo 'x - skipping stripsgml.icn (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting stripsgml.icn (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'stripsgml.icn' &&
X############################################################################
X#
X#	Name:	 stripsgml.icn
X#
X#	Title:	 Strip (or translate) simple SGML tags from a file
X#
X#	Author:	 Richard L. Goerwitz
X#
X#	Version: 1.7
X#
X############################################################################
X#
X#  This program, stripsgml, may be used either to strip SGML tags
X#  from a file, or to translate them into some other format (or perhaps
X#  some combination of the two).  Note that it only handles very
X#  simple SGML codes, either stripping or translating set strings.
X#  This is a VERY simple program, merely intended to satisfy a need
X#  many have expressed for being able to remove, or perform simple
X#  manipulations on, files containing <>-style tags.
X#
X#  In its basic mode, you would simply have stripsgml read the
X#  standard input (an SGML-marked file).  Stripsgml would then write
X#  an SGML-free text on the standard output.  Used in this way,
X#  stripsgml is just a simple stripping program.
X#
X#  If you want some or all of the SGML codes translated into another set
X#  of codes, simply create a file in which each line has 1) the name of
X#  the SGML code, and then 2) the way you want that code translated on
X#  both initialization and completion.  The completion specification is
X#  optional.  Put succinctly, the format is:
X#
X#      code	initialization	completion
X#
X#  A tab or colon separates the fields.  If you want to use a tab or colon
X#  as part of the text (and not as a separator), place a backslash before
X#  it.  The completion field is optional.  There is not currently any way
X#  of specifying a completion field without an initialization field.
X#
X#  In its translation mode, stripsgml is invoked with one argument (the
X#  name of the file containing the translation information).  As before,
X#  the standard input is expected to contain an SGML encoded file:
X#
X#      stripsgml translation_file < SGML-file
X#
X#  To the standard output is written a SGML-free text.
X#
X#  Note that, if you are translating SGML code into font change or escape
X#  sequences, you may get unexpected results.  This isn't stripsgml's
X#  fault.  It's just a matter of how your terminal or WP operate.  Some
X#  need to be "reminded" at the beginning of each line what mode or font
X#  is being used.  Note also that stripsgml assumes < and > as delimiters.
X#  If you want to put a greater-than or less-than sign into your text,
X#  put a backslash before it.  This will effectively "escape" the spe-
X#  cial meaning of those symbols.  There is currently no way to change
X#  the default delimiters.
X#
X############################################################################
X#
X#  Links: slashbal.icn ./stripunb.icn ./readtbl.icn
X#
X############################################################################
X
X
Xprocedure main(a)
X
X    usage := "usage:  stripsgml [map-file]"
X    *a > 1 & stop(usage)
X
X    map_file := open(a[1]) & t := readtbl(map_file)
X
X    every line := !&input do
X	write(stripunb('<','>',line,&null,&null,t))
X
X    # last_k is the stack used in stripunb.icn
X    if *\last_k ~= 0 then
X	stop("Unexpected EOF encountered.  Expecting ", pop(last_k), ".")
X
Xend
SHAR_EOF
true || echo 'restore of stripsgml.icn failed'
rm -f _shar_wnt_.tmp
fi
# ============= stripunb.icn ==============
if test -f 'stripunb.icn' -a X"$1" != X"-c"; then
	echo 'x - skipping stripunb.icn (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting stripunb.icn (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'stripunb.icn' &&
X############################################################################
X#
X#	Name:	 stripunb.icn
X#
X#	Title:	 Strip unbalanced material
X#
X#	Author:	 Richard L. Goerwitz
X#
X#	Version: 1.3
X#
X############################################################################
X#  
X#  This routine strips material from a line which is unbalanced with
X#  respect to the characters defined in arguments 1 and 2 (unbalanced
X#  being defined as bal() defines it, except that characters preceded
X#  by a backslash are counted as regular characters, and are not taken
X#  into account by the balancing algorithm).
X#
X#  One little bit of weirdness I added in is a table argument. Put
X#  simply, if you call stripunb() as follows,
X#
X#      stripunb('<','>',s,&null,&null,t)
X#
X#  and if t is a table having the form,
X#
X#      key:  "bold"        value: outstr("\e[2m", "\e1m")
X#      key:  "underline"   value: outstr("\e[4m", "\e1m")
X#      etc.
X#
X#  then every instance of "<bold>" in string s will be mapped to
X#  "\e2m," and every instance of "</bold>" will be mapped to "\e[1m."
X#  Values in table t must be records of type output(on, off).  When
X#  "</>" is encountered, stripunb will output the .off value for the
X#  preceding .on string encountered.
X#
X############################################################################
X#
X#  Links: slashbal.icn
X#
X############################################################################
X
Xglobal last_k
Xrecord outstr(on, off)
X
X
Xprocedure stripunb(c1,c2,s,i,j,t)
X
X    # NB:  Stripunb() returns a string - not an integer (like find,
X    # upto).
X
X    local lookinfor, bothcs, s2, k
X    #global last_k
X    initial last_k := list()
X
X    /c1 := '<'
X    /c2 := '>'
X    bothcs := c1 ++ c2
X    lookinfor := c1 ++ '\\'
X    c := &cset -- c1 -- c2
X
X    /s := \&subject | stop("stripunb:  No string argument.")
X    /i := \&pos | 1
X    /j := *s + 1
X
X    s2 := ""
X    s ? {
X	tab(i) | fail
X	while s2 ||:= tab(upto(lookinfor)) do {
X	    if ="\\" & any(bothcs) then {
X		&pos+1 > j & (return s2)
X		s2 ||:= move(1)
X		next
X	    }
X	    else {
X		&pos > j & (return s2)
X		any(c1) |
X		    stop("stripunb:  Unbalanced string, pos(",&pos,").\n",s)
X		k := tab(slashbal(c,c1,c2,&null,&null,&null,1)) | tab(0)
X		if \t then {
X		    k ?:= 2(="<", tab(find(">")), =">", pos(0))
X		    if k ?:= (="/", tab(0)) then {
X			compl := pop(last_k) | stop("Unclosed <>, ",&subject) 
X			if k == ""
X			then k := compl
X			else k == compl | stop("Incorrectly paired <>, </>.")
X			s2 ||:= \(\t[k]).off
X		    }
X		    else {
X			s2 ||:= \(\t[k]).on
X			push(last_k, k)
X		    }
X		}
X	    }
X	}
X	s2 ||:= tab(0)
X    }
X
X    return s2
X
Xend
SHAR_EOF
true || echo 'restore of stripunb.icn failed'
rm -f _shar_wnt_.tmp
fi
# ============= readtbl.icn ==============
if test -f 'readtbl.icn' -a X"$1" != X"-c"; then
	echo 'x - skipping readtbl.icn (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting readtbl.icn (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'readtbl.icn' &&
X############################################################################
X#
X#	Name:	 readtbl.icn
X#
X#	Title:	 Read user-created stripsgml table
X#
X#	Author:	 Richard L. Goerwitz
X#
X#	Version: 1.1
X#
X############################################################################
X#  
X#  This file is part of the stripsgml package.  It does the job of read-
X#  ing option user-created mapping information from a file.  The purpose
X#  of this file is to specify how each code in a given input text should
X#  be translated.  Each line has the form:
X#
X#      SGML-designator	start_code	end_code
X#
X#  where the SGML designator is something like "quote" (without the quota-
X#  tion marks), and the start and end codes are the way in which you want
X#  the beginning and end of a <quote>...<\quote> sequence to be transla-
X#  ted.  Presumably, in this instance, your codes would indicate some set
X#  level of indentation, and perhaps a font change.  If you don't have an
X#  end code for a particular SGML designator, just leave it blank.
X#
X############################################################################
X#
X#  Links: stripsgml.icn
X#
X############################################################################
X
X
Xprocedure readtbl(f)
X
X    local t, line, k, on_sequence, off_sequence
X
X    /f & stop("readtbl:  Arg must be a valid open file.")
X
X    t := table()
X
X    every line := trim(!f,'\t ') do {
X	line ? {
X	    k := tabslashupto('\t:') &
X	    tab(many('\t:')) &
X	    on_sequence := tabslashupto('\t:') | tab(0)
X	    tab(many('\t:'))
X	    off_sequence := tab(0)
X	} | stop("readtbl:  Bad map file format.")
X	insert(t, k, outstr(on_sequence, off_sequence))
X    }
X
X    return t
X
Xend
X
X
X
Xprocedure tabslashupto(c,s)
X
X    POS := &pos
X
X    while tab(upto('\\' ++ c)) do {
X	if ="\\" then {
X	    move(1)
X	    next
X	}
X	else {
X	    if any(c) then {
X		suspend &subject[POS:.&pos]
X	    }
X	}
X    }
X
X    &pos := POS
X    fail
X
Xend
SHAR_EOF
true || echo 'restore of readtbl.icn failed'
rm -f _shar_wnt_.tmp
fi
# ============= slashbal.icn ==============
if test -f 'slashbal.icn' -a X"$1" != X"-c"; then
	echo 'x - skipping slashbal.icn (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting slashbal.icn (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'slashbal.icn' &&
X############################################################################
X#
X#	Name:	 slashbal.icn
X#
X#	Title:	 Bal() with backslash escaping
X#
X#	Author:	 Richard L. Goerwitz
X#
X#	Version: 1.4
X#
X############################################################################
X#
X#  I am often frustrated at bal()'s inability to deal elegantly with
X#  the common \backslash escaping convention (a way of telling Unix
X#  Bourne and C shells, for instance, not to interpret a given
X#  character as a "metacharacter").  I recognize that bal()'s generic
X#  behavior is a must, and so I wrote slashbal() to fill the gap.
X#
X#  Slashbal behaves like bal, except that it ignores, for purposes of
X#  balancing, any c2/c3 char which is preceded by a backslash.  Note
X#  that we are talking about internally represented backslashes, and
X#  not necessarily the backslashes used in Icon string literals.  If
X#  you have "\(" in your source code, the string produced will have no
X#  backslash.  To get this effect, you would need to write "\\(."
X#
X#  BUGS:  Note that, like bal() (v8), slashbal() cannot correctly
X#  handle cases where c2 and c3 intersect.
X#
X############################################################################
X#
X#  Links: none
X#
X############################################################################
X
Xprocedure slashbal(c1, c2, c3, s, i, j)
X
X    local twocs, allcs, chr2, count
X
X    /c1 := &cset
X    /c2 := '('
X    /c3 := ')'
X    twocs := c2 ++ c3
X    allcs := c1 ++ c2 ++ c3 ++ '\\'
X
X    /s := \&subject | stop("slashbal:  No string argument.")
X    /i := \&pos | 1
X    /j := *s + 1
X
X    count := 0
X    s ? {
X	tab(i) | fail
X	while tab(upto(allcs)) do {
X	    chr := move(1)
X	    if chr == "\\" & any(twocs) then {
X		chr2 := move(1)
X		&pos > j & fail
X		if any(c1, chr) & count = 0 then
X		    suspend .&pos - 2
X		if any(c1, chr2) & count = 0 then
X		    suspend .&pos - 1
X	    }
X	    else {
X		&pos > j & fail
X		if any(c1, chr) & count = 0 then
X		    suspend .&pos - 1
X		if any(c2, chr) then
X		    count +:= 1
X		else if any(c3, chr) then
X		    count -:= 1
X	    }
X	}
X    }
X
Xend
SHAR_EOF
true || echo 'restore of slashbal.icn failed'
rm -f _shar_wnt_.tmp
fi
# ============= README ==============
if test -f 'README' -a X"$1" != X"-c"; then
	echo 'x - skipping README (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting README (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'README' &&
XRe:  stripsgml.icn & associated files
X
XThis program is documented in the various source files, most notably
Xstripsgml.icn.  Please look them over, even if you are not an Icon
Xprogrammer.
X
XIn order to compile this program, you will need an Icon interpreter
X(or compiler).  If you do not have it, get it.  It is free, and can
Xbe obtained via ftp from cs.arizona.edu.  If you do not have access
Xto the internet, drop a line to the icon-project@arizona.edu, and
Xthey will fill you in on what to do.
X
XIf you are working on a Unix system, you can simply mv Makefile.dist
Xto Makefile, and then make.  Users on other systems will need to
Xtype:
X
X     icont -o stripsgml readtbl.icn slashbal.icn stripsgml.icn stripunb.icn
X
XAs I said above, see the file stripsgml.icn for more information on how
Xto use this program.  This program is not fancy, and handles only the
Xsimplest <>-style markup.  It is in no way an attempt to handle the full
Xmetalanguage!
X
X-Richard (goer@sophist.uchicago.edu)
SHAR_EOF
true || echo 'restore of README failed'
rm -f _shar_wnt_.tmp
fi
# ============= Makefile.dist ==============
if test -f 'Makefile.dist' -a X"$1" != X"-c"; then
	echo 'x - skipping Makefile.dist (File already exists)'
	rm -f _shar_wnt_.tmp
else
> _shar_wnt_.tmp
echo 'x - extracting Makefile.dist (Text)'
sed 's/^X//' << 'SHAR_EOF' > 'Makefile.dist' &&
XPROGNAME = stripsgml
X
X# Please edit these to reflect your local file structure & conventions.
XDESTDIR = /usr/local/bin
XOWNER = bin
XGROUP = bin
X
XSRC = $(PROGNAME).icn stripunb.icn readtbl.icn slashbal.icn
X
X$(PROGNAME): $(SRC)
X	icont -o $(PROGNAME) $(SRC)
X
X# Pessimistic assumptions regarding the environment (in particular,
X# I don't assume you have the BSD "install" shell script).
Xinstall: $(PROGNAME)
X	@sh -c "test -d $(DESTDIR) || (mkdir $(DESTDIR) && chmod 755 $(DESTDIR))"
X	cp $(PROGNAME) $(DESTDIR)/
X	chgrp $(GROUP) $(DESTDIR)/$(PROGNAME)
X	chown $(OWNER) $(DESTDIR)/$(PROGNAME)
X	@echo "\nInstallation done.\n"
X
Xclean:
X	-rm -f *~ .u?
X	-rm -f $(PROGNAME)
SHAR_EOF
true || echo 'restore of Makefile.dist failed'
rm -f _shar_wnt_.tmp
fi
exit 0