[net.unix] Unix text files

kay@warwick.UUCP (Kay Dekker) (10/26/85)

Extensive quoting ensues, as I've moved the discussion to net.unix from
net.bugs, and people may have missed this...

Sometime back, gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) wrote:

>> >Many UNIX text-file utilities will discard a (necessarily final)
>> >text line that does not end in a newline.  Quite simply, such a
>> >file is not a proper UNIX text file.

and I responded with:

>> Who says?  Where's the definition of a 'proper' UNIX text file?

to which he replied:

>The problem is, there are several interpretations of such a file,
>depending on the utility involved.  Perhaps there should be a
>well-defined standard interpretation, but there isn't currently.
>
>"A file of text consists simply of a string of characters, with
>lines demarcated by the newline character."  -- from "The UNIX
>Time-Sharing System" by Ritchie & Thompson
>
>"text file, ASCII file -- a file, the bytes of which are understood
>to be in ASCII code"  -- from "Glossary" in "UNIX Time-Sharing
>System Programmer's Manual", 8th Ed.
>
>"A text stream is an ordered sequence of bytes composed into lines,
>each line consisting of zero or more characters plus a terminating
>new-line character.  ...  The sequentially last character read in
>from a text stream will, however, always be sequentially the last
>character that was earlier written out to the text stream, if that
>character was a new-line."  -- from ANSI X3J11/85-045
>
>My personal choice would be similar to Ritchie & Thompson, where
>newlines delimit (NOT "terminate") text lines, so that the last
>character in a text file would not need to be a newline.  However,
>this raises the question of what utilities should do with the
>null line at the end of every text file that DOES end with a
>newline; this will still be utility-dependent (and should be
>documented whenever it is handled differently from other text
>lines in the file).
>
>X3J11/85-045 botched it anyhow, since they intended that ALL UNIX
>files qualify as "text streams" under stdio (vs. "binary streams",
>which have to be handled differently on some non-UNIX OSes).
>
>So, how do we establish a standard interpretation for non-newline-
>terminated UNIX text files?

Doug,
	I may be being optimistic (and thus *wrong*) but I don't see where
the problem with your suggestion [newlines delimiting text lines] lies:
the rule would be, simply,

"Text consists of an ordered sequence of characters, with lines delimited
by newline characters.  Text is normally terminated by a newline.  This
newline should be considered to be followed by a (nonexistant) null line.
The null line should not be considered to be part of the text.
	"If the last character of the text is not a newline, then consider
the text to be terminated by a newline - null line pair; however, this
newline - null line pair should not be considered to have been part of
the file.

I *think* that's right...
							Kay.
-- 
"The only good thing that I can find to say about the idea of colonies
in space is that America could, at last, have a world to herself."
						-- Elisabeth Zyne
			... mcvax!ukc!warwick!flame!kay

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (10/27/85)

> "Text consists of an ordered sequence of characters, with lines delimited
> by newline characters.  Text is normally terminated by a newline.  This
> newline should be considered to be followed by a (nonexistant) null line.
> The null line should not be considered to be part of the text.
> 	"If the last character of the text is not a newline, then consider
> the text to be terminated by a newline - null line pair; however, this
> newline - null line pair should not be considered to have been part of
> the file.
> 
> I *think* that's right...
> 							Kay.

Perhaps that is the best interpretation, but it sure is hard
to put all that into a formal grammar, whereas the original
concept was very simple:

file		::=	binary_file	|	text_file

binary_file	::=	{ byte }*

byte		::=	<primitive unit of data,
				at least 8 bits>

text_file	::=	{ text_line }*

text_line	::=	{ text_char }* newline

text_char	::=	<7-bit ASCII character
				excluding NUL and newline>

newline		::=	<ASCII LF character>

danny@nvzg2.UUCP (Danny Zerkel) (10/29/85)

> > "Text consists of an ordered sequence of characters, with lines delimited
> > by newline characters.  Text is normally terminated by a newline.  This
> > newline should be considered to be followed by a (nonexistant) null line.
> > The null line should not be considered to be part of the text.
> > 	"If the last character of the text is not a newline, then consider
> > the text to be terminated by a newline - null line pair; however, this
> > newline - null line pair should not be considered to have been part of
> > the file.
> > 
> > I *think* that's right...
> > 							Kay.
> 
> Perhaps that is the best interpretation, but it sure is hard
> to put all that into a formal grammar, whereas the original
> concept was very simple:
> 
> file		::=	binary_file	|	text_file
> 
> binary_file	::=	{ byte }*
> 
> byte		::=	<primitive unit of data,
> 				at least 8 bits>
> 
> text_file	::=	{ text_line }*
> 
> text_line	::=	{ text_char }* newline
> 
> text_char	::=	<7-bit ASCII character
> 				excluding NUL and newline>
> 
> newline		::=	<ASCII LF character>

Hmmm... sounds like the old, variable length data representation problem...
Hmmm.....  seems to me there are two fundamental representations of
variable length data, counting and sentinels.  Anything not fitting these
molds is unstructured or at best partially structured.

Looks like a text file is being represented as a variable number of
variable length strings. Except that the number of lines is unknown
(but indirectly derivable), and the sentinel marking the end of a line
is optional on the last line.

  "Look, ma, unstructured data!"
  "Avert your eyes son, or it will blind and confuse you."

Does anyone out there want to show those of us with weak knees how one
would use this kind of data structure [used loosely] in a program?
(In other words, as if the data were within the program not without.)
Without additional support information, like keeping track of the number
and lengths of lines.

I think it would be a good example to the young of inheirent complexity.
And I thought we were trying to make life simple!  The main problem here
is that we are trying to impose structure on unstructured data, which
is probably not the best approach.

NOTE:
Sentinels are a wonderful way of implementing lists, but a terrible way
of implementing strings.  Hint, hint.

All of this is not to say, the is no use for unstructured data.  "tr" does
a great job on unstructured data, mainly because it treats it as such.
Using "cat" to look at files, however is probably the worst offender.  It
does not care what the data is, but attempts to make it appear on the
users screen.

------------------------------------------------------------------------
From the finger tips spasmotically responding to the brain of the Master
of the Universe between the ears of---Danny J. Zerkel

bill@ur-cvsvax.UUCP (Bill Vaughn) (10/31/85)

> > "Text consists of an ordered sequence of characters, with lines delimited
> > by newline characters.  Text is normally terminated by a newline.  This
> > newline should be considered to be followed by a (nonexistant) null line.
> > The null line should not be considered to be part of the text.
> > 	"If the last character of the text is not a newline, then consider
> > the text to be terminated by a newline - null line pair; however, this
> > newline - null line pair should not be considered to have been part of
> > the file.
> > 
> > I *think* that's right...
> > 							Kay.
> 
> Perhaps that is the best interpretation, but it sure is hard
> to put all that into a formal grammar, whereas the original
> concept was very simple:
> 
> file		::=	binary_file	|	text_file
> binary_file	::=	{ byte }*
> byte		::=	<primitive unit of data, at least 8 bits>
> text_file	::=	{ text_line }*
> text_line	::=	{ text_char }* newline
> text_char	::=	<7-bit ASCII character excluding NUL and newline>
> newline	::=	<ASCII LF character>

Won't this change do it:

text_file	::=	{ text_line }*  { text_char }*

I'm assuming that { something }*  means zero or more occurences
of 'something'.  I don't mean to imply that the change is desirable or
trivial, but it doesn't seem to be 'hard'.

Bill Vaughn
Univ. of Rochester

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (11/05/85)

> Does anyone out there want to show those of us with weak knees how one
> would use this kind of data structure [used loosely] in a program?
> (In other words, as if the data were within the program not without.)
> Without additional support information, like keeping track of the number
> and lengths of lines.

Most data processing algorithms are (or should be) driven by the
structure of the data that they process; this is normally taught
these days in the "data structures" CS course.  It should be obvious
from the grammar how to structure code that e.g. gets a line
of text, processes it, and writes out the resulting line.  (There
is no need to bring in line numbering or "length of line".)  If
there is no (or only a fuzzy) definition of "line of text",
then it is not obvious how to get/put one, and some random
choice is made by the programmer.  (Which is what started this
discussion.)

For simplicity, I left out of the grammar one important constraint,
which is a limit of no more than 510 characters in a line of text
(exclusive of newline).  I had already stretched the notation a bit
and didn't want to invent yet another notation like { char }*510 .
This limit is actually important in allowing efficient get-line
implementations.

> I think it would be a good example to the young of inheirent complexity.

There is nothing complex about that grammar.  It is a remarkably
simple one, which was the point.  Note that it was decomposed
into meaningful subunits -- this is important!  Just having a
formal grammar (syntax) is not sufficient for good semantic
processing.  (People often forget this.)

> And I thought we were trying to make life simple!  The main problem here
> is that we are trying to impose structure on unstructured data, which
> is probably not the best approach.

Text files certainly are structured, although it's a rather
flexible structure.  One might argue that dividing text into
lines is artificial, but the concept of a "line of text" is
useful in many text-processing programs (e.g., "grep").

> Sentinels are a wonderful way of implementing lists, but a terrible way
> of implementing strings.  Hint, hint.

Oh, foo.  Both the count+data and NUL-terminated representations
for character strings have good and bad points.  I've used both
and prefer C's approach for most routine programming.

If the point of the correspondent was that FILES-11 variable-
length record format is easier to work with, he deserves a
large horse laugh.  See "Software Tools" for examples of the
use of UNIX-like text file formats in programs.