[comp.lang.misc] Text or data files?

sommar@enea.se (Erland Sommarskog) (09/26/88)

(This stems from the discussions of VMS and Unix file systems. I've
added comp.lang.misc and directed follow-ups there. It doesn't
belong to comp.os.vms anymore and I don't read comp.unix.wizards.)

Miles O'Neal (meo@stiatl.UUCP) writes:
>If you even have your data files as text files, debugging
>becomes much easier. For instance, would you rather debug
>98764389437034gh307ytfhr398f39
>or
>12/22/88 01:30 10790 100 100 382 -1
>?
>These are not real data, but examples of what data files I've dealt
>with looked like. The processing to do all this is cheap nowdays,
>so why not use text files if there is no OVERWHELMING reason not to?

There are several advantages of using fixed record files for data,
if the data I have fits to that format. Let's say we have: (This is  
Pascal.)

    Data_record = RECORD
                     Date : PACKED ARRAY(.1..8.) OF char;
                     Time : PACKED ARRAY(.1..8.) OF char;
                     Incident       : Incident_type;   (* Enumerated *)
                     No_of_warnings : integer;
                     Alarmed        : boolan;
                     Username       : PACKED ARRAY(.1..12.) OF char;
                  END;

The simplest way to read and write this is to through a FILE OF Data_record,
if no other programs is to read it. If we store the data in a text file, we
have to parse every line we get. (And what a trouble if Username contains
an LF character.) And since the text file is vulnerable to changes from 
e.g. a text editor, we cannot be sure that the file follows the supposed
format throughout.

As for debugging this only applies if you have to look at the file as such.

>Another thing this buys you is that, in my experience, its easier
>to change file formats if you use text files. It requires a little
>plannning, but in general is a lot less work than doing the same
>thing with any other type of data.

Uh? If you have a text file and change the format you have to rewrite
the parsing and the writing-to file parts. With a fixed format you
change the declaration, and that's all. (Well, you may have to write 
a simple program to convert old files, but you have do to that for 
text files too.)
-- 
Erland Sommarskog            ! "Hon ligger med min b{ste v{n, 
ENEA Data, Stockholm         !  jag v}gar inte sova l{ngre", Orup
sommar@enea.UUCP             ! ("She's making love with best friend,
                             !   I dare not to sleep anymore")

meo@stiatl.UUCP (Miles O'Neal) (09/27/88)

In article <3958@enea.se>, sommar@enea.se (Erland Sommarskog) writes:
> There are several advantages of using fixed record files for data,
> if the data I have fits to that format.

Either you missed the point, or I didnt make it well. The issue I was
addressing was not whether the record format was variable, but
whether the format was human readable. And if you are developing
anything remotely complex, being able to read and "text process"
the data files is very handy for debugging. I believe that was the
context of the original discussion. I agree that, usually, fixed
format is better, unless space is a problem.
=====================================================================
Any opinions, facts, fantasies, theories, fallacies and/or lies which
may or may not have been expressed in this posting or elsewhere,  may
or may not belong  to myself,  my employer,  the building management,
Atlanta, Fulton County, public and/or private utilities serving such,
the State of GA, the USA,  or God himself. No other warranties apply,
implied or stated. For further information, call 1(800)HOO-HAHA.
---------------------------------------------------------------------
Miles O'Neal                                 decvax!gatech!stiatl!meo

mmengel@cuuxb.ATT.COM (Marc W. Mengel) (09/27/88)

In article <3958@enea.se> sommar@enea.se (Erland Sommarskog) writes:
>
>Miles O'Neal (meo@stiatl.UUCP) writes:
>>If you even have your data files as text files, debugging
>>becomes much easier. For instance, would you rather debug
>>98764389437034gh307ytfhr398f39
>>or
>>12/22/88 01:30 10790 100 100 382 -1
>>?
>
>There are several advantages of using fixed record files for data,
>if the data I have fits to that format. Let's say we have: (This is  
>Pascal.)
>    Data_record = RECORD
>                     Date : PACKED ARRAY(.1..8.) OF char;
>                     Time : PACKED ARRAY(.1..8.) OF char;
>                     Incident       : Incident_type;   (* Enumerated *)
>                     No_of_warnings : integer;
>                     Alarmed        : boolan;
>                     Username       : PACKED ARRAY(.1..12.) OF char;
>                  END;
>
>The simplest way to read and write this is to through a FILE OF Data_record,
>if no other programs is to read it.

Two major problems with this idea.  The first is that most of the time
other programs will need to read the data sooner or later.  Second, when
files are written in a binary format like this, the same program cannot
read the data when run on a different machine with a different byte
ordering, so after you have built up a list of 2000 incidents, and have
to move to a new machine, you lose big time.  You have a data file
with packed records in it, and you (the programmer) have *no idea* how
the data is actually formatted.

> If we store the data in a text file, we
>have to parse every line we get. (And what a trouble if Username contains
>an LF character.) And since the text file is vulnerable to changes from 
>e.g. a text editor, we cannot be sure that the file follows the supposed
>format throughout.

It's true, you have to parse some of the data file (the numbers), but
even Pascal gives you a means of writing and reading integers of a
fixed width.  Since Pascal has a problem with newlines, you can either
write them with an escape sequence like '\n', or just check Pascal's
eol() function before reading a character.

>As for debugging this only applies if you have to look at the file as such.

Clearly, one would never need to look at the data file a program uses
to determine if a value is getting trashed before being written to 
the file, or after being read back in. (HEAVY sarcasm here)

>>Another thing this buys you is that, in my experience, its easier
>>to change file formats if you use text files. It requires a little
>>plannning, but in general is a lot less work than doing the same
>>thing with any other type of data.
>
>Uh? If you have a text file and change the format you have to rewrite
>the parsing and the writing-to file parts. With a fixed format you
>change the declaration, and that's all. (Well, you may have to write 
>a simple program to convert old files, but you have do to that for 
>text files too.)

What's so tough about fixed format text parsing?  Every language known
to mankind can read and write integers, etc. from a file;  The format
is easily extensible, you *can* add records with a text editor, you 
can debug your code much more easily, you can write programs in other
languages or on other machines that can read your data files, you can
use Unix utilities like grep and sed and awk to make useful reports...
The list goes on and on.

Binary data saves you a whole 20 minutes to an hour when writing 
the program, makes your data files unportable between machines,
makes you have to use the same language you started with on the
same machine as long as you want to use the same data files...
This list also goes on and on.

If you're writing real, production programs that are useful to people
and generate data files, the people who use it may want to use it on
a different machine 5 years from now; they will read in the source and
data file from a tape onto their new machine, and their data will
be garbage, because the 1's will be 256's, or 2048's, etc. due to
byte order on the new machine.

If they move to this new machine because the old one is dead and they can't
get parts any more, and now they can't read their old accounting records;
they will find worse names to call you than you can imagine.

>-- 
>Erland Sommarskog            ! "Hon ligger med min b{ste v{n, 
>ENEA Data, Stockholm         !  jag v}gar inte sova l{ngre", Orup
>sommar@enea.UUCP             ! ("She's making love with best friend,
>                             !   I dare not to sleep anymore")

-- 
 Marc Mengel			       
 mmengel@cuuxb.att.com
 attmail!mmengel	
 {lll-crg|mtune|att}!cuuxb!mmengel

sommar@enea.se (Erland Sommarskog) (10/01/88)

I had the example:
>>    Data_record = RECORD
>>                     Date : PACKED ARRAY(.1..8.) OF char;
>>                     Time : PACKED ARRAY(.1..8.) OF char;
>>                     Incident       : Incident_type;   (* Enumerated *)
>>                     No_of_warnings : integer;
>>                     Alarmed        : boolan;
>>                     Username       : PACKED ARRAY(.1..12.) OF char;
>>                  END;
>>
>>The simplest way to read and write this is to through a FILE OF Data_record,
>>if no other programs is to read it.

Marc W. Mengel (mmengel@cuuxb.UUCP) wrote:
>Two major problems with this idea.  The first is that most of the time
>other programs will need to read the data sooner or later.  

If we have data that are to be read by more than one program, two 
programs can import the declaration of the data record from a common 
source, and thus they do not need to be rewritten if the format is 
changed. With a text file we can achieve the same effect with common 
procedures for read and writing, to be imported together with the data 
definition, giving a higer degree of dependency. Also now we have the
problem that for one change we have to edit three in places: the read 
and write routines and the data definition, introducing a source of 
error.)
  If you have many programs that are to read the same data, you are 
likely to get a database system, and I don't think they store data 
in a text-file format...
  The only case when I can see that this argument is valid is when 
"the other program" is standard a text-oriented utility.   

>Second, when
>files are written in a binary format like this, the same program cannot
>read the data when run on a different machine with a different byte
>ordering, so after you have built up a list of 2000 incidents, and have
>to move to a new machine, you lose big time.  

A valid point. However, text files are not necessarily compatible either. 
Imagine that the data record above has a message field, 80 characters 
long. Assume that the program started its life on VMS and that one of 
the messages contains a CR-LF. Now we move to a Unix system... And I 
have seen Pascal systems that gladly read 123 from the line "123ABD", 
and those who chokes, saying "inavlid integer". Both these problems 
can be avoided with careful programming, it should be added.

>You have a data file with packed records in it, and you (the programmer) 
>have *no idea* how the data is actually formatted.

Isn't this a point? I always thought that a high level of abstraction  
as possible was a good thing. You don't need to know the actaul disk
format until you really have a need to move the file.

>It's true, you have to parse some of the data file (the numbers), but
>even Pascal gives you a means of writing and reading integers of a
>fixed width.  

The problem is that you often have little use for these standard 
routines, unless you can accept that the program crashes because there 
was a letter where you expected a number. Storing data in text files 
gives you a bigger problem with data integrity, than with binary
files.

>What's so tough about fixed format text parsing?  
>...
>you *can* add records with a text editor, 

A plus, but applying the text editor is clearly a violence on data 
integrity.

>you can debug your code much more easily, 

Since I have less code, binary files win here, as long as I have
good debugger around.

>you can write programs in other languages 

If you work on VMS and have CDD (Common Data Dictionary) around this
is possible with binary files too. (With CDD you can write data 
definitions in specific data definition language. Several of DEC's
compiler suppiles a DICTIONARY directive to import these definitions.)

>Binary data saves you a whole 20 minutes to an hour when writing 
>the program, 

It saves you more time than so. You don't have to think so much about 
integreity checks, you have less problem changing the format during 
development, maintenance benefits from the reduced code volume.

I'm not saying that you should never use text files for storing data.
In many cases this may be very desirable. I once wrote a simple
text formatter with an interactive syllabication facility. I stored
the syllabications on i text file, since I realized that the user
wanted to be able to remove an erroneous syllabication, and I didn't
want to write a tool for maintaining the file.
  What to use is a decision the programmer has to make based on the
requirements on portability (+ for text), performance (+ for binary),
data integrity (+ for binary) and so on. Generally it seems to me that
a cheap system with low requirements on integrity and maintainability
a text file is the natural choice. But as the complexity and the amount
of data grow you are likely to chose binary files and eventually
you pass the line where you need a database management system.
-- 
Erland Sommarskog            
ENEA Data, Stockholm         
sommar@enea.UUCP

allbery@ncoast.UUCP (Brandon S. Allbery) (10/03/88)

As quoted from <3958@enea.se> by sommar@enea.se (Erland Sommarskog):
+---------------
| Uh? If you have a text file and change the format you have to rewrite
| the parsing and the writing-to file parts. With a fixed format you
| change the declaration, and that's all. (Well, you may have to write 
| a simple program to convert old files, but you have do to that for 
| text files too.)
+---------------

The last time I added a field to the UNaXcess userfile I added a ":%s" to
the fprintf() write and a "getfld(buf,var)" to the read.  Trivial.  And
because of the way getfld() is written and the fact that new fields always
go at the end of the line (record), the userfile is self-upgrading:  if the
field is missing, a default value is used and the field is added on output.
Try THAT with a binary file!

As for newlines in "records", etc.:  there is a backslash-escape convention
that deals with this quite well.  But I rarely, if ever, find a need to
store newlines in data records.
-- 
Brandon S. Allbery, uunet!marque!ncoast!allbery			DELPHI: ALLBERY
	  For comp.sources.misc send mail to <backbone>!sources-misc
comp.sources.misc is moving off ncoast -- please do NOT send submissions direct
ncoast's days are numbered -- please send mail to ncoast!system if you can help

nevin1@ihlpb.ATT.COM (Liber) (10/05/88)

In article <3967@enea.se> sommar@enea.se (Erland Sommarskog) writes:
>I had the example:
>>>    Data_record = RECORD
>>>                     Date : PACKED ARRAY(.1..8.) OF char;
>>>                     Time : PACKED ARRAY(.1..8.) OF char;
>>>                     Incident       : Incident_type;   (* Enumerated *)
>>>                     No_of_warnings : integer;
>>>                     Alarmed        : boolan;
>>>                     Username       : PACKED ARRAY(.1..12.) OF char;
>>>                  END;
>>>
>>>The simplest way to read and write this is to through a FILE OF Data_record,
>>>if no other programs is to read it.

>Marc W. Mengel (mmengel@cuuxb.UUCP) wrote:
>>Two major problems with this idea.  The first is that most of the time
>>other programs will need to read the data sooner or later.  

>If we have data that are to be read by more than one program, two 
>programs can import the declaration of the data record from a common 
>source, and thus they do not need to be rewritten if the format is 
>changed.

This assumes that all the programs are not only run on the same type of
machine and operating system, but that they are written in the same
language using the same compiler (stuff like pack arrays are not only
*machine* dependent and *operating system* dependent, they are *language*
dependent, *compiler* dependent, and in some cases are even *optimization*
dependent).  This is unnecessarily restrictive, and typically not practical
in commercial environments.

Irregardless of whether I use text files or binary files, I would
rather write my own read/write routines (even if they only call the
standard ones) than be dependent on my compiler.

>Also now we have the
>problem that for one change we have to edit three in places: the read 
>and write routines and the data definition, introducing a source of 
>error.)

But you have gained an interface layer (is it time to throw the
'object-oriented' buzzword around yet? :-))!  Except for the read/write
routines, the rest of the program is independent of the way the data is
stored on disk.  This is by far a big advantage!  (Note:  this
advantage comes from the argument, not from the type of data file
used.)

Suppose you decide to delete one of the fields stored on disk (because
it can be calculated, for instance), but you want the field available
for the rest of the program.  If you didn't bother to put the interface
layer in, this is a maintenance nightmare.

>  If you have many programs that are to read the same data, you are 
>likely to get a database system, and I don't think they store data 
>in a text-file format...

You wouldn't necessarily want a prepackaged DBMS.  There is usually a
lot of overhead associated with DBMS systems, and you have decide
whether it is worth it.

>  The only case when I can see that this argument is valid is when 
>"the other program" is standard a text-oriented utility.   

Well, if you're on a Un*x (sorry about the '*' in place of the 'i',
but Legal is talking about trademark protection again) system, this may
be very desirable.  You can use all your familiar tools (like grep,
sed, etc.) to do many of your manipulations.

>>Second, when
>>files are written in a binary format like this, the same program cannot
>>read the data when run on a different machine with a different byte
>>ordering, so after you have built up a list of 2000 incidents, and have
>>to move to a new machine, you lose big time.  

>A valid point. However, text files are not necessarily compatible either. 
>Imagine that the data record above has a message field, 80 characters 
>long. Assume that the program started its life on VMS and that one of 
>the messages contains a CR-LF. Now we move to a Unix system... And I 
>have seen Pascal systems that gladly read 123 from the line "123ABD", 
           ^^^^^^ need I say more? :-)
>and those who chokes, saying "inavlid integer".

Yes, but this isn't a deficiency of the file format; it is a deficiency
of the implementation of the programming language (I knew this
discussion was somehow relevant to this group :-)).  So far, your only
valid argument for using binary files instead of text files is that it
is cumbersome to do text manipulation with languages such as Pascal,
Modula-2, etc.

>>You have a data file with packed records in it, and you (the programmer) 
>>have *no idea* how the data is actually formatted.

>Isn't this a point? I always thought that a high level of abstraction  
>as possible was a good thing. You don't need to know the actaul disk
>format until you really have a need to move the file.

But some of us don't plan on using the same machine forever (or even
for one year).  I would hate to have to write conversion programs every
time I needed to port something.  The problem with abstraction is that
if the model wasn't designed just right, you typically have to find a way
around it.  I would much rather be able to design the model for
abstraction from the ground up than being forced into using what
someone else thought would be good enough.  Standard Pascal does
not give me these primitives when it comes to files; other languages
do.  All you have done here is point out another problem with the
language, not the data format.

>>It's true, you have to parse some of the data file (the numbers), but
>>even Pascal gives you a means of writing and reading integers of a
>>fixed width.  

>The problem is that you often have little use for these standard 
>routines, unless you can accept that the program crashes because there 
>was a letter where you expected a number.

Again, a deficiency of the programming language, not of the data format.
In C, people use the standard routines with no problems; they don't
ungracefully crash when an error occurs like Wirth-type languages do.

>Storing data in text files 
>gives you a bigger problem with data integrity, than with binary
>files.

Actually, the opposite is true.  Since the effective data is more
compressed in binary formats (if this wasn't true, there would be nothing
that would distinguish text formats from binary formats), it is more likely
that a data error will go by unnoticed.

>>you *can* add records with a text editor, 

>A plus, but applying the text editor is clearly a violence on data 
>integrity.

What makes a text editor a 'violence on data integrity'
any more than someone hacking together a program to modify the data?
The latter is probably worse, since it is much harder to check the
integrity within a program than by just looking at it through an
editor.

Besides, whenever I need a binary format for data, I use a hex-oriented
file editor.  It is an essential debugging tool (especially if the data
gets corrupted).  The existance of this editor has no bearing on my
data integrity.  I take other precautions irregardless of the data
format (eg, setgid to a special group for Un*x-based systems where the data
is to be shared).

>>you can debug your code much more easily, 

>Since I have less code, binary files win here, as long as I have
>good debugger around.

It had better not let you modify your data file!  As you said, that
would be a 'violence on data integrity'.

Also, it is much easier to find an error in a text file than it is in a
data file (why do you use a good debugger in the first place?  So that
you can see a symbolic representation of your program is usually one of
the reasons.  In other words, you need to look at a text format).  The
error you find in the data can be traced back to the program.  It is a
useful debugging technique.  You only have less code by half a dozen
procedures and your code is much more interdependent than mine.  I
would think that your method would lead to more errors than mine.

>You don't have to think so much about 
>integreity checks,

Strike one!  The integrity checks are more complicated.

>you have less problem changing the format during development,

Strike two!  If one of the data elements changes storage formats, the
rest of the program has to be checked for dependencies.

>maintenance benefits from the reduced code volume.

Strike three!  Although there is less NCSL (non commented source
lines), the code is more interdependent and hence more complex,
resulting in higher maintenence costs.

>  What to use is a decision the programmer has to make based on the
>requirements on portability (+ for text), performance (+ for binary),

I agree (finally :-)) with these two.

Most of the points that you brought up came about because it is much
harder to do rigorous text manipulation in the language you were using.
If you are stuck using a restrictive language, then this is a valid
point.  (BTW, I'm not trying to start a C vs Pascal debate.  Different
languages have different strong points and different weaknesses.  Due to
other factors, we can't always use the language best suited for the
task.)  But don't use this to say that text is worse than binary when
the real problem is with the language, not the format.
-- 
 _ __		NEVIN J. LIBER  ..!att!ihlpb!nevin1  (312) 979-4751  IH 4F-410
' )  )  "I catch him with a left hook. He eels over. It was a fluke, but there
 /  / _ , __o  ____  he was, lying on the deck, flat as a mackerel - kelpless!"
/  (_</_\/ <__/ / <_	As far as I know, these are NOT the opinions of AT&T.

sommar@enea.se (Erland Sommarskog) (10/09/88)

(>> is me.)

Nevin J Liber (nevin1@ihlpb.UUCP) writes:
>Irregardless of whether I use text files or binary files, I would
>rather write my own read/write routines (even if they only call the
>standard ones) than be dependent on my compiler.

Of course, no matter what type of files we use, we should encapsulate
the disk I/O routines for our data structure. What the rest of the 
program should see is just Get(put)_one_record(Data) where Data is 
of some type. And these routines are easier to maintain if they 
simply write Data (or Data.all) in its binary format to the disk.

>This assumes that all the programs are not only run on the same type of
>machine and operating system, but that they are written in the same
>language using the same compiler (stuff like pack arrays are not only
>*machine* dependent and *operating system* dependent, they are *language*
>dependent, *compiler* dependent, and in some cases are even *optimization*
>dependent).  This is unnecessarily restrictive, and typically not practical
>in commercial environments.

This is true if you can't call your common interface routines from
another language. And if you can't, well, you have a maintenance
problem no matter the file format. As I also mentioned in my 
previous article, a tool like VAX CDD is a help in a multi-language 
environment.

>>  If you have many programs that are to read the same data, you are 
>>likely to get a database system, and I don't think they store data 
>>in a text-file format...
>
>You wouldn't necessarily want a prepackaged DBMS.  There is usually a
>lot of overhead associated with DBMS systems, and you have decide
>whether it is worth it.

And there is a lot of development overhead associated with not using
a DBMS. Did I hear NIH?

>>  The only case when I can see that this argument is valid is when 
>>"the other program" is standard a text-oriented utility.   
>
>Well, if you're on a Un*x (...) system, this may
>be very desirable.  You can use all your familiar tools (like grep,
>sed, etc.) to do many of your manipulations.

Agreed. Just because I said it was the only, doesn't mean that it's
unimportant. (Side note: On a system like VMS you still have some
use for SEARCH, DIFFERENCES etc for binary files, since they recognize
the file format.)

>>The problem is that you often have little use for these standard 
>>routines, unless you can accept that the program crashes because there 
>>was a letter where you expected a number.
>
>Again, a deficiency of the programming language, not of the data format.
>In C, people use the standard routines with no problems; they don't
>ungracefully crash when an error occurs like Wirth-type languages do.

Possibly C handles this case better than other languages do. All langauges
that I've seen protest in some way when they get a non-digit when
trying to read an integer. Not all of them crash though. Simula and 
standard Pascal do. Ada and Fortran have exception mechanisms to help
you. (But I wonder what C does? If I guess, it sets some error variable
that you can forget to check, returns for zero for the integer, and 
doesn't move the current position in the file, so when you're reading the 
following string field you're starting in the wrong place. This would 
be just as bad as simply crashing.)

>>Storing data in text files 
>>gives you a bigger problem with data integrity, than with binary
>>files.
>
>Actually, the opposite is true.  Since the effective data is more
>compressed in binary formats (if this wasn't true, there would be nothing
>that would distinguish text formats from binary formats), it is more likely
>that a data error will go by unnoticed.

Whether a binary file is more compressed than the corresponding
text file, depends on the data. With many numbers it's true,
but with many string fields, you can save disk space with a text
file, since you don't have to store trailing blanks.
  The size has little to with the integrity. The assumption is
that no sane person would start to edit a binary file "by hand",
but you can't overlook this case for a text file. If we can assume
that the file is only accessed through the common I/O routines
mentioned earlier, we are assured that format integrity is maintained.

-- 
Erland Sommarskog            
ENEA Data, Stockholm         
sommar@enea.UUCP

nevin1@ihlpb.ATT.COM (Liber) (10/12/88)

In article <3980@enea.se> sommar@enea.se (Erland Sommarskog) writes:

ES> Of course, no matter what type of files we use, we should encapsulate
ES> the disk I/O routines for our data structure. What the rest of the 
ES> program should see is just Get(put)_one_record(Data) where Data is 
ES> of some type.

Agreed.

ES> And these routines are easier to maintain if they 
ES> simply write Data (or Data.all) in its binary format to the disk.

Only if I want to write all the data that is contained in my record in
the format that is in my record.  This, in my experience, is not
usually the case.

NL> This assumes that all the programs are not only run on the same type of
NL> machine and operating system, but that they are written in the same
NL> language using the same compiler (stuff like pack arrays are not only
NL> *machine* dependent and *operating system* dependent, they are *language*
NL> dependent, *compiler* dependent, and in some cases are even *optimization*
NL> dependent).  This is unnecessarily restrictive, and typically not practical
NL> in commercial environments.

ES> This is true if you can't call your common interface routines from
ES> another language. And if you can't, well, you have a maintenance
ES> problem no matter the file format.

What happens in the situation when you no longer have the original
source code?  With a text file, it is fairly easy to figure out the
data format (eg:  the uuencoding scheme is extremely easy to figure
out, and I did so when I couldn't find a uudecode program around for my
PC).  With a non-compressed binary format, it is a little tougher (you
have to know how integers are represented on the target machine, use
some good test data, etc.), especially if you think it is 'not sane' to
hand-edit a binary file.  If your data is compressed at all (like
Pascal's packed arrays), you had better know the compression scheme or
figuring out the format will be very difficult.

ES> As I also mentioned in my 
ES> previous article, a tool like VAX CDD is a help in a multi-language 
ES> environment.

Since I haven't seen the CDD, I cannot comment on it.

NL> You wouldn't necessarily want a prepackaged DBMS.  There is usually a
NL> lot of overhead associated with DBMS systems, and you have decide
NL> whether it is worth it.

ES> And there is a lot of development overhead associated with not using
ES> a DBMS. Did I hear NIH?

No, you didn't hear NIH.  What I meant by deciding whether or not it is
worth having a DBMS is whether or not the overhead of a DBMS outweighs
the overhead of development without it.  Sometimes adding a DBMS *adds*
overhead to development (you have to learn the interface to your
language, you have to learn the DBMS, etc.).  This topic, however, is
not appropriate for comp.lang.misc.  If you wish to discuss it, move it
to comp.databases.  Nuff said.

ES> (Side note: On a system like VMS you still have some
ES> use for SEARCH, DIFFERENCES etc for binary files, since they recognize
ES> the file format.)

These tend to be very limited.

NL> Again, a deficiency of the programming language, not of the data format.
NL> In C, people use the standard routines with no problems; they don't
NL> ungracefully crash when an error occurs like Wirth-type languages do.

ES> Possibly C handles this case better than other languages do. All langauges
ES> that I've seen protest in some way when they get a non-digit when
ES> trying to read an integer. Not all of them crash though. Simula and 
ES> standard Pascal do. Ada and Fortran have exception mechanisms to help
ES> you.

ES> (But I wonder what C does? If I guess, [...]
ES> [very bad guess deleted]

The C function strtol (string to long), takes a string and converts in
into a long int.  It ignores leading whitespace and it scans until it
finds a character which is inconsistent with the base.  It returns
the converted number and a pointer to the character which terminated
the scan.  This is much more graceful than the standard Pascal
solution.

ES> Whether a binary file is more compressed than the corresponding
ES> text file, depends on the data. With many numbers it's true,
ES> but with many string fields, you can save disk space with a text
ES> file, since you don't have to store trailing blanks.

Not a valid point.  Since we were talking about *fixed-format*, the
trailing blanks have to be included whether or not we are using text
files of binary files.  Fixed-format binary files are more compressed
than corresponding fixed-format text files (with the possible exception
of all-text files).

ES>   The size has little to with the integrity. The assumption is
ES> that no sane person would start to edit a binary file "by hand",
ES> but you can't overlook this case for a text file.

I guess I'm insane :-), but let me give you a few examples where I had
to edit a binary file by hand.

Ever had a head crash on a disk drive?  On a PC, guess where the head
usually resides.  On the file that was last accessed.  Guess what that
file usually is.  It's usually the file containing the current
directory.  On numerous occasions, I had to reconstruct a sector of a
directory, and the only way I know to do this is to go in and edit the
sector with a binary editor.

Another example:  every once in a while someone deletes a file that
they wanted to save.  On PCs, the file isn't actually removed; the
directory entry is flagged and the space that the file occupied is put
back on the free list.  By going in and hand editing the directory
file, it is a very simple process to undelete the file.

Also, being able to hand edit a binary file is a useful debugging
tool.

ES> If we can assume
ES> that the file is only accessed through the common I/O routines
ES> mentioned earlier, we are assured that format integrity is maintained.

Not true, either.

Still another example:  A word processor that I was using would mark
the file that I was editing in such a way that no one else using the
word processor would be able to edit this file.  Guess what happened
when the system came down.  The file was permanently marked open, and
the word processor did not have an option for unmarking a file (until
the next release, anyway).  Without hand editing it, I would have to
scrap the file.  With the binary editor, however, I was able to change
the 1-bit marking flag, and I lost no work.  The format was corrupted
by having my process interrupted (although this can happen with text
files, too, the recovery is much easier with text files).
-- 
 _ __		NEVIN J. LIBER  ..!att!ihlpb!nevin1  (312) 979-4751  IH 4F-410
' )  )  "I catch him with a left hook. He eels over. It was a fluke, but there
 /  / _ , __o  ____  he was, lying on the deck, flat as a mackerel - kelpless!"
/  (_</_\/ <__/ / <_	As far as I know, these are NOT the opinions of AT&T.

wsmith@m.cs.uiuc.edu (10/12/88)

>>Uh? If you have a text file and change the format you have to rewrite
>>the parsing and the writing-to file parts. With a fixed format you
>>change the declaration, and that's all. (Well, you may have to write 
>>a simple program to convert old files, but you have do to that for 
>>text files too.)
>
>What's so tough about fixed format text parsing?  Every language known
>to mankind can read and write integers, etc. from a file;  The format
>is easily extensible, you *can* add records with a text editor, you 
>can debug your code much more easily, you can write programs in other
>languages or on other machines that can read your data files, you can
>use Unix utilities like grep and sed and awk to make useful reports...
>The list goes on and on.

I would like to address an orthogonal issue.  It can be implemented as
either text files or as binary files equally well.  "How do you portably save
a data structure with pointers?"  If the data structure has a canonical
order, it may be traversed via some implicit spanning tree and all of the
pointers could be stored in the data structure implicitly.

If the data structure is more complex, I use some overhead in each record
to hold an integer mark that I use while traversing the data structure.
First, I set each mark to a negative number (using the same marking
algorithm as I use to traverse the data structure).  Once all the 
marks are negative, I traverse the data structure again.  This time, I output
all of the data plus an integer for each pointer.  If the mark for the 
pointer is positive, I output the mark.  If the mark is negative, I reset
the mark to the next positive available mark, output that number and add
the number to a queue of unprocessed records.  At the end the queue will
be empty.  This algorithm is described in "Marking Algorithms",
Lars-Erik Thorelli, BIT 12 (1972) pp. 555-68.

To read the data structure back, I create an array with one null pointer
for each record in the data structure.  (In the first pass that sets the
marks to a negative number, I also keep of count of the number of records.
The first integer output is the number of records so  that I may allocate 
the array when I start reading in the data.)  When a pointer is read in,
this array is consulted.  If the pointer in non-null, the record has been
read in and that pointer may be used as a reference to the actual record.
If the pointer is null, the record has not been read in yet.  Save the index
in a queue along with a reference to the pointer that needs to be fixed when
the record is read in.  After a record is read in, the next element of the
queue is read.  If the pointer has already been filled in, just fixup the
reference with the value from the array, otherwise, read it in as a new 
record.  Continue until the queue is empty.

>
>Binary data saves you a whole 20 minutes to an hour when writing 
>the program, makes your data files unportable between machines,
>makes you have to use the same language you started with on the
>same machine as long as you want to use the same data files...
>This list also goes on and on.
>

If you do more work on your binary data format you can make it portable
across architectures.  The hard porting problems are ASCII vs. EBCDIC 
and a computer with non-8 bit bytes and will cause some portability headaches,
but they are not impossible.  The trick, I think, is to define a standard
byte stream format for each data type (including floating point, if you
use it).  Then, when you want to write a given data type into the file, you
call a subroutine for that specific data type.  The subroutine could
write in binary, ASCII or some home-brew bcd format because that is
merely an implementation detail.  Pointers will map onto the integer type 
as described above.  

In addition, by representing the high-level file data format at the beginning 
of a file, as part of the file format definition, you can write a 
data editor that provides an interface to the data in the program.  
Once a data structure is complicated enough, for example, one equivalent to 
an arbitrary multigraph, editing a text file will be a daunting task to 
begin with.  If your application is complex enough to need a bigger and 
better file format, it will be worthwhile to put the effort in to allow 
for some form of data translation from version 1.1 to 1.2 of your software.
With the data format inside the file, you are safe from the program and
data becoming inconsistent because the program will tell you when that 
happens and may even automatically compensate.
>-- 
> Marc Mengel			       
> mmengel@cuuxb.att.com
> attmail!mmengel	
> {lll-crg|mtune|att}!cuuxb!mmengel

Bill Smith
wsmith@cs.uiuc.edu
uiucdcs!wsmith