sommar@enea.se (Erland Sommarskog) (09/26/88)
(This stems from the discussions of VMS and Unix file systems. I've added comp.lang.misc and directed follow-ups there. It doesn't belong to comp.os.vms anymore and I don't read comp.unix.wizards.) Miles O'Neal (meo@stiatl.UUCP) writes: >If you even have your data files as text files, debugging >becomes much easier. For instance, would you rather debug >98764389437034gh307ytfhr398f39 >or >12/22/88 01:30 10790 100 100 382 -1 >? >These are not real data, but examples of what data files I've dealt >with looked like. The processing to do all this is cheap nowdays, >so why not use text files if there is no OVERWHELMING reason not to? There are several advantages of using fixed record files for data, if the data I have fits to that format. Let's say we have: (This is Pascal.) Data_record = RECORD Date : PACKED ARRAY(.1..8.) OF char; Time : PACKED ARRAY(.1..8.) OF char; Incident : Incident_type; (* Enumerated *) No_of_warnings : integer; Alarmed : boolan; Username : PACKED ARRAY(.1..12.) OF char; END; The simplest way to read and write this is to through a FILE OF Data_record, if no other programs is to read it. If we store the data in a text file, we have to parse every line we get. (And what a trouble if Username contains an LF character.) And since the text file is vulnerable to changes from e.g. a text editor, we cannot be sure that the file follows the supposed format throughout. As for debugging this only applies if you have to look at the file as such. >Another thing this buys you is that, in my experience, its easier >to change file formats if you use text files. It requires a little >plannning, but in general is a lot less work than doing the same >thing with any other type of data. Uh? If you have a text file and change the format you have to rewrite the parsing and the writing-to file parts. With a fixed format you change the declaration, and that's all. (Well, you may have to write a simple program to convert old files, but you have do to that for text files too.) -- Erland Sommarskog ! "Hon ligger med min b{ste v{n, ENEA Data, Stockholm ! jag v}gar inte sova l{ngre", Orup sommar@enea.UUCP ! ("She's making love with best friend, ! I dare not to sleep anymore")
meo@stiatl.UUCP (Miles O'Neal) (09/27/88)
In article <3958@enea.se>, sommar@enea.se (Erland Sommarskog) writes: > There are several advantages of using fixed record files for data, > if the data I have fits to that format. Either you missed the point, or I didnt make it well. The issue I was addressing was not whether the record format was variable, but whether the format was human readable. And if you are developing anything remotely complex, being able to read and "text process" the data files is very handy for debugging. I believe that was the context of the original discussion. I agree that, usually, fixed format is better, unless space is a problem. ===================================================================== Any opinions, facts, fantasies, theories, fallacies and/or lies which may or may not have been expressed in this posting or elsewhere, may or may not belong to myself, my employer, the building management, Atlanta, Fulton County, public and/or private utilities serving such, the State of GA, the USA, or God himself. No other warranties apply, implied or stated. For further information, call 1(800)HOO-HAHA. --------------------------------------------------------------------- Miles O'Neal decvax!gatech!stiatl!meo
mmengel@cuuxb.ATT.COM (Marc W. Mengel) (09/27/88)
In article <3958@enea.se> sommar@enea.se (Erland Sommarskog) writes: > >Miles O'Neal (meo@stiatl.UUCP) writes: >>If you even have your data files as text files, debugging >>becomes much easier. For instance, would you rather debug >>98764389437034gh307ytfhr398f39 >>or >>12/22/88 01:30 10790 100 100 382 -1 >>? > >There are several advantages of using fixed record files for data, >if the data I have fits to that format. Let's say we have: (This is >Pascal.) > Data_record = RECORD > Date : PACKED ARRAY(.1..8.) OF char; > Time : PACKED ARRAY(.1..8.) OF char; > Incident : Incident_type; (* Enumerated *) > No_of_warnings : integer; > Alarmed : boolan; > Username : PACKED ARRAY(.1..12.) OF char; > END; > >The simplest way to read and write this is to through a FILE OF Data_record, >if no other programs is to read it. Two major problems with this idea. The first is that most of the time other programs will need to read the data sooner or later. Second, when files are written in a binary format like this, the same program cannot read the data when run on a different machine with a different byte ordering, so after you have built up a list of 2000 incidents, and have to move to a new machine, you lose big time. You have a data file with packed records in it, and you (the programmer) have *no idea* how the data is actually formatted. > If we store the data in a text file, we >have to parse every line we get. (And what a trouble if Username contains >an LF character.) And since the text file is vulnerable to changes from >e.g. a text editor, we cannot be sure that the file follows the supposed >format throughout. It's true, you have to parse some of the data file (the numbers), but even Pascal gives you a means of writing and reading integers of a fixed width. Since Pascal has a problem with newlines, you can either write them with an escape sequence like '\n', or just check Pascal's eol() function before reading a character. >As for debugging this only applies if you have to look at the file as such. Clearly, one would never need to look at the data file a program uses to determine if a value is getting trashed before being written to the file, or after being read back in. (HEAVY sarcasm here) >>Another thing this buys you is that, in my experience, its easier >>to change file formats if you use text files. It requires a little >>plannning, but in general is a lot less work than doing the same >>thing with any other type of data. > >Uh? If you have a text file and change the format you have to rewrite >the parsing and the writing-to file parts. With a fixed format you >change the declaration, and that's all. (Well, you may have to write >a simple program to convert old files, but you have do to that for >text files too.) What's so tough about fixed format text parsing? Every language known to mankind can read and write integers, etc. from a file; The format is easily extensible, you *can* add records with a text editor, you can debug your code much more easily, you can write programs in other languages or on other machines that can read your data files, you can use Unix utilities like grep and sed and awk to make useful reports... The list goes on and on. Binary data saves you a whole 20 minutes to an hour when writing the program, makes your data files unportable between machines, makes you have to use the same language you started with on the same machine as long as you want to use the same data files... This list also goes on and on. If you're writing real, production programs that are useful to people and generate data files, the people who use it may want to use it on a different machine 5 years from now; they will read in the source and data file from a tape onto their new machine, and their data will be garbage, because the 1's will be 256's, or 2048's, etc. due to byte order on the new machine. If they move to this new machine because the old one is dead and they can't get parts any more, and now they can't read their old accounting records; they will find worse names to call you than you can imagine. >-- >Erland Sommarskog ! "Hon ligger med min b{ste v{n, >ENEA Data, Stockholm ! jag v}gar inte sova l{ngre", Orup >sommar@enea.UUCP ! ("She's making love with best friend, > ! I dare not to sleep anymore") -- Marc Mengel mmengel@cuuxb.att.com attmail!mmengel {lll-crg|mtune|att}!cuuxb!mmengel
sommar@enea.se (Erland Sommarskog) (10/01/88)
I had the example: >> Data_record = RECORD >> Date : PACKED ARRAY(.1..8.) OF char; >> Time : PACKED ARRAY(.1..8.) OF char; >> Incident : Incident_type; (* Enumerated *) >> No_of_warnings : integer; >> Alarmed : boolan; >> Username : PACKED ARRAY(.1..12.) OF char; >> END; >> >>The simplest way to read and write this is to through a FILE OF Data_record, >>if no other programs is to read it. Marc W. Mengel (mmengel@cuuxb.UUCP) wrote: >Two major problems with this idea. The first is that most of the time >other programs will need to read the data sooner or later. If we have data that are to be read by more than one program, two programs can import the declaration of the data record from a common source, and thus they do not need to be rewritten if the format is changed. With a text file we can achieve the same effect with common procedures for read and writing, to be imported together with the data definition, giving a higer degree of dependency. Also now we have the problem that for one change we have to edit three in places: the read and write routines and the data definition, introducing a source of error.) If you have many programs that are to read the same data, you are likely to get a database system, and I don't think they store data in a text-file format... The only case when I can see that this argument is valid is when "the other program" is standard a text-oriented utility. >Second, when >files are written in a binary format like this, the same program cannot >read the data when run on a different machine with a different byte >ordering, so after you have built up a list of 2000 incidents, and have >to move to a new machine, you lose big time. A valid point. However, text files are not necessarily compatible either. Imagine that the data record above has a message field, 80 characters long. Assume that the program started its life on VMS and that one of the messages contains a CR-LF. Now we move to a Unix system... And I have seen Pascal systems that gladly read 123 from the line "123ABD", and those who chokes, saying "inavlid integer". Both these problems can be avoided with careful programming, it should be added. >You have a data file with packed records in it, and you (the programmer) >have *no idea* how the data is actually formatted. Isn't this a point? I always thought that a high level of abstraction as possible was a good thing. You don't need to know the actaul disk format until you really have a need to move the file. >It's true, you have to parse some of the data file (the numbers), but >even Pascal gives you a means of writing and reading integers of a >fixed width. The problem is that you often have little use for these standard routines, unless you can accept that the program crashes because there was a letter where you expected a number. Storing data in text files gives you a bigger problem with data integrity, than with binary files. >What's so tough about fixed format text parsing? >... >you *can* add records with a text editor, A plus, but applying the text editor is clearly a violence on data integrity. >you can debug your code much more easily, Since I have less code, binary files win here, as long as I have good debugger around. >you can write programs in other languages If you work on VMS and have CDD (Common Data Dictionary) around this is possible with binary files too. (With CDD you can write data definitions in specific data definition language. Several of DEC's compiler suppiles a DICTIONARY directive to import these definitions.) >Binary data saves you a whole 20 minutes to an hour when writing >the program, It saves you more time than so. You don't have to think so much about integreity checks, you have less problem changing the format during development, maintenance benefits from the reduced code volume. I'm not saying that you should never use text files for storing data. In many cases this may be very desirable. I once wrote a simple text formatter with an interactive syllabication facility. I stored the syllabications on i text file, since I realized that the user wanted to be able to remove an erroneous syllabication, and I didn't want to write a tool for maintaining the file. What to use is a decision the programmer has to make based on the requirements on portability (+ for text), performance (+ for binary), data integrity (+ for binary) and so on. Generally it seems to me that a cheap system with low requirements on integrity and maintainability a text file is the natural choice. But as the complexity and the amount of data grow you are likely to chose binary files and eventually you pass the line where you need a database management system. -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP
allbery@ncoast.UUCP (Brandon S. Allbery) (10/03/88)
As quoted from <3958@enea.se> by sommar@enea.se (Erland Sommarskog): +--------------- | Uh? If you have a text file and change the format you have to rewrite | the parsing and the writing-to file parts. With a fixed format you | change the declaration, and that's all. (Well, you may have to write | a simple program to convert old files, but you have do to that for | text files too.) +--------------- The last time I added a field to the UNaXcess userfile I added a ":%s" to the fprintf() write and a "getfld(buf,var)" to the read. Trivial. And because of the way getfld() is written and the fact that new fields always go at the end of the line (record), the userfile is self-upgrading: if the field is missing, a default value is used and the field is added on output. Try THAT with a binary file! As for newlines in "records", etc.: there is a backslash-escape convention that deals with this quite well. But I rarely, if ever, find a need to store newlines in data records. -- Brandon S. Allbery, uunet!marque!ncoast!allbery DELPHI: ALLBERY For comp.sources.misc send mail to <backbone>!sources-misc comp.sources.misc is moving off ncoast -- please do NOT send submissions direct ncoast's days are numbered -- please send mail to ncoast!system if you can help
nevin1@ihlpb.ATT.COM (Liber) (10/05/88)
In article <3967@enea.se> sommar@enea.se (Erland Sommarskog) writes: >I had the example: >>> Data_record = RECORD >>> Date : PACKED ARRAY(.1..8.) OF char; >>> Time : PACKED ARRAY(.1..8.) OF char; >>> Incident : Incident_type; (* Enumerated *) >>> No_of_warnings : integer; >>> Alarmed : boolan; >>> Username : PACKED ARRAY(.1..12.) OF char; >>> END; >>> >>>The simplest way to read and write this is to through a FILE OF Data_record, >>>if no other programs is to read it. >Marc W. Mengel (mmengel@cuuxb.UUCP) wrote: >>Two major problems with this idea. The first is that most of the time >>other programs will need to read the data sooner or later. >If we have data that are to be read by more than one program, two >programs can import the declaration of the data record from a common >source, and thus they do not need to be rewritten if the format is >changed. This assumes that all the programs are not only run on the same type of machine and operating system, but that they are written in the same language using the same compiler (stuff like pack arrays are not only *machine* dependent and *operating system* dependent, they are *language* dependent, *compiler* dependent, and in some cases are even *optimization* dependent). This is unnecessarily restrictive, and typically not practical in commercial environments. Irregardless of whether I use text files or binary files, I would rather write my own read/write routines (even if they only call the standard ones) than be dependent on my compiler. >Also now we have the >problem that for one change we have to edit three in places: the read >and write routines and the data definition, introducing a source of >error.) But you have gained an interface layer (is it time to throw the 'object-oriented' buzzword around yet? :-))! Except for the read/write routines, the rest of the program is independent of the way the data is stored on disk. This is by far a big advantage! (Note: this advantage comes from the argument, not from the type of data file used.) Suppose you decide to delete one of the fields stored on disk (because it can be calculated, for instance), but you want the field available for the rest of the program. If you didn't bother to put the interface layer in, this is a maintenance nightmare. > If you have many programs that are to read the same data, you are >likely to get a database system, and I don't think they store data >in a text-file format... You wouldn't necessarily want a prepackaged DBMS. There is usually a lot of overhead associated with DBMS systems, and you have decide whether it is worth it. > The only case when I can see that this argument is valid is when >"the other program" is standard a text-oriented utility. Well, if you're on a Un*x (sorry about the '*' in place of the 'i', but Legal is talking about trademark protection again) system, this may be very desirable. You can use all your familiar tools (like grep, sed, etc.) to do many of your manipulations. >>Second, when >>files are written in a binary format like this, the same program cannot >>read the data when run on a different machine with a different byte >>ordering, so after you have built up a list of 2000 incidents, and have >>to move to a new machine, you lose big time. >A valid point. However, text files are not necessarily compatible either. >Imagine that the data record above has a message field, 80 characters >long. Assume that the program started its life on VMS and that one of >the messages contains a CR-LF. Now we move to a Unix system... And I >have seen Pascal systems that gladly read 123 from the line "123ABD", ^^^^^^ need I say more? :-) >and those who chokes, saying "inavlid integer". Yes, but this isn't a deficiency of the file format; it is a deficiency of the implementation of the programming language (I knew this discussion was somehow relevant to this group :-)). So far, your only valid argument for using binary files instead of text files is that it is cumbersome to do text manipulation with languages such as Pascal, Modula-2, etc. >>You have a data file with packed records in it, and you (the programmer) >>have *no idea* how the data is actually formatted. >Isn't this a point? I always thought that a high level of abstraction >as possible was a good thing. You don't need to know the actaul disk >format until you really have a need to move the file. But some of us don't plan on using the same machine forever (or even for one year). I would hate to have to write conversion programs every time I needed to port something. The problem with abstraction is that if the model wasn't designed just right, you typically have to find a way around it. I would much rather be able to design the model for abstraction from the ground up than being forced into using what someone else thought would be good enough. Standard Pascal does not give me these primitives when it comes to files; other languages do. All you have done here is point out another problem with the language, not the data format. >>It's true, you have to parse some of the data file (the numbers), but >>even Pascal gives you a means of writing and reading integers of a >>fixed width. >The problem is that you often have little use for these standard >routines, unless you can accept that the program crashes because there >was a letter where you expected a number. Again, a deficiency of the programming language, not of the data format. In C, people use the standard routines with no problems; they don't ungracefully crash when an error occurs like Wirth-type languages do. >Storing data in text files >gives you a bigger problem with data integrity, than with binary >files. Actually, the opposite is true. Since the effective data is more compressed in binary formats (if this wasn't true, there would be nothing that would distinguish text formats from binary formats), it is more likely that a data error will go by unnoticed. >>you *can* add records with a text editor, >A plus, but applying the text editor is clearly a violence on data >integrity. What makes a text editor a 'violence on data integrity' any more than someone hacking together a program to modify the data? The latter is probably worse, since it is much harder to check the integrity within a program than by just looking at it through an editor. Besides, whenever I need a binary format for data, I use a hex-oriented file editor. It is an essential debugging tool (especially if the data gets corrupted). The existance of this editor has no bearing on my data integrity. I take other precautions irregardless of the data format (eg, setgid to a special group for Un*x-based systems where the data is to be shared). >>you can debug your code much more easily, >Since I have less code, binary files win here, as long as I have >good debugger around. It had better not let you modify your data file! As you said, that would be a 'violence on data integrity'. Also, it is much easier to find an error in a text file than it is in a data file (why do you use a good debugger in the first place? So that you can see a symbolic representation of your program is usually one of the reasons. In other words, you need to look at a text format). The error you find in the data can be traced back to the program. It is a useful debugging technique. You only have less code by half a dozen procedures and your code is much more interdependent than mine. I would think that your method would lead to more errors than mine. >You don't have to think so much about >integreity checks, Strike one! The integrity checks are more complicated. >you have less problem changing the format during development, Strike two! If one of the data elements changes storage formats, the rest of the program has to be checked for dependencies. >maintenance benefits from the reduced code volume. Strike three! Although there is less NCSL (non commented source lines), the code is more interdependent and hence more complex, resulting in higher maintenence costs. > What to use is a decision the programmer has to make based on the >requirements on portability (+ for text), performance (+ for binary), I agree (finally :-)) with these two. Most of the points that you brought up came about because it is much harder to do rigorous text manipulation in the language you were using. If you are stuck using a restrictive language, then this is a valid point. (BTW, I'm not trying to start a C vs Pascal debate. Different languages have different strong points and different weaknesses. Due to other factors, we can't always use the language best suited for the task.) But don't use this to say that text is worse than binary when the real problem is with the language, not the format. -- _ __ NEVIN J. LIBER ..!att!ihlpb!nevin1 (312) 979-4751 IH 4F-410 ' ) ) "I catch him with a left hook. He eels over. It was a fluke, but there / / _ , __o ____ he was, lying on the deck, flat as a mackerel - kelpless!" / (_</_\/ <__/ / <_ As far as I know, these are NOT the opinions of AT&T.
sommar@enea.se (Erland Sommarskog) (10/09/88)
(>> is me.) Nevin J Liber (nevin1@ihlpb.UUCP) writes: >Irregardless of whether I use text files or binary files, I would >rather write my own read/write routines (even if they only call the >standard ones) than be dependent on my compiler. Of course, no matter what type of files we use, we should encapsulate the disk I/O routines for our data structure. What the rest of the program should see is just Get(put)_one_record(Data) where Data is of some type. And these routines are easier to maintain if they simply write Data (or Data.all) in its binary format to the disk. >This assumes that all the programs are not only run on the same type of >machine and operating system, but that they are written in the same >language using the same compiler (stuff like pack arrays are not only >*machine* dependent and *operating system* dependent, they are *language* >dependent, *compiler* dependent, and in some cases are even *optimization* >dependent). This is unnecessarily restrictive, and typically not practical >in commercial environments. This is true if you can't call your common interface routines from another language. And if you can't, well, you have a maintenance problem no matter the file format. As I also mentioned in my previous article, a tool like VAX CDD is a help in a multi-language environment. >> If you have many programs that are to read the same data, you are >>likely to get a database system, and I don't think they store data >>in a text-file format... > >You wouldn't necessarily want a prepackaged DBMS. There is usually a >lot of overhead associated with DBMS systems, and you have decide >whether it is worth it. And there is a lot of development overhead associated with not using a DBMS. Did I hear NIH? >> The only case when I can see that this argument is valid is when >>"the other program" is standard a text-oriented utility. > >Well, if you're on a Un*x (...) system, this may >be very desirable. You can use all your familiar tools (like grep, >sed, etc.) to do many of your manipulations. Agreed. Just because I said it was the only, doesn't mean that it's unimportant. (Side note: On a system like VMS you still have some use for SEARCH, DIFFERENCES etc for binary files, since they recognize the file format.) >>The problem is that you often have little use for these standard >>routines, unless you can accept that the program crashes because there >>was a letter where you expected a number. > >Again, a deficiency of the programming language, not of the data format. >In C, people use the standard routines with no problems; they don't >ungracefully crash when an error occurs like Wirth-type languages do. Possibly C handles this case better than other languages do. All langauges that I've seen protest in some way when they get a non-digit when trying to read an integer. Not all of them crash though. Simula and standard Pascal do. Ada and Fortran have exception mechanisms to help you. (But I wonder what C does? If I guess, it sets some error variable that you can forget to check, returns for zero for the integer, and doesn't move the current position in the file, so when you're reading the following string field you're starting in the wrong place. This would be just as bad as simply crashing.) >>Storing data in text files >>gives you a bigger problem with data integrity, than with binary >>files. > >Actually, the opposite is true. Since the effective data is more >compressed in binary formats (if this wasn't true, there would be nothing >that would distinguish text formats from binary formats), it is more likely >that a data error will go by unnoticed. Whether a binary file is more compressed than the corresponding text file, depends on the data. With many numbers it's true, but with many string fields, you can save disk space with a text file, since you don't have to store trailing blanks. The size has little to with the integrity. The assumption is that no sane person would start to edit a binary file "by hand", but you can't overlook this case for a text file. If we can assume that the file is only accessed through the common I/O routines mentioned earlier, we are assured that format integrity is maintained. -- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP
nevin1@ihlpb.ATT.COM (Liber) (10/12/88)
In article <3980@enea.se> sommar@enea.se (Erland Sommarskog) writes:
ES> Of course, no matter what type of files we use, we should encapsulate
ES> the disk I/O routines for our data structure. What the rest of the
ES> program should see is just Get(put)_one_record(Data) where Data is
ES> of some type.
Agreed.
ES> And these routines are easier to maintain if they
ES> simply write Data (or Data.all) in its binary format to the disk.
Only if I want to write all the data that is contained in my record in
the format that is in my record. This, in my experience, is not
usually the case.
NL> This assumes that all the programs are not only run on the same type of
NL> machine and operating system, but that they are written in the same
NL> language using the same compiler (stuff like pack arrays are not only
NL> *machine* dependent and *operating system* dependent, they are *language*
NL> dependent, *compiler* dependent, and in some cases are even *optimization*
NL> dependent). This is unnecessarily restrictive, and typically not practical
NL> in commercial environments.
ES> This is true if you can't call your common interface routines from
ES> another language. And if you can't, well, you have a maintenance
ES> problem no matter the file format.
What happens in the situation when you no longer have the original
source code? With a text file, it is fairly easy to figure out the
data format (eg: the uuencoding scheme is extremely easy to figure
out, and I did so when I couldn't find a uudecode program around for my
PC). With a non-compressed binary format, it is a little tougher (you
have to know how integers are represented on the target machine, use
some good test data, etc.), especially if you think it is 'not sane' to
hand-edit a binary file. If your data is compressed at all (like
Pascal's packed arrays), you had better know the compression scheme or
figuring out the format will be very difficult.
ES> As I also mentioned in my
ES> previous article, a tool like VAX CDD is a help in a multi-language
ES> environment.
Since I haven't seen the CDD, I cannot comment on it.
NL> You wouldn't necessarily want a prepackaged DBMS. There is usually a
NL> lot of overhead associated with DBMS systems, and you have decide
NL> whether it is worth it.
ES> And there is a lot of development overhead associated with not using
ES> a DBMS. Did I hear NIH?
No, you didn't hear NIH. What I meant by deciding whether or not it is
worth having a DBMS is whether or not the overhead of a DBMS outweighs
the overhead of development without it. Sometimes adding a DBMS *adds*
overhead to development (you have to learn the interface to your
language, you have to learn the DBMS, etc.). This topic, however, is
not appropriate for comp.lang.misc. If you wish to discuss it, move it
to comp.databases. Nuff said.
ES> (Side note: On a system like VMS you still have some
ES> use for SEARCH, DIFFERENCES etc for binary files, since they recognize
ES> the file format.)
These tend to be very limited.
NL> Again, a deficiency of the programming language, not of the data format.
NL> In C, people use the standard routines with no problems; they don't
NL> ungracefully crash when an error occurs like Wirth-type languages do.
ES> Possibly C handles this case better than other languages do. All langauges
ES> that I've seen protest in some way when they get a non-digit when
ES> trying to read an integer. Not all of them crash though. Simula and
ES> standard Pascal do. Ada and Fortran have exception mechanisms to help
ES> you.
ES> (But I wonder what C does? If I guess, [...]
ES> [very bad guess deleted]
The C function strtol (string to long), takes a string and converts in
into a long int. It ignores leading whitespace and it scans until it
finds a character which is inconsistent with the base. It returns
the converted number and a pointer to the character which terminated
the scan. This is much more graceful than the standard Pascal
solution.
ES> Whether a binary file is more compressed than the corresponding
ES> text file, depends on the data. With many numbers it's true,
ES> but with many string fields, you can save disk space with a text
ES> file, since you don't have to store trailing blanks.
Not a valid point. Since we were talking about *fixed-format*, the
trailing blanks have to be included whether or not we are using text
files of binary files. Fixed-format binary files are more compressed
than corresponding fixed-format text files (with the possible exception
of all-text files).
ES> The size has little to with the integrity. The assumption is
ES> that no sane person would start to edit a binary file "by hand",
ES> but you can't overlook this case for a text file.
I guess I'm insane :-), but let me give you a few examples where I had
to edit a binary file by hand.
Ever had a head crash on a disk drive? On a PC, guess where the head
usually resides. On the file that was last accessed. Guess what that
file usually is. It's usually the file containing the current
directory. On numerous occasions, I had to reconstruct a sector of a
directory, and the only way I know to do this is to go in and edit the
sector with a binary editor.
Another example: every once in a while someone deletes a file that
they wanted to save. On PCs, the file isn't actually removed; the
directory entry is flagged and the space that the file occupied is put
back on the free list. By going in and hand editing the directory
file, it is a very simple process to undelete the file.
Also, being able to hand edit a binary file is a useful debugging
tool.
ES> If we can assume
ES> that the file is only accessed through the common I/O routines
ES> mentioned earlier, we are assured that format integrity is maintained.
Not true, either.
Still another example: A word processor that I was using would mark
the file that I was editing in such a way that no one else using the
word processor would be able to edit this file. Guess what happened
when the system came down. The file was permanently marked open, and
the word processor did not have an option for unmarking a file (until
the next release, anyway). Without hand editing it, I would have to
scrap the file. With the binary editor, however, I was able to change
the 1-bit marking flag, and I lost no work. The format was corrupted
by having my process interrupted (although this can happen with text
files, too, the recovery is much easier with text files).
--
_ __ NEVIN J. LIBER ..!att!ihlpb!nevin1 (312) 979-4751 IH 4F-410
' ) ) "I catch him with a left hook. He eels over. It was a fluke, but there
/ / _ , __o ____ he was, lying on the deck, flat as a mackerel - kelpless!"
/ (_</_\/ <__/ / <_ As far as I know, these are NOT the opinions of AT&T.
wsmith@m.cs.uiuc.edu (10/12/88)
>>Uh? If you have a text file and change the format you have to rewrite >>the parsing and the writing-to file parts. With a fixed format you >>change the declaration, and that's all. (Well, you may have to write >>a simple program to convert old files, but you have do to that for >>text files too.) > >What's so tough about fixed format text parsing? Every language known >to mankind can read and write integers, etc. from a file; The format >is easily extensible, you *can* add records with a text editor, you >can debug your code much more easily, you can write programs in other >languages or on other machines that can read your data files, you can >use Unix utilities like grep and sed and awk to make useful reports... >The list goes on and on. I would like to address an orthogonal issue. It can be implemented as either text files or as binary files equally well. "How do you portably save a data structure with pointers?" If the data structure has a canonical order, it may be traversed via some implicit spanning tree and all of the pointers could be stored in the data structure implicitly. If the data structure is more complex, I use some overhead in each record to hold an integer mark that I use while traversing the data structure. First, I set each mark to a negative number (using the same marking algorithm as I use to traverse the data structure). Once all the marks are negative, I traverse the data structure again. This time, I output all of the data plus an integer for each pointer. If the mark for the pointer is positive, I output the mark. If the mark is negative, I reset the mark to the next positive available mark, output that number and add the number to a queue of unprocessed records. At the end the queue will be empty. This algorithm is described in "Marking Algorithms", Lars-Erik Thorelli, BIT 12 (1972) pp. 555-68. To read the data structure back, I create an array with one null pointer for each record in the data structure. (In the first pass that sets the marks to a negative number, I also keep of count of the number of records. The first integer output is the number of records so that I may allocate the array when I start reading in the data.) When a pointer is read in, this array is consulted. If the pointer in non-null, the record has been read in and that pointer may be used as a reference to the actual record. If the pointer is null, the record has not been read in yet. Save the index in a queue along with a reference to the pointer that needs to be fixed when the record is read in. After a record is read in, the next element of the queue is read. If the pointer has already been filled in, just fixup the reference with the value from the array, otherwise, read it in as a new record. Continue until the queue is empty. > >Binary data saves you a whole 20 minutes to an hour when writing >the program, makes your data files unportable between machines, >makes you have to use the same language you started with on the >same machine as long as you want to use the same data files... >This list also goes on and on. > If you do more work on your binary data format you can make it portable across architectures. The hard porting problems are ASCII vs. EBCDIC and a computer with non-8 bit bytes and will cause some portability headaches, but they are not impossible. The trick, I think, is to define a standard byte stream format for each data type (including floating point, if you use it). Then, when you want to write a given data type into the file, you call a subroutine for that specific data type. The subroutine could write in binary, ASCII or some home-brew bcd format because that is merely an implementation detail. Pointers will map onto the integer type as described above. In addition, by representing the high-level file data format at the beginning of a file, as part of the file format definition, you can write a data editor that provides an interface to the data in the program. Once a data structure is complicated enough, for example, one equivalent to an arbitrary multigraph, editing a text file will be a daunting task to begin with. If your application is complex enough to need a bigger and better file format, it will be worthwhile to put the effort in to allow for some form of data translation from version 1.1 to 1.2 of your software. With the data format inside the file, you are safe from the program and data becoming inconsistent because the program will tell you when that happens and may even automatically compensate. >-- > Marc Mengel > mmengel@cuuxb.att.com > attmail!mmengel > {lll-crg|mtune|att}!cuuxb!mmengel Bill Smith wsmith@cs.uiuc.edu uiucdcs!wsmith