scs@adam.pika.mit.edu (Steve Summit) (04/29/89)
From time to time we debate the relative merits of "binary" and "ascii" data file formats. Today I discovered Yet Another Good Reason not to use binary files, a fact which I was in general already eminently certain of. I'll pass the reason on, in case it will keep anyone else from making the same mistake. I am working with some code which, for historical reasons, writes C structures out to files and reads them back in, in the usual (unrecommended) byte-for-byte binary way, using fwrite and fread. It happens that the structures contain pointers, the writing of which to files is an even worse idea, because the pointers are virtually certain to be meaningless when read back in. The struct-writing code is therefore careful to chase down the pointers and additionally write the substructures and strings pointed to; the struct-reading code ignores the garbage pointers read in and replaces them with new pointers to the memory containing the pointed-to information just read from the file. This is all well and good, and has been working just fine, but it happens that it is on the benighted PC, and today I thought I'd switch from medium model to large model. Surprise, surprise, those 16-bit garbage pointers I wrote out yesterday are now trying to be read in as 32-bit garbage pointers, offsetting the rest of the data, and several thousand data files are currently unreadable... (You don't need to chide me for using binary files in the first place, a fact for which I am amply chiding myself already, or suggest ways out of the dilemma, any number of variously unpalatable ones of which I am already considering.) Steve Summit scs@adam.pika.mit.edu
nather@ut-emx.UUCP (Ed Nather) (04/29/89)
I have used both binary and ascii data files in different versions of the same basic data acquisition program -- binary back when digital data cassettes were new and floppy disks held a massive 160KB, and ascii when things loosened up a bit. Believe me, ascii is better: 1. I can use Unix (or Unix-like) text tools to scan data files and their corresponding headers; I used to have to write my own tools. 2. Normal, everyday human beings can read the files without having to use a special translation program. Printers can print them for scrutiny. 3. I can often whump up a pipeline to do some special processing job, based on Unix(-like) tools for most of it, and often only a single tool to do the special stuff as the data flow through. 4. Other computers understand ascii and can read the files without having to write special conversion routines. 5. I have written conversion routines that turn old, compact binary data files & headers into ascii, matching the current program's output. Going the other way would be unthinkable. Bad things: 1. Files take a bit longer to read in, since conversion from ascii is now necessary, but it's a small percentage of the total read time. 2. Files are larger. 3. There are no other bad things. -- Ed Nather Astronomy Dept, U of Texas @ Austin
cik@l.cc.purdue.edu (Herman Rubin) (04/30/89)
In article <12546@ut-emx.UUCP>, nather@ut-emx.UUCP (Ed Nather) writes: < same basic data acquisition program -- binary back when digital data < cassettes were new and floppy disks held a massive 160KB, and ascii < when things loosened up a bit. Believe me, ascii is better: ...................... > Bad things: > > 1. Files take a bit longer to read in, since conversion from ascii is now > necessary, but it's a small percentage of the total read time. > > 2. Files are larger. > > 3. There are no other bad things. There is another bad thing. We may not have a good ASCII representation for the data. One example is a multi-font system. Another example is floating point data; there is no standard floating point binary, and conversion to and from decimal is a source of roundoff errors, which may even be serious. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
zougas@me.utoronto.ca ("Athanasios(Tom) Zougas") (04/30/89)
In article <12546@ut-emx.UUCP> nather@ut-emx.UUCP (Ed Nather) writes: >.... Believe me, ascii is better: > ... >Bad things [WRT ascii data file]: > >1. Files take a bit longer to read in ^^^^^ Have you ever tried reading/writing a 1,000,000 real numbers using ascii instead of binary? I changed an engineering analysis program from being intolerably SSSLLLOOOWWW to exceedingly fast by using binary data files instead of ascii. My boss was happy :-) Tom. p.s. This was on a microcomputer, name witheld to protect my innocence. -- This is my signature: tom zougas
bill@twwells.uucp (T. William Wells) (05/01/89)
In article <89Apr30.140219edt.18480@me.utoronto.ca> zougas@hammer.me.UUCP (Athanasios(Tom) Zougas) writes:
: >1. Files take a bit longer to read in
: ^^^^^
: Have you ever tried reading/writing a 1,000,000 real numbers using
: ascii instead of binary? I changed an engineering analysis program
: from being intolerably SSSLLLOOOWWW to exceedingly fast by using
: binary data files instead of ascii. My boss was happy :-)
Bet you were using scanf! It isn't so bad if you do it yourself, but
scanf and friends are SSSLLLOOOWWW.
---
Bill { uunet | novavax } !twwells!bill
poser@csli.Stanford.EDU (Bill Poser) (05/01/89)
I agree that in many cases it is desirable to use ASCII data files, but in some situations binary is better. One such situation is when you need to know how many items are in the file before you read it (say to allocate storage). If the data is binary you just stat the file and divide by the item size. But if you use ASCII data there won't be a fixed item size, unless you go to the trouble of arranging fixed field widths (which also may waste a lot of space), or arrange to write a header containing the number of items, which is impractical in some situations.
mcdonald@uxe.cso.uiuc.edu (05/01/89)
>.... Believe me, ascii is better: > ... >Bad things [WRT ascii data file]: > >1. Files take a bit longer to read in More than a bit longer. I see no reason not to standardize on a standard binary format and use that for data interchange: 8 bit bytes, two's complement. On more-than-eight-bit byte machines, pad with zeroes. Your choice of endianness. For example, TeX produces a well-defines .dvi file as output, and Metafont produces standard .gf files, all of which are binary. I have no difficulty copying these between DEC-20's, IBM-PCs and VAX's (which are the same for this), and big-endian RISC machines. Dec-20s have 36 bit words. Screwing around with endianness etc., repacking bytes, etc., is a lot faster than ASCIIfication, especially in C. Doug McDonald
henry@utzoo.uucp (Henry Spencer) (05/02/89)
In article <1271@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >There is another bad thing. We may not have a good ASCII representation for >the data... example is floating >point data; there is no standard floating point binary, and conversion to and >from decimal is a source of roundoff errors, which may even be serious. Binary and decimal do not cover the entire space of possibilities. The problem is not in conversion to a text form, it's in base conversion. If you are willing to assume that floating point is binary, it's conceivable to convert floating point to octal or hex rather than decimal. Also, if you have well-behaved (IEEE) floating point conversions, and allow enough digits, you may not have roundoff errors. -- Mars in 1980s: USSR, 2 tries, | Henry Spencer at U of Toronto Zoology 2 failures; USA, 0 tries. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
scs@adam.pika.mit.edu (Steve Summit) (05/02/89)
In article <1271@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >There is another bad thing. We may not have a good ASCII representation for >the data. One example is a multi-font system. Another example is floating >point data; there is no standard floating point binary, and conversion to and >from decimal is a source of roundoff errors, which may even be serious. Or it may be innocuous. The data I typically manipulate is derived from a medium-quality A/D converter with 4 or 5 digits of precision; I'd be misleading myself if I were to take steps to ensure that my data files could handle more. I'm not sure what the comment about non-ASCII/multi font systems means. Is the implication that binary formats (bit-for-bit copies of the machine's internal floating-point format) are somehow less susceptible to such portability concerns? Steve Summit scs@adam.pika.mit.edu
scs@adam.pika.mit.edu (Steve Summit) (05/02/89)
In article <8758@csli.Stanford.EDU> poser@csli.stanford.edu (Bill Poser) writes: >I agree that in many cases it is desirable to use ASCII data files, >but in some situations binary is better. One such situation is when >you need to know how many items are in the file before you read it >(say to allocate storage). If the data is binary you just >stat the file and divide by the item size. Actually, this illustrates another thing it's worth shying away from if you can. The assumption that you can determine, without actually reading them, exactly how many characters a file contains, can get you in to trouble, although of course it's a perfectly valid assumption on Unix systems. Not so on VMS and MS-DOS and doubtless other lesser systems -- stat() or the equivalent may only give you an approximation. A prime example is Unix tar format: a tar file consists of a file header, followed by a file, followed by a file header, etc. The file header contains the (following) file's size; the size must be exact because the program reading the tar file must use it to determine where the file ends and the next header begins. It's trivial to write the header correctly on Unix: just stat the file. If you're trying to create tar files on other systems (a reasonable thing to do, since tar is an interchange format) you typically have to read each file twice: once to count the characters in it, and a second time to copy it to the tar output file. The moral is that if you're writing a program that might be ported to a non-Unix system, don't depend on the ability to find a file's size, "in advance," without explicitly reading it. Getting back to data files, it's not necessary to know how big they are while reading them. Just use code like the following: int nels = 0; int nallocated = 0; struct whatever *p = NULL; while(there's another item) { if(nels >= nallocated) { nallocated += 10; if(p == NULL) p = (struct whatever *)malloc( nallocated * sizeof(struct whatever)); else p = (struct whatever *)realloc((char *)p, nallocated * sizeof(struct whatever)); if(p == NULL) complain; } read item into p[nels]; nels++; } If realloc can handle a NULL first argument, you can dispense with the initial test and call to malloc, and always call realloc (which is why I'm always ranting in favor of this realloc functionality, which ANSI C incidentally requires). The on-the-fly reallocation may look inefficient, but "it doesn't matter much in practice." (At least for me. When I'm really unconcerned with efficiency, I even skip the nallocated += 10 chunking jazz and call realloc for each item read, and that has never caused problems either. Your mileage may vary.) Steve Summit scs@adam.pika.mit.edu
sjs@spectral.ctt.bellcore.com (Stan Switzer) (05/02/89)
Another alternative definitely worth looking into is Sun's XDR. XDR is available via anonymous FTP from titan.rice.edu (the sun-spots archives) in "sun-source/rpcsrc.*.*". If you have NFS you already have XDR. XDR is an efficient and rather clever scheme for producing flat, binary, machine-independent representations of arbitrary C data structures. It easily allows for pointer chasing, memory allocation, unions. Check it out. Stan Switzer sjs@ctt.bellcore.com
john@frog.UUCP (John Woods) (05/03/89)
In article <893@twwells.uucp>, bill@twwells.uucp (T. William Wells) writes: > In article <89Apr30.140219edt.18480@me.utoronto.ca> zougas@hammer.me.UUCP (Athanasios(Tom) Zougas) writes: < : >1. Files take a bit longer to read in > : ^^^^^ < : Have you ever tried reading/writing a 1,000,000 real numbers using > : ascii instead of binary? I changed an engineering analysis program < : from being intolerably SSSLLLOOOWWW to exceedingly fast by using > : binary data files instead of ascii. My boss was happy :-) < Bet you were using scanf! It isn't so bad if you do it yourself, but > scanf and friends are SSSLLLOOOWWW. Furthermore, you could always use an ASCII representation like +.8000000000000+2 to represent 1/2 * 2^2 (or the value 2). If you are going between machines with "similar" floating point representations (binary fraction, binary exponent), then the details of the conversion are fairly simple (especially if you were thoughtful enough to use a header which specified the format of the original internal representation). If your target is something really peculiar (a decimal computer, say) then it's no worse than converting "2.000000000E+000", anyway. A computer with a base-16 floating point format could handle the binary format with shifts. -- John Woods, Charles River Data Systems, Framingham MA, (508) 626-1101 ...!decvax!frog!john, john@frog.UUCP, ...!mit-eddie!jfw, jfw@eddie.mit.edu
dwilbert@bbn.com (Deborah Wilbert) (05/03/89)
I worked as part of a team building a (very) large, multiprocess, distributed application and was very good about isolating the user interface, and even within the user interface, isolating strings used from the rest of the presentation code... except for the error log(*). Strings which eventually wended their way into the error log originated in any part of the system. Well, along comes someone who wants to translate the system into Japanese. "A piece of cake!" was my first reaction (we had all the display tools) and indeed, translating the user interface WAS a piece of cake... except for the error log. In fact, they wanted to be able to run multiple user interfaces to a running system, some in Japanese, others in English, simultaneously. No problem. They also wanted to veiw the log in either English or Japanese, upon demand. Big problem. Virtually every component had to be changed to report errors in a binary format rather than a string format and the error log itself had to be maintained in binary format to achieve this functionality. FYI, the binary format involved an error code, and various parameters associated with the errors. Of course, the reimplementation necessary in most of the modules was unnecessary... we should always have been reporting internally in binary format (mea culpa), even if we maintained the original log file in English. However, (1) a binary output file format would have enforced internal binary interfaces thus preventing the sloppy coding and (2) the binary format of the log file was eventually necessary for multilingual access. If you work on any serious product, you should consider keeping log-style files in a binary format. When I started work on my project, I had no idea it would eventually be translated into another language. -Deborah (*) well, it wasn't really an error log... it was more integral and important than an error log, but let's call it an error log.
tneff@bfmny0.UUCP (Tom Neff) (05/03/89)
In article <11021@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes: >Actually, this illustrates another thing it's worth shying away >from if you can. The assumption that you can determine, without >actually reading them, exactly how many characters a file >contains, can get you in to trouble, although of course it's a >perfectly valid assumption on Unix systems. Not so on VMS and >MS-DOS and doubtless other lesser systems -- stat() or the ^^^^^^ >equivalent may only give you an approximation. MS-DOS has exact filesizes in bytes, and a standard OS call to retrieve a file's size in bytes. The poster may have been thinking of CP/M, but CP/M is not MS-DOS. -- Tom Neff UUCP: ...!uunet!bfmny0!tneff "Truisms aren't everything." Internet: tneff@bfmny0.UU.NET
bet@dukeac.UUCP (Bennett Todd) (05/04/89)
In article <11021@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit)
writes that assuming you can stat a file for its size breaks down on non-UNIX
systems, and recommends reading into a dynamically grown buffer, which he
grows linearly.
I have often done similar things. The getline() routine in my libbent does a
conceptually similar job for reading arbitrarily long text lines (where you
don't know in advance how many bytes to allocate to last you 'till the next
newline on input). Also, in an image I/O and manipulation library I wrote, I
wanted to be able to read an image from a pipe. I disbelieve in header
parsing, and deduce image dimensions from the file length, so I had to do
roughly the same thing.
However, I am not sure I like the linear reallocation strategy. I would tend
to assume, in general, that realloc would usually be implemented as a series
of malloc/memcpy/free, and thus I try to avoid working it too hard. I found a
binary growth algorithm easy to code, however; basically it looks just like
Steve's linear algorithm except instead of
nallocated += 10;
I use
nallocated *= 2;
Also, I start with somewhat larger allocations; for getline() I started with
128, and with the image reading facility I start with 65536. Finally, where
you are reading hoping for EOF, by all means issue one big read for reach
realloc, rather than reading along by one or two at a time.
I actually haven't done any performance measurements to determine whether I am
buying any speed with this strategy; however, it isn't much harder to code,
and I am sure on some (if not most) machines the vendor doesn't take enough
care to optimize performance of realloc().
-Bennett
bet@orion.mc.duke.edu
poser@csli.Stanford.EDU (Bill Poser) (05/04/89)
It is true that some operating systems don't allow you to determine the true file size, but essentially any program that does file i/o, graphics, or other dealings with the outside world has to make some assumptions about the OS. After all, some OSs don't provide for shared memory or execution of child processes or our favorite graphics system or window system. And some really grody operating systems have zillions of file types, which generally do not have exact counterparts in other OSs, so just using text files (I don't write "ASCII" since some of the nasty OSs in question use EBCDIC) doesn't make the program portable. For many of us (admittedly not all) the class of UN*X systems is a large enough target for portability. And if, as I do, you are dealing with files containing millions of numbers, the efficiency of using binary data can be a big win.
mcdonald@uxe.cso.uiuc.edu (05/04/89)
>MS-DOS has exact filesizes in bytes, and a standard OS call to retrieve >a file's size in bytes. The poster may have been thinking of CP/M, >but CP/M is not MS-DOS. Yes, but that size is the same as the number of characters returned by fgetc only if the file is binary. Not generally so if it is text. Doug McDonald
geoff@cs.warwick.ac.uk (Geoff Rimmer) (05/05/89)
Let me throw in my 2c worth. My "rules" for choosing whether I use a binary file, or an ASCII file for storing data are as follows: (1) If I am storing a file of structs, I always use binary. This is because it is faster, and because it makes the code easier to write and understand. For example, to delete a struct whose "ref" is set to 999, and write the new file of structs out elsewhere: struct blurfl buf; while ( fread ( (char*) &buf, sizeof(struct blurfl), 1, fpr)) { if (buf->ref != 999) fwrite( (char*) &buf, sizeof(struct blurfl), 1, fpw); } (2) If I don't know how many fields will be read in at a time, ie. one of the fields might determine how many more fields to be read in. With (2), I often use strtok(), which I've found very useful: while (!feof(fpr)) { char *ptr, str[BUFSIZ]; if (!fgets(str,BUFSIZ-1,fpr)) break; if (!(ptr=strtok(str," \t"))) continue; do { printf("%s\n",ptr); } while (ptr=strtok((char*)0," \t")); } (In reality, I wouldn't use fgets, since it requires a maximum length to read, which could result in lost data if there is a particularly long line.) I don't believe you can say "always use ascii files" - it just ain't good enough for some applications. Geoff
nather@ut-emx.UUCP (Ed Nather) (05/05/89)
In article <225800167@uxe.cso.uiuc.edu>, mcdonald@uxe.cso.uiuc.edu writes: > > > >MS-DOS has exact filesizes in bytes, and a standard OS call to retrieve > >a file's size in bytes. The poster may have been thinking of CP/M, > >but CP/M is not MS-DOS. > > > Yes, but that size is the same as the number of characters returned > by fgetc only if the file is binary. Not generally so if it is text. Sorry, that's not correct. The exact file size as it is present as an image on the disk is indeed returned. If, however, you read a file into memory in text mode, its size will CHANGE due to removal of '\r' codes by the input routine. You can avoid this behavior by referring to the text file as binary in the 'fopen' operation. There are also C commands to set output to binary, if you wish, so the '\r' codes are not inserted on output. This whole mess came about because the author(s) of MS-DOS refused to accept the Unix convention of a single '\n' newline character instead of '\r''\n'. CPM still lives; its genes are hiding inside MS-DOS. -- Ed Nather Astronomy Dept, U of Texas @ Austin
les@chinet.chi.il.us (Leslie Mikesell) (05/07/89)
In article <1821@ubu.warwick.UUCP> geoff@cs.warwick.ac.uk (Geoff Rimmer) writes: >My "rules" for choosing whether I use a binary file, or an ASCII file >for storing data are as follows: >(1) If I am storing a file of structs, I always use binary. This >is because it is faster, and because it makes the code easier to write >and understand. [...] >I don't believe you can say "always use ascii files" - it just ain't >good enough for some applications. I'm surprised that no one has mentioned this yet, but writing binary data or structs to disk files results in a file which may require conversion for use by another machine. This becomes a serious problem in a networked environment where it may not be apparent which machine created the file or when multiple machines need simultaneous access. It is not that unlikely that your next machine upgrade will consist of adding more hosts on a net with access to the current files. Do you want to bet that they will have the same byte order, word size, and struct padding? Les Mikesell
bright@Data-IO.COM (Walter Bright) (05/09/89)
In article <12815@ut-emx.UUCP> nather@ut-emx.UUCP (Ed Nather) writes:
<This whole mess came about because the author(s) of MS-DOS refused to accept
<the Unix convention of a single '\n' newline character instead of '\r''\n'.
<CPM still lives; its genes are hiding inside MS-DOS.
And CPM is based on DEC's RT-11 for PDP-11 computers. At the time, DEC
operating systems were very popular, and they all used the \r and \n
convention. Unix was not nearly so prevalent then. So you cannot fault
CPM for not following the unix conventions. MS-DOS got a head start by
making it easy to port CPM programs over to it. At the time it had
no C compiler or hard disk, so there was no urge to port unix code to it.
All that existed was CPM assembly programs.
All in all, most of the decisions were rational and made sense at the time.
It's rather easy to criticize from 10 years later.
MS-DOS moved into the unix camp and away from DEC with version 2.0. The
big screwup here was using a \ as a separator, rather than /. I suspect
that the reason was that Microsoft used the / as a switch character in
all their application programs.
peter@ficc.uu.net (Peter da Silva) (05/09/89)
In article <1970@dataio.Data-IO.COM>, bright@Data-IO.COM (Walter Bright) writes: > And CPM is based on DEC's RT-11 for PDP-11 computers. At the time, DEC > operating systems were very popular, and they all used the \r and \n > convention. That's funny... most DEC systems I know (including RSX, which is what CP/M seems most closely modelled on) store files as a series of variable length records containing (usually) a 2 or 4 byte header containing the length and maybe the line number and then the data on the line. CP/M was actully based directly on an obscure intel DOS called Isis, with some teminology (PIP, etc) borrowed from DEC. Isn't this getting a bit far from 'C'? -- Peter da Silva, Xenix Support, Ferranti International Controls Corporation. Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180. Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.