[comp.lang.c] Binary data file compatibility across machines

stiber@cs.ucla.edu (Michael D Stiber) (11/24/90)

On different machines, the implementation of C data types is different.
I forget what fixed types' lengths are, but I know that at least some of
them may vary.  I also know that doubles can have different encoding
schemes (ie, IEEE vs. DEC).  Then, there's little endian machines versus
big endian ones.

So, my question is this:  Say you want to share data files among
different machines.  You also want to be able to use the same code on
each machine.  Therefore, you want to have either a uniform file format,
or you want the code to be able to figure out what the file format is,
and convert it to the native data type representation.  Now, one alternative
would be ASCII files --- this is guaranteed to work (assuming that you
can get C on an IBM 3090 to write ASCII).  However, in my application,
ASCII would produce files that are way too huge --- I must use a binary
format.  So, is there an already-existing, standard solution to this
problem of binary data file transfer?
--
			    Michael Stiber
			  stiber@cs.ucla.edu
		   ...{ucbvax,ihpn4}!ucla-cs!stiber
		     UCLA Computer Science Dept.

rmartin@clear.com (Bob Martin) (11/26/90)

In article <STIBER.90Nov23134600@maui.cs.ucla.edu> stiber@cs.ucla.edu (Michael D Stiber) writes:
>
>On different machines, the implementation of C data types is different.
>So, is there an already-existing, standard solution to this
>problem of binary data file transfer?
>--

CCITT has created a standard called X.409.  It nicely encodes data
in nearly any structure at all into a moderately concise binary 
format.  It provides for arrays, structures, strings, integers, and
user formats.  

I have used this standard quite successfully to store data in
files, or transmit data in "mail" messages between machines of
different architechtures.

I believe that the OSI standard for ASN.1 (X.208, X.209???) is
quite similar to X.409.  This standard is part of the presentation
layer in the OSI 7 layer model....

Another option is a unix convention called XDR.  I know very
little about it, other than it is used by RPC.  You might check
your unix manuals...

-- 
+-Robert C. Martin-----+:RRR:::CCC:M:::::M:| Nobody is responsible for |
| rmartin@clear.com    |:R::R:C::::M:M:M:M:| my words but me.  I want  |
| uunet!clrcom!rmartin |:RRR::C::::M::M::M:| all the credit, and all   |
+----------------------+:R::R::CCC:M:::::M:| the blame.  So there.     |

prk@planet.bt.co.uk (KnightRider) (11/26/90)

stiber@cs.ucla.edu (Michael D Stiber) writes:


>On different machines, the implementation of C data types is different.
>I forget what fixed types' lengths are, but I know that at least some of
>them may vary.  I also know that doubles can have different encoding
>schemes (ie, IEEE vs. DEC).  Then, there's little endian machines versus
>big endian ones.

>So, my question is this:  Say you want to share data files among
>different machines.  You also want to be able to use the same code on
>each machine.  Therefore, you want to have either a uniform file format,
>or you want the code to be able to figure out what the file format is,
>and convert it to the native data type representation.  Now, one alternative
>would be ASCII files --- this is guaranteed to work (assuming that you
>can get C on an IBM 3090 to write ASCII).  However, in my application,
>ASCII would produce files that are way too huge --- I must use a binary
>format.  So, is there an already-existing, standard solution to this
>problem of binary data file transfer?
>--
>			    Michael Stiber
>			  stiber@cs.ucla.edu
>		   ...{ucbvax,ihpn4}!ucla-cs!stiber
>		     UCLA Computer Science Dept.

The short, easy, answer is yes - Use the facilities provided by the 
presentation layer services provided by your communications system, 
if you have one.  For OSI systems this will ASN.1 -Abstract Syntax 
Notation 1.  If you use Sun systems, you may be able to use XDR 
-External Data Representation.

However, these are difficult services to encode, usually, so you may
want to go to an external vendor to support these.  

Peter Knight
BT Research 

#include <std.disclaimer>

cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/26/90)

In article <STIBER.90Nov23134600@maui.cs.ucla.edu>, stiber@cs.ucla.edu
(Michael D Stiber) writes:
|> 
|> On different machines, the implementation of C data types is different.
|> I forget what fixed types' lengths are, but I know that at least some of
|> them may vary.  I also know that doubles can have different encoding
|> schemes (ie, IEEE vs. DEC).  Then, there's little endian machines versus
|> big endian ones.
|> 
|> So, my question is this:  Say you want to share data files among
|> different machines.  You also want to be able to use the same code on
|> each machine.  Therefore, you want to have either a uniform file format,
|> or you want the code to be able to figure out what the file format is,
|> and convert it to the native data type representation.  Now, one alternative
|> would be ASCII files --- this is guaranteed to work (assuming that you
|> can get C on an IBM 3090 to write ASCII).  However, in my application,
|> ASCII would produce files that are way too huge --- I must use a binary
|> format.  So, is there an already-existing, standard solution to this
|> problem of binary data file transfer?
|> -

Most UNIX systems have a library of routines based on the XDR protocol.
These routines will accomplish what you need to do.  Be aware, however,
that the tradeoff here is execution speed:  encoding and decoding files
to and from IEEE format can be expensive in CPU usage.

Try 'man xdr'; failing that, read your network documentation.

--- Dale Cook   cdm@inel.gov
========== long legal disclaimer follows, press n to skip ===========
^L
Neither the United States Government or the Idaho National Engineering
Laboratory or any of their employees, makes any warranty, whatsoever,
implied, or assumes any legal liability or responsibility regarding any
information, disclosed, or represents that its use would not infringe
privately owned rights.  No specific reference constitutes or implies
endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions expressed herein do not necessarily reflect those of the
United States Government or the Idaho National Engineering Laboratory,
and shall not be used for advertising or product endorsement purposes.

hp@vmars.tuwien.ac.at (Peter Holzer) (11/27/90)

stiber@cs.ucla.edu (Michael D Stiber) writes:

>On different machines, the implementation of C data types is different.
>I forget what fixed types' lengths are, but I know that at least some of
>them may vary.  I also know that doubles can have different encoding
>schemes (ie, IEEE vs. DEC).  Then, there's little endian machines versus
>big endian ones.

>So, my question is this:  Say you want to share data files among
>different machines.  You also want to be able to use the same code on
>each machine.  Therefore, you want to have either a uniform file format,
>or you want the code to be able to figure out what the file format is,
>and convert it to the native data type representation.  Now, one alternative
>would be ASCII files --- this is guaranteed to work (assuming that you
>can get C on an IBM 3090 to write ASCII).  However, in my application,
>ASCII would produce files that are way too huge --- I must use a binary
>format.  So, is there an already-existing, standard solution to this
>problem of binary data file transfer?
>--
>			    Michael Stiber
>			  stiber@cs.ucla.edu
>		   ...{ucbvax,ihpn4}!ucla-cs!stiber
>		     UCLA Computer Science Dept.

I do not know of any standard solution (ANSI or ISO or
something) but here is my personal ``standard'':
(Well, most of the time I just use ASCII files. They are not that
much bigger, and I can examine (and change!!) them with standard
tools)

For integer data I choose the format that is used on the machines 
I am working on most of the time. Each binary data file then
gets a header describing the data format. Something like

<magic number>		2 Bytes: 'P' 'B' (portable binary)
<integer type> 		1 Byte: 0 = 2compl., 1 = 1compl.,
				2 = sign/mag,
<endianness>		1 Byte: 0 = little, 1 = big.
<float-format>		??

I didn't need float format until now but I would adopt at least
two different formats: IEEE and a generic format where a float
is broken into a mantissa (long int) and exponent (short int).

Shorts are assumed to be 2 bytes, longs 4 bytes (The minimums
required by ANSI).

A program which reads these files would first check if the data
format is the same as it uses internally. If it does it can use
fread/fwrite for the rest of the file, else it has to call
special routines to deal with the various types.
Most of the time the file will be read on the same machine it
was written, so files can usually be read fast.

A portable routine to read a big-endian sign/magnitude long would then
be:

long read_long_bs (FILE * fp)
{
	unsigned long ul;
	long	l;

	ul = getc (fp);
	ul = (ul << 8) | getc (fp);
	ul = (ul << 8) | getc (fp);
	ul = (ul << 8) | getc (fp);

	l = ul & 0x80000000 ? - (ul & 0x7fffffff) : ul;
	return l;
}

Oh yes I am assuming that a character is 8 bits and the machine
is using the ASCII character set. If that is not the case the 
program must not use more than the lowest eight bits of any
character and strings must be converted to ASCII first.

--
|    _  | Peter J. Holzer                       | Think of it   |
| |_|_) | Technical University Vienna           | as evolution  |
| | |   | Dept. for Real-Time Systems           | in action!    |
| __/   | hp@vmars.tuwien.ac.at                 |     Tony Rand |

cdm@gem-hy.Berkeley.EDU (Dale Cook) (11/27/90)

In article <prk.659625326@pegasus>, prk@planet.bt.co.uk (KnightRider) writes:
|> stiber@cs.ucla.edu (Michael D Stiber) writes:
|> 
|> The short, easy, answer is yes - Use the facilities provided by the 
|> presentation layer services provided by your communications system, 
|> if you have one.  For OSI systems this will ASN.1 -Abstract Syntax 
|> Notation 1.  If you use Sun systems, you may be able to use XDR 
|> -External Data Representation.
|> 
|> However, these are difficult services to encode, usually, so you may
|> want to go to an external vendor to support these.  
|> 

XDR libraries exist on a host of UNIX environments (VAX, Cray, and MASSCOMP
to name a few), and they're not that hard to use.  I believe Sun developed
the standard.

---Dale Cook    cdm@inel.gov
========== long legal disclaimer follows, press n to skip ===========
^L
Neither the United States Government or the Idaho National Engineering
Laboratory or any of their employees, makes any warranty, whatsoever,
implied, or assumes any legal liability or responsibility regarding any
information, disclosed, or represents that its use would not infringe
privately owned rights.  No specific reference constitutes or implies
endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions expressed herein do not necessarily reflect those of the
United States Government or the Idaho National Engineering Laboratory,
and shall not be used for advertising or product endorsement purposes.

bright@nazgul.UUCP (Walter Bright) (11/29/90)

In article <1990Nov26.154631.7493@inel.gov> cdm@gem-hy.Berkeley.EDU (Dale Cook) writes:
/Neither the United States Government or the Idaho National Engineering
/Laboratory or any of their employees, makes any warranty, whatsoever,
/implied, or assumes any legal liability or responsibility regarding any
/information, disclosed, or represents that its use would not infringe
/privately owned rights.  No specific reference constitutes or implies
/endorsement, recommendation, or favoring by the United States
/Government or the Idaho National Engineering Laboratory.  The views and
/opinions expressed herein do not necessarily reflect those of the
/United States Government or the Idaho National Engineering Laboratory,
/and shall not be used for advertising or product endorsement purposes.

Hey, lighten up! This is Usenet! :-)

djones@megatest.UUCP (Dave Jones) (11/29/90)

From article <prk.659625326@pegasus>, by prk@planet.bt.co.uk (KnightRider):
> stiber@cs.ucla.edu (Michael D Stiber) writes:
>>However, in my application,
>>ASCII would produce files that are way too huge --- I must use a binary
>>format.

I suppose you have considered using "compress" and "uncompress" in
conjunction with ascii files, but I thought I would just mention it.

Also, you don't have to use the obvious ascii encoding either. That
converts one binary nibble into an ascii char, doubling the data size.
It's not at all hard to get a ratio of 4/3 as uuencode does.

userAKDU@mts.ucs.UAlberta.CA (Al Dunbar) (11/29/90)

In article <2172@tuvie>, hp@vmars.tuwien.ac.at (Peter Holzer) writes:
>stiber@cs.ucla.edu (Michael D Stiber) writes:
>
>
>>On different machines, the implementation of C data types is different.
<<<deletions>>>
>
>For integer data I choose the format that is used on the machines
>I am working on most of the time. Each binary data file then
>gets a header describing the data format. Something like
>
><magic number>          2 Bytes: 'P' 'B' (portable binary)
><integer type>          1 Byte: 0 = 2compl., 1 = 1compl.,
>                                2 = sign/mag,
><endianness>            1 Byte: 0 = little, 1 = big.
><float-format>          ??
>
Pardon my curiosity, but, if you write such a file on a
particular type of machine, then read it back on another,
won't your code have to do some decoding of this header
information? Say, for example, you write from an ASCII
machine and read from an EBCDIC one. The "PB" will map to
some other combination of characters. Will your program
determine from whatever they happen to be that the source
machine is ASCII? I always forget whether "endianness"
refers to the ordering of bytes in words or bits in bytes -
if the latter, your program will also have to do some
juggling to properly decode the third and fourth bytes.
What about machine architectures you don't know about yet?
What about 12 and 60 bit machines (PDP8, Cyber)?
 
If transportability is important, use ASCII (pardon, character),
and let some o/s utility do the conversion. If efficiency is
paramount, use binary and include a disclaimer about moving
the file to another machine (you can't, after all, move the
executable that way, can you?). If both are crucial, provide
a separate conversion program.
 
-------------------+-------------------------------------------
Al Dunbar          |
Edmonton, Alberta  |  "this mind left intentionally blank"
CANADA             |          - Manuel Writer
-------------------+-------------------------------------------
#! r

hp@vmars.tuwien.ac.at (Peter Holzer) (11/30/90)

userAKDU@mts.ucs.UAlberta.CA (Al Dunbar) writes:

>In article <2172@tuvie>, hp@vmars.tuwien.ac.at (Peter Holzer) writes:
>>stiber@cs.ucla.edu (Michael D Stiber) writes:
>>
>>
>>>On different machines, the implementation of C data types is different.
><<<deletions>>>
>>
>>For integer data I choose the format that is used on the machines
>>I am working on most of the time. Each binary data file then
>>gets a header describing the data format. Something like
>>
>><magic number>          2 Bytes: 'P' 'B' (portable binary)
>><integer type>          1 Byte: 0 = 2compl., 1 = 1compl.,
>>                                2 = sign/mag,
>><endianness>            1 Byte: 0 = little, 1 = big.
>><float-format>          ??
>>
>Pardon my curiosity, but, if you write such a file on a
>particular type of machine, then read it back on another,
>won't your code have to do some decoding of this header
>information? Say, for example, you write from an ASCII
>machine and read from an EBCDIC one. The "PB" will map to
>some other combination of characters. Will your program
>determine from whatever they happen to be that the source
>machine is ASCII? I always forget whether "endianness"
>refers to the ordering of bytes in words or bits in bytes -
>if the latter, your program will also have to do some
>juggling to properly decode the third and fourth bytes.
>What about machine architectures you don't know about yet?
>What about 12 and 60 bit machines (PDP8, Cyber)?

You left out the last four lines of my posting:

=Oh yes I am assuming that a character is 8 bits and the machine
=is using the ASCII character set. If that is not the case the 
=program must not use more than the lowest eight bits of any
=character and strings must be converted to ASCII first.

The idea of my data representation is to provide easy access to data on
the machines I usually work on. They use the ASCII character set, have
8bit characters, 16bit shorts and 32bit longs (The minimum sizes
guarantueed by the ANSI-C standard). So the magic number is always 0x50 0x42.

EBCDIC machines must convert character data (Both on read and write).
If they have 8bit-char, 16bit-short, 32bit-long they may read write
integers in their native format. 

Machines which have characters with more than 8bits, or shorts with more
than 16 bits or longs with more than 32 bits must convert integer data
both on read and write. They have to split shorts in two 8bit packets
and longs in 4 8-bit packets and store these 8bit packets as consecutive
characters. Thus a machine with 9bit characters and 36 bit shorts could
only use the lowest 16 bits of its shorts and and would need 2
9bit-characters to store them in a file.
> 
>If transportability is important, use ASCII (pardon, character),
>and let some o/s utility do the conversion. If efficiency is
>paramount, use binary and include a disclaimer about moving
>the file to another machine (you can't, after all, move the
>executable that way, can you?). If both are crucial, provide
>a separate conversion program.

Conversion programs are fine if the data is not moved around much. If
you have a file that is on a file system mounted by different machines
it is not as good (We don't have the situation, VAXes, DECstations and
PCs all have the same data representation, only our real-time system
(based on 68000) does it the other way round).
--
|    _  | Peter J. Holzer                       | Think of it   |
| |_|_) | Technical University Vienna           | as evolution  |
| | |   | Dept. for Real-Time Systems           | in action!    |
| __/   | hp@vmars.tuwien.ac.at                 |     Tony Rand |

userAKDU@mts.ucs.UAlberta.CA (Al Dunbar) (12/02/90)

In article <2188@tuvie.UUCP>, hp@vmars.tuwien.ac.at (Peter Holzer) writes:
>userAKDU@mts.ucs.UAlberta.CA (Al Dunbar) writes:
>
>>In article <2172@tuvie>, hp@vmars.tuwien.ac.at (Peter Holzer) writes:
>>>stiber@cs.ucla.edu (Michael D Stiber) writes:
>>>
>>>
>>>>On different machines, the implementation of C data types is different.
>><<<deletions>>>
>>>
<<<yet more deletions>>>
>
>EBCDIC machines must convert character data (Both on read and write).
>If they have 8bit-char, 16bit-short, 32bit-long they may read write
>integers in their native format.

If you are going to force EBCDIC machines to convert through ASCII
on input and output, why not have all versions of your program
produce identical output? Then you needn't have the short header,
or the "overhead" of decoding it.

I guess I misunderstood your original posting to be addressing
a general case of binary file portability. One problem with
your approach (though it appears to work well in your environment)
seems to be that the code must be tailored for each machine
it runs on in order to compensate for architectural differences.
Thus, you may achieve data portability, but have lost program
portability.

-------------------+-------------------------------------------
Al Dunbar          |
Edmonton, Alberta  |  "this mind left intentionally blank"
CANADA             |          - Manuel Writer
-------------------+-------------------------------------------

hp@vmars.tuwien.ac.at (Peter Holzer) (12/03/90)

Well, I hate to followup my own articles, but maybe I should add that in
general, if you can use any standards that are widely available you
should use them. 

Rolling your own data conversion routines is only of advantage if

1. There is no common standard between the machines you are using.

2. The problem is so small that inventing a format and writing the
   conversion routines is MUCH less work than familiarizing yourself
   with existing routines.

--
|    _  | Peter J. Holzer                       | Think of it   |
| |_|_) | Technical University Vienna           | as evolution  |
| | |   | Dept. for Real-Time Systems           | in action!    |
| __/   | hp@vmars.tuwien.ac.at                 |     Tony Rand |

rabbieh@ajpo.sei.cmu.edu (Harold Rabbie) (12/13/90)

Jeez, it's even worse than that.  You can get binary file incompatibility
on the SAME machine.  Try writing a file using Ultrix-C on a VAX,
then read it back using VMS-C on another VAX.  Ultrix word-aligns
structs, VMS doesn't.  All on the same VAX.

bilbo@bisco.kodak.COM (Charles Tryon) (12/18/90)

I have (as I have noted here previously) been using XDR to solve this problem.
I am wondering, however, how many systems/machines subscribe to this standard?
We have various flavors of Sun's here (3.4, 4.0.1 Sun3, Sparc) as well as an
IBM RS/6000 which all know about XDR.  What other machines out there have
XDR libraries?

--
Chuck Tryon
    <bilbo@bisco.kodak.com>
    USmail: 46 Post Ave.;Roch. NY 14619                       B. Baggins
    <<...include standard disclamer...>>                      At Your Service

  "Swagger knows no upper bound, but the laws of physics remain unimpressed."
                                                            (D. Mocsny)

hp@vmars.tuwien.ac.at (Peter Holzer) (12/20/90)

bilbo@bisco.kodak.COM (Charles Tryon) writes:

>I have (as I have noted here previously) been using XDR to solve this problem.
>I am wondering, however, how many systems/machines subscribe to this standard?
>We have various flavors of Sun's here (3.4, 4.0.1 Sun3, Sparc) as well as an
>IBM RS/6000 which all know about XDR.  What other machines out there have
>XDR libraries?

On our DECstations running Ultrix (3.1 and 4.0) using sockets causes
lots of functions called xdr_* to be linked in. So obviously an XDR
library does exist. Unfortunately no manual pages do exist for these
functions.
--
|    _  | Peter J. Holzer                       | Think of it   |
| |_|_) | Technical University Vienna           | as evolution  |
| | |   | Dept. for Real-Time Systems           | in action!    |
| __/   | hp@vmars.tuwien.ac.at                 |     Tony Rand |

cdm@gem-hy.Berkeley.EDU (Dale Cook) (12/21/90)

In article <2216@tuvie.UUCP>, hp@vmars.tuwien.ac.at (Peter Holzer) writes:
|> bilbo@bisco.kodak.COM (Charles Tryon) writes:
|> 
|> >I have (as I have noted here previously) been using XDR to solve
this problem.
|> >I am wondering, however, how many systems/machines subscribe to this
standard?
|> >We have various flavors of Sun's here (3.4, 4.0.1 Sun3, Sparc) as
well as an
|> >IBM RS/6000 which all know about XDR.  What other machines out there have
|> >XDR libraries?
|> 
|> On our DECstations running Ultrix (3.1 and 4.0) using sockets causes
|> lots of functions called xdr_* to be linked in. So obviously an XDR
|> library does exist. Unfortunately no manual pages do exist for these
|> functions.
|> 

We use it on our Masscomp and Cray/Unicos systems as well.  It is _very_
well documented by the Masscomp folks (Concurrent Computing).  I believe
that most, if not all, RPC implementations incorporate the XDR protocol.

--- Dale Cook     cdm@inel.gov

========== long legal disclaimer follows, press n to skip ===========
^L
Neither the United States Government or the Idaho National Engineering
Laboratory or any of their employees, makes any warranty, whatsoever,
implied, or assumes any legal liability or responsibility regarding any
information, disclosed, or represents that its use would not infringe
privately owned rights.  No specific reference constitutes or implies
endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions expressed herein do not necessarily reflect those of the
United States Government or the Idaho National Engineering Laboratory,
and shall not be used for advertising or product endorsement purposes.