[comp.std.internat] data announcement

erik@sra.co.jp (Erik M. van der Poel) (06/17/91)
-
-
	A Proposed Solution for the Data Announcement Problem


1. Introduction

This document proposes a solution for the problem of (the lack of)
"data announcement". That is, if we have a series of binary digits
(bits), how do computers and how do humans determine what it is?

The Macintosh "knows" quite a lot about its files. It is possible to
start up an application simply by double clicking on an icon. The file
itself is stored in the data part, and the "metadata" (data that
describes the data in the file) is stored in the resources part. The
metadata contains the application name and the icon associated with
the file.

Although the Macintosh stores both the data and the metadata for each
file, many other systems store only the data on such media as
floppies, cartridges, and so on. In these cases, it is impossible to
start up the right program automatically. The human user is required
to instruct the computer, often through the keyboard.

Worse, the user may not know which program to start up since the label
may have been lost, or the user may not know which character encoding
was used for textual data, simply because the originator did not
include it in the label.

Even if the data is tagged with an indication of the associated
application, the computer may not be able to start it up, since that
program may not be accessible on the current system. In this
situation, it is highly desirable for the user to have a simple way to
find out what kind of data it is, and which application is needed.

The solution proposed here is simple (and therefore easy to implement)
and highly extensible. The data announcers are intended to be able to
be read by both computers and humans.

2. Syntax

Although the syntax of the data announcement scheme is not very
important (as long as we agree on it), a syntax is proposed here to
give an idea of the readability and extensibility that is intended.

The data announcement is done by prepending ASCII headers to the data. 
The headers are in the format defined by the Internet electronic mail
standard RFC-822. The headers are a series of name-value pairs on
consecutive lines, ending with an empty line. The following is an
example of a couple of headers, followed by the file itself.

	Content-Type: text
	Char-Encoding: ascii

	This file contains one line of text.

It is possible to add new header types freely, since the system only
has to pick up the information that it knows about and needs, and then
looks for the empty line. If a new header type has been added and the
user wants the system to interpret it correctly, the user will have to
upgrade the system. This is normal evolution.

Obviously, it is necessary to define a number of header types before
this scheme is used, so that the system may behave intelligently. 
Presumably, this would include the application name header and a
number of content types such as text, which can be displayed by many
different programs.

3. Why ASCII Headers?

One of the reasons for choosing ASCII email headers for the data
announcement scheme, is because it is compatible with a widely used
standard, namely RFC-822. In fact, the IETF (Internet Engineering Task
Force) is currently discussing extensions for RFC-822 that would allow
automatic start-up of applications, extraction of files, etc.

Another reason for using ASCII is because it is a very small and
simple set of characters, that can be displayed or printed on a very
large number of current systems.

This may seem rather biased towards English-speaking populations.
However, it should not be a problem, since, although English is not a
universal language, it is rapidly becoming a universal *second*
language. Also, it is possible to add further sets of headers in other
languages by including something like "Another-Header:". This would
indicate the presence of a set of headers after the current one, and
could be used to include any number of header sets.

4. POSIX Specific Details

As POSIX is becoming increasingly important in the world of
information technology standards, a brief discussion of the
implications for POSIX is included here. However, the principles
should also be applicable to other operating systems.

UNIX files have traditionally only contained the data of the file.
Some metadata was kept in a structure called the inode. This included
such fields as the size of the file, the time of the last
modification, etc. However, the metadata does not include the name of
the application associated with the file, and the inode is not freely
extensible.

This problem can be solved by adding a new system call. The current
system calls for accessing the data and the metadata are open() and
stat(), respectively. It is proposed here to add a new system call
called, for example, mopen(), which would open a freely extensible
metadata part of the file. For compatibility with the header scheme,
this metadata should also be in the header format.

Old programs would be able to continue to use open() and stat(), while
new applications could use mopen() to read or write metadata
associated with the file. For example, a GUI (graphical user
interface) might find the name of the associated application here, for
automatic start-up when the user double clicks on the file's icon.

This also means that the POSIX data interchange format needs to be
updated to be able to include the metadata together with the data.

5. Coded Character Sets

As mentioned above, one of the reasons for using a data announcement
scheme, is to solve the coded character set problem. Over the years,
standards bodies have defined many de jure codesets, and vendors have
defined many de facto codesets. These codesets are often used in text
files without any indication whatsoever of the codeset being used.

When users of a system store textual files in different languages, the
problem arises that users cannot "blindly" share these files without
knowing the codeset and informing the computer of this. This can also
be a problem when remote filesystems are mounted across the network.

In addition, some codesets can be used for several different human
languages. It is sometimes necessary, for example when spell-checking,
to know which language the text is in. This information could be
stored in headers, or it could be stored in-line or in the codes, if
the text is multilingual.

6. Transparent File Access

One of the more popular network services these days is so-called
transparent file access. In order to allow a data announcement scheme
to be used with remote filesystems, the specifications need to be
updated to include transmission of the metadata associated with a
file.

If POSIX is extended to specify an interface for metadata in the ASCII
header format, the mapping to an email-like message for communication
in the transparent file access protocol will be straightforward. On
other systems such as the Macintosh, the metadata may not be in
human-readable ASCII, so remote file implementations would have to
derive ASCII headers from the resources fork.

Adding metadata to remote files would also solve the problem of
inadvertent invokation of binary executables that only work with other
CPU types, since the user agent could check the type before attempting
to start up the application. This may become less of a problem when
ABIs are widely used.

7. Media

Although it is emphasized that the data announcement scheme is
intended to be able to be used with any series of bits, it will not
always be appropriate for the ASCII header to be placed at the very
beginning of a medium. For example, 1/4 inch magnetic tape cartridges
all have one particular outer format in common, and the system
software that reads these tapes parse the outer layer of bits to
retrieve the inner data itself. In these cases, it would be natural to
prepend the ASCII header to the data itself.

Although it is theoretically possible to enhance audio CD players to
understand the headers being proposed here, the CD player
manufacturers may decide not to add data announcers to audio CDs. They
may feel that it is not worth the trouble. Similarly, if it is felt
that floppy disks will soon be phased out of production and use, it
may be decided not to enhance floppies either.

While the data announcement is intended to allow the systems to
process the data automatically, thereby relieving the human user of
the burden of learning new keyboard commands, the user will be
expected to learn at least a few basics, such as the fact that
inserting a floppy into the CD-ROM drive is unhealthy.

8. Related Work

The data announcement scheme outlined here is rather similar to SGML's
DTD method, which also involves using human-readable text to inform
the application about the contents of the document.

There also exists a data interchange standard called ISO 1001, that
uses (a subset of) ASCII for the headers, although this standard is
really meant for interchange of files only, as opposed to "live" data
that can be used to decide which application to invoke.

The data announcers will need to use object identifiers, which should
be registered with the appropriate authority, so that users can agree
on the meaning of the headers. CCITT X.208 (also known as ISO 8824)
describes the registration process for object IDs. These IDs are used
in X.400(1988) (also known as ISO 10021), but can be used elsewhere as
well.

It is known that a lot of research has been done in this area. The
problem seems to be that none of the schemes is widely implemented.

9. Migration

In order to allow systems of the future to deal intelligently with old
data that does not contain the data announcers, a mechanism is needed
to distinguish old and new data. It is proposed to start every header
with a long series of bits that is unlikely to appear in any old data.

It should be fairly obvious that headers should not be prepended to
data before the systems that interpret the data are updated. Although
it is possible for one particular user to accelerate the migration
process by updating the systems and data quickly, this user may then
have problems exchanging data with other users.

If the number of users of the new data announcement scheme is small,
problems will frequently be encountered in data interchange, probably
to such an extent that a return to the old system will be desirable.
On the other hand, if the number of users of the new system is large,
the old system users will be inconvenienced, and will want to upgrade. 
So it is quite clear that a certain critical mass of new systems will
have to be reached in order to succeed.

10. A Long Term Plan

  I.   Promotion and Acceptance of the Concept and Plan

    The first stage will involve letting people know about the proposal,
    and refining it to try to gain acceptance of the ideas and the plan.

  II.  Creating and Distributing Readers

    The most important step is to update system software in such a way
    that files are no longer considered to be just data. The stream must
    be divided into two parts, the metadata and the data. This
    modification will be easy to make if the headers are kept extremely
    simple, as outlined above. For example, floppy disk drivers will need
    to be updated to parse the headers. Also, new transparent file access
    protocols will need to be implemented. However, these modifications
    should only allow the headers to be read, not written, so that no data
    will be produced that might cause problems for users that have not yet
    upgraded to the new system.

  III. Registering Header Names and Values

    In parallel with Stage II, current data and metadata needs to be
    researched extensively to collect and discuss header name-value pairs.
    As agreements are reached, these names and object IDs are registered
    internationally.

  IV.  Creating and Distributing Writers

    After some time, many users will have new system software that can
    read the headers. At this point, which should be well beyond the
    (imagined) critical mass, systems vendors can start to distribute
    software that writes the headers.

  V.   Evolution of User Agents and Applications

    When the underlying system has settled down, users can start to
    install agents such as the GUIs mentioned above. Also, applications
    can be updated to take advantage of the metadata by including
    application-specific headers, and so on.

11. Acknowledgements

The author wishes to thank John C. Klensin of the Massachusetts
Institute of Technology for invaluable guidance throughout the idea
formation period, and for providing information about previous
research in related areas.

END

---------------

Please post or email comments on this proposal. In particular, I would
like to know about other work in this area.
-
-- 
Erik M. van der Poel                                      erik@sra.co.jp
Software Research Associates, Inc., Tokyo, Japan     TEL +81-3-3234-2692