erik@sra.co.jp (Erik M. van der Poel) (06/17/91)
- - A Proposed Solution for the Data Announcement Problem 1. Introduction This document proposes a solution for the problem of (the lack of) "data announcement". That is, if we have a series of binary digits (bits), how do computers and how do humans determine what it is? The Macintosh "knows" quite a lot about its files. It is possible to start up an application simply by double clicking on an icon. The file itself is stored in the data part, and the "metadata" (data that describes the data in the file) is stored in the resources part. The metadata contains the application name and the icon associated with the file. Although the Macintosh stores both the data and the metadata for each file, many other systems store only the data on such media as floppies, cartridges, and so on. In these cases, it is impossible to start up the right program automatically. The human user is required to instruct the computer, often through the keyboard. Worse, the user may not know which program to start up since the label may have been lost, or the user may not know which character encoding was used for textual data, simply because the originator did not include it in the label. Even if the data is tagged with an indication of the associated application, the computer may not be able to start it up, since that program may not be accessible on the current system. In this situation, it is highly desirable for the user to have a simple way to find out what kind of data it is, and which application is needed. The solution proposed here is simple (and therefore easy to implement) and highly extensible. The data announcers are intended to be able to be read by both computers and humans. 2. Syntax Although the syntax of the data announcement scheme is not very important (as long as we agree on it), a syntax is proposed here to give an idea of the readability and extensibility that is intended. The data announcement is done by prepending ASCII headers to the data. The headers are in the format defined by the Internet electronic mail standard RFC-822. The headers are a series of name-value pairs on consecutive lines, ending with an empty line. The following is an example of a couple of headers, followed by the file itself. Content-Type: text Char-Encoding: ascii This file contains one line of text. It is possible to add new header types freely, since the system only has to pick up the information that it knows about and needs, and then looks for the empty line. If a new header type has been added and the user wants the system to interpret it correctly, the user will have to upgrade the system. This is normal evolution. Obviously, it is necessary to define a number of header types before this scheme is used, so that the system may behave intelligently. Presumably, this would include the application name header and a number of content types such as text, which can be displayed by many different programs. 3. Why ASCII Headers? One of the reasons for choosing ASCII email headers for the data announcement scheme, is because it is compatible with a widely used standard, namely RFC-822. In fact, the IETF (Internet Engineering Task Force) is currently discussing extensions for RFC-822 that would allow automatic start-up of applications, extraction of files, etc. Another reason for using ASCII is because it is a very small and simple set of characters, that can be displayed or printed on a very large number of current systems. This may seem rather biased towards English-speaking populations. However, it should not be a problem, since, although English is not a universal language, it is rapidly becoming a universal *second* language. Also, it is possible to add further sets of headers in other languages by including something like "Another-Header:". This would indicate the presence of a set of headers after the current one, and could be used to include any number of header sets. 4. POSIX Specific Details As POSIX is becoming increasingly important in the world of information technology standards, a brief discussion of the implications for POSIX is included here. However, the principles should also be applicable to other operating systems. UNIX files have traditionally only contained the data of the file. Some metadata was kept in a structure called the inode. This included such fields as the size of the file, the time of the last modification, etc. However, the metadata does not include the name of the application associated with the file, and the inode is not freely extensible. This problem can be solved by adding a new system call. The current system calls for accessing the data and the metadata are open() and stat(), respectively. It is proposed here to add a new system call called, for example, mopen(), which would open a freely extensible metadata part of the file. For compatibility with the header scheme, this metadata should also be in the header format. Old programs would be able to continue to use open() and stat(), while new applications could use mopen() to read or write metadata associated with the file. For example, a GUI (graphical user interface) might find the name of the associated application here, for automatic start-up when the user double clicks on the file's icon. This also means that the POSIX data interchange format needs to be updated to be able to include the metadata together with the data. 5. Coded Character Sets As mentioned above, one of the reasons for using a data announcement scheme, is to solve the coded character set problem. Over the years, standards bodies have defined many de jure codesets, and vendors have defined many de facto codesets. These codesets are often used in text files without any indication whatsoever of the codeset being used. When users of a system store textual files in different languages, the problem arises that users cannot "blindly" share these files without knowing the codeset and informing the computer of this. This can also be a problem when remote filesystems are mounted across the network. In addition, some codesets can be used for several different human languages. It is sometimes necessary, for example when spell-checking, to know which language the text is in. This information could be stored in headers, or it could be stored in-line or in the codes, if the text is multilingual. 6. Transparent File Access One of the more popular network services these days is so-called transparent file access. In order to allow a data announcement scheme to be used with remote filesystems, the specifications need to be updated to include transmission of the metadata associated with a file. If POSIX is extended to specify an interface for metadata in the ASCII header format, the mapping to an email-like message for communication in the transparent file access protocol will be straightforward. On other systems such as the Macintosh, the metadata may not be in human-readable ASCII, so remote file implementations would have to derive ASCII headers from the resources fork. Adding metadata to remote files would also solve the problem of inadvertent invokation of binary executables that only work with other CPU types, since the user agent could check the type before attempting to start up the application. This may become less of a problem when ABIs are widely used. 7. Media Although it is emphasized that the data announcement scheme is intended to be able to be used with any series of bits, it will not always be appropriate for the ASCII header to be placed at the very beginning of a medium. For example, 1/4 inch magnetic tape cartridges all have one particular outer format in common, and the system software that reads these tapes parse the outer layer of bits to retrieve the inner data itself. In these cases, it would be natural to prepend the ASCII header to the data itself. Although it is theoretically possible to enhance audio CD players to understand the headers being proposed here, the CD player manufacturers may decide not to add data announcers to audio CDs. They may feel that it is not worth the trouble. Similarly, if it is felt that floppy disks will soon be phased out of production and use, it may be decided not to enhance floppies either. While the data announcement is intended to allow the systems to process the data automatically, thereby relieving the human user of the burden of learning new keyboard commands, the user will be expected to learn at least a few basics, such as the fact that inserting a floppy into the CD-ROM drive is unhealthy. 8. Related Work The data announcement scheme outlined here is rather similar to SGML's DTD method, which also involves using human-readable text to inform the application about the contents of the document. There also exists a data interchange standard called ISO 1001, that uses (a subset of) ASCII for the headers, although this standard is really meant for interchange of files only, as opposed to "live" data that can be used to decide which application to invoke. The data announcers will need to use object identifiers, which should be registered with the appropriate authority, so that users can agree on the meaning of the headers. CCITT X.208 (also known as ISO 8824) describes the registration process for object IDs. These IDs are used in X.400(1988) (also known as ISO 10021), but can be used elsewhere as well. It is known that a lot of research has been done in this area. The problem seems to be that none of the schemes is widely implemented. 9. Migration In order to allow systems of the future to deal intelligently with old data that does not contain the data announcers, a mechanism is needed to distinguish old and new data. It is proposed to start every header with a long series of bits that is unlikely to appear in any old data. It should be fairly obvious that headers should not be prepended to data before the systems that interpret the data are updated. Although it is possible for one particular user to accelerate the migration process by updating the systems and data quickly, this user may then have problems exchanging data with other users. If the number of users of the new data announcement scheme is small, problems will frequently be encountered in data interchange, probably to such an extent that a return to the old system will be desirable. On the other hand, if the number of users of the new system is large, the old system users will be inconvenienced, and will want to upgrade. So it is quite clear that a certain critical mass of new systems will have to be reached in order to succeed. 10. A Long Term Plan I. Promotion and Acceptance of the Concept and Plan The first stage will involve letting people know about the proposal, and refining it to try to gain acceptance of the ideas and the plan. II. Creating and Distributing Readers The most important step is to update system software in such a way that files are no longer considered to be just data. The stream must be divided into two parts, the metadata and the data. This modification will be easy to make if the headers are kept extremely simple, as outlined above. For example, floppy disk drivers will need to be updated to parse the headers. Also, new transparent file access protocols will need to be implemented. However, these modifications should only allow the headers to be read, not written, so that no data will be produced that might cause problems for users that have not yet upgraded to the new system. III. Registering Header Names and Values In parallel with Stage II, current data and metadata needs to be researched extensively to collect and discuss header name-value pairs. As agreements are reached, these names and object IDs are registered internationally. IV. Creating and Distributing Writers After some time, many users will have new system software that can read the headers. At this point, which should be well beyond the (imagined) critical mass, systems vendors can start to distribute software that writes the headers. V. Evolution of User Agents and Applications When the underlying system has settled down, users can start to install agents such as the GUIs mentioned above. Also, applications can be updated to take advantage of the metadata by including application-specific headers, and so on. 11. Acknowledgements The author wishes to thank John C. Klensin of the Massachusetts Institute of Technology for invaluable guidance throughout the idea formation period, and for providing information about previous research in related areas. END --------------- Please post or email comments on this proposal. In particular, I would like to know about other work in this area. - -- Erik M. van der Poel erik@sra.co.jp Software Research Associates, Inc., Tokyo, Japan TEL +81-3-3234-2692