[bionet.molbio.bio-matrix] Some thoughts on database formats

sys_ms@bmc1.uu.se (03/10/88)

From: sys_ms@bmc1.uu.se


I am trying to implement a relational database for nucleic and protein
sequences. While doing this I have come to the conclusion that the
current form of the EMBL and Genbank databases are not an optimal form
to use as input to programs that load my databse.

As an example, while reading the keyword lines from the EMBL tape
you read the documentation and it say: keywords are separated by
semicolons and the last keyword is followed by a period. Simple.

You run the program you just implemented, and check for semicolons
and periods. Just to find out that the keyword "4.5S RNA" is there.
You end up with the keyword "4" in the database, and you know you
should have done it in another way.

I have heard that both Genbank and EMBL are trying to put their
databases into relational database handlers. I would then suggest
them to consider distributing their data in a simple tabular
form that correspond to the relational structures they implement.

It would bee much simpler and less errorprone to load other
databases from such files.


        Mats Sundvall,
        Biomedical Center,
        University of Uppsala,
        Sweden

        mats@bmc1.uu.se
        sysdan@semax51.bitnet