[bionet.genome.arabidopsis] Further db conversations

BERLYN@yalemed.bitnet (04/20/91)

Andrew Millar and I have had a couple of follow-up exchanges that we
thought we'd put on the wire, in case others are interested in commenting or
suggesting.  I've put some additional commentary at the end of the msgs.

Date: Thu, 18 Apr 91 23:32:35 -0400
From: chualab@rockvax.ROCKEFELLER.EDU
Subject: Databases.
To: berlyn%yalemed.BITNET@cunyvm.cuny.edu
Message-id: <9104190332.AA16052@rockvax.ROCKEFELLER.EDU>
From: Andrew Millar <chualab@rockvax>

Dear Mary,
 Thanks for your reply, this was exactly the type of info' I was hoping for.
I agree that the Arabidopsis community should adopt a common database
format but we could use something now, rather than whenever the stock
center decides on its format.  The others who replied are using everything
from Excel to a Mac:  the more data they accumulate, the harder it will be to
standardise!

Some criteria we think may be important in selecting a database (but we are
obviously not experts):
 a.  Compatibility (better still identity) with the database eventually cho-
    sen for the NSF Stock Center.
b.  Flexibility in field designation, to permit additional fields for new
    traits after the database has been set up.
c.  Ability to run fast on a PC, preferably under Windows 3.x. [With the
    fast PC's coming out now, we hope to avoid using a mainframe, but as
    data volumes grow, this may become unrealistic].
d.  Large capacity (say, 5000 entries for a lab database).
e.  User-friendly (graphic?) data entry screen that can be printed directly
    to give hard copy containing all the data on a given entry, in a presen-
    table form (eg, for sending out with a strain, viz. the DuPont screen-
    ing sheet).
f.  dBase language, as this seems to be the market standard for PC's.
g.  Relational capability; for example, a database of clones should be rela-
    ted to a database of transgenics, or a database of mutants to a database
    of crosses.
h.  Flexibility to accept graphic fields, for scanned photographs of mutants
    (not too critical!)

I would be very interested to look at your field/form structure. Our address
is Box 301, 1230 York Avenue, New York, NY 10021-6399;  FAX (212) 570
8327.  Rob Last <JGL@cornella.cit.cornell.edu> also expressed an interest in
database planning-perhaps you could drop him a note?  If you have any
general information on Sybase, AptForms and SQL (??), we might fly them by
our mainframe people.
    If you hear anything from the NSF Arabidopsis database project, I
would be interested to hear what they are planning!
Thanks again,
Andrew.

From:   BIOMED::BERLYN       18-APR-1991 23:50:57.38
To: IN%"chualab@rockvax.ROCKEFELLER.EDU"
CC: BERLYN
Subj:   RE: Databases.

I want to continue this discussion at some earlier hour of the day.  Should
it be at the BBd. or personal level?  I recognize and sympathize with the
need to have something now, but you really need to have a careful look at
not just the volume of data, but your structure and querying needs in
order to choose a product that can meet those needs.  I've seen a strains
database in Hypercard, actually 2 or 3 of them, that the owners were happy
as clams with, but to query for combinations of even 2 or more mutations or
properties severely taxed the system.  I've used Notebook, which someone
was touting on the wire, and it was just fine for a medium sized biblio db,
but couldn't have even come close for a fraction of my current db.  You
certainly can't wait for the NSF Arabido Center; I don't think it's even
reached the approval stage yet (in terms of who and where, let alone db
design), so at best the community could probably LEAD the Center, and
maybe the discussion you've initiated can lead to that.  (I apologize for the
importance-of-planning statement above, it's no doubt preaching to the
choir, since you raised the question.)  If you approve and think this is useful
as part of the BBd discussion, I'll forward it and its sequel there tomorrow.
Mary

From:   IN%"chualab@rockvax.ROCKEFELLER.EDU" 19-APR-1991 00:03:26.79
To: BERLYN%YALEMED.BITNET@YALEVM.YCC.Yale.Edu
Subj:   RE: Databases.

Subject: RE: Databases.
In-reply-to: Your message of Thu, 18 Apr 91 23:50:00 -0500.
 <60A7FF17400014A7@YALEMED.BITNET>
To: BERLYN%YALEMED.BITNET@YALEVM.YCC.Yale.Edu
Message-id: <9104190403.AA16750@rockvax.ROCKEFELLER.EDU>

I didn't realise the NSF project was at such an early stage. Yes, by all
means forward the mail to the Network.  We have no experience of any
database other than for Biblio (we use Ref.Manager-works very well), so all
comments will be news to us.  I'm sure the wire doesn't watch the clock, but
I may not reply quite as promptly to messages too early in the day....Andrew
P.S. What do the fly people do? They must have a wee database on some Cray
somewhere.
****************************************************
Additional Comments:
I hope I haven't given misinformation about the stage of the NSF project;
maybe someone from NSF will comment.
    Concerning the list from the first message:  when you think about
these and other things you'll want, it means that at least ultimately you're
talking about a very sophisticated database.  Do you want to include access
to supporting data as well as just 'strain list, mutation and  gene lists'.
Maps?  How much summarization and how much access to actual supporting
data?  I think the best thing you can do now is do those things that you have
to do to keep yourself from drowning (being careful that what you enter is
atomized and separated in a consistent fashion, not confounding concepts,
and minimizing text, so that it will be exportable) and begin to PLAN, PLAN,
PLAN for  a comprehensive one that will receive these data types as a
subset of its generic community mission.  When that one happens, import
into a new structure compatible with it (capital It, the sanctioned
comprehensive db).  I think people have had good export results with spread
sheets, or consistently delimited flat files, or small database products.
However, if you construct a table in which, for example, alleles and
phenotypic comments and synonyms are entered into one column and null
entries are not delimited, you've created a major parsing problem at export
time.  It would not have to be an interminable wait to get some core
elements of 'the comprehensive db' designed and functioning, especially if
you define 'core' as those for which models exist and function within
current technology for other organisms.  Plants and coli and people, mice,
worms,etc. all have genes and mutations and chromosomes and RFLPs and
maps of various sorts, etc. and the conceptual models will have a lot of
simlarity and repesent opportunities for adaptation.  Electronic notebooks
for the masses of data generated in mapping and sequencing projects also
have many models out there: lots of big labs have developed their own
customized versions.
    Of the things on your list, a,b and g should be taken care of by
considerations of exportability and a good design of the comprehensive db;
for c and f, I don't think uniformity aimed at using PC or Windows or dbase
language is relevant at this point, because I don't think you're likely to end
up designing the product at that level, and the in-house interim products
probably won't be coordinated between labs.  (Workstations running Unix are
the popular alternative to pc v. mainframe, btw, and not all that expensive.)
It's the design and vocabulary and conceptual model and setting standards
that needs attention now.  (In my humble opinion.)
    Another humble opinion:  One thing that I thought should be
aggressively addressed, with aggressive follow-through, at the E. Lansing
and Bloomington meetings, but were actually, treated pretty
lackadaisically, are nomenclature standards and enforcement.  There's some
activity in that area now, I guess an IPMB plant genetic nomenclature
committee,  I leave it to those involved to describe it.  It's obviously
critical for db compatibility and connectivity.
    How to get the Planning and Designing moving?  I'd hate to be the one
to push for workshops, not always finding them to be a productive
experience,  but I'm not sure of how many alternatives there are.  Assuming
you don't find waiting for a contract or grant to be given out and the ensuing
mechanism to take its course a suitable alternative.  A workshop can be
useful to get ideas and a sense of community consensus and maybe a
statement of goals, but the most critical need is a (presumably subsequent)
task force of 2 or 3 people with butt-in-chair mentality wrt a mission to
act on that consensus.  (Maybe an *ELECTRONIC* workshop?  to avoid all this
travelling and pie-eating at the weary taxpayer's expense?)
    You ask about the fly db.  I have never been a participant or observer
of the fly database development, but have inquired of someone who was one
of their workshoppers.  They have a number of independent efforts (including
I guess YACs and cytogenetic map and stock center and a new effort to put
together a relational db at NLM, and others with Ashburner and Merriam's
names coming up) and apparently are now working on integration.