[sci.bio] Abbreviations for ambigious bases

roy@phri.UUCP (Roy Smith) (03/18/88)

	David Kristofferson was kind enough to send me Rich Roberts'
official enzyme list.  Looking the list over, I'm confused about what the
non-standard bases mean.  For example, I see:

AccI                           GT^JKAC
AeuI (EcoRII)                  CC^LGG

	I've never seen the J or L before.  I would guess that J is [AC]
but I've always used M for that, which I though was the IUPAC standard.
Here's an extract from an include file I always use:

# define BASE_A         1       /* Adenine */
# define BASE_C         2       /* Cytosine */
# define BASE_G         3       /* Guanine */
# define BASE_T         4       /* Thymine */
# define BASE_U         5       /* Uracil */
# define BASE_R         6       /* A or G (puRine) */
# define BASE_Y         7       /* C or T (pYrimidine) */
# define BASE_M         8       /* A or C */
# define BASE_W         9       /* A or T */
# define BASE_S         10      /* C or G */
# define BASE_K         11      /* G or T */
# define BASE_B         12      /* C, G, or T (not A) */
# define BASE_D         13      /* A, G, or T (not C) */
# define BASE_H         14      /* A, C, or T (not G) */
# define BASE_V         15      /* A, C, or G (not T) */
# define BASE_N         16      /* A, C, G, or T (anything) */
# define BASE_BLK       17      /* Blank, place holder for insertions */
# define BASE_ERR       18      /* Error, (illegal character on input) */

	Did the standard change, or was I mislead, or is Rich Roberts using
his own notation, or what?  Come to think of it, if I had my way, I think I
might vote for dropping the special multi-base abbreviations all together
and forcing people who cared about such things to learn about regular
expressions; gt[ac][gt]ac makes a lot more sense to me than either gtmkac
or gtjkac.  The notational convenience of one-base, one-position often
doesn't seem worth the effort of having to remember all those non-mneumonic
abbreviations (not to mention the fact that everybody seems to have their
own idea of what those abbreviations should be).
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016

dd@beta.UUCP (Dan Davison) (03/18/88)

In article <3193@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes:
> 
> 	David Kristofferson was kind enough to send me Rich Roberts'
> official enzyme list.  Looking the list over, I'm confused about what the
> non-standard bases mean.  For example, I see:
> 
> AccI                           GT^JKAC
> AeuI (EcoRII)                  CC^LGG
> 
> 	I've never seen the J or L before.  I would guess that J is [AC]
> 
> 	Did the standard change, or was I mislead, or is Rich Roberts using
> his own notation, or what?  Come to think of it, if I had my way, I think I
> Roy Smith, {allegra,cmcl2,philabs}!phri!roy

It's the ambiguous base code developed by the MOLGEN project at SU
SUMEX-AIM.STANFORD.EDU back in the dawn of time, 1979-1980.  It bears no
resemblence to the Staden or IUPAC codes.  

[INEWS FODDER]







dan davison theoretical biology los alamos national laboratory t-10 ms k710
los alamos, nm 87544  dd@lanl.gov, dd@lanl.UUCP, ...cmcl2!lanl!dd

-- 
dan davison/theoretical biology/t-10 ms k710/los alamos national laboratory
los alamos, nm 875545/dd@lanl.gov (arpa)/dd@lanl.uucp(new)/..cmcl2!lanl!dd
"I refuse to be intimidated by reality any more"  "What is reality anyway?
Nuthin' but a collective hunch!" --Jane Wagner,via Lily Tomlin 

roy@phri.UUCP (Roy Smith) (03/19/88)

	In response to a query of mine about Rich Roberts's ambigious base
notation, dd@beta.UUCP (Dan Davison) writes:

> It's the ambiguous base code developed by the MOLGEN project at SU
> SUMEX-AIM.STANFORD.EDU back in the dawn of time, 1979-1980.  It bears no
> resemblence to the Staden or IUPAC codes.  

	I did a bit more research on this topic and came up with the
following paper:

%A Athel Cornish-Bowden
%T Nomenclature for incompletely specified bases in nucleic acid sequences:
recommendations 1984
%J Nucleic Acids Research
%D 1985
%V 13
%P 3021-3030

	This paper includes a longish list of references to other attempts
at standardizing the code, and provides some arguments as to why the scheme
he presents (the IUPAC scheme) is more mneumonic that any other.  For
example, W={A,T} and S={C,G} because A-T pairs are Weak and C-G pairs are
Strong; M={A,C} and K={G,T} because A and C have aMido groups in chemicaly
similar positions while G and T have Keto groups in those positions.

	I'm fully aware how hard it is to change over from one standard to
another, especially after using the old one for so many years.  On the
other hand, I think it's pretty much agreed that IUPAC is the final
authority when it comes to chemical nomenclature; to insist on using some
other naming system just doesn't make sense.
-- 
Roy Smith, {allegra,cmcl2,philabs}!phri!roy
System Administrator, Public Health Research Institute
455 First Avenue, New York, NY 10016