dictyBase SOPs: Similarity-Based Curation
Return to SOPs Index

Similarity-Based Curation

Last updated March 30, 2006
General Info Priorities
General Protocol
Annotation Gene Name
Gene Product
Description
Curator Note
Gene Ontology Annotation
Notes MetaCyc Curation
Other Notes on Similarity-Based Curation


Priorities [TOP]
Genes and sequences are annotated with the following priority:
  1. Lists of gene families/pathways sent by users
  2. Genes with similarity to known proteins (see Inparanoid)
  3. Genes containing known functional domains
  4. Genes with no similarity to known proteins but with ESTs

General Protocol [TOP]
In general, curators perform the following steps to determine the identitiy of gene products:
  1. BlastP vs. SwissProt and/or nr
  2. GOst Search with protein sequence
  3. InterProScan and Pfam search for protein families and domains
  4. FingerPRINTScan for motif fingerprints in families
  5. PSORT and Target P for cellular localization
  6. TMHMM for transmembrane domains
  7. Blast the ISS protein vs. dictyBase, make sure it’s the top hit and/or take into account (and annotate if appropriate) other hits

Gene Name [TOP]
Guidelines for naming unpublished genes:
  • If no name is assigned, use the dictyBaseID of the primary feature after Curated Model is made. Keep the Sequencing Center name if you do not create a Curated Model.
  • Check nomenclature for other species (should we try to generate a list of PMIDs of nomenclature papers?). If there appears to be a single precedent for naming a particular family, go ahead and give the gene that name (hopefully Demerec, although standard nomenclature for all organisms is favored over Demerec; see also Procedure for naming genes). If the origin of the gene name is known, include this information in the Name Description field.
  • If it is clear that it is the SINGLE ortholog (used loosely, not in a strict evolutionary sense) of a family member, then give the gene that letter/number. If it is unclear which letter/number family member it corresponds to, give the gene a letter, starting with A. (Creation of novel gene names should be discussed with the other curators and/or researchers.
  • When a gene name is not completely clear, refrain from naming or consult with fellow curators.

Gene Product [TOP]
  • Keeping the rules about Gene Names in mind, use the expanded gene name if there is evidence pointing to a likely homolog/ortholog/paralog. (Provided that the gene name is related to its functionality, not a mutant phenotype.)
  • Use lowercase letters for gene products, except in cases where it is standard to use upper case, such as acronyms.
  • Use the "Search Gene Product" tool to reduce redundancy of the gene products.
  • If protein is completely by ISS (e.g., no compelling evidence for ortholog, not the best reciprocal hit, etc.), stick to "putative" (putative is preferred for gene product, similar to is more of a description field phrase).
  • When in doubt, always use the broadest classification possible (i.e. instead of putative CAM kinase use putative serine/threonine protein kinase).
  • Include both general and specific information if available (e.g., kinases must all have "protein kinase" somewhere in a gene product and information regarding membership in some subfamily, if applicable; multiple gene products are encouraged if they keep it searchable).
  • When creating new gene products, keep the user in mind: what are they likely to search for?
  • GENE PRODUCT X (example: ribonucleotide reductase, small subunit)
    • Highest hit Dicty vs. UniProt/nr/GOst
    • Highest hit other spp. protein vs. dictyBase (reciprocal hit), all other hits insignificant
    • High level of identity in pairwise BLAST (over length of protein). Start with >35% identity over >80% length of the protein. If there are examples of genes we would like to annotate that fall below that, we should discuss them. Also look for conserved patterns of conserved protein domains.
    • Genes we expect to be essential and that are present as a single copy can be annotated "Gene X" even if the 35% identity over 80% length rule is not true; for example, if there is only one RNA polymerase.


    PUTATIVE GENE PRODUCT X (example: putative AGC protein kinase)
    • High level of confidence that this protein is a member of a particular group/family/subfamily but lower level of overall identity and/or best reciprocal hit test is inconclusive.


    X DOMAIN-CONTAINING PROTEIN (example: BZIP domain-containing protein)
    • Similarity in conserved functional domains only (no similarity over length of protein).

Description [TOP]
Descriptions can be derived from any of these sources, plus general information about the gene product’s process/function/component:
  • EC reaction descriptions
  • UniProt descriptions/functions
  • General functions from data from other organisms
  • Example: similar to human LYST and mouse Beige proteins (lvs genes)

Curator Note [TOP]
Write curator notes when appropriate:
  • Similar to [species/gene name] (SwissProt/GenBank ID#) % identical / % similar / % length (Dicty protein). (SwissProt/GenBank ID# = XXX AA; DDB# = YYY AA).
  • Be sure to pay attention to length of Dicty protein and length of comparison protein. Subtract strings of 'N' or 'Q' in Dicty protein. Add other important information/observations as needed.
  • If the Curated Model has feature or subfeature coordinates that differ from the Sequencing Center Gene Prediction and the reason for the coordinate changes is not obvious, make a public note explaining the difference.
  • If you do not make a Curated Model, make a public note:
    The available data are inconclusive to determine the correct gene model. The gene model presented here was obtained from the Dictyostelium Genome Consortium.

Gene Ontology Annotation [TOP]
  • Using top hits from GOst search and BLAST vs. UniProt, nr, InterPro, and Pfam, look at GO annotations of these sequences; ISS with that database record and use reference "dictyBase 'Inferred from Sequence or structural Similarity' Unpublished" (reference_no=10155).
  • If no non-IEA/ISS/NAS annotations exist for these top hits, you may use the sequence record in the with column, but in this case try to find a reference that provides evidence for the process/function/component (need to import PMID first, then make this annotation).
  • If you have a good annotation for a function and you can logically and confidently infer that the gene product participates in a process or localizes to a cellular component based on other annotations, use the IC evidence code (for example, a protein annotated with function "DNA binding" can be annotated with IC component "nucleus").
  • Alternatively, if you have good hits with InterProScan or ProSite, you may ISS with those records that have GO annotations. (See also InterPro2go and EC2go mappings.)
  • ISS may be done with molecular_function, however, biological_process and cellular_component terms must be used carefully. Very general process terms may be used, and component terms should be discussed.
  • As with curation of previously identified genes, annotate all second generation genes to all three ontologies (function, process, component). If there is nothing to ISS in one or more ontologies, and no IEAs exist, annote with "unknown."
    • biological_process unknown ; GO:0000004
    • molecular_function unknown ; GO:0005554
    • cellular_component unknown ; GO:0008372
  • Listing of dictyBase internal GO references:
  • ND: dictyBase 'No biological Data' Unpublished (reference_no=9851)
  • ISS: dictyBase 'Inferred from Sequence or structural Similarity' Unpublished (reference_no=10155)
  • IC: dictyBase 'Inferred by Curator' Unpublished (reference_no=11067)
  • NAS: (unpublished information from authors) dictyBase (2005) 'Personal communication to dictyBase' Unpublished (reference_no=11050; note this reference changes each calendar year)

MetaCyc Curation [TOP]
  • Curation of metabolic genes is a type of similarity-based curation; most of the same guidelines apply, but with a few exceptions.
  • The metabolic proteins of Dictyostelium are being documented with MetaCyc and metabolic pathways will eventually be displayed at dictyBase (dictyCyc) with the use of Pathway Tools software.
  • Previously unidentified metabolic genes can be found in dictyBase in a variety of ways. Using the search tool, the automated gene products and/or GO terms may contain protein names or EC numbers. Alternatively, the best approach to finding potential orthologs is via BLAST. To find protein sequences in other species to BLAST against database, use the following resources: KEGG, BRENDA, UniProt, or Enzyme Nomenclature.
  • When writing gene products for genes to be integrated into the dictyCyc database, DO NOT use "putative" or they will not be automatically imported by Pathway Tools.

Other Notes on Similarity-Based Curation [TOP]
  • ISS only with non-IEAs and non-ISS.
  • Don’t ISS with ISS; this dilutes the meaning of annotations (not to mention the level of confidence for ISS annotations from other databases).
  • Avoid confusing annotations such as Cellular Component IMP.
  • You can "mix and match" -- all GO annotations don’t need to be from the same reference (i.e., the process can come from one and function from another).
  • Not all GO annotations from the "with" record need to be ISSed to the Dicty gene; some annotations contain extraneous information that is not relevant.
  • See also the GO list e-mail exchanges (April 5, 2004 and November 29, 2002).

Home| Contact dictyBase| SOPs| Site Map  Supported by NIH (NIGMS and NHGRI)