Back to help center

FAQ

1. What is the purpose of COMBREX?

2. What do the symbols next to the clusters and genes (gold circle, green circle, blue square, and black diamond) mean?

3. How are your green gene lists generated?

4. What is the Gold Standard data set?

5. How do I find a gene?

6. What search terms can I use to search for a gene?

7. What can I do with Advanced Search?

8. How can I search by organism name?

9. Why are Escherichia coli str. K-12 substr. MG1655 and Helicobacter pylori 26695 COMBREX focus organisms?

10. If I want to experimentally validate an annotation, how should I go about finding a cluster and picking a gene?

11. How does COMBREX recommend genes for experimental validation?

12. What purpose does the Histogram for the average distances from each protein to every other protein in a cluster serve?

13. How is the Histogram for the average distances from each protein to every other protein in a cluster produced?

14. Gene X is experimentally validated. How can I get it added to the green or gold gene list?

15. How are cluster search results ordered?

16. What are curated clusters?

17. What clustering model does COMBREX use to create its protein clusters?

18. What are Related Protein Clusters?

19. Why does COMBREX provide information on Related Clusters?

20. How do I submit a prediction?

21. Where are all the predictions?

22. What is the Functional Linkage Table found in some gene pages?

23. What is OperonDB?

24. What is Domain Fusion?

25. What is Phylogenetic Profile?

26. What is Gene Neighborhood?

27. Does COMBREX make its own predictions?

28. How do I submit a bid?

29. Why can I not submit a bid for certain genes?

30. How can I submit a bid for a gold or green gene?

31. What is the purpose of the High Priority List?

32. What does it mean to have “NCBI Protein Clusters” as a source of prediction?

33. What does it mean to have “NCBI annotation” as a source of prediction?

34. What information does the “Phenotype Data” section, within the gene detail, page contain?



1. What is the purpose of COMBREX?

COMBREX is a collaborative project to bring the computational and experimental communities of biologists together in an effort to better understand gene function.

2. What do the symbols next to the clusters and genes (gold circle, green circle, blue square, and black diamond) mean?

COMBREX uses colored symbols as visual cues to reflect the experimental validation status of genes and clusters. The colored symbol seen next to a gene name indicates the experimental validation status of the gene. The colored symbol seen next to a cluster name indicates the experimental validation status of its constituent proteins.

For Genes:

Gold Circle – indicates the protein has been experimentally validated, and the DNA sequence coding for the exact protein has been determined. The publication(s) reporting the sequence and the biochemistry are documented.

Green Circle – indicates that the gene is believed to have been experimentally validated, but manual curation is incomplete, or information required for gold status is lacking.

Blue Square – indicates the gene has a specific prediction of molecular function, but has not been experimentally validated, or, the gene's experimental validation status has yet to be established in COMBREX.

Black Diamond – indicates the gene does not have a specific prediction of molecular function but it may have predictions of "general" or "non-specific" functionality.

For Clusters:

Green Circle - indicates the cluster contains one or more experimentally validated (green and/or gold) genes.

Blue Square - indicates the cluster does not contain any experimentally validated genes and that it contains genes with specific predictions of molecular function.

Black Diamond – indicates the cluster does not contain any experimentally validated genes. Additionally, no constituent gene has a specific prediction of molecular function.

3. How are your experimentally validated genes (green status) generated?

Green genes are believed to have experimentally validated functions as indicated by other highly curated databases (NCBI, UniProt, EcoCyc and others). In many cases, they are candidate "gold standard" genes that are awaiting manual curation to confirm their gold status. Alternatively, a gene that has greater than 98% full-length sequence similarity to a gold gene is also considered to be a green gene.

4. What is the Gold Standard data set?

The Gold Standard data set consists of genes that have been experimentally validated, and the DNA sequence coding for those exact proteins have been determined. All gold entries have been manually curated, and references for the experiments and the gene sequencing are available. The Gold Standard data set is an ongoing project that is still in its early stages. If you would like more information or want to help with the curation of this data set, please contact Dr. Richard J. Roberts (roberts@neb.com). NCBI and UniProt will soon have a downloadable version of the Gold Standard data set on their sites, but until then, it can be obtained on the COMBREX site. For more information on Gold Standard genes please refer to the Gold Standard Genes document.

5. How do I find a gene?

To find a gene of interest, enter information about the gene into the COMBREX search engine and click “Search.” The more specific your search term, the more easily you will be able to find your gene of interest. For instance, using a NCBI Gene ID to specify your gene will result in finding your gene of interest faster than using a gene name or gene symbol to specify it.

The search will return a list of NCBI Protein Clusters (groups of highly sequence-similar proteins thought to perform the same function), each of which contain genes matching your search criteria. The list of clusters can be sorted by various criteria including phylogenetic distribution and cluster size (in terms of numbers of proteins or numbers of organisms). For each cluster, we highlight genes from either of our two "focus organisms", E. coli K12 MG1655 and Helicobacter pylori 26695, when present.

If you used a unique identifier such as a RefSeq protein identifer or UniProt accession number in your search, the search should yield a single cluster, and the matching gene will be highlighted beneath it, in addition to any genes from the two focus genomes above.

6. What search terms can I use to search for a gene?

Any of the following terms can be used:
Entrez GeneID -- e.g.: "1021855"
UniProt accession number -- e.g.: "Q8G6A5"
RefSeq protein accession number -- e.g.: "NP_695922.1"
Please Note: YP and NP must be capitalized and you must include an underscore “_”, but need not include the version number (".1")
Gene Name -- e.g.: "helY"
Protein Cluster
-- e.g.: "CLSK967808" (Please Note: CLSK denotes non-curated clusters)
-- e.g.: "PRK10917" (Please Note: PRK denotes curated clusters)
Keywords
-- e.g.: "helicase"
-- e.g.: "RNA helicase"
-- e.g.: "Superfamily II RNA helicase"
Please Note: Using key words or generic gene names may not uniquely identify your gene. To easily search for a unique, specific gene, please use specific identifiers such as the Entrez GeneID, UniProt accession number, or RefSeq accession number if any of these are known. You can also use our Advanced Search feature to limit your search results by specifying gene status, protein cluster status, or species name.
For more information, please see our Help Center.

7. What can I do with Advanced Search?

Advanced Search allows you to search for a particular gene status, protein cluster status, and/or species name. Additionally, you can limit your search results to include only those with a specific gene status, protein cluster status, and/or species name.

8. How can I search by organism name?

To search the COMBREX database for all genes found within a specific organism - Click on “Advanced Search.” Then type the organism’s name in the “Species” box and click “Search.” The search will result in a list of clusters, each of which contains at least one gene from your organism of interest.

To search the COMBREX database for a specific gene found within a specific organism - Enter the gene information into the COMBREX home page search box. Then click on “Advanced Search.” Finally, type the organism’s name in the “Species” box and click “Search.” The search will result in a list of clusters, each of which contains the gene of interest in the organism of interest.

At the present time, specifying an organism name in the COMBREX home page search box will yield incomplete results. This will be adjusted in a future release.

9. Why are Escherichia coli str. K-12 substr. MG1655 and Helicobacter pylori 26695 COMBREX focus organisms?

E. coli str. K-12 substr. MG1655 was chosen as a COMBREX focus organism because of its frequent use as a model organism in molecular biology and biochemistry. H. pylori 26695 was chosen because of its importance to human health and disease. Although these two bacteria are considered COMBREX “focus organisms”, thus affording them some preferential treatment, COMBREX accepts predictions for and funds experimental validation of genes from any bacterial and archaeal organism.

10. If I want to experimentally validate an annotation, how should I go about finding a cluster and picking a gene?

Ideally, choosing a protein as a target for experimental validation is based upon prior functional knowledge. We encourage researchers submitting bids to select protein functions with which they have some previous experience. (The essence of the COMBREX project is to match specific predictions with expert biochemists who are knowledgeable about the appropriate assays to use and who have suitable reagents already on hand.) When selecting a particular protein for experimental validation, we encourage choosing one that will provide the most information for the entire protein cluster.

Another method of choosing a protein for experimental validation involves choosing a protein will provide the most information for the entire cluster if experimentally validated. If a green or gold gene is present in the cluster, select the protein that is furthest away from the green or gold member. If there are no experimentally validated members in the cluster, select a protein that is close to the centroid of the cluster. The histogram for the average distances from each protein to every other protein in a cluster can be used to help identify good targets for experimental validation. Proteins with larger average distances to every other protein are good targets for clusters containing green or gold genes, and proteins with smaller average distances to every other protein are good targets for clusters containing no experimentally validated genes.

Follow the steps below to find a gene for experimental validation:

1. Find a cluster.

The importance of a particular cluster is ranked according to the following scale (high priority clusters listed first, and low priority clusters listed last):
2. After choosing a cluster, select a particular gene within the cluster.

The importance of a gene within a cluster is ranked according to the following scales (high priority genes listed first, and low priority genes listed last):

For genes in clusters containing no green or gold genes:
For genes in clusters containing green and/or gold genes: For more information on how to select a gene for experimental validation and how to submit a bid for funding to experimentally validate this gene please refer to the How To Submit A Bid document.

11. How does COMBREX recommend genes for experimental validation?

COMBREX currently uses two criteria to recommend genes in a given Protein Cluster for experimental validation. First, we recommend validating genes in COMBREX ‘focus organisms’, which currently include E. coli K-12 MG1655 and H. pylori 26695; this ensures that we continue to develop a more complete picture of the coding potential, and thus a greater understanding of the biology, of these two important model organisms.
Second, for ‘blue’ clusters, in other words those with no experimentally validated members at present, we recommend validating the gene with the shortest average distance to all other proteins in the cluster as measured using sequence similarity; this gene can be thought of as lying nearest the centroid of the cluster. The functions of uncharacterized genes are often predicted based on sequence similarity to experimentally validated homologs, and an implicit assumption in this process is that the confidence in such predictions increases as sequence similarity increases. Thus, validating the function of a gene near the centroid of a cluster results in the greatest overall confidence when that function is predicted to apply to all other members of the cluster.
In some cases, there may be a compelling reason why a recommended gene might not be a good validation candidate (for example, the organism from which it comes is highly pathogenic, difficult to obtain, or otherwise difficult to work with). In such cases, we suggest using the ‘average distance to other proteins’ metric we provide to choose another gene from the cluster with a relatively small average distance. We encourage experimental biologists to use their best judgment.
For more information on the Histogram for the average distances from each protein to every other protein within a cluster, please see questions 12 and 13 below.

12. What purpose does the Histogram for the average distances from each protein to every other protein in a cluster serve

The Histogram for the average distances from each protein to every other protein in a cluster serves to help you identify a good target for experimental validation within a certain cluster. Good targets are considered to be proteins whose experimental validation would provide the most predictive value for the entire protein cluster.

If a green or gold gene is present in a cluster, then a good target for experimental validation is a protein that is far away from it in sequence space. You can get this information by contacting COMBREX.

If a green or gold gene is NOT present in a cluster, then a good target for experimental validation is a protein that is close to the centroid of a cluster. Such a protein will have low average distances to every other protein in the cluster.

Before selecting a protein to experimentally validate, you may want to examine the multiple sequence alignment and full phylogenetic tree of the cluster, if it is small enough to allow such analysis. This information is available on NCBI and can be reached by clicking on the protein cluster link at the top of the protein cluster page.

For additional information and examples, see (avg_dist_to_other_genes.doc)

13. How is the Histogram for the average distances from each protein to every other protein in a cluster produced?

First, the average distance of a protein to all other cluster members is determined by performing a multiple sequence alignment for each member of a protein cluster. The multiple sequence alignments for the curated protein clusters were provided by NCBI using the tool MUSCLE [1]. These alignments are then converted into a distance matrix using the protdist program within the tool PHYLIP [2]. We use the Jones-Taylor-Thornton model [3] for amino acid substitution. Finally, we use this distance matrix to calculate the average distance to all other members of the cluster.

1: Edgar RC.
MUSCLE: multiple sequence alignment with high accuracy and high throughput.
Nucleic Acids Res, 2004 Mar 19. 32(5):1792-7. Print 2004. PubMed PMID: 15034147; PubMed Central PMCID: PMC390337.

2: Felsenstein, J.
PHYLIP - Phylogeny Inference Package (Version 3.2).
Cladistics, 1989. 5: 164-6.

3: Jones DT, Taylor WR, Thornton JM.
The rapid generation of mutation data matrices from protein sequences.
Comput Appl Biosci, 1992 Jun. 8(3):275-82. PubMed PMID: 1633570.

14. Gene X is experimentally validated. How can I get it added to the green or gold gene list?

We realize that there may be cases where the annotation we have listed for a protein is incomplete or inaccurate. For example, we may list a protein's function as something generic, such as “protein-binding” when experimental evidence suggests a specific enzymatic function. In other cases, we may list a protein's current annotation as unverified, meaning that we do not have a reference, when an appropriate reference exists. If you notice any annotation on our site which could be improved, please submit an annotation update via the gene page.

We encourage you to use the annotation submission tool to:
  1. Update an existing annotation by making it more specific or correcting it
  2. Nominate a gene as a candidate for the set of gold standard annotated proteins (link to help page on what is the gold standard)
  3. Make a comment on the existing annotation.
For more information on how to submit an annotation and how to nominate a gene as a candidate for the set of gold standard annotated proteins please refer to the How To Submit An Annotation guide.

15. How are cluster search results ordered?

When a list of search results (gene clusters) is returned, the clusters are ordered by default based on the following criteria. (The search results can also be manually reordered based on number of organisms, number of proteins, phylogenetic spread score, and cluster name by using the available sort functionality.)
  1. Validation status:
    Clusters with a blue experimental validation status are listed before clusters with a green experimental validation status which are listed before clusters with a black experimental validation status. The color assigned to a cluster represents the experimental validation status of its constituent genes. For more information on the symbol and color assignments to genes and clusters please refer to the symbol and color explanation for clusters and genes.
  2. Number of focus organisms in a cluster:
    Clusters containing genes from our two focus organisms, E. coli K12 MG1655 and H. pylori 26695, are listed before clusters that do not.
  3. Phylogenetic spread score:
    The phylogenetic spread score currently employed is an integer that corresponds to the depth of the most recent common ancestor for the species within the cluster in the phylogenetic tree provided by NCBI. For example, a phylogentic score of 0 means that the cluster species are conserved at the root level, and a score of 1 means they are conserved at the kingdom level. In general, clusters with lower phylogenetic spread scores, corresponding to wider spread are listed before clusters with high phylogenetic spread scores.
  4. Number of organisms in a cluster:
    Clusters with genes from larger numbers of organisms are listed before clusters with genes from smaller numbers of organisms
  5. Number of genes in a cluster:
    Clusters with larger numbers of genes are listed before clusters with smaller numbers of genes.

16. What are curated clusters?

COMBREX organizes proteins within the context of NCBI protein clusters. Therefore, once you have submitted your search terms, COMBREX will return one or more appropriate NCBI Protein Clusters. As of February 2010, the NCBI Protein Clusters database contains 409016 prokaryotic clusters, of which only 7297 have been curated. Curated NCBI Protein Clusters contain added information which includes functional annotation for proteins, Enzyme Commission numbers which detail enzymatic function, and publications describing protein function and composition. Selecting the “Curated Clusters Only” box during a COMBREX search will limit your search results to curated NCBI Protein Clusters. For more information on the NCBI Protein Clusters database and the cluster curation process, please see here: NCBI publications.

17. What clustering model does COMBREX use to create its protein clusters?

We use the NCBI Protein Clusters as our clustering model. NCBI clusters proteins mainly based on sequence similarity. Protein sequences are compared by performing a BLAST (Basic Local Alignment Search Tool that finds regions of similarity between two protein sequences; for more information on BLAST please refer to PMID: 18440982) all against all with an E-value (Expect value: a parameter describing the number of BLAST alignment matches, known as “hits”, one can attribute to chance; for more information on the E-value, please see here: (link to reference) cut off 1E-05. Each BLAST alignment match between two sequences is assigned a BLAST score which is then modified to take into account protein length and the alignment length between the two protein sequences. Proteins within a cluster are one another’s best BLAST alignment matches as determined by the modified score. For a protein within a cluster, all other proteins within that cluster will have a higher modified score to that protein than would any protein not within the cluster. For more information on the NCBI Protein Clusters Database and their methodology of producing protein clusters, please refer to PMID: 18940865.

18. What are Related Protein Clusters?

Proteins within a Protein Cluster share significant sequence similarity with one another. However, proteins in two different Protein Clusters can also share significant sequence similarity, so the concept of "related" clusters captures this type of relationship. Relationships between Protein Clusters are determined by NCBI using the alignment tools BLASTP (please refer to PMID: 18440982) and RPS-BLAST (info link to reference describing RPS-BLAST). For two clusters, A and B, to be related, every protein within cluster A must be related to every protein within cluster B. Two proteins are defined as related if they share a BLASTP alignment with an E-value less than 1e-03 and covering greater than 80% of the length of the shorter sequence, and if they have RPS-BLAST matches to the Conserved Domain Database (info PMID: 18984618), they share a similar domain structure. For more information on the NCBI Protein Clusters Database and their methodology of determining related clusters, please refer to PMID: 18940865.

19. Why does COMBREX provide information on Related Clusters?

NCBI Protein Clusters related to a cluster of interest are displayed on the cluster detail page in order to help the user identify other possible clusters of interest. Considering proteins within related clusters share significant sequence similarity with one another, any information gained about one protein will most likely shed light on proteins within its own cluster as well as on proteins within related clusters. COMBREX users interested in biochemically characterizing multiple proteins may find it worthwhile to choose target proteins within related clusters along with multiple proteins from the same cluster.

20. How do I submit a prediction?

At this initial stage of the COMBREX project we are focusing on predictions of biochemical function, which is typically described by EC numbers or GO Molecular Function (MF) terms. We will accept predictions in the form of traditional TEXT description but this format is not encouraged and should be used only when an appropriate structured vocabulary term is either not available or not specific enough. For step by step instructions on how to submit predictions of function for a single gene or multiple genes please refer to the How To Submit Predictions guide.

For examples of the formats required for prediction submissions, click here: link to template section in how to submit predictions document. Once your predictions have been submitted, they will be available on the gene detail pages of the genes for which you made predictions. To submit predictions that you also want to validate experimentally, submit a bid for your gene of interest and explain your prediction in your bid submission. Based on your preference, these predictions will or will not be made public. For step by step instructions on how to submit a bid for funding to experimentally validate a gene's function, please refer to the How To Submit A Bid guide.

21. Where are all the predictions?

Predictions of function for a given gene are available on the gene’s specific gene detail page and on the cluster page to which it belongs.

Predictions of function come from a variety of sources. Genes grouped within a cluster are predicted to have similar functions based on sequence similarity, and so all genes have their cluster definition as a functional prediction. Conserved domains within a gene that are associated with a specific function serve as another source of functional predictions. The NCBI gene definition associated with uncharacterized genes may also serve as a prediction of function. Over time, additional predictions will be added by COMBREX members and the broader community. We expect these predictions of gene function provided by expert teams of computational biologists will be of high quality.

Experimentalists will have the option to experimentally validate any of these predictions

22. What is the Functional Linkage Table found in some gene pages?

The Functional Linkage Table is a graphical representation of how a gene of interest is functionally linked to other genes as predicted by various methods. Two genes are functionally linked if evidence suggests that they perform the same biological or biochemical function. In these cases one can consider “transferring” the functional annotation from one gene to its functionally linked neighbours. The level of confidence in such a functional annotation “transfer” is indicated by the shading of the square boxes within the table - the darker the shade, the higher the confidence.

Examples of Functional Linkage Networks based on these functional linkages we can define a network of genes linked to other genes based on experimental evidence or computationally predicted functional linkages.

Functional linkage networks were originally defined in [1]. A probabilistic version of functional linkage graphs networks (PFLGs) was originally defined in [2]. The first database of functional linkages was described in [3].

[1] Eisenberg D, Marcotte EM, Xenarios I, Yeates TO.
Protein function in the post-genomic era.
Nature, 2000 Jun 15. 405(6788):823-6. Review. PMID: 10866208

[2] Letovsky S and Kasif S.
Predicting protein function from protein/protein interaction data: a probabilistic approach.
Bioinformatics, 2003. 19 Suppl 1:i197-204. PMID: 12855458

[3] Mellor JC, Yanai I, Clodfelter KH, Mintseris J, DeLisi C.
Predictome: a database of putative functional links between proteins.
Nucleic Acids Res, 2002 Jan 1. 30(1):306-9. PMID: 11752322

23. What is OperonDB?

OperonDB is a method of predicting gene function that involves detecting and analyzing  pairs of genes located adjacent to one another on the same DNA strand and conserved in two or more bacterial genomes. For each conserved gene pair, OperonDB estimates  the probability that the genes belong to the same operon by taking into account alternative possibilities that explain why the genes are adjacent in several genomes. To determine the structure of an operon, the gene order and orientation must be conserved in two or more species. Since genes within an operon often have related functions, knowing the operon's structure provides information about the function of genes within it.

Mihaela Pertea, Kunmi Ayanbule, Megan Smedinghoff and Steven L. Salzberg.
OperonDB: a comprehensive database of predicted operons in microbial genomes.

Nucleic Acids Res, 2009 Jan; 37(Database issue):D479-82. Epub 2008 Oct 23.
Please refer to: Operon Database

24. What is Domain Fusion?

Domain fusion allows for a prediction of functional relationship between two distinct genes in an organism if those two genes are fused as a continuous sequence in another organism. The fused gene in one organism suggests a relationship between the component genes in another organism - a relationship which is not necessarily due to sequence similarity. Fusion links frequently relate genes of the same functional category. Therefore, the function of an uncharacterized gene within a fusion link can be inferred from the known function of the gene to which it is fused.

Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D.
Detecting protein function and protein-protein interactions from genome sequences.
Science, 1999. 285(5428):751-3.

Yanai I, Derti A, and DeLisi C.
Genes linked by fusion events are generally of the same functional category: a systematic analysis of 30 microbial genomes.
Proc Natl Acad Sci U S A, 2001. 98(14):7940-5.

25. What is Phylogenetic Profile?

Phylogenetic profiling infers the function of a gene from another gene with a known function that has a pattern of presence and absence across a set of phylogenetically distributed genomes that is identical to that of the gene with the unknown function. The profile of a gene consists of the pattern of occurrence of its orthologs across a set of genomes. Orthologs here are used as defined in the COG database. Two genes are assumed to be functionally related if the correlation between their profiles is greater than would be expected by chance.

Wu J, Kasif S, and DeLisi C.
Identification of functional links between genes using phylogenetic profiles.
Bioinformatics, 2003. 19(12):1524-30.

Wu J, Hu Z, and DeLisi C.
Gene annotation and network inference by phylogenetic profiling.
BMC Bioinformatics, 2006. 7:80.

26. What is Gene Neighborhood?

A gene neighborhood consists of genes located near one another on a DNA strand for which their proximity is conserved in several genomes. When a gene neighborhood is known, related function between these genes can be inferred from their conservation of proximity across many genomes. The probability that neighboring genes encode proteins within the same biological pathway depends on the number of genomes in which the proximity of the genes is conserved. The conserved order of genes implies selective bias, which suggests related function. This method produces links between ortholog families that are validated by observed proximity in genomes from multiple phylogenetic groups.

Yanai I, Mellor JC, and DeLisi C.
Identifying functional links between genes using conserved chromosomal proximity.
Trends Genet, 2002. 18(4):176-9.

27. Does COMBREX make its own predictions?

Yes. COMBREX predictions are produced by conservatively propagating molecular function from experimentally validated proteins to proteins without annotation and proteins lacking specific functional annotation. The molecular functions of some experimentally validated proteins can be propagated to others based on sequence similarity; the greater the similarity, the higher the confidence in the propagated function. Functional propagation involves transferring a gold or green gene’s function to a gene with unknown function that shares significant identity with the experimentally validated gene. The functional propagation is mainly used to improve the predictions associated with black genes, consequently turning them into blue genes. Work on propagating function from an experimentally validated protein to another experimentally validated one is under progress and may be included in future versions of COMBREX. Propagation of function follows the criteria that require both proteins to share all the same domains regardless of the order of the domains and that both proteins share sufficiently enough sequence similarity to result in a BLAST E-value below 1e-05. These predictions are displayed under the “Predicted Function” section on the Gene Page of the gene to which the function is propagated. Proteins that receive annotation of function from an experimentally validated protein are listed as protein ID’s under the “Status” section on the Gene Page of the experimentally validated gene, the gene from which the function is propagated. For more information on the conservative functional propagation predictions produced by COMBREX, please refer to the Functional Propagation guide.

Additionally, many teams associated with the COMBREX project, such as the Salzberg team, Vitkup team, DeLisi team, and Segre team, provide expert predictions of gene function. The collaborative nature of the COMBREX project entails the involvement of many teams beyond the immediate COMBREX community which include the Horn team, Greiner team, the Palsson team, the Sjolander, the Haft and other teams. These teams contributing to the COMBREX project use a variety of different methods to computationally produce predictions of gene function. These predictions along with the conservative functional propagation predictions produced by COMBREX can be found on the gene detail page.

28. How do I submit a bid?

To submit a bid, go to the “Submit a Bid” page by either clicking on the auctioneer’s gavel graphic found on the cluster page to the left of each gene symbol under the “Status” column or by clicking on the “Submit a Bid” button found on the gene page under the “Status” heading in the “Bid Status” row.

On the “Submit a Bid” page you can download the bid submission form here: Bid Submission Form.

The bid submission form requires you to provide the following information:

Once you have completed the bid submission form, attach your completed bid form to the “Submit Bid” page, and remember to specify the amount of funds you are requesting.

After you submit your bid, please attach a biosketch or CV for the laboratory’s PI. The biosketch or CV can be in any standard NIH, NSF, or other granting agency's format. Other support information is not needed.

If your bid is judged to be competitive by the COMBREX executive committee and its external reviewers, you will be asked to submit a full proposal, which will include a detailed budget and the normal institutional assurances and paperwork required by NIH to establish a subcontract with Boston University. These details are not needed at the time of the initial bid.

You will also find a location to upload a 1-2 page word or pdf document with details of the proposed experiments.

For step by step detailed instructions on how to submit a bid, please refer to the How To Submit A Bid instructions guide.

29. Why can I not submit a bid for certain genes?

Genes currently being investigated by an experimental group are not available for an additional bid until their six month period of investigation completes. The bid in progress logo will indicate which genes are currently being bid on and consequently are not available for further bid submission. It is a goal of COMBREX to foster cooperation and collaboration, rather than competition, and groups awarded a bid will have a six month exclusive period funded by COMBREX to perform their investigations. If you are interested in the experimental validation of a certain gene that is already in the process of validation by another group, please contact the administrators.

30. How can I submit a bid for a gold or green gene?

Generally, submitting bids for green or gold genes is discouraged, because the functions of these genes have already been validated experimentally. Further validation of these genes will most likely not provide significant additional insight into gene function. However, if there is a new prediction or other information indicating that the current annotation of certain gold or green genes may be incorrect or incomplete, bid submission for those green or gold genes will be considered.

COMBREX administrators should be notified of such possible mis-annotations of green or gold genes, and the evidence of incomplete or mis-annotation will be reviewed. If the evidence is compelling, the bid submission and review process will proceed as usual.

One criterion for labeling a gene "Green" is if it shares 98% or more full-length sequence identity to a confirmed, experimentally-validated Gold gene. A relevant scenario that may occur is the prediction of altered substrate specificity due to amino acid changes near an active site. Bid submissions for these genes are likely to be considered, because the changes can be plausibly linked to altered function, and experimental validation of this could be an important step towards understanding a larger family of enzymes.

31. What is the purpose of the High Priority List?

The High Priority list consists of uncharacterized genes with specific, biochemically testable predictions of molecular function that have been nominated by registered COMBREX users to have high priority for experimental validation. This High Priority status implies that their biochemical characterization would be of great benefit to the scientific community. The list is not ranked, and as a result it can be sorted by species name, gene name, and functional assignment. Registered COMBREX users who wish to apply for funding to experimentally validate a High Priority gene may do so either through the appropriate gene page, or directly from the High Priority list (instructions can be found (how to submit a bid document). For step by step instructions on how to nominate a gene for the High Priority List, please see here: (link to "how to nominate a gene for High Priority List" document in help center) If you have any questions regarding the High Priority List please contact us at: help-desk@combrex.bu.edu.

32. What does it mean to have “NCBI Protein Clusters” as a source of prediction?

The NCBI Protein Clusters database groups proteins into mutually sequence-similar groups called clusters. Because the proteins in a given cluster are likely to perform the same or similar function, membership in a cluster with a defined function can serve as a functional prediction for a given protein. For more information on the NCBI Protein Clusters Database and their methodology of producing protein clusters, please refer to: PMID: 18940865.

33. What does it mean to have “NCBI annotation” as a source of prediction?

For a protein that has no function prediction produced by COMBREX or other computational teams, and for which NCBI Protein Clusters provides no functional clues, we use its NCBI (RefSeq protein) annotation as the default prediction of function.

34. What information does the “Phenotype Data” section, within the gene detail, page contain?

The “Phenotype Data” section, within the COMBREX gene detail page, contains information about documented phenotypes associated with the gene of interest. Specifically, phenotype name, a brief description of the phenotype, expression class (e.g. wild type, knockout, etc.), and a link to the reference which documents the association of the gene with the listed phenotype are listed in this section. Currently, this phenotype data consists of antibiotic resistance, antibiotic hypersensitivity, and gene essentiality. Brief descriptions of these phenotypes and links to the sources from which these data were obtained are provided below.

Antibiotic Resistance genes

These genes can confer resistance to one or multiple antibiotics through several mechanisms. The antibiotic resistance data was obtained from the Antibiotic Resistance Genes Database (ARDB).

Antibiotic Hypersensitivity genes

Loss of these genes confers increased sensitivity to one or more antibiotics. This data was kindly provided by Dr. Jeffrey H. Miller (UCLA). The Miller lab has screened the KEIO collection of approximately 4000 single gene knockout mutants of E. coli K12 for increased sensitivity to 22 different antibiotics. More information on the antibiotic hypersensitivity genes can be found in the original publication: Liu A et al (2010) Antimicrob Agents Chemother 54(4), 1393-1403. PMID 20065048.

Essential genes

These genes are identified as being essential for growth or viability in one or more of the following organisms: Escherichia coli str. K-12, Helicobacter pylori 26695, Acinetobacter baylyi ADP1, Bacillus subtilis 168, Haemophilus influenzae Rd, and Pseudomonas aeruginosa. Information about these candidate essential genes has been gathered from multiple sources, references for which can be found at the bottom of the “list of Phenotypes” page.

To view a complete list of all the phenotypes available in COMBREX, click on “Advanced Search” next to the search bar and then click on “View list of phenotypes” within the advanced search options.