This page will help you understand how to use our Gene Search to better find the products related to your genes of interest.
Introduction
Finding a cDNA clone to match your gene or sequence of interest
can at first seem like a complex task. However, in actuality it is quite easy once
you understand that each clone may have numerous identifiers and which identifiers
are easiest to use. Or perhaps you have a gene name or a piece of sequence and need
to find a clone. At Open Biosystems it is our goal to make this
process as straight forward as possible. We provide a clone query (Gene Search) that accepts the
most commonly used identifier, the GenBank Accession number, and additional clone
identifiers when possible. In addition, our unique RefClone Mapping assists in your
efforts to rapidly identify physical cDNA clones containing your sequence of
interest. Below you will find a description of the Open Biosystems Gene Search, a
brief tutorial on identifier definitions, and additional hints on finding a clone.
Don't forget that if you have difficulty finding a clone you require, have a
special request or want to provide feedback on how we can better assist you please
contact us at
info@openbiosystems.com or call our customer service at 1-888-412-2225.
How to search
Searching is easy. You may search for up to fifty identifiers, of a common type, at a time and page through the results. Enter your identifier(s) in the text box at the far right and click the submit button.
Searching by GenBank Accession number or Clone ID: Accepted Identifiers
The Open Biosystems clone query accepts both
GenBank Accession numbers and clone identifiers that are unique to specific
collections. To find a clone of interest, enter either a GenBank Accession or Clone
ID into the search query box. The Open Biosystems clone query will then attempt
to find an exact match to the GenBank Accession or Clone ID that you entered.
However, if an exact match is not found then the clone query system will use a
unique mapping mechanism, RefClone Mapping, to return a listing of homologous
clones in our collection. An example of the accepted identifiers are given below:
| GenBank Accession | one or two letters followed by a series of digits | BQ228424 |
| IMAGE ID | five to seven digit number | 6059449 |
| University of Iowa ID | a series of letters and dashes beginning with U | UI-R-E0-by-f-10-0-UI |
A GenBank Accession number is assigned every time a sequence is deposited
in the NCBI. It consists of one or two letters followed by a series of numbers (eg.
BQ228424). A single cDNA clone can have several GenBank Accession numbers. This is
because several sequences may have been deposited with the NCBI for the same cDNA
clone. For example, one laboratory may have sequenced the 5' end of a clone and
deposited the sequence while another lab may have sequenced the 3' end.
Additionally, if a cDNA clone is completely sequenced and found to contain a
full-length gene sequence then it will be given yet another GenBank Accession
number
*Note: Many GenBank Accession records of sequence data do not reference a cDNA clone directly. These may be mRNA sequences (NM_), protein sequences (NP_), genomic sequence (NC_,NT_) or non-coding transcripts (XR_). A cDNA clone or BAC clone may be available containing the sequence of interest but will require additional searching to identify. See 'Identifier Definitions' or contact technical service for assistance.
The IMAGE ID is a unique identifier
assigned by the I.M.A.G.E. Consortium to each clone that was derived from this
project. Unlike the GenBank Accession number a single clone will have only one
IMAGE ID. All sequences derived from the same clone will reference the same IMAGE
ID. For example, the IMAGE ID 236338 is a cDNA clone containing sequence for the
Tumor protein p53 (Li-Fraumeni syndrome). This same cDNA clone has been sequenced
from the 5' end and this sequence was deposited with NCBI and given the GenBank
Accession number
H61357, it was also sequenced from 3' end and given the GenBank Accession
number
H62385. In both cases the IMAGE ID 236338 will be referenced. When entering
IMAGE clone ID's into the clone query, enter the numerical portion only (i.e.
remove IMAGE:)
The University of Iowa ID is a unique identifier assigned to cDNA
clones derived from projects originating at this University. These include a rat
cDNA clone collection, and other unique human cDNA collections (i.e. Cystic Fibrosis
cDNA Collection). The University of Iowa ID's include information relating to
tissue source, library type, etc. The University of Iowa ID is assigned much as the
IMAGE ID that is detailed above. An example of an UI ID: UI-R-FS1-cqh-p-20-0-UI
Gene Search and RefClone Mapping
To
find a clone of interest, enter either a GenBank Accession or Clone ID
into the search query box. The Open Biosystems' Gene Search will then
look for an exact match to the GenBank Accession or Clone ID that you
entered. However, if an exact match is not found then the system will
use a unique mapping mechanism, RefClone Mapping, to find homologous
clones in our collection. The query processes your request in a
stepwise fashion:
- An identifier type is determined to be an accession number or clone ID.
- An
exact match is sought after in the Open Biosystems' clone database.
Exact matches are returned with the ordering information and technical
data.
- Identifiers without exact matches then undergo RefClone
mapping to the UniGene database. The UniGene database clusters multiple
clone sequences by MegaBLAST alignments (see UniGene Clustering Process
below for details on the UniGene database). The cluster containing the
requested identifier is first identified.
- All of the clones are then extracted from the
representative UniGene cluster and compared against the Open
Biosystems\' clone database.
- Clones found to be contained in both the UniGene
cluster and the Open Biosystems' clone database are then displayed
under the message:
- 'You searched for '123456', we found # in the same cluster (Hs.###).'
- Clones are returned with the ordering information and technical data.
- The
clones returned were found to be related by sequence homology utilizing
the methods described in UniGene Clustering Process (see below) These
clones should be confirmed by individual alignment to the requested
sequence to ensure accuracy. BLAST2 (http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html) at NCBI maybe utilized to rapidly align the requested sequence and the RefClone sequence.
Open Biosystems is not responsible for clones that may not align in whole or in
part, as clone sequencing, genome finishing and clustering is currently a dynamic
process. This service is provided in an effort to assist in your efforts to rapidly
identify homologous cDNA clones.
Other Common Identifiers
The GenBank accession number
and Clone identifiers previously described relate directly to sequences of a cDNA
clone. However, you may find reference to or be interested in identifiers beginning
with the prefixes below. The prefix indicates a specific type of sequence and even
though it is not a direct sequence from a cDNA clone, we may be able to locate a
clone for you containing the sequence of interest. We have provided the brief
definition of these identifiers and how a representative clone containing the
sequence of interest may be obtained if one is available.
Identifier Definitions
(taken from
http://www.ncbi.nih.gov/RefSeq/)
NCBI Accession numbers
that begin with the prefix NG_ (genomic), NM_ (mRNA) and NP_ (protein) are generated
and maintained by the Curated RefSeq project. Sequence records are reviewed and
additional feature annotation may be added. In addition, in some instances the
sequence has been modified relative to the original GenBank sequence from which it
was derived. Representative clones are available for a majority of these accession
numbers. These are readily identified by reviewing the 'mRNA' section of the
UniGene entry containing the accession number. Full length MGC/IMAGE clones that
are representative of the RefSeq entry will be listed in the 'mRNA' section. For
EST representatives, review the 'EST' section of the same UniGene entry.
| Accession Format | Molecule Type | Genome |
| NC_123456 | Complete Genome | Archaea, Bacterial, Organelle, Virus |
| Complete Chromosome | Eukaryote |
| NG_123456 | Genomic Region | Homo sapiens |
| NM_123456 | mRNA | Homo sapiens Mus musculus Rattus norvegicus |
| NP_123456 | Protein | All of the above |
| NT_123456 | Genomic Contig | Homo sapiens Mus musculus |
NCBI Accession numbers that begin with the prefix XM_ (mRNA),
XR_ (non-coding transcript), and XP_ (protein) are model reference
sequences produced by NCBI's Genome Annotation project (i.e. in silico predictions).
These records represent the transcripts and proteins that are annotated on the NCBI
Contigs, which may have been generated from incomplete data. Because the XM_, XR_,
and XP_ accessions reflect the current state of NCBI\'s assembly of the genomic
sequence, they may be different from GenBank submissions for mRNAs and/or the
curated RefSeq records. These differences may reflect real sequence variation
(polymorphism), errors in GenBank accessions used as sources for unreviewed
(provisional) RefSeq records, or errors or gaps in the available genomic sequence.
These sequences should be used with caution, after comparing them to other available
sequence information (Check the evidence viewer, BLink, LocusLink, or sequence
neighbors).
| Accession Format | Molecule Type | Genome |
| XM_123456 | mRNA | Homo sapiens model mRNA provided by the Genome Annotation process; sequence corresponds to the genomic contig. |
| XR_123456 | RNA | Homo sapiens model non-coding transcripts provided by the Genome Annotation process; sequence corresponds to the genomic contig. |
| XP_123456 | Protein | Homo sapiens model proteins provided by the Genome Annotation process; sequence corresponds to the genomic contig. |
Finding a clone starting with a gene name
For human or mouse full-length clones the
best resource is the MGC website at
http://mgc.nci.nih.gov which has a keyword or gene symbol search
Another resource to find a clone by gene name is the UniGene
database,
http://www.ncbi.nlm.nih.gov/UniGene/. UniGene groups (clusters) genes according
to their function or disease association based on sequence similarity. Typing the
name of your gene of interest into the UniGene search engine and choosing your
desired organism from the dropdown menu will return the cluster ID and cluster
information. Included in the cluster information is a list of associated EST clones.
Clones are displayed by sequence end read length from longest to shortest if
noted. You can use the Open Biosystems Clone Query to
search our current stocks using either the Source ID or GenBank accession
Finding a clone starting with a nucleotide sequence
For human or mouse full-length clones the best resource is the MGC website at
http://mgc.nci.nih.gov which has BLAST
capability
The standard nucleotide-nucleotide
BLAST engine at the NCBI website,
http://www.ncbi.nlm.nih.gov/BLAST/, is a useful tool for locating homologous
clones. Simply paste the sequence of interest into the "search" box, choose the
database you would like to search (i.e. est_others), and click "BLAST". The
instructions listed above are the simplest way to perform a BLAST query. There are
many ways to refine and restrict the query described at the NCBI BLAST website.
The results returned from the
BLAST query will be arranged in descending order from the most highly similar
sequence alignment to the least similar. By scrolling down to the "Alignments"
section or by clicking on an individual record, clone matches can be located by
looking in the description as shown below.
UniGene Clustering Process (taken from 'The UniGene Build Procedure')
- Clustering
is the process of finding subsets of sequences that belong together
within a larger set. This is done by converting discrete similarity
scores to boolean links between sequences. That is, two sequences are
considered linked if their similarity exceeds a threshold. UniGene
clustering proceeds in several stages, with each stage adding less
reliable data to the results of the preceding stage. This staged
clustering affords greater control than a more egalitarian treatment of
all links between sequences.
- Screening for contaminants, repeats, and low-complexity
sequence is performed. Low-complexity screening is performed using
NCBI\'s Dust. Mitochondrial and ribosomal sequences are screened for,
as are vector contaminants and repetitive elements. After screening, a
sequence must contain at least 100 informative base pairs (bp) to be a
candidate for entry into UniGene.
- Gene links are established. The set of mRNA sequences
is compared with itself. Sequence pairs that are sufficiently similar
are linked together to form initial clusters.
- Links between ESTs and mRNA are added to these
clusters. The set of ESTs is compared with sequences from the set of
initial clusters using megablast, and sufficiently similar sequence
pairs are added to the clusters. Links that would join the initial
mRNA-based clusters are discarded. EST to EST links are also generated
and used to extend the initial clusters and to generate clusters
composed solely of ESTs.
- Clone-based edges are added; these allow
non-overlapping 5' and 3' ESTs to be assigned to the same cluster.
Clone IDs which link at least two 5' ends from one cluster with and
least two 3' ends another cluster are found, and the two clusters are
merged. Due to imperfect clone labeling, a single clone-ID based edge
is insufficient to merge two clusters.
- Any resulting cluster that does not contain a sequence
with a polyadenylation signal or tail is discarded. Clusters that meet
these criteria are called anchored clusters, since their 3' end is
presumed to be known.
- EST's that do not belong to an anchored cluster are
rechecked at a lower level of stringency than in the preceding passes.
An EST which passes this less stringent test is then added to the
cluster which contains the sequence which is the best match to the EST;
it is a guest member.
- Clusters of size 1 (that is, clusters which seem to
identify infrequently expressed genes) are compared against the rest of
the sequences in UniGene at a lower level of stringency, and merged
with the cluster containing the most similar sequence.
- The resulting clusters are compared with the preceding
week's build and renumbered in an attempt to maintain continuity. Since
the sequences that make up a cluster may change from week to week, and
since the cluster identifier may disappear (typically when two clusters
merge) using the cluster identifier as a reference is ill-advised.
Using the GB accession numbers of the sequences that comprise the
cluster is a safe alternative.
Call For Assistance
If you have any questions that are not covered here, please
call us at 888-412-2225 or email us at
info@openbiosystems.com and our customer service representatives will be happy
to help you.