number: A unique
number or code given to mark the entry of a sequence (protein or nucleic
acid) or pattern (regular expression, fingerprint, profile) to a primary
or secondary database. Accession numbers should remain static between
database updates, and hence in theory provide a mechanism for reliably
identifying a particular entry in subsequent database releases.
The logical sequence of steps by which a task can be performed.
spliced form: See
acid: The fundamental building block of proteins. There are 20 naturally occurring
amino acids in animals and around 100 more found only in plants.
helix: A helix
that displays a characteristic charge separation in terms of the distribution
of its polar and non-polar residues on opposite faces. Their ‘sidedness’
allows such helices to sit comfortably at polar/apolar interfaces,
such as at the surfaces of globular proteins(where their hydrophilic
sides point towards the solvent, and their hydrophobic sides point
towards the protein core), or within membranes (where their hydrophobic
sides point towards the lipid environment, and their hydrophilic sides
point towards the protein interior).
Analogues: Non-homologous proteins that have similar folding archiatectures, or similar
functional sites, which are believed to have arisen through convergent
Small software applications loaded from a server via HTML pages.
Assembly: The process of aligning overlapping sequence fragments into a contig.
or series of contigs.
(bp): Any possible
pairing between bases in opposing strands of DNA or RNA. Adenine pairs
with thymine in DNA, or with uracil in RNA; and guanine pairs with
of computational techniques to the management and analysis of biological
An ungapped, aligned motif consisting of sequence segments that are clustered
to reduce multiple contributions from groups of highly similar or
Browser: A computer program (commonly known as a Web client) that permits information
retrieval from the Internet and the WWW.
library: A gene library composed of cDNA inserts synthesized from mRNA using reverse
dogma: A fundamental
principle of molecular biology, first expounded by Francis Crick in
1958, essentially stating that the transfer of information from nucleic
acid to nucleic acid, or from nucleic aced to protein, is possible,
while transfer from protein to nucleic acid or from protein to protein
is impossible. A shorthand expression of the dogma gives the unidirectional
Chaperone: A protein that assists the correct non-covalent assembly of folding proteins
in vivo; chaperones do not themselves form part of the structures
they help to assemble.
Chromosomes: The paired, self-replicating genetic structures of cells that contain
the cellular DNA; the nucleotide sequence of the DNA encodes the linear
array of genes.
Client: Any program that interacts with a server (Lynx, Mosaic and Netscape are
examples of client software).
A copied fragment of DNA, maintained in circular form, identical to the
template from which it is derived.
The process of generating identical copies of a DNA fragment (that may
encode a complete gene) from a single template DNA.
vector: A DNA
molecule originating from a virus, a plasmid, or the cell of a higher
organism into which another DNA fragment can be integrated without
compromising the vector’s capacity for self-replication.
A region of DNA or RNA whose sequence determines the sequence of amino
acids in a protein.
line: The basic
level at which a computer prompts the user for input.
protocol: An agreed
set of rules for structuring communication between programs (allowing,
for example, data exchange between nodes on the Internet).
DNA (cDNA): DNA
that is synthesized from a messenger RNA template using the enzyme
database: A database
that amalgamates a number of primary sources, using a set of defined
criteria that determine the priority of inclusion of the different
sources and the level of redundancy retained (e.g., NRDB is a non-identical
composite protein sequence database and OWL is a non-redundant composite).
computational process of interpreting the sequence of nucleotides
in mRNA via the genetic code to a sequence of amino acids,
which may or may not code for protein.
sequence: A pseudo-sequence
that summarises the residue information contained in a multiple alignment.
sequence: A sequence
of bases in a DNA molecule (or an amino acid sequence in a protein)
that has remained essentially unchanged during evolution.
Contig: Sequences of clones, representing overlapping regions of a gene, presented
as an assembly or multiple alignment.
A method used to add dansyl group to free amino groups in protein end-group
analysis. The dansyl amino acids, isolated after hydrolysis of the
protein, are fluorescent and may be detected in nanomolar quantities
performance (diagnostic power):
A measure of the ability of a discriminator to identify true matches,
either in an individual query sequence or in a database.
abstraction of a conserved motif, or set of motifs (e.g., a regular
expression pattern, a profile or a fingerprint), used to search either
an individual query sequence or a full database for the occurrence
of that same, or similar, motif(s).
(deoxyribonucleic acid): The
molecule that encodes genetic information. DNA is a double-stranded
molecule held together by weak bonds between basepairs of nucleotides.
The four nucleotides in DNA contain the bases; adenine (A), guanine
G), cytosine (C) and thymine (T). In nature, basepairs form only between
A and T and between G and c; thus the base sequence of each single
strand can be deduced from that of its partner.
sequence: The linear sequence of base pairs, whether in a fragment of DNA, a gene,
a chromosome or an entire genome.
Domain: A compact, local, semi-independent folding unit, presumed to have arisen
via gene fusion and gene duplication events. Domains need not be formed
from contiguous regions of an amino acid sequence: they may be discrete
entities, joined only by a flexible linking region of the chain; they
may have extensive interfaces, sharing many close contacts; and they
may exchange chains with domain neighbours. The combination of domains
within a protein determines its overall structure and function.
State of a computer when it is non-operational and hence unavailable for
Dumb: A dumb terminal is a desttop display device that is not capable of local
processing, this being entirely carried out by the central computer.
Such terminals (e.g.,
VT52, VT100, etc.) do not support windowing applications.
system: The systematic classification and naming of enzymes by the Enzyme Commission,
whereby enzymes are denoted by the letters E.C., followed by a set
of four numbers separated by dots. The first number indicates one
of six main functional divisions (oxidoreductases, transferases, hydrolases,
lyases, isomerases and ligases); the following numbers denote different
subclasses, as defined by donor group, acceptor, substrate, isomer,
etc., the final digit being a serial number for the particular enzyme
(e.g., E.C.22.214.171.124 for alcohol dehydrogenase, E.C.126.96.36.199 for protein-arginine
method used in sequencing polypeptides, whereby amino acid residues
are removed sequentially from the N-terminus by reaction with phenyl-isothiocyanate,
to form phenylthiocarbamyl-peptide(PTC-peptide). This is cleaved in
anhydrous acid, releasing a thialzolinone intermediate and the remainder
of the peptide.
Enzyme: A protein that acts as a catalyst, speeding the rate at which a biochemical
reaction process but not altering the direction or nature of the reaction.
See E.C. system.
Eukaryote: Cell or organism with membrane-bound, structurally discrete nucleus and
other well-developed subcellular compartments, Eukaryotes include
all organisms except viruses, bacteria and blue-green algae.
Exon: The protein-coding DNA sequences of a gene.
Sequence Tag (EST):
A partial sequence of a clone, randomly selected from a cDNA library
and used to identify genes expressed in a particular tissue. ESTs
are used extensively in projects to map the human genome.
profile: The characteristic
range of genes expressed at different stages of a cell’s development
A true match that incorrectly recognized by a discriminator.
A false match
incorrectly recognized by a discriminator.
Transfer profile (FTP): A
method of transferring files to remote computers.
Fingerprint: A group of ungapped motifs excised from a sequence alignment and used
to build a characteristic signature of family membership by means
of iterative searching of a primary (or composite) database.
Firewall: A mechanism for protecting a proprietary computer network (or intranet),
allowing internal users to access the Internet, while preventing external
Internet users from penetrating the intranet.
A human-readable data-file in a convenient form for interchange of database
information, Flat-files may be created as output from relational databases,
in a format suitable for loading into other databases.
problem: The problem
of determining how a protein folds into its final 3D form given only
the information encoded in its primary structure.
Frameshift: An alteration in the reading sense of DNA resulting from an inserted or
deleted base, such that the reading frame for all subsequent codons
is shifted with respect to the number of changes made (e.g., if a
sequence should read UCU-CAA-AGG-UUA, and a single U is added to the
beginning, the new sequence would read UUC-UCA-AAG-GUU, etc.). Frame
shifts may arise through random mutations, or via errors in reading
The fundamental physical and functional unit of heredity. A gene is an
ordered sequence of nucleotides located in a particular position of
a particular chromosome that encodes a specific functional product
(i.e., a protein or RNA molecule).
genetic alteration in which a segment of DNA is repeated, Duplications
may appear anywhere, but where the duplicated segment is adjacent
to the original one, this is termed a tandem duplication.
process by which a gene’s coded information is converted into the
structures present and operating in the cell. Express ed genes include
those that are transcribed into mRNA and then translated into protein
and those that are transcribed into RNA but not translated into protein(
e.g., transfer and ribosomal RNAs).
of closely related genes that encode similar protein products.
product: The protein
resulting from the expression of a gene. In some cases, the gene product
may be an RNA molecule that is never translated.
code: The rules
that relate the four DNA or RNA bases to the 20 amino acids, There
are 64 possible three-base (triplet) sequences, which are known as
codons. A single triplet uniquely defines one amino acid, but an amino
acid may be coded by as many as six codons, The code is thus said
to be degenerate.
Genome: All the genetic material in the chromosomes of a pr\articular organism;
its size is generally given as its total number of basepairs.
(often via international collaboration) to map and sequence the entire
genomes of particular organisms. The first complete eukaryotic genome
to have been sequenced is that of the yeast S. cereviseae;
the human genome is expected to be finished by roughly 2003-2005;
and mouse by around 2008. The majority of genomes completed to date
are those of prokaryotes.
wheel: A circular
graph depicting five turns of helix, around which the residues of
a protein sequence are plotted. Helical potential is recognized by
the clustering of hydrophilic and hydrophobic residues in distinct
polar and non-polar arcs.
Markov Model (HMM): A
probabilistic model consisting of a number of interconnecting states.
Like profiles, HMMs encode full domain alignments. They are essentially
linear chains of match, delete or insert states: a match state denotes
a conserved column in an alignment; an insert state allows insertions
relative to match states; and delete states allow match positions
to be skipped.
page: The HTML document that acts as the first contact point between a browser
and a server.
Being related by the evolutionary process of divergence from a common ancestor.
Homology is not a synonym for similarity.
The process of
joining two complementary strands of DNA or one each of DNA and RNA
to form a double-stranded molecule.
Having the property of hydrophobicity, a low affinity for water.
profile: A graph
in which hydropathy values are calculated within a sliding window
and plotted for each residue in a protein sequence. Such graphs show
characteristic peaks and troughs, corresponding to the most hydrophobic
and hydrophilic regions of the sequence respectively.
An active HTTP cross-reference that links one Web document to another document
on the Internet.
Hypermedia: Formatted Web documents containing a variety of information types, including
text, image, movie and audio.
Text that contains embedded links (hyperlinks) to other documents.
Markup language (HTML): The
syntax governing the way documents are created so that they can be
interpreted and rendered by Web browser.
Transport Protocol (HTTP): The
communication protocol used by Web servers.
An INsertion/DELetion in a DNA or protein sequence.
The international network of computer networks that connect government,
academic and business institutions.
Inter-ORB Protocol (IIOP): The
communication protocol used by object-request brokers to communicate
over the Internet.
Computer network isolated from the Internet by means of a firewall but
that offers similar facilities to the local community (e.g., Web servers,
The sequence of DNA bases that interrupts the protein-coding sequence of
a gene; these sequences are transcribed into RNA but are edited out
of the message before it is translated into protein.
address: Internet protocol address- a unique identifying number assigned to each
computer on the Internet to allow communication between them.
An object-oriented, network programming language that permits creation
of either stand-alone programs, or applets that are launched via links
on Web pages. In theory, Java programs run on any machine that supports
the Java run-time environment (including PCs and UNIX workstations).
(Kb): Unit of
length for DNA fragments equal to 1000 nucleotides.
Library: An unordered collection of clones (i.e., cloned DNA from a particular
organism), generated from genomic DNA or cDNA.
(pl. loci): The
position on a chromosome of a gene or other chromosome marker; also,
the DNA at that position. The use of locus is sometimes restricted
to mean regions of DNA that are expressed.
(Mb): Unit of
length for DNA fragments equal to 1 million nucleotides.
Zone: Region of
sequence identity where sequence comparisons fail completely to detect
system: A biological system used to represent other , often more complex, systems
, in which similar phenomena either do, or are thought to, occur (e.g.,
D. melanogaster, M. musculus, S. cerevisiae, C. elegans, E.coli).
Module: An autonomous folding unit, believed to have arisen largely as a result
of genetic shuffling mechanism, Modules are contiguous in sequence
and are often used as building blocks to confer a variety of complex
functions on the parent protein. They may be thought of as a subset
of protein domains. Examples of modules include Kringle domains (named
after the shape of a Danish pastry). Which are autonomous structural
units found throughout the blood clotting and fibrinolytic proteins;
the ubiquitous DNA-binding zinc fingers, which are small self-folding
units in which zinc is a crucial structural component; and the WW
module (characterized by two conserved tryptophan residues, hence
its name), which is found in a number of disparate proteins, including
dystrophin, the product encoded by the gene responsible for Duchenne
Mosaic: A mosaic protein is a modular protein that, rather than including multiple
tandem repeats of the same module, is composed of a number of different
modules, each conferring different aspects of the parent protein’s
overall functionality (e.g., the calcium independent latrotoxin receptor,
a mosaic of EGF-like and laminin G-like modules).
Motif: A consecutive string of amino acids in a protein sequence whose general
character is repeated, or conserved, in all sequences in a multiple
alignment at a particular position. Motifs are of interest because
they may correspond to structural or functional elements within the
sequences they characterize.
Any change in DNA sequence.
library generated such that all the genes in the library are represented
at the same frequency.
A molecule consisting of a nitrogenous base (A,G, T, or C in DNA; A, G,
U or C in RNA), a phosphate moiety and a sugar group (deoxyribose
in DNA and ribose in RNA). Thousands of nucleotides are linked to
form a DNA or RNA molecule.
database: A database
in which data are stored as abstract objects, with abstract relationships
between them. The data representations are potentially very varied,
including, for example, character strings, digitised images, tables,
etc. An object may subsume many other objects, and the database allows
retrieval of the objects as a whole. The flexibility of data representation,
and the ability to group objects together, renders object-oriented
databases potentially very powerful systems.
reading frame (ORF):
A series of DNA codons including a 5’ initiation codon and a termination
codon, that encodes
a putative or known gene.
Homologous proteins that perform the same function in different species.
A self-contained message, or component of a message, comprising address,
control and data signals, which may be transferred as a single entity
within a communications network.
Homologous proteins that perform different but related functions within
Scores, or weights, used by programs in the computation of sequence alignments;
such scores are normally supplied as parameters to the programs and
thus may be modified by the user.
insertions or deletions that arise when physical irregularities in
a sequencing gel cause the reading software either to call a base
too soon, or to miss a base altogether.
of the evolutionary relationships between a species and its predecessors
(e.g., using phylogenic trees).
chain reaction (PCR):
A method for amplifying a DNA base sequence using a heat-stable polymerase
and two primers, one complementary to the (+)-strand at one end of
the sequence to be amplified and the other complementary to the (-)-strand
at the other end. The faithfulness of reproduction of the sequence
is related to the fidelity of the polymerase. Errors may be introduced
into the sequence using this method of amplification.
enzyme-catalysed alteration to a protein made after its translation
from mRNA (e.g., glycosylation, phosphorylation, myristoylation, methylation).
database: A database
that stores biomolecular sequences (protein or nucleic acid) and associated
annotation information (organism, species, function, mutations linked
to particular diseases, functional/structural patterns, bibliographic,
linear sequence of amino acids in a protein molecule.
A short polynucleotide chain to which new deoxyribonucleotides can be added
by DNA polymerase.
Probe: A DNA or protein sequence used as a query in a database search.
A position-specific scoring table that encapsulates the sequence information
within complete alignments. Profiles define which residues are allowed
at giver positions; which positions are conserved and which degenerate;
and which positions, or regions, can tolerate insertions. In addition
to data implicit in the alignment, the scoring system may include
evolutionary weights and results from structural studies. Variable
penalties are specified to weight against insertions and deletions
occurring in secondary structure elements.
An organism lacking a membrane-bound, structurally discrete nucleus and
other subcellular compartments. Bacteria are prokaryotes.
A site on DNA to which RNA polymerase will bind and initiate transcription.
Protein: A molecule composed of one or more chains of amino acids in a specific
order; the order is determined by the base sequence of nucleotides
in the gene coding for the protein. Proteins are required for the
structure, function and regulation of cells, tissues and organs, each
protein having a specific role (e.g., hormones, enzymes and antibodies).
arrangement of separate protein chains in a protein molecule with
more than one subunit.
arrangement of separate molecules, such as in protein-protein or protein-nucleic
In X-ray crystallography, this parameter is used to express the extent
of agreement between theoretical calculations and the measured data;
the lower the R-factor, the better the fit (R means either Residual
single consensus expression derived from a conserved region of a sequence
alignment, and used as a characteristic signature of family membership.
Synonymous terms: rule, pattern.
regions or sequences:
A DNA base sequence that controls gene expression.
database: A database
that uses a relational data model, in which data are stored in two-dimensional
tables. The tables embody different aspects or properties of the data,
but contain overlapping information.
The extent to which closely juxtaposed objects can be distinguished as
separate entities. The degree of resolution is dependent on the resolving
power of the system; the fineness of detail with which
objects may be visualized is determined by the wavelength of electromagnetic
radiation used. X-rays, for example, have wavelengths in the range
10-8m to 10-11m and hence can be used to resolve
structures at the atomic level. Structures are thus said to be determined,
for example, to 3 Ǻ resolution, 5 Ǻ resolution, etc.
A molecule chemically similar to DNA that plays a central role in
protein synthesis. The structure of RNA is similar to that of DNA
but it is inherently less stable. There are several classes of RNA
molecule, including messenger RNA (mRNA), transfer RNA (tRNA), ribosomal
RNA (rRNA), and other small RNAs , each serving a different purpose.
database: A database
that contains information derived from primary sequence data, typically
in the form of regular expressions (patterns), fingerprints, blocks,
profiles or Hidden Markov Models. These abstractions represent distillations
of the most conserved features of multiple alignments, such that they
are able to provide potent discriminators of family membership for
newly determined sequence.
of local regularity within a protein fold (e.g., α-helices,
alignment: A linear
comparison of amino (or nucleic) acid sequences in which insertions
are made in order to bring equivalent position in adjacent sequences
into the correct register. Alignments are the basis of sequence analysis
methods, and are used to pinpoint the occurrence of conserved motifs.
Tagged Site (STS): Short
(200 to 500 basepairs DNA sequence that has a single occurrence in
the human genome and whose location and base sequence are known. Detectable
by polymerase chain reaction (PCR), STSs are useful for localizing
and orienting the mapping and sequence data reported from many different
laboratories and serve as landmarks on the developing physical map
of the human genome. Expressed sequence tags (ESTs) are STSs derived
Determination of the order of nucleotides (base sequences) in a DNA or
RNA molecule, or the order of amino acids in a protein.
A computer of software system that communicates information via the Internet
to a client.
of DNA fragments randomly generated from a genome.
mutation: A nucleotide
substitution that does not result in an amino acid substitution in
the translation product, because of the redundancy of the genetic
of a stretch of DNA taking into account three forward translations
and three reverse translations, arising from the three possible reading
frames of an uncharacterized stretch of DNA.
matrix: A matrix
in which most of the elements or cells have zero scores.
of different length that arise through translation of mRNAs that have
not included all available exons in the template DNA.
A DNA or protein sequence matched by a query sequence in a database search.
A distinct polypeptide chain within a protein that may be separated from
other chains (whether identical or different) without breaking covalent
arrangement of α-hecices and/or β-strands in a protein sequence into
discrete folded structures (e.g., β-barrels, β-α-β units, Greek keys,
protocol: A method
of communication between remote computers that allows users to log
on and use the distant machines as if physically present at the remote
database: A database
derived from information housed in secondary (pattern) databases (e.g.,
the BLOCKS and eMOTEF databases, which draw on data stored within
PROSITE and PRINTS). The value of such resources is in providing a
different scoring perspective on the same underlying data, allowing
the possibility to diagnose relationships that might be missed using
the original implementation.
overall fold of a protein sequence, formed by the packing of its secondary
and/or super-secondary structure elements.
The synthesis of an RNA copy from a sequence of DNA (a gene); the
first step in gene expression.
Translation: The process in which the genetic code carried by mRNA directs the synthesis
of proteins from amino acids.
domain: A region
of a protein sequence that traverses a membrane; for a-helical structures,
this requires a span of 20-25 residues.
Control Protocol/Internet Protocol (TCP/IP): The
rules that govern data transmission between two computers over the
True-negative: A false match that correctly fails to be recognized b a discriminator.
A true match correctly
recognized by a discriminator.
Zone: A zone of
sequence similarity (~0-20% identity) within which alignments appear
plausible to the eye but are not statistically significant (i.e.,
could have arisen by chance)
Resource locator (URL): The
address of a source of information. The URL comprises four parts---the
protocol, the host name, the directory path and the file name (e.g.
The status of a computer system when it is operational.
Upstream: Further back in the sequence of a DNA molecule, with respect to the direction
in which the sequence is being read.
matrix: See Profile.
Amino acid residues isolated from neighbouring residues by spurious gaps,
usually the result of over-zezlous gap insertion by automatic alignment
Wide Web: The
information system or network on the Internet that uses HTTP as the
primary communications medium.