Glossary in Bioinformatics (English Version)

Accession number: A unique number or code given to mark the entry of a sequence (protein or nucleic acid) or pattern (regular expression, fingerprint, profile) to a primary or secondary database. Accession numbers should remain static between database updates, and hence in theory provide a mechanism for reliably identifying a particular entry in subsequent database releases.

Algorithm: The logical sequence of steps by which a task can be performed.

Alternatively spliced form: See Splice variant.

Amino acid: The fundamental building block of proteins. There are 20 naturally occurring amino acids in animals and around 100 more found only in plants.

Amphipathic helix: A helix that displays a characteristic charge separation in terms of the distribution of its polar and non-polar residues on opposite faces. Their ‘sidedness’ allows such helices to sit comfortably at polar/apolar interfaces, such as at the surfaces of globular proteins(where their hydrophilic sides point towards the solvent, and their hydrophobic sides point towards the protein core), or within membranes (where their hydrophobic sides point towards the lipid environment, and their hydrophilic sides point towards the protein interior).

Analogues: Non-homologous proteins that have similar folding archiatectures, or similar functional sites, which are believed to have arisen through convergent evolution.

Applet: Small software applications loaded from a server via HTML pages.

Assembly: The process of aligning overlapping sequence fragments into a contig. or series of contigs.

Basepair (bp): Any possible pairing between bases in opposing strands of DNA or RNA. Adenine pairs with thymine in DNA, or with uracil in RNA; and guanine pairs with cytosine.

Bioinformatics: The application of computational techniques to the management and analysis of biological information.

Block: An ungapped, aligned motif consisting of sequence segments that are clustered to reduce multiple contributions from groups of highly similar or identical sequences.

Browser: A computer program (commonly known as a Web client) that permits information retrieval from the Internet and the WWW.

cDNA library: A gene library composed of cDNA inserts synthesized from mRNA using reverse transcriptase.

Central dogma: A fundamental principle of molecular biology, first expounded by Francis Crick in 1958, essentially stating that the transfer of information from nucleic acid to nucleic acid, or from nucleic aced to protein, is possible, while transfer from protein to nucleic acid or from protein to protein is impossible. A shorthand expression of the dogma gives the unidirectional relation: DNA>RNA>protein.

Chaperone: A protein that assists the correct non-covalent assembly of folding proteins in vivo; chaperones do not themselves form part of the structures they help to assemble.

Chromosomes: The paired, self-replicating genetic structures of cells that contain the cellular DNA; the nucleotide sequence of the DNA encodes the linear array of genes.

Client: Any program that interacts with a server (Lynx, Mosaic and Netscape are examples of client software).

Clone: A copied fragment of DNA, maintained in circular form, identical to the template from which it is derived.

Cloning: The process of generating identical copies of a DNA fragment (that may encode a complete gene) from a single template DNA.

Cloning vector: A DNA molecule originating from a virus, a plasmid, or the cell of a higher organism into which another DNA fragment can be integrated without compromising the vector’s capacity for self-replication.

Coding sequence (CDS): A region of DNA or RNA whose sequence determines the sequence of amino acids in a protein.

Command line: The basic level at which a computer prompts the user for input.

Communication protocol: An agreed set of rules for structuring communication between programs (allowing, for example, data exchange between nodes on the Internet).

Complementary DNA (cDNA): DNA that is synthesized from a messenger RNA template using the enzyme reverse transcriptase.

Composite database: A database that amalgamates a number of primary sources, using a set of defined criteria that determine the priority of inclusion of the different sources and the level of redundancy retained (e.g., NRDB is a non-identical composite protein sequence database and OWL is a non-redundant composite).

Conceptual translation: The computational process of interpreting the sequence of nucleotides in mRNA via the genetic code to a sequence of amino acids,  which may or may not code for protein.

Consensus sequence: A pseudo-sequence that summarises the residue information contained in a multiple alignment.

Conserved sequence: A sequence of bases in a DNA molecule (or an amino acid sequence in a protein) that has remained essentially unchanged during evolution.

Contig: Sequences of clones, representing overlapping regions of a gene, presented as an assembly or multiple alignment.

Dansylation: A method used to add dansyl group to free amino groups in protein end-group analysis. The dansyl amino acids, isolated after hydrolysis of the protein, are fluorescent and may be detected in nanomolar quantities

Diagnostic performance (diagnostic power): A measure of the ability of a discriminator to identify true matches, either in an individual query sequence or in a database.

Discriminator: A mathematical abstraction of a conserved motif, or set of motifs (e.g., a regular expression pattern, a profile or a fingerprint), used to search either an individual query sequence or a full database for the occurrence of that same, or similar, motif(s).

DNA (deoxyribonucleic acid): The molecule that encodes genetic information. DNA is a double-stranded molecule held together by weak bonds between basepairs of nucleotides. The four nucleotides in DNA contain the bases; adenine (A), guanine G), cytosine (C) and thymine (T). In nature, basepairs form only between A and T and between G and c; thus the base sequence of each single strand can be deduced from that of its partner.

DNA sequence: The linear sequence of base pairs, whether in a fragment of DNA, a gene, a chromosome or an entire genome.

Domain: A compact, local, semi-independent folding unit, presumed to have arisen via gene fusion and gene duplication events. Domains need not be formed from contiguous regions of an amino acid sequence: they may be discrete entities, joined only by a flexible linking region of the chain; they may have extensive interfaces, sharing many close contacts; and they may exchange chains with domain neighbours. The combination of domains within a protein determines its overall structure and function.

Down: State of a computer when it is non-operational and hence unavailable for normal use.

Dumb: A dumb terminal is a desttop display device that is not capable of local processing, this being entirely carried out by the central computer. Such terminals  (e.g., VT52, VT100, etc.) do not support windowing applications.

E.C. system: The systematic classification and naming of enzymes by the Enzyme Commission, whereby enzymes are denoted by the letters E.C., followed by a set of four numbers separated by dots. The first number indicates one of six main functional divisions (oxidoreductases, transferases, hydrolases, lyases, isomerases and ligases); the following numbers denote different subclasses, as defined by donor group, acceptor, substrate, isomer, etc., the final digit being a serial number for the particular enzyme (e.g., E.C.1.1.1.1 for alcohol dehydrogenase, E.C.3.5.3.15 for protein-arginine deiminase, etc)

Edman degradation: A method used in sequencing polypeptides, whereby amino acid residues are removed sequentially from the N-terminus by reaction with phenyl-isothiocyanate, to form phenylthiocarbamyl-peptide(PTC-peptide). This is cleaved in anhydrous acid, releasing a thialzolinone intermediate and the remainder of the peptide.

Enzyme: A protein that acts as a catalyst, speeding the rate at which a biochemical reaction process but not altering the direction or nature of the reaction.

Enzyme Classification System: See E.C. system.

Eukaryote: Cell or organism with membrane-bound, structurally discrete nucleus and other well-developed subcellular compartments, Eukaryotes include all organisms except viruses, bacteria and blue-green algae.

Exon: The protein-coding DNA sequences of a gene.

Expressed Sequence Tag (EST): A partial sequence of a clone, randomly selected from a cDNA library and used to identify genes expressed in a particular tissue. ESTs are used extensively in projects to map the human genome.

Expression profile: The characteristic range of genes expressed at different stages of a cell’s development and functioning

False-negative: A true match that incorrectly recognized by a discriminator.

False-positive: A false match incorrectly recognized by a discriminator.

File Transfer profile (FTP): A method of transferring files to remote computers.

Fingerprint: A group of ungapped motifs excised from a sequence alignment and used to build a characteristic signature of family membership by means of iterative searching of a primary (or composite) database.

Firewall: A mechanism for protecting a proprietary computer network (or intranet), allowing internal users to access the Internet, while preventing external Internet users from penetrating the intranet.

Flat-file: A human-readable data-file in a convenient form for interchange of database information, Flat-files may be created as output from relational databases, in a format suitable for loading into other databases.

Folding problem: The problem of determining how a protein folds into its final 3D form given only the information encoded in its primary structure.

Frameshift: An alteration in the reading sense of DNA resulting from an inserted or deleted base, such that the reading frame for all subsequent codons is shifted with respect to the number of changes made (e.g., if a sequence should read UCU-CAA-AGG-UUA, and a single U is added to the beginning, the new sequence would read UUC-UCA-AAG-GUU, etc.). Frame shifts may arise through random mutations, or via errors in reading sequencing output.

Gene: The fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position of a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule).

Gene duplication: A genetic alteration in which a segment of DNA is repeated, Duplications may appear anywhere, but where the duplicated segment is adjacent to the original one, this is termed a tandem duplication.

Gene expression: The process by which a gene’s coded information is converted into the structures present and operating in the cell. Express ed genes include those that are transcribed into mRNA and then translated into protein and those that are transcribed into RNA but not translated into protein( e.g., transfer and ribosomal RNAs).

Gene families: Groups of closely related genes that encode similar protein products.

Gene product: The protein resulting from the expression of a gene. In some cases, the gene product may be an RNA molecule that is never translated.

Genetic code: The rules that relate the four DNA or RNA bases to the 20 amino acids, There are 64 possible three-base (triplet) sequences, which are known as codons. A single triplet uniquely defines one amino acid, but an amino acid may be coded by as many as six codons, The code is thus said to be degenerate.

Genome: All the genetic material in the chromosomes of a pr\articular organism; its size is generally given as its total number of basepairs.

Genome projects: Initiatives (often via international collaboration) to map and sequence the entire genomes of particular organisms. The first complete eukaryotic genome to have been sequenced is that of the yeast S. cereviseae; the human genome is expected to be finished by roughly 2003-2005; and mouse by around 2008. The majority of genomes completed to date are those of prokaryotes.

Helical wheel: A circular graph depicting five turns of helix, around which the residues of a protein sequence are plotted. Helical potential is recognized by the clustering of hydrophilic and hydrophobic residues in distinct polar and non-polar arcs.

Hidden Markov Model (HMM): A probabilistic model consisting of a number of interconnecting states. Like profiles, HMMs encode full domain alignments. They are essentially linear chains of match, delete or insert states: a match state denotes a conserved column in an alignment; an insert state allows insertions relative to match states; and delete states allow match positions to be skipped.

Home page: The HTML document that acts as the first contact point between a browser and a server.

Homology: Being related by the evolutionary process of divergence from a common ancestor. Homology is not a synonym for similarity.

Hybridization: The process of joining two complementary strands of DNA or one each of DNA and RNA to form a double-stranded molecule.

Hydropathy: Having the property of hydrophobicity, a low affinity for water.

Hydropathy profile: A graph in which hydropathy values are calculated within a sliding window and plotted for each residue in a protein sequence. Such graphs show characteristic peaks and troughs, corresponding to the most hydrophobic and hydrophilic regions of the sequence respectively.

Hydrophobicity: See Hydropathy.

Hyperlink: An active HTTP cross-reference that links one Web document to another document on the Internet.

Hypermedia: Formatted Web documents containing a variety of information types, including text, image, movie and audio.

Hypertext: Text that contains embedded links (hyperlinks) to other documents.

HyperText Markup language (HTML): The syntax governing the way documents are created so that they can be interpreted and rendered by Web browser.

HyperText Transport Protocol (HTTP): The communication protocol used by Web servers.

INDEL: An INsertion/DELetion in a DNA or protein sequence.

Internet: The international network of computer networks that connect government, academic and business institutions.

Internet Inter-ORB Protocol (IIOP): The communication protocol used by object-request brokers to communicate over the Internet.

Intranet: Computer network isolated from the Internet by means of a firewall but that offers similar facilities to the local community (e.g., Web servers, mail, etc.).

Introns: The sequence of DNA bases that interrupts the protein-coding sequence of a gene; these sequences are transcribed into RNA but are edited out of the message before it is translated into protein.

IP address: Internet protocol address- a unique identifying number assigned to each computer on the Internet to allow communication between them.

Java: An object-oriented, network programming language that permits creation of either stand-alone programs, or applets that are launched via links on Web pages. In theory, Java programs run on any machine that supports the Java run-time environment (including PCs and UNIX workstations).

Kilobase (Kb): Unit of length for DNA fragments equal to 1000 nucleotides.

Library: An unordered collection of clones (i.e., cloned DNA from a particular organism), generated from genomic DNA or cDNA.

Locus (pl. loci): The position on a chromosome of a gene or other chromosome marker; also, the DNA at that position. The use of locus is sometimes restricted to mean regions of DNA that are expressed.

Megabase (Mb): Unit of length for DNA fragments equal to 1 million nucleotides.

Midnight Zone: Region of sequence identity where sequence comparisons fail completely to detect structural similarity.

Model system: A biological system used to represent other , often more complex, systems , in which similar phenomena either do, or are thought to, occur (e.g., D. melanogaster, M. musculus, S. cerevisiae, C. elegans, E.coli).

Module: An autonomous folding unit, believed to have arisen largely as a result of genetic shuffling mechanism, Modules are contiguous in sequence and are often used as building blocks to confer a variety of complex functions on the parent protein. They may be thought of as a subset of protein domains. Examples of modules include Kringle domains (named after the shape of a Danish pastry). Which are autonomous structural units found throughout the blood clotting and fibrinolytic proteins; the ubiquitous DNA-binding zinc fingers, which are small self-folding units in which zinc is a crucial structural component; and the WW module (characterized by two conserved tryptophan residues, hence its name), which is found in a number of disparate proteins, including dystrophin, the product encoded by the gene responsible for Duchenne muscular dystrophy.

Mosaic: A mosaic protein is a modular protein that, rather than including multiple tandem repeats of the same module, is composed of a number of different modules, each conferring different aspects of the parent protein’s overall functionality (e.g., the calcium independent latrotoxin receptor, a mosaic of EGF-like and laminin G-like modules).

Motif: A consecutive string of amino acids in a protein sequence whose general character is repeated, or conserved, in all sequences in a multiple alignment at a particular position. Motifs are of interest because they may correspond to structural or functional elements within the sequences they characterize.

Multiple alignment: See Sequence alignment.

Mutation: Any change in DNA sequence.

Normalised library: cDNA library generated such that all the genes in the library are represented at the same frequency.

Nucleotide: A molecule consisting of a nitrogenous base (A,G, T, or C in DNA; A, G, U or C in RNA), a phosphate moiety and a sugar group (deoxyribose in DNA and ribose in RNA). Thousands of nucleotides are linked to form a DNA or RNA molecule.

Object-oriented database: A database in which data are stored as abstract objects, with abstract relationships between them. The data representations are potentially very varied, including, for example, character strings, digitised images, tables, etc. An object may subsume many other objects, and the database allows retrieval of the objects as a whole. The flexibility of data representation, and the ability to group objects together, renders object-oriented databases potentially very powerful systems.

Open reading frame (ORF): A series of DNA codons including a 5’ initiation codon and a termination codon,  that encodes a putative or known gene.

Orthologues: Homologous proteins that perform the same function in different species.

Packet: A self-contained message, or component of a message, comprising address, control and data signals, which may be transferred as a single entity within a communications network.

Paralogues: Homologous proteins that perform different but related functions within one organism.

Pattern database: See Secondary database.

Penalties: Scores, or weights, used by programs in the computation of sequence alignments; such scores are normally supplied as parameters to the programs and thus may be modified by the user.

Phantom INDELs: Spurious insertions or deletions that arise when physical irregularities in a sequencing gel cause the reading software either to call a base too soon, or to miss a base altogether.

Phylogenic analysis: Study of the evolutionary relationships between a species and its predecessors (e.g., using phylogenic trees).

Polymerase chain reaction (PCR): A method for amplifying a DNA base sequence using a heat-stable polymerase and two primers, one complementary to the (+)-strand at one end of the sequence to be amplified and the other complementary to the (-)-strand at the other end. The faithfulness of reproduction of the sequence is related to the fidelity of the polymerase. Errors may be introduced into the sequence using this method of amplification.

Post-translational modification: An enzyme-catalysed alteration to a protein made after its translation from mRNA (e.g., glycosylation, phosphorylation, myristoylation, methylation).

Primary database: A database that stores biomolecular sequences (protein or nucleic acid) and associated annotation information (organism, species, function, mutations linked to particular diseases, functional/structural patterns, bibliographic, etc.).

Primary structure: The linear sequence of amino acids in a protein molecule.

Primer: A short polynucleotide chain to which new deoxyribonucleotides can be added by DNA polymerase.

Probe: A DNA or protein sequence used as a query in a database search.

Profile: A position-specific scoring table that encapsulates the sequence information within complete alignments. Profiles define which residues are allowed at giver positions; which positions are conserved and which degenerate; and which positions, or regions, can tolerate insertions. In addition to data implicit in the alignment, the scoring system may include evolutionary weights and results from structural studies. Variable penalties are specified to weight against insertions and deletions occurring in secondary structure elements.

Prokaryote: An organism lacking a membrane-bound, structurally discrete nucleus and other subcellular compartments. Bacteria are prokaryotes.

Promotor: A site on DNA to which RNA polymerase will bind and initiate transcription.

Protein: A molecule composed of one or more chains of amino acids in a specific order; the order is determined by the base sequence of nucleotides in the gene coding for the protein. Proteins are required for the structure, function and regulation of cells, tissues and organs, each protein having a specific role (e.g., hormones, enzymes and antibodies).

Quaternary structure: The arrangement of separate protein chains in a protein molecule with more than one subunit.

Quinternary structure: The arrangement of separate molecules, such as in protein-protein or protein-nucleic acid interactions.

R-factor: In X-ray crystallography, this parameter is used to express the extent of agreement between theoretical calculations and the measured data; the lower the R-factor, the better the fit (R means either Residual or Reliability).

Regular expression: A single consensus expression derived from a conserved region of a sequence alignment, and used as a characteristic signature of family membership. Synonymous terms: rule, pattern.

Regular regions or sequences: A DNA base sequence that controls gene expression.

Relational database: A database that uses a relational data model, in which data are stored in two-dimensional tables. The tables embody different aspects or properties of the data, but contain overlapping information.

Resolution: The extent to which closely juxtaposed objects can be distinguished as separate entities. The degree of resolution is dependent on the resolving power of the system; the fineness of detail with which objects may be visualized is determined by the wavelength of electromagnetic radiation used. X-rays, for example, have wavelengths in the range 10-8m to 10-11m and hence can be used to resolve structures at the atomic level. Structures are thus said to be determined, for example, to 3 Ǻ resolution, 5 Ǻ resolution, etc.

RNA (ribonucleic acid): A molecule chemically similar to DNA that plays a central role in protein synthesis. The structure of RNA is similar to that of DNA but it is inherently less stable. There are several classes of RNA molecule, including messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), and other small RNAs , each serving a different purpose.

Secondary database: A database that contains information derived from primary sequence data, typically in the form of regular expressions (patterns), fingerprints, blocks, profiles or Hidden Markov Models. These abstractions represent distillations of the most conserved features of multiple alignments, such that they are able to provide potent discriminators of family membership for newly determined sequence.

Secondary structure: Regions of local regularity within a protein fold (e.g., α-helices, β-turns, β-strands.).

Sequence alignment: A linear comparison of amino (or nucleic) acid sequences in which insertions are made in order to bring equivalent position in adjacent sequences into the correct register. Alignments are the basis of sequence analysis methods, and are used to pinpoint the occurrence of conserved motifs.

Sequence Tagged Site (STS): Short (200 to 500 basepairs DNA sequence that has a single occurrence in the human genome and whose location and base sequence are known. Detectable by polymerase chain reaction (PCR), STSs are useful for localizing and orienting the mapping and sequence data reported from many different laboratories and serve as landmarks on the developing physical map of the human genome. Expressed sequence tags (ESTs) are STSs derived from cDNA.

Sequencing: Determination of the order of nucleotides (base sequences) in a DNA or RNA molecule, or the order of amino acids in a protein.

Server: A computer of software system that communicates information via the Internet to a client.

Shotgun method: Cloning of DNA fragments randomly generated from a genome.

Silent mutation: A nucleotide substitution that does not result in an amino acid substitution in the translation product, because of the redundancy of the genetic code.

Six-frame translation: Translation of a stretch of DNA taking into account three forward translations and three reverse translations, arising from the three possible reading frames of an uncharacterized stretch of DNA.

Sparse matrix: A matrix in which most of the elements or cells have zero scores.

Splice variants: Proteins of different length that arise through translation of mRNAs that have not included all available exons in the template DNA.

Subject: A DNA or protein sequence matched by a query sequence in a database search.

Subunit: A distinct polypeptide chain within a protein that may be separated from other chains (whether identical or different) without breaking covalent bonds.

Super-secondary structure: The arrangement of α-hecices and/or β-strands in a protein sequence into discrete folded structures (e.g., β-barrels, β-α-β units, Greek keys, etc.).

Telnet protocol: A method of communication between remote computers that allows users to log on and use the distant machines as if physically present at the remote location.

Tertiary database: A database derived from information housed in secondary (pattern) databases (e.g., the BLOCKS and eMOTEF databases, which draw on data stored within PROSITE and PRINTS). The value of such resources is in providing a different scoring perspective on the same underlying data, allowing the possibility to diagnose relationships that might be missed using the original implementation.

Tertiary structure: The overall fold of a protein sequence, formed by the packing of its secondary and/or super-secondary structure elements.

Transcription: The synthesis of an RNA copy from a sequence of DNA (a gene); the first step in gene expression.

Translation: The process in which the genetic code carried by mRNA directs the synthesis of proteins from amino acids.

Transmembrane domain: A region of a protein sequence that traverses a membrane; for a-helical structures, this requires a span of 20-25 residues.

Transmission Control Protocol/Internet Protocol (TCP/IP): The rules that govern data transmission between two computers over the Internet.

True-negative: A false match that correctly fails to be recognized b a discriminator.

True-positive: A true match correctly recognized by a discriminator.

Twilight Zone: A zone of sequence similarity (~0-20% identity) within which alignments appear plausible to the eye but are not statistically significant (i.e., could have arisen by chance)

Uniform Resource locator (URL): The address of a source of information. The URL comprises four parts---the protocol, the host name, the directory path and the file name (e.g. http://wwww.biochem.ucl.ac.uk/bsm/dbbrowser/prefacefrm.html).

Up: The status of a computer system when it is operational.

Upstream: Further back in the sequence of a DNA molecule, with respect to the direction in which the sequence is being read.

Weight matrix: See Profile.

Widow: Amino acid residues isolated from neighbouring residues by spurious gaps, usually the result of over-zezlous gap insertion by automatic alignment programs.

World Wide Web: The information system or network on the Internet that uses HTTP as the primary communications medium.