Zoology in the Classroom: A - Z Bioinformatics

Accession number - An identifier supplied by the
curators of the major biological databases upon submission of a novel entry that
uniquely identifies that sequence (or other) entry.

Active site -
The amino acid residues at the catalytic site of
an enzyme. These residues provide the binding and activation energy needed to
place the substrate into its transition state and bridge the energy barrier of
the reaction undergoing catalysis

Agents -
Independent, autonomous, software modules that
can search the Internet for data or content pertinent to a particular
application, such as a gene, protein, or biological system.

Algorithm -
A series of steps defining a procedure or formula
for solving a problem, that can be coded into a programming language and
executed. Bioinformatics algorithms typically are used to process, store,
analyze, visualize and make predictions from biological data.

Alignment -
The result of a comparison of two or more gene or
protein sequences in order to determine their degree of base or amino acid
similarity. Sequence alignments are used to determine the similarity, homology,
function or other degree of relatedness between two or more genes or gene
products.

Allele -
A given form of a gene that occupies a specific
position or locus on a chromosome. Variant forms of genes occurring at the same
locus are said to be alleles of one another.

Alternative splicing - One of the alternate
combinations of a folded protein that are possible due to by recombination of
multiple gene segments during mRNA splicing that occurs in higher
organisms.

Alu
family - A common set
of dispersed DNA sequences found throughout the human genome; each is about 300
bases long and they are repeated at least 500,000 times. Alu sequences are
speculated to have originated from viral RNA sequences that integrated into
human DNA thousands of years ago.

Amino acid -
One of the 20 chemical building blocks that are
joined by amide (peptide) linkages to form a polypeptide chain of a
protein

Analogy -
Reasoning by which the function of a novel gene
or protein sequence may be deduced from comparisons with other gene or protein
sequences of known function. Identifying analogous or homologous genes via
similarity searching and alignment is one of the chief uses of Bioinformatics.
(See also alignment, similarity search.)

Annotation -
A combination of comments, notations, references,
and citations, either in free format or utilizing a controlled vocabulary, that
together describe all the experimental and inferred information about a gene or
protein. Annotations can also be applied to the description of other biological
systems. Batch, automated annotation of bulk biological sequence is one of the
key uses of Bioinformatics tools.

Anticodon -
The triplet of contiguous bases on tRNA that
binds to the codon sequence of nucleotides on mRNA. Example: GGG codes for
Glycine.

Antigen -
Any foreign molecule that stimulates an immune
response in a vertebrate organism. Many antigens are proteins such as the
surface proteins of foreign organisms.

Antisense -
DNA or RNA composed of the complementary sequence
to the target DNA/RNA. Also used to describe a therapeutic strategy that uses
antisense DNA or RNA sequences to target specific gene DNA sequences or mRNA
implicated in disease, in order to bind and physically inhibit their expression
by physically blocking them.

Assay -
A method for measuring a biological activity.
This may be enzyme activity, binding affinity, or protein turnover. Most assays
utilize a measurable parameter such as color, fluorescence or radioactivity to
correlate with the biological activity.

Assembly -
Compilation of overlapping sequences from one or
more related genes that have been clustered together based on their degree of
sequence identity or similarity. Sequence assembly may be used to piece together
"shotgun" sequencing fragments (see shotgun sequencing) based upon overlapping
restriction enzyme digests, or may be used to identify and index novel genes
from "single-pass" cDNA sequencing efforts.

Autoradiography - A method used to locate
radioisotope-labeled materials which have been separated in gels or are present
in blots. The location of the radiolabeled material is determined by overlaying
the test material with a photographic film that is sensitive to the
radioisotope.

Backtracking
algorithm - The process of repeatedly exploring paths until you encounter the
solution.

Bacterial artificial chromosome
(BAC) - Cloning vector
that can incorporate large fragments of DNA.

Beta sheet -
A three dimensional arrangement taken up by
polypeptide chains that consists of alternating strands linked by hydrogen
bonds. The alternating strands together form a sheet that is frequently twisted.
One of the secondary structural elements characteristic of proteins.

BioCyc - The BioCyc collection of Pathway/Genome Databases (DBs) provides electronic reference sources on the pathways and genomes of different organisms.

Bioinformatics - The field of endeavor that
relates to the collection, organization and analysis of large amounts of
biological data using networks of computers and databases (usually with
reference to the genome project and DNA sequence information)

Bivalent -
Having two binding sites; having 2 free electrons
available for binding.

BRENDA - The Comprehensive Enzyme Information System.

Carboxyl group - The -COOH functional group,
acidic in nature, found in all amino acids

cDNA (complementary DNA) - A DNA strand copied from mRNA
using reverse transcriptase. A cDNA library represents all of the expressed DNA
in a cell.

cDNA library
- A set of DNA fragments prepared from the total
mRNA obtained from a selected cell, tissue or organism.

Chimeric clone - A cloning artifact created by a
foreign gene being inserted into a vector in an incorrect orientation resulting
in theexpression of a protein consisting of a fusion of two different gene
products.

Chromat -
Data file output from most popular DNA
sequencers. Chromat files consist of the fluorescent traces generated by the
sequencer for each of the four chemical bases, A, C, G, and T, together with the
sequence and measures of the error in the traces at each sequence
position.

Chromatin -
The chromosome as it appears in its condensed
state, composed of DNA and associated proteins (mainly histones).

Chromosome -
The structure in the cell nucleus that contains
all of the cellular DNA together with a number of proteins that compact and
package the DNA.

Clone -
A population of genetically identical cells or
DNA molecules.

Cluster -
The grouping of similar objects in a
multidimensional space. Clustering is used for constructing new features which
are abstractions of the existing features of those objects. The quality of the
clustering depends crucially on the distance metric in the space. In
bioinformatics, clustering is performed on sequences, high-throughput expression
and other experimental data. Clusters of partial or complete gene sequences can
be used to identify the complete (contiguous) sequence and to better identify
its function. Clustering expression data enables the researcher to discern
patterns of co-regulation in groups of genes.

Coding regions (CDS) - The portion of a genomic
sequence bounded by start and stop codons that identifies the sequence of the
protein being coded for by a particular gene.

Codon -
A sequence of three adjacent nucleotides that
designates a specific amino acid or start/stop site for
transcription.

COG - Clusters of Orthologous Groups of proteins were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.

Combinatorial chemistry -
The use of chemical methods to generate
all possible combinations of chemicals starting with a subset of compounds. The
building blocks may be peptides, nucleic acids or small molecules. The libraries
of compounds formed by this methodology are used to probe for new pharmaceutical
reagents (see high-throughput screening).

Complementary determining region
(CDR) - The
hypervariable regions of an antibody molecule, consisting of three loops from
the heavy chain and three from the light chain, that together form the
antigen-binding site.

Complexity (of gene
sequence) - The term
"low complexity sequence" may be thought of as synonymous with regions of
locally biased amino acid composition. In these regions, the sequence
composition deviates from the random model thatunderlies the calculation of the
statistical significance (P-value) of an alignment. Such alignments among low
complexity sequences are statistically but not biologically significant, i.e.,
one cannot infer homology (common ancestry) or functional similarity.

Configuration
- In software, the complete ordering and
description of all parts of a software or database system. Configuration
management is the use of software to identify, inventory and maintain the
component modules that together comprise one or more systems or
products.

Conformation
- The precise three-dimensional arrangement of
atoms and bonds in a molecule describing its geometry and hence its molecular
function.

Consensus sequence - A single sequence delineated
from an alignment of multiple constituent sequences that represents a "best fit"
for all those sequences. A "voting" or other selection procedure is used to
determine which residue (nucleotide or amino acid) is placed at a given position
in the event that not all of the constituent sequences have the identical
residue at that position.

Constitutive synthesis (expression)
- Synthesis of mRNA
and protein at an unchanging or constant rate regardless of a cellís
requirements (see housekeeping genes).

Contig -
A length of contiguous sequence assembled from
partial, overlapping sequences, generated from a "shotgun" sequencing project.
Contigs are typically created computationally, by comparing the overlapping ends
of several sequencing reads generated by restriction enzyme digestion of a
segment of genomic DNA. The creation of contigs in the presence of sequencing
errors, ambiguities and the presence of repeats is one of the most
computationally challenging aspects of the role of Bioinformatics in genome
analysis.

Convergence -
The end-point of any algorithm that uses
iteration or recursion to guide a series of data processing steps. An algorithm
is usually said to have reached convergence when the difference between the
computed and observed steps falls below a pre-defined threshold.

Cosmids -
DNA vectors that allow the insertion of long
fragments of DNA (up to 50 kbases).

Crystal structure - Term used to describe the high
resolution molecular structure derived by x- ray crytallographic analysis of
protein or other biomolecular crystals.

C value -
The characteristic amount of DNA in the
haploid genome of a species.

C value paradox - The apparent lack of correlation between an organism's C value and
its level of morphological complexity.

Data
Mining - The ability to query very
large databases in order to satisfy a hypothesis ("top-down" data mining); or to
interrogate a database in order to generate new hypotheses based on rigorous
statistical correlations ("bottom-up" data mining).

Data
Processing - Data processing is defined as
the systematic performance of operations upon data such as handling, merging,
sorting, and computing. The semantic content of the original data should not be
changed, but the semantic content of the processed data may be
changed.

Data
Warehouses
- Vast
arrays of heterogeneous (biological) data, stored within a single logical data
repository, that are accessible to different querying and manipulation
methods.

Database - Any file system by which data
gets stored following a logical process.

DDBJ - DNA
Data Bank of Japan

Deconvolution - Mathematical procedure to
separate out the overlapping effects of molecules such as mixtures of compounds
in a high-throughput screen, or mixtures of cDNAs in a high density
array.

Deletion - A chromosomal alteration in
which a portion of the chromosome or the underlying DNA is
lost.

Deletion mapping -
Process in which different deletions in a region of DNA are
created and used to map the functionally critical areas of that DNA. e.g the
minimal region of DNA required for a test promoter can be ascertained by
systematic deletions in the region of interest.

Dendrogram
- A
graphical procedure for representing the output of a hierarchical clustering
method. A dendrogram is strictly defined as a binary tree with a distinguished
root, that has all the data items at its leaves. Conventionally, all the leaves
are shown at the same level of the drawing. The ordering of the leaves is
arbitrary, as is their horizontal position. The heights of the internal nodes
may be arbitrary, or may be related to the metric information used to form the
clustering.

Dicentric - A structurally abnormal
chromosome with two centromeres.

Dideoxynucleotide - A modified nucleotide that lacks
the 3' hydroxyl group and so terminates strand synthesis when incorporated into
a polynucleotide.

Dimer -
A composite molecule formed by the binding of two
molecules.

Directed
Sequencing - Successively sequencing DNA from adjacent stretches of
chromosome.

Directed Shotgun
Approach - A genome sequencing strategy that combines random shotgun
sequencing with a genome map, the latter used to aid assembly of the master
sequence.

Directional
Selection - A selective process that changes the frequency of an allele in a
specific direction, either toward fixation or toward elimination.

Distance
Matrix - A
table showing the evolutionary distances between all pairs of nucleotide
sequences in a dataset.

Distance Method
- A
rigorous mathematical approach to alignment of nucleotide
sequences.

Disulphide Bond
- Covalent link formed between the sulphur atoms
of two different cysteine residues in a protein. Important in maintaining the
folded structure of a protein, and also for linking different proteins in a
complex.

DNA (DeoxyriboNucleic Acid) - The chemical that forms the
basis of the genetic material in virtually all organisms. DNA is composed of the
four nitrogenous bases Adenine, Cytosine, Guanine, and Thymine, which are
covalently bonded to a backbone of deoxyribose-phosphate to form a DNA strand.
Two complementary strands (where all Gs pair with Cs and As with Ts) form a
double helical structure which is held together by hydrogen bonding between the
cognate bases.

DNA Fingerprinting - A technique for identifying
human individuals based on a restriction enzyme digest of tandemly repeated DNA
sequences that are scattered throughout the human genome, but are unique to each
individual.

DNA Microarrays
- The deposition of oligonucleotides or cDNAs
onto an inert substrate such as glass or silicon. Thousands of molecules may be
organized spatially into a high-density matrix. These DNA chips may be probed to
allow expression monitoring of many thousands of genes simultaneously. Uses
include study of polymorphisms in genes, de novo sequencing or molecular
diagnosis of disease.

DNA Polymerase -
An enzyme that catalyzes the synthesis of DNA
from a DNA template given the deoxyribonucleotide precursors.

DNA Probes -
Short single stranded DNA molecules of specific
base sequence, labeled either radioactively or immunologically, that are used to
detect and identify the complementary base sequence in a gene or genome by
hybridizing specifically to that gene or sequence.

DNA Sequencing -
The technique in which the specific sequence of
bases forming a particular DNA region is deciphered.

DNase (Deoxyribonuclease) - One of a series of enzymes that
can digest DNA.

Domain (Protein)
- A region of special biological interest within
a single protein sequence. However, a domain may also be defined as a region
within the three-dimensional structure of a protein that may encompass regions
of several distinct protein sequences that accomplishes a specific function. A
domain class is a group of domains that share a common set of well-defined
properties or characteristics.

Drug -
An agent that affects a biological process.
Specifically, a molecule whose molecular structure can be correlated with its
pharmacological activity.

Drug Discovery Cycle - The cycle of events required to
develop a new drug. Typically this involves research, preclinical testing and
clinical development, and can take from 5 to 12 years.

Dynamic Programming - A type of algorithm widely used for constructing sequence
alignments and for evaluating all possible candidate gene structures.

Ecocyc - Ecocyc is
a scientific database for the bacterium Escherichia coli K-12
MG1655

European Bioinformatics
Institute
(EBI) - An outstation of the European Molecular Biology Laboratory located
in Hinxton, England, near Cambridge University.

Electronic Northerns - The use of an electronic
database of cDNA sequences (or probes derived from them) in order to measure the
relative levels of mRNAs expressed in different cells or tissues. An example of
the use of an electronic Northern might be to identify the differences in the
genes expressed in prostate cancer and those in benign prostate hyperplasia, by
subtracting the database of one from the other and seeing which cDNAs
remain.

Electrophoresis
- The use of an external electric field to
separate large biomolecules on the basis of their charge by running them
through acrylamide or agarose gels.

Electrostatic Interactions - Ionic bonds that form between charged chemical groups.

Electrostatic Surface
Potential
-
The electrostatic charges on the
surface of a protein, often indicative of the protein's functional regions.

EMBL - European Molecular Biology
Laboratory. The European Molecular Biology
Laboratory in Heidelberg, Germany, maintains the EMBL database, one of the major
public DNA sequence databases.

EMBnet - European Molecular Biology
Network. EMBnet was established in 1988,
and provides services including local molecular databases and software for
molecular biologists in Europe.

EMBOSS - European Molecular Biology Open
Software Suite- A suite of freely available
programs and libraries for molecular biology and bioinformatics.

EMP  - Enzymes and Metabolic Pathways database, EMP, is a unique and most comprehensive electronic source of biochemical data. It covers all aspects of enzymology and metabolism.

Enhancers -
DNA sequences that can greatly increase the
transcription rates of genes even though they may be far upstream or downstream
from the promoter they stimulate.

Entrez - Entrez is the integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others.

Enzyme -
A class of proteins that are capable of
catalyzing chemical reactions (the making or breaking of chemical bonds). They
do so by orienting their substrates into a suitable geometry in a particular
location (the active site) where electrophilic or nucleophilic amino acid
residues can participate in the reaction. Enzymes are protein catalyst that
speeds up chemical reactions that would otherwise be prohibitively slow under
physiological conditions.

Epigenomics -
The study of complex expression networks or
linkages both spatially (within the body) and temporally (at different times in
development).

Equilibrium constant - Value that describes the
equilibrium state of the reversible reaction between two molecular
species.

Exon -
The region of DNA within a gene that codes for a
polypeptide chain or domain. Typically a mature protein is composed of several
domains coded by different exons within a single gene.

Expasy  - Expert Protein Analysis System is a proteomics server of the Swiss Institute of Bioinformatics (SIB) is dedicated to the analysis of protein sequences and structures as well as 2-D PAGE.

Expressed Sequence Tags (ESTs) - A small sequence from an
expressed gene that can be amplified by PCR. ESTs act as physical markers for
cloning and full length sequencing of the cDNAs of expressed genes. Typically
identified by purifying mRNAs, converting to cDNAs, and then sequencing a
portion of the cDNAs.

Expression (gene or protein) - A measure of the presence,
amount, and time-course of one or more gene products in a particular cell or
tissue. Expression studies are typically performed at the RNA (mRNA) or protein
level in order to determine the number, type, and level of genes that may be
up-regulated or down-regulated during a cellular process, in response to an
external stimulus, or in sickness or disease. Gene chips and proteomics now
allow the study of expression profiles of sets of genes or even entire
genomes.

Expression profile - The level and duration of
expression of one or more genes, selected from a particular cell or tissue type,
generally obtained by a variety of high-throughput methods, such as sample
sequencing, serial analysis, or microarray-based detection.

FASTA - A sequence in
FASTA format begins with a single-line description, followed by lines of
sequence data. The description line is distinguished from the sequence data by a
greater-than (">") symbol in the first column. It is recommended that all
lines of text be shorter than 80 characters in length.

Fingerprint - A fingerprint is a set of
motifs used to predict the occurrence of similar motifs, in either an individual
sequence or in a database. Fingerprints are refined by iterative scanning of a
composite protein sequence database. A composite or multiple-motif fingerprint
contains a number of aligned motifs taken from different parts of a multiple
alignment. True family members are then easy to identify by virtue of
possessing all elements of the fingerprint, while subfamily members may be
identified by possessing only part of it.

Flow Cytometry - A method for the separation of chromosomes by detecting the
light-absorbing or fluorescing properties of cells or subcellular fractions
(i.e., chromosomes) passing in a narrow stream through a laser beam.

Flow Karyotyping - Use of flow cytometry to analyze and separate chromosomes on the
basis of their DNA content.

FLpter
Value - The
unit used in FISH to describe the position of a hybridization signal relative to
the end of the short arm of the chromosome.

Fluid-Mosaic - Widely accepted model of the plasma membrane in which proteins
(the mosaic) are embedded in lipids (the fluid).

Fluorescence In situ
Hybridization (FISH) - A
mapping technique that uses fluorescent tags to identify the locations of
specific markers on chromosomes.

FlyBase - A database of the Drosophila genome.

Frameshift -
A deletion, substitution, or duplication of one
or more bases that causes the reading-frame of a structural gene to shift from
the normal series of triplets.

Functional genomics - The use of genomic information
to delineate protein structure, function, pathways and networks. Function may be
determined by "knocking out" or "knocking in" expressed genes in model organisms
such as worm, fruitfly, yeast or mouse.

Fusion protein -
The protein resulting from the genetic joining
and expression of 2 different genes (see chimeric)

Fuzzy Logic - A superset of Boolean logic dealing with the concept of partial
truth, in which numbers from 0 to 1 are used as truth values between "completely
false" (0) and "completely true" (1).

Gaps
(affine gaps) - A gap is defined as any
maximal, consecutive run of spaces in a single string of a given alignment. Gaps
help create alignments that better conform to underlying biological models and
more closely fit patterns that one expects to find in meaningful alignment. The
idea is to take in account the number of continuous gaps and not only the number
of spaces when calculating an alignment. Affine gaps contain a component for gap
insertion and a component for gap extension, where the extension penalty is
usually much lower than the insertion penalty. This mimics biological reality as
multiple gaps would imply multiple mutations, but a single mutation can lead to
a long gap quite easily.

Gap Extension Penalty - The penalty in an alignment score of extending a gap another
character.

Gap Opening Penalty - The penalty in an alignment score for a gap of any length.

Gap penalties -
The penalty applied to a similarity score for the
introduction of an insertion or deletion gap, the extension of a gap, or both.
Gap penalties are usually subtracted from a cumulative score being determined
for the comparison of two or more sequences via an optimization algorithm that
attempts to maximize that score.

Gap Score - The score assigned to a gap.

Gapped Alignment - An alignment in which gaps are permitted.

Gel electrophoresis - A technique by which molecules
are separated by size or charge by passing them through a gel under the
influence of an external electric field.

Gene Index -
A listing of the number, type, label and sequence
of all the genes identified within the genome of a given organism. Gene indices
are usually created by assembling overlapping EST sequences into clusters, and
then determining if each cluster corresponds to a unique gene. Methods by which
a cluster can be identified as representing a unique gene include identification
of long open reading frames (ORFs), comparison to genomic sequence, and
detection of SNPs or other features in the cluster that are known to exist in
the gene.

GenBank -
Data bank of genetic sequences operated by a
division of the National Institutes of Health.

Gene -
Classically, a unit of inheritance. In practice,
a gene is a segment of DNA on a chromosome that encodes a protein and all the
regulatory sequences (promoter) required to control expression of that
protein.

Gene chips -
The covalent attachment of oligonucleotides or
cDNA directly onto a small glass or silicon chip in organized arrays. Over
50,000 different DNA fragments can be presented on a single chip providing a
high throughput parallel method of probing gene expression, genotype or gene
function.

Gene
families - Subsets
of genes containing homologous sequences which usually correlate with a common
function.

Gene library -
A collection of cloned DNA fragments created by
restriction endonuclease digestion that represent part or all of an organismís
genome.

Gene product -
The product, either RNA or protein, that results
from expression of a gene. The amount of gene product reflects the activity of
the gene.

Gene therapy -
The use of genetic material for therapeutic
purposes. The therapeutic gene is typically delivered using recombinant virus or
liposome based delivery systems.

Genetic code -
The mapping of all possible codons into the 20
amino acids including the start and stop codons.

Genetic marker -
Any gene that can be readily recognized by its
phenotypic effect, and which can be used as a marker for a cell, chromosome, or
individual carrying that gene. Also, any detectable polymorphism used to
identify a specific gene.

Genome -
The complete genetic content of an
organism.

Genomic DNA (sequence) - DNA sequence typically obtained
from mammalian or other higher-order species, which includes both intron and
exon sequence (coding sequence), as well as non-coding regulatory sequences such
as promoter, and enhancer sequences.

Genomics - The analysis of the entire genome of a chosen
organism.

Genotype -
Strictly, all of the genes possessed by an individual. In practice, the
particular alleles present in a specific genetic locus.

H

Hairpin - A double-helical region in a
single DNA or RNA strand formed by the hydrogen-bonding between adjacent inverse
complementary sequences to form a hairpin shaped structure.

Helix-Turn-Helix
Motif (Hth, Hlh)
-
A common structural motif for binding of a
protein to DNA that consists of two alpha-helices connected by a short
nonhelical segment called a "turn".

Heterodimer -
Protein composed of 2 different chains or
subunits.

Heteroduplex -
Hybrid structure formed by the annealing of two
DNA strands (or an RNA and DNA) that have sufficient complementarity in their
sequence to allow hydrogen bonding.

Hidden Markov model (HMM) - A joint statistical model for an
ordered sequence of variables. The result of stochastically perturbing the
variables in a Markov chain (the original variables are thus "hidden"), where
the Markov chain has discrete variables which select the "state" of the HMM at
each step. The perturbed values can be continuous and are the "outputs" of the
HMM. A Hidden Markov Model is equivalently a coupled mixture model where the
joint distribution over states is a Markov chain. Hidden Markov models are
valuable in bioinformatics because they allow a search or alignment algorithm to
be trained using unaligned or unweighted input sequences; and because they allow
position-dependent scoring parameters such as gap penalties, thus more
accurately modeling the consequences of evolutionary events on sequence
families.

High-throughput screening - The method by which very large
numbers of compounds are screened against a putative drug target in either
cell-free or whole-cell assays. Typically, these screenings are carried out in
96 well plates using automated, robotic station based technologies or in higher-
density array ("chip") formats.

HLA complex -
Another name for the MHC in humans; refers to the
"Human Leukocyte Antigen" complex located on chromosome 6.

Homeobox -
A highly conserved region in a homeotic gene
composed of 180 bases (60 amino acids) that specifies a protein domain (the
homeodomain) that serves as a master genetic regulatory element in cell
differentiation during development in species as diverse as worms, fruitflies,
and humans.

Homeodomain -
A 60 amino-acid protein domain coded for by the
homeobox region of a homeotic gene.

Homeotic gene -
A gene that controls the activity of other genes
involved in the development of a body plan. Homeotic genes have been found in
organisms ranging from plants to humans.

Homology -
Two or more biological species, systems or
molecules that share a common evolutionary ancestor or Two or more gene or
protein sequences that share a significant degree of similarity, typically
measured by the amount of identity (in the case of DNA), or conservative
replacements (in the case of protein), that they register along their lengths.
Sequence "homology" searches are typically performed with a query DNA or protein
sequence to identify known genes or gene products that share significant
similarity and hence might inform on the ancestry, heritage and possible
function of the query gene.

Housekeeping genes - Genes that are always expressed
(ie. they are said to be constitutively expressed) due to their constant
requirement by the cell.

Human Anti-Murine Antibody Response
(HAMA) - An immune
response generated in humans to antibodies raised in murine (e.g. mouse or rat)
cells.

Hybridization -
The interaction of complementary nucleic acid
strands. This can occur between two DNA strands or between DNA and RNA strands,
and is the basis of many techniques such as Southern and northern
blots.

Hydrogen bond -
A weak chemical interaction between an
electronegative atom (e.g. nitrogen or oxygen) and a hydrogen atom that is
covalently attached to another atom. This bond maintains the two-helices of DNA
together and is also the primary interaction between water
molecules.

Hydrophilicity -
(lit. water-loving) The degree to which a
molecule is soluble in water. Hydrophilicity depends to a large degree on the
charge and polarizability of the molecule and its ability to form transient
hydrogen-bonds with (polar) water molecules.

Hydrophobicity -
(lit. water-hating) The degree to which a
molecule is insoluble in water, and hence is soluble in lipids. If a molecule
lacking polar groups is placed in water, it will be entropically driven to
finding a hyrdophobic environment (such as the interior of a protein or a
membrane).

Identity Matrix - A scoring matrix in which only identical characters receive a
positive score; the matrix has ones along the main diagonal and zeroes
elsewhere.

IMGT - The International ImMunoGeneTics Information system is a high-quality integrated knowledge resource specialized in the immunoglobulins (IG), T cell receptors (TR), major histocompatibility complex (MHC), immunoglobulin superfamily (IgSF), major histocompatibility complex superfamily (MhcSF) and related proteins of the immune system (RPI) of human and other vertebrate species, created in 1989 by Marie-Paule Lefranc.

Implicit
Parallelism - The
idea that genetic algorithms have an extra built-in form of parallelism that is
expressed when a GA searches through a search space.

in silico
(biology) - The use of
computers to simulate, process, or analyse a biological
experiment.

in situ hybridization - A variation of the DNA/RNA
hybridization procedure in which the denatured DNA is in place in the cell and
is then challenged with RNA or DNA extracted from another source. (See also
fluorescence in situ hybridization).

Integration -
The physical insertion of DNA into the host cell
genome. The process is used by retroviruses where a specific enzyme catalyses
the process or can occur at random sites with other DNA (eg.
transposons).

Intracellular signalling - The communication of a molecular
message from the surface of the cell to the nucleus via the participation of a
series of molecules, including receptors, enzymes, proteins, and
small-molecules. The end result of the signalling process is the up- or
down-regulation of a particular series of genes that may be involved in cell
growth, division or differentiation.

Introns -
Nucleotide sequences found in the structural
genes of eukaryotes that are non-coding and interrupt the sequences containing
information that codes for polypeptide chains. Intron sequences are spliced out
of their RNA transcripts before maturation and protein synthesis.

Isoschizomers -
Two different restriction enzymes which recognize
and cut DNA at the same recognition site. e.g Sma I and Xma I both recognize and
cut the sequence CCCGGG.

Isozymes -
Two or more enzymes capable of catalyzing the
same reaction but varying in their specificity due to differences in their
structures and hence their efficiencies under different environmental
conditions.

Iteration -
A series of steps in an algorithm whereby the
processing of data is performed repetitively until the result exceeds a
particular threshold. Iteration is often used in multiple sequence alignments
whereby each set of pairwise alignments are compared with every other, starting
with the most similar pairs and progressing to the least similar, until there
are no longer any sequence-pairs remaining to be aligned.

Jarvis-Patrick Cluster
Algorithm - A
non-hierarchical clustering method for conformational analysis of molecules that
uses a "nearest neighbor" approach to look for clusters of conformations that
are the shortest distance away from each other.

Junk DNA -
Term used to describe the excess DNA that is
present in the genome beyond that required to encode proteins. A misleading term
since these regions are likely to be involved in gene regulation, and other as
yet unidentified functions.

K - A statistical parameter used in calculating BLAST scores that can
be thought of as a natural scale for search space size. The value K is used to
convert a raw score (S) to a bit score (S').

Karyotype -
The constitution (typically number and size) of
chromosomes in a cell or individual.

KEGG -
KEGG PATHWAY is a
collection of manually drawn pathway maps representing our knowledge on the
molecular interaction and reaction networks for metabolism,Genetic Information
Processing, Environmental Information Processing,Cellular Processes,Human
Diseases and Drug Development.

Knockout mice (gene targeting) - Mice which have been engineered
to lack a chosen gene. The gene is inactivated in so called embryonic stem cells
using the technique of homologous recombination. These cells are then introduced
into a early stage embryo (blastocyst) and this is then transplanted into a
recipient mouse. The subsequent progeny lack the targeted gene in some cells.
This technique is used to determine the function of the chosen
gene.
Kozak consensus - The nucleotide sequence
surrounding the initiation codon of a eukaryotic mRNA.

Ktup - The word size used to make the
hash table in the FASTA program.

"Lab on
a chip" - Term describing microdevices
that allow rapid, microanalytical analysis of DNA or protein in a single, fully
integrated system. Typically, these devices are miniature surfaces, made of
silicon, glass or plastic, which carry the necessary microdevices (pumps,
valves, microfluidic controllers, and detectors) that allow sample separation
and analysis. These devices are used in drug discovery, genetic testing and
separation science.

Lead
compound - A candidate compound
identified as the best "hit" (tight binder) after screening of a combinatorial
(or other) compound library, that is then taken into further rounds of screening
to determine its suitability as a drug.

Lead
optimization - The process of converting a
putative lead compound ("hit") into a therapeutic drug with maximal activity and
minimal side affects, typically using a combination of computer-based drug
design, medicinal chemistry and pharmacology.

Leaky
Mutation - A mutation that results in
partial loss of a characteristic.

Length Abridgement - The gradual shortening of
pseudogenes during evolution that is caused by an excess of deletions over
insertions.

Leucine zipper - Protein motif which binds DNA in
which 4-5 Leucines are found at 7 amino acid intervals. This motif is present
typically in transcription factors and other proteins that bind
DNA.

Lexicon -
In Bioinformatics, a lexicon refers to a
pre-defined list of terms that together completely define the contents of a
particular database.
(strict.) The component in the grammar which is in bare
form a list of words or lexical entries.

Library -
A large collection of compounds, peptides, cDNAs
or genes which may be screened in order to isolate cognate
molecules.

Ligand -
Any small molecule that binds to a protein or
receptor; the cognate partner of many cellular proteins, enzymes, and
receptors.

LINE - Long Interspersed Nuclear
Element

Linkage -
The association of genes (or genetic loci) on the
same chromosome. Genes that are linked together tend to be transmitted
together.

Linkage map -
A genetic map of a chromosome or genome
delineated by mapping the positions of genes to their chromosomes by their
linkage to readily identifiable genetic loci.

Locus -
The specific position occupied by a gene on a
chromosome. At a given locus, any one of the variant forms of a gene may be
present. The variants are said to be alleles of that
gene.
Locus Control Region
(LCR) -
A DNA sequence that maintains a functional domain
in an open, active configuration.

Lod Score - A statistical measure of linkage
determined by pedigree analysis.

Log-Odds - A scoring system in which
the values are the logarithm of the relative probability (odds) of a comparison
being due to homology or being due to chance.

MADS box - A DNA-binding domain found in
several transcription factors involved in plant development.

MAGE - MicroArray and Gene
Expression

Map unit -
A measure of genetic distance between two linked
genes that corresponds to a recombination frequency of 1%.

Markov chain -
Any multivariate probability density whose
independence diagram is a chain.The variables are ordered, and each variable
"depends" only on its neighbors in the sense of being conditionally independent
of the others. Markov chains are an integral component of hidden Markov
models.

Matrix-Associated
Region(Mar) -
An AT-rich segment of a eukaryotic genome that
acts as an attachment point to the nuclear matrix.

Melting (of DNA)
- The denaturation of double-stranded DNA into
two single strands by the application of heat. (Denaturation breaks the hydrogen
bonds holding the double-stranded DNA together).

MetaCyc - Metacyc describes enzymes and metabolic pathways for more than 300 organisms. MetaCyc does not seek to model the complete metabolic network of any one organism, but to provide a comprehensive collection of experimentally elucidated metabolic pathways.

Methylation -
The addition of -CH3 (methyl) groups to a target
site. Typically such addition occurs on to the cytosine bases of
DNA.

MIAME - Minimum Information About a
Microarray Experiment

MIAME/Tox - A proposal by the EBI, NIEHS-NCT
and ILSI-HESI for defining the Mininum Information About Microarray Experiment
requirements for Toxicogenomics.

Microarray -
A 2D array, typically on a glass, filter, or
silicon wafer, upon which genes or gene fragments are deposited or synthesized
in a predetermined spatial order allowing them to be made available as probes in
a high-throughput, parallel manner.

Microfluidics -
The miniaturization of chemical reactions or
pharmacalogical assays into microscopic tubes or vessels in order to greatly
increase their throughput, by placing many of them side-by-side in an
array.

Mimetics -
Compounds that mimic the function of other
molecules via their high degree of structural (conformational) similarity, and
hence physio-chemical properties.

Missense mutation - A point mutation in which one
codon (triplet of bases) is changed into another designating a different amino
acid.

Modeling -
In bioinformatics, modeling usually refers to
molecular modeling, a process whereby the three-dimensional architecture of
biological molecules is interpreted (or predicted), visually represented, and
manipulated in order to determine their molecular properties. A series of
mathematical equations or procedures which simulate a real-life process, given a
set of assumptions, boundary parameters, and initial conditions.

Modus Ponens - A rule of logical inference
that if proposition A is true, and A implies B, then B is
true.

Molecular Modeling - The prediction of a molecule's
three dimensional shape can be estimated from sequence data of previously
identified molecular shapes.

Molecular Modeling
Database
(MMDB) - A
database of macromolecular 3D structures, as well as tools for their
visualization and comparative analysis; contains experimentally determined
biopolymer structures obtained from the Protein Data Bank (PDB).

Molecular
Phylogenetics - A set of techniques that enable the evolutionary relationships
between DNA sequences to be inferred by making comparisons between those
sequences.

Monomer -
A single unit of any biological molecule or
macromolecule, such as an amino acid, nucleic acid, polypeptide domain, or
protein.

Monophyletic - Sharing a common
ancestor.

Monovalent -
Having one binding site; strictly, an atom with
only one free electron available for binding in its highest energy
shell.

Monte Carlo
Simulation - A method of calculating significance where the analysis in
question is repeated using randomized or permuted sequences, in order to
determine the expected scores for unrelated (random) sequences.

Morbid Map - A diagram showing the
chromosomal location of genes associated with disease.

Motif -
A conserved element of a protein sequence
alignment that usually correlates with a particular function. Motifs are
generated from a local multiple protein sequence alignment corresponding to a
region whose function or structure is known. It is sufficient that it is
conserved, and is hence likely to be predictive of any subsequent occurrence of
such a structural/functional region in any other novel protein
sequence.

Multifurcation - A graphical representation of an
unknown branching order in a phylogenetic tree involving three or more
taxa.

Multigene family
- A set of genes derived by duplication of an
ancestral gene, followed by independent mutational events resulting in a series
of independent genes either clustered together on a chromosome or dispersed
throughout the genome.

Multiple (Sequence) Alignment - A Multiple Alignment of k
sequences is a rectangular array, consisting of characters taken from the
alphabet A, that satisfies the following conditions: There are exactly k
rows; ignoring the gap character, row number i is exactly the sequence
s_I; and each column contains at least one character different
from "-". In practice multiple sequence alignments include a cost/weight
function, that defines the penalty for the insertion of gaps (the "-" character)
and weights identities and conservative substitutions accordingly. Multiple
alignment algorithms attempt to create the optimal alignment defined as the one
with the lowest cost/weight score.

Multiplex sequencing - Approach to high-throughput
sequencing that uses several pooled DNA samples run through gels simultaneously
and then separated and analyzed.

Mutagen -
Any agent that can cause an increase in the rate
of mutations in an organism.

Mutation -
An inheritable alteration to the genome that includes genetic (point or
single base) changes, or larger scale alterations such as chromosomal deletions
or rearrangements.

Mutation Data
Matrix
(MDM) - The
most common scoring system for proteins; log-odds form of the PAM-250 mutation
probability matrix.

Naked
DNA - Pure, isolated DNA devoid of
any proteins that may bind to it.

National Center for Biotechnology
Information
(NCBI) - A unit of the National Library of Medicine (NLM), National
Institutes of Health (NIH).

National Institutes of
Health
(NIH) - An agency of the U.S. Department of Health and Human Services,
Public Health Service; supports research and training to improve health.

National Library of
Medicine
(NLM) - Part of
the National Institutes of Health (NIH). Includes NCBI.

NCEs (New Chemical Entity) - Compounds identified as
potential drugs that are sent from research and development into clinical trials
to determine their suitability.

Neighbor-Joining
Method - A method
for construction of phylogenetic trees.

Nested PCR -
The second round amplification of an already
PCR-amplified sequence using a new pair of primers which are internal to the
original primers. Typically done when a single PCR reaction generates
insufficient amounts of product.

Neural net - A neural net is an interconnected assembly of simple
processing elements, units or nodes, whose functionality is loosely based on the
animal brain. The processing ability of the network is stored in the inter-unit
connection strengths, or weights, obtained by a process of adaptation to, or
learning from, a set of training patterns. Neural nets are used in
bioinformatics to map data and make predictions, such as taking a multiple
alignment of a protein family as a training set in order to identify novel
members of the family from their sequence data alone.

Nonsense mutation - A point mutation in which a
codon specific for an amino-acid is converted into a nonsense
codon.

Northern blotting - A technique to identify RNA
molecules by hybridization that is analogous to Southern blotting.

Nuclear Magnetic
Resonance
(NMR) -
A technique for determining the
three-dimensional structure of large molecules.

Nuclease -
Any enzyme that can cleave the phosphodiester
bonds of nucleic acid backbones.

OFAGE - Orthogonal Field Alternation Gel
Electrophoresis

Oligonucleotide
- A short molecule consisting of several linked
nucleotides (typically between 10 and 60) covalently attached by phosphodiester
bonds.

OMIM - Online Mendelian Inheritance in
Man

Open reading frame (ORF) - Any stretch of DNA that
potentially encodes a protein. Open reading frames start with a start codon, and
end with a termination codon. No termination codons may be present internally.
The identification of an ORF is the first indication that a segment of DNA may
be part of a functional gene.

Operator -
A segment of DNA that interacts with the products
of regulatory genes and facilitates the transcription of one or more structural
genes.

Operon -
A unit of transcription consisting of one or more
structural genes, an operator, and a promoter.

Orthologs -
Orthologs are genes in different species that
evolved from a common ancestral gene by speciation. Normally, orthologs retain
the same function in the course of evolution. Identification of orthologs is
critical for reliable prediction of gene function in newly sequenced
genomes.

Orthologous - Homologous sequences in different species that arose from a common
ancestral gene during speciation.

Overlapping
clones - Collection of
cloned sequences made by generating randomly overlapping DNA fragments with
infrequently cutting restriction enzymes.

Palindrome - A region of
DNA with a symmetrical arrangement of bases occuring about a single point such
that the base sequences on either side of that point are identical (if the
strands are both read in the same direction) e.g 5í GAATTC 3í whose
complementary sequence is 3í CTTAAG 5í.

Parallel
Evolution - The development of similar characteristics in organisms that are
not closely related (not part of a monophyletic group) due to adaptation to
similar environments and/or strategies of life.

Pattern -
Molecular biological patterns usually occur at
the level of the characters making up the gene or protein sequence. A pattern
language must be defined in order to apply different criteria to different
positions of a sequence. In order to have position-specific comparison done by a
computer, a pattern-matching algorithm must allow alternative residues at a
given position, repetitions of a residue, exclusion of alternative residues,
weighting, and ideally, combinatorial representation.

Pathways -
Bioinformatics strives to define representations of key biological
datatypes, algorithms and inference procedures, including sequences, structures,
biological pathways and reactions. Representing and computing with biological
pathways requires ontologies for representing pathway knowledge; User interfaces
to these databases; Physico-chemical properties of enzymes and their substrates
in pathways; And pathway analysis of whole genomes including identifying common
patterns across species and species differences.

Paralogs -
Paralogs are genes related by duplication within
a genome. Orthologs retain the same function in the course of evolution, whereas
paralogs evolve new functions, even if these are related to the original
one.

Parameters -
Parameters are user-selectable values, typically
experimentally determined, that govern the boundaries of an algorithm or
program. For instance, selection of the appropriate input parameters governs the
success of a search algorithm. Some of the most common search parameters in
bioinformatics tools include the stringency of an alignment search tool, and the
weights (penalties) provided for mismatches and gaps.

Pfam -
Pfam is a large
collection of multiple sequence alignments and hidden Markov models covering
many common protein domains and families.

Peptide -
A short stretch of amino acids each covalently
coupled by a peptide (amide) bond.

Peptide bond (amide bond) - A covalent bond formed between
two amino acids when the amino group of one is linked to the carboxy group of
another (resulting in the elimination of one water molecule).

Percent Accepted
Mutation
(PAM) -
A unit introduced by Dayhoff et al. to quantify the
amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount
of evolution which will change, on average, 1% of amino acids in a protein
sequence.

Phage display -
A technique in which phage are engineered to fuse
a foreign peptide or protein with their capsid (surface) proteins and hence
display it on their cell surfaces. The immobilized phage may then be used as a
screen to see what ligands bind to the expressed fusion protein exhibited
(displayed) on the phage surface.

Pharmacogenomics
- The use of (DNA-based) genotyping in order to
target pharmaceutical agents to specific patient populations. Genetic
differences are known to affect responses to many types of drug therapy, and
pharmacogenomics analysis serves to customize the use of pharmaceuticals for
specific subgroups of patients.The rationale for this approach is that observed
gene expression differences may correlate with, and explain, the differences in
side effects and efficacy to drugs in humans.

Pharmacophore -
The three dimensional spatial arrangment of
atoms, substituents, functional groups, or chemical features that together are
sufficient to describe the pharmacologically active components of a drug
molecule or molecule series.

Phenogram - A graphic representation that portrays or attempts to portray the
taxonomic relationships among a number of findividuals, species, or higher taxa
on the basis of overall similarities between them.

Phenotype -
Any observable feature of an organism that is the
result of one or more genes.

Phylum -
The segmentation of the animal kingdom into about
30 major groups collectively known as phyla. The members of each phylum share
the same basic structure and organization. For instance, fish, birds, and human
beings belong to one phylum - the Chordata - because all have spinal
cords.

Physical map -
A physical map consists of a linearly ordered set
of DNA fragments encompassing the genome or region of interest. Physical maps
are of two types, macro-restriction maps and ordered clone maps. The former
consists of an ordered set of large DNA fragments generated by using restriction
enzymes whose recognition sequences are infrequently represented in the genome.
An ordered clone map consists of an overlapping collection of cloned DNA
fragments. The DNA may be cloned into any one of the available vector
systems--YACs, cosmids, phage, or even plasmids. Major advantages of ordered
clone maps are that they are of high resolution and directly provide the clones
for further study.

Pi-Pi
Interactions - The hydrophobic interactions that occur between adjacent base
pairs in a double-stranded DNA molecule.

PIR - Protein Information
Resource. An integrated public
bioinformatics resource that supports genomic and proteomic research and
scientific studies.

Pleitropy -
The multiple effects on an organismís phenotype
due to a single gene or allele e.g the cytokines which can bind to multiple
cellular receptors and effect growth and multiple immune pathways.

Point mutation -
A mutation in which a single nucleotide in a DNA
sequence is substituted by another nucleotide.

Polygenic Inheritance - Inheritance involving alleles at
many genetic loci.

Polymerase chain reaction (PCR)
- Technique used to
amplify or generate large amounts of replica DNA of a segment of any DNA whose
"flanking" sequences are known. Oligonucleotide primers which bind these
flanking sequences are used by an enzyme (Taq polymerase) to copy the sequence
in between the primers. Cycles of heat to break apart the DNA strands, cooling
to allow the primers to bind, and heating again to allow the enzyme to copy the
intervening sequence lead to a doubling of DNA at each cycle. The reactions are
typically carried out on a regulated heating block and consist of 30-35 cycles
of repeated amplification of all the DNA present. Single molecules of "target"
DNA can be amplified to microgram amounts of DNA. The target DNA can be of any
origin.

Polymorphism -
The existence of a gene in a population in at
least two different forms at a frequency far higher than that attributable to
recurrent mutation alone. Variations in a population may be measured by
determining the rate of mutation in polymorphic genes.

Polypeptide -
A single chain of covalently attached amino acids
joined by peptide bonds. Polypeptide chains usually fold into a compact, stable
form (a domain) that is part (or all) of the final protein.

Positional cloning - Method used to define the
location of a gene on a chromosome and use this information to identify and
clone the gene. The location of the gene is determined by linkage analysis of
DNA from a large family containing afflicted and normal members to identify
linkages between the transmission of the disease gene and observable genetic
markers. This information is then used to screen (by chromosomal jumping and
walking) the location for putative genes. The disease gene must be compared
between the afflicted and normal family members and be shown to be different in
the two groups. The full sequencing of the gene will then provide information
regarding the characteristics and function of the gene product, and a potential
explanation for the cause of the disease.

Position-Specific Scoring
Matrix
(PSSM)
- The PSSM gives the log-odds score for finding a particular
matching amino acid in a target sequence.

Post-Transcriptional
Modification - Alterations made to pre-mRNA before it leaves the nucleus and
becomes mature mRNA.

Post-Translational
Modification - Alterations made to a protein after its synthesis at the ribosome.
These modifications, such as the addition of carbohydrate or fatty acid chains,
may be critical to the function of the protein.

PRINTS -
PRINTS is a compendium
of protein fingerprints. A fingerprint is a group of conserved motifs used to
characterise a protein family; its diagnostic power is refined by iterative
scanning of a SWISS-PROT/TrEMBL composite.

Probe -
Any biochemical that is labelled or tagged in
some way so that it can be used to identify or isolate a gene, RNA, or
protein.

PRODOM -
ProDom is a
comprehensive set of protein domain families automatically generated from the
SWISS-PROT and TrEMBL sequence databases.

Profile -
Sequence profiles are usually derived from
multiple alignments of sequences with a known relationship, and consist of
tables of position-specific scores and gap-penalties. Each position in the
profile contains scores for all of the possible amino acids, as well as one
penalty score for opening and one for continuing a gap at the specified
position. Attempts have been made to further improve the sensitivity of the
profile by refining the procedures to construct a profile starting from a given
multiple alignment. Other representations for sequence domains or motifs do not
necessarily require the presence of a correct and complete multiple alignment,
such as hidden Markov models.

Promoter (site)
- A promoter site is defined by its recognition
by eukaryotic RNA polymerase II; its activity in a higher eukaryote; by
experimentally evidence, or homology and sufficient similarity to an
experimentally defined promoter; and by observed biological function.

Prosite - A database of
"patterns" (regular expressions) specific for various protein motifs.

Protein families
- Sets of proteins that share a common
evolutionary origin reflected by their relatedness in function which is usually
reflected by similarities in sequence, or in primary, secondary or tertiary
structure. Subsets of proteins with related structure and function.

Proteome -
The entire protein complement of a given
organism.

Proteomics -
The study of the proteome. Typically, the
cataloging of all the expressed proteins in a particular cell or tissue type,
obtained by identifying the proteins from cell extracts using a combination of
2D gel electrophoresis and mass spectrometry. The large scale analysis of the
protein composition and function.

Pseudoknot - A 3D structure where the RNA folds back on a hairpin to form a
structure where three strands are held together by hydrogen bonds.

Query
(sequence) - A DNA, RNA of protein sequence
used to search a sequence database in order to identify close or remote family
members (homologs) of known function, or sequences with similar active sites or
regions (analogs), from whom the function of the query may be
deduced.

Quantitative Structure Activity
Relationship (QSAR) - Relates
numerical properties of the molecular structure to the activity via a
mathematical model.

Quantitative Trait
Locus
(Qtl)
- A gene locus that contributes to a complex trait.

Quantum Model Of
Speciation - A model of evolution that holds that speciation sometimes occurs
rapidly as well as over long periods, as the classical theory
proposed.

Ramachandran
map - A contour plot of the different phi and psi angles that are found
within a protein.

RASMOL - Program package by R.
Sayle to display protein structures.

Rat Genome
Database
(RGD) -
A collaborative effort among leading research
institutions involved in rat genetic and genomic research to collect,
consolidate, and integrate data generated from ongoing rat genetic and genomic
research efforts and make these data widely available to the scientific
community.

Raw Score (S)
-
The score of an alignment, S, calculated as
the sum of substitution and gap scores.

RCSB - Research Collaboratory for
Structural Bioinformatics

Rational drug design (Structure based drug
design) - The
development of drugs based on the 3-dimensional molecular structure of a
particular target.

Reading frame
- A sequence of codons beginning with an
intiation codon and ending with a termination codon, typically of at least 150
bases (50 amino acids) coding for a polypeptide or protein chain.

REBASE - REBASE is a database of type 2 restriction enzymes. Restriction enzymes are used to cut DNA sequences at a particular site. A site is a particular sequence of bases.

Recessive - Any trait that is expressed phenotypically only when present on both alleles of a gene.

Recurrent Neural
Network - A network similar to a feedforward neural network except that
there may be connections from an output or hidden layer to the inputs. Recurrent
neural networks are capable of universal computation.

Recursion -
An algorithmic procedure whereby an algorithm
calls on itself to perform a calculation until the result exceeds a threshold,
in which case the algorithm exits. Recursion is a powerful procedure with which
to process data and is computationally quite efficient.

Regulatory gene
- A DNA sequence that functions to control the
expression of other genes by producing a protein that modulates the synthesis of
their products (typically by binding to the gene promoter).

Relational Database Management Systems
(RDBMS) - A software
system that includes a database architecture, query language, and data loading
and updating tools and other ancillary software that together allow the creation
of a relational database application.

Relative Rate
Test - A calibration-free test for checking the constancy of the rate of
nucleotide substitutions in different lineages during their evolution, thus
determining whether or not the molecular clock operates at the same rate among
different lineages.

Repeats (repeat sequences) - Repeat sequences and approximate
repeats occur throughout the DNA of higher organisms (mammals). For example, the
Alu sequences of length about 300 characters, appear hundreds of
thousands of times in Human DNA with about 87% homology to a consensus
Alu string. Some short substrings such as TATA-boxes, poly-A and (TG)*
also appear more often than by chance. Repeat sequences may also occur within
genes, as mutations or alterations to those genes. Repetitive sequences,
especially mobile elements, have many applications in genetic research. DNA
transposons and retroposons are routinely used for insertional mutagenesis, gene
mapping, gene tagging, and gene transfer in several model systems.

Repetitive DNA
Fingerprinting - A clone fingerprinting technique that involves determining the
positions of genome-wide repeats in cloned DNA fragments.

Repetitive elements - Repetitive elements provide
important clues about chromosome dynamics, evolutionary forces, and mechanisms
for exchange of genetic information between organisms The most ubiquitous class
of repetitive elements in the DNA sequence in primate genomes is the Alu
family of interspersed repeats which have arisen in the last 65 million years of
evolution Alu repeats belong to a class of sequences defined as short
interspersed elements (SINEs). Approximately 500,000 Alu SINEs exist
within the human genome, representing about 5% of the genome by
mass.

Replication -
The synthesis of an informationally identical
macromolecule (e.g. DNA) from a template molecule.

Repository -
A collection of resources that can be
accessed to retrieve information.

Repressor -
The protein product of a regulatory gene that
combines with a specific operator (regulatory DNA sequence) and hence blocks the
transcription of genes in an operon.

Restriction fragment length polymorphisms
(RFLPs) - Variation
within the DNA sequences of organisms of a given species that can be identified
by fragmenting the sequences using restriction enzymes, since the variation lies
within the restriction site. RFLPs can be used to measure the diversity of a
gene in a population.

Restriction map
- A physical map or depiction of a gene (or
genome) derived by ordering overlapping restriction fragments produced by
digestion of the DNA with a number of restriction enzymes.

Retron - A genomic sequence
encoding reverse transcriptase but lacking the ability to
transpose

Reverse Genetics
- The use of protein information to elucidate the
genetic sequence encoding that protein. Used to describe the process of gene
isolation starting with a panel of afflicted patients.

Reverse transcriptase - A DNA polymerase that can
synthesise a complementary DNA (cDNA) strand using RNA as a template - a
so-called RNA-dependent DNA polymerase.

Reverse transcriptase-PCR
(RT-PCR) - Procedure in which PCR amplification is carried out on DNA that is
first generated by the conversion of mRNA to cDNA using reverse
transcriptase.

Ribbon-Helix-Ribbon
Motif - A type of DNA-binding
domain.

Rooted Tree -
A phylogenetic tree that
specifies ancestral and descendant species, thus indicating the direction of the
evolutionary path.

Score
Matrix - In a dynamic programming alignment, the score matrix indicates the
quality of the alignment ending at each possible pair of residues

Scoring
Function -
A Scoring Function maps an abstract concept to a
numeric value. A variety of scoring functions are used to evaluate ("score")
single edit operations (or simply the combination of characters involved),
columns of a multiple alignment, whole pairwise alignments, and whole multiple
alignments. Generally, the score of an alignment of two sequences s and t is the
sum of the score of all its edit operations that lead from s to t. For multiple
alignment, there are various competing ways to calculate a score, e.g.
sum-of-pairs score, or scoring along a tree. Usually, the terms Cost Function
and Weight Function can be treated as synonyms, although they carry the idea
that "small values are desirable". Both similarity measure and distance measure
are specific type's of a scoring function.

Secondary structure (protein) - The organization of the peptide
backbone of a protein that occurs as a result of hydrogen bonds e.g alpha helix,
Beta pleated sheet.

SEG - A program for filtering low
complexity regions in amino acid sequences.

Selenocysteine (U) - A nonstandard amino acid.

Selectivity -
Selectivity of bioinformatics similarity search
algorithms is defined as the significance threshold for reporting database
sequence matches. As an example, for BLAST searches, the parameter E is
interpreted as the upper bound on the expected frequency of chance occurrence of
a match within the context of the entire database search. E may be thought of
as the number of matches one expects to observe by chance alone during the
database search.

Sensitivity
- Sensitivity of bioinformatics similarity search
algorithms centers around two areas: First, how well can the method detect
biologically meaningful relationships between two related sequences in the
presence of mutations and sequencing errors; Secondly how does the heuristic
nature of the algorithm affect the probability that a matching sequence will not
be detected. At the user's discretion, the speed of most similarity search
programs can be sacrificed in exchange for greater sensitivity - with an
emphasis on detecting lower scoring matches.

Sequence Tagged Site (STS) - A unique sequence from a
known chromosomal location that can be amplified by PCR. STSs act as physical
markers for genomic mapping and cloning.

Sequin - Sequin is a stand-alone software tool developed by the NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases. It is capable of handling simple submissions that contain a single short mRNA sequence, and complex submissions containing long sequences, multiple annotations, segmented sets of DNA, or phylogenetic and population studies.

Sexual PCR (Molecular
Diversity) - Sexual
PCR is a form of PCR in which similar, but not identical, DNA sequences are
reassembled to obtain novel juxtapositions, simulating the result of genetic
recombination. The result is the creation of an array of related genes which may
possess improved characteristics. By repeated rounds of recombination, selection
and PCR-based amplification vastly improved gene-products, such as enzymes with
greater activity, may be generated and selected.

Shotgun cloning - The cloning of an entire gene
segment or genome by generating a random set of fragments using restriction
endonucleases to create a gene library that can be subsequently mapped and
sequenced to reconstruct the entire genome.

Short Interspersed Nuclear
Element
(Sine) -
A type of
genome-wide repeat, less than 500 bp in length.

Shotgun Sequencing - An approach used to
decode an organism's genome by shredding it into smaller fragments of DNA which
can be sequenced individually.

Signal sequence (leader
sequence) - A short sequence added to the
amino-terminal end of a polypeptide chain that forms an amphipathic helix
allowing the nascent polypeptide to migrate through membranes such as the
endoplasmic reticulum or the cell membrane. It is cleaved from the polypeptide
after the protein has crossed the membrane.

Similarity (homology) search - Given a newly sequenced gene,
there are two main approaches to the prediction of structure and function from
the amino acid sequence. Homology methods are the most powerful and are based on
the detection of significant extended sequence similarity to a protein of known
structure, or of a sequence pattern characteristic of a protein family.
Statistical methods are less successful but more general and are based on the
derivation of structural preference values for single residues, pairs of
residues, short oligopeptides or short sequence patterns. The transfer of
structure/function information to a potentially homologous protein is
straightforward when the sequence similarity is high and extended in length, but
the assessment of the structural significance of sequence similarity can be
difficult when sequence similarity is weak or restricted to a short
region.

Simple Sequence Length
Polymorphism (Sslp) - An array of repeat sequences that
display length variations and simple tandem repeat

Single nucleotide polymorphisms
(SNPs) - Variations of single base pairs
scattered throughout the human genome that serve as measures of the genetic
diversity in humans. About 1 million SNPs are estimated to be present in the
human genome, and SNPs are useful markers for gene mapping studies.

Single-pass sequencing - Rapid sequencing of large segments of the genome of an organism
by isolating as many expressed (cDNA) sequences as possible and performing
single sequencer runs on their 5í or 3í ends. Single-pass sequencing typically
results in individual, error-prone sequencing reads of 400-700 bases, depending
on the type of sequencer used. However, if many of these are generated from
numerous clones from different tissues, they may be overlapped and assembled to
remove the errors and generate a contiguous sequence for the entire expressed
gene.

Site -
Sites in sequences can be located either in DNA (e.g. binding sites, cleavage
sites) or in proteins. In order to identify a site in DNA, ambiguity symbols are
used to allow several different symbols at one position. Proteins, however, need
a different mechanism (see Pattern). Restriction enzyme cleavage sites, for
instance, have the following properties: limited length (typically, less than
20 base pairs); definition of the cleavage site and its appearance (3', 5'
overhang or blunt); definition of the binding site.

Southern blotting - A procedure for the identification of DNA by transmitting a
fragment isolated on an agarose gel to a nitrocellulose filter where it can be
hybridized with a complementary "probe" sequence.

Solvent accessibility - The surface area (typically measured in square angstroms) of a
biological molecule, usually a protein, that is exposed to solvent in its
native, folded form. Determining the solvent accessibility of a protein helps
define which amino acids in its molecular sequence are on the exterior of the
molecule, and thus available to participate in interactions with other
molecules.

SRS - Sequence Retrieval
System

Structural gene - Gene which encodes a structural protein (cf. Regulatory
gene).

Structural
Genomics
-
The effort to determine the 3D structures of
large numbers of proteins using both experimental techniques and computer
simulation

Structure-Based
Drug Design - Pharmaceutical research driven by the three-dimensional structure
of a target site.

STS - Sequence Tagged
Site

STS
Mapping - A physical mapping procedure that locates the positions of
sequence tagged sites (STSs) in a genome

Structure prediction - Algorithms that predict the secondary, tertiary and sometimes
even quarternary structure of proteins from their sequences. Determining
protein structure from sequence has been dubbed "the second half of the Genetic
Code" since it is the folded tertiary structure of a protein that governs how it
functions as a gene product. As yet most structure prediction methods are only
partially successful, and typically work best for certain well-defined classes
of proteins.

Substitution matrix - A model of protein evolution at the sequence level resulting in
the development of a set of widely used substitution matrices. These are
frequently called Dayhoff, MDM (Mutation Data Matrix), BLOSUM or PAM (Percent
Accepted Mutation) matrices. They are derived from global alignments of closely
related sequences. Matrices for greater evolutionary distances are extrapolated
from those for lesser ones.

Subtraction library - A cDNA library that only contains cDNAs uniquely expressed in a
given cell or tissue. e.g T cells and B cells will express many common RNAs, as
well as a very small percentage which will be unique for T cells and B cells
respectively. To make a T cell subtraction library, the cDNA from a T cell
library is hybridized with a vast excess of B cell RNA. The commonly expressed
genes will result in RNA-cDNA hybrids which can be removed (or subtracted) to
leave only T cell specific cDNAs.

Superfamily -
A collection of genes, all products of
gene duplication, that have diverged from each other to a considerable extent
(in protein-coding genes, usually a similarity of less than 50% at the amino
acid level).

SWISS INSTITUTE OF
BIOINFORMATICS (SIB, ISB) - The Swiss Institute of
Bioinformatics operates three Web servers: the ExPASy proteomics server, the
Swiss node of EMBnet, and "Protéine à la Une".

SWISS-PROT - A non-redundant protein
sequence database

TANDEM REPEAT
SEQUENCES - Multiple copies of the same base
sequence on a chromosome; used as a marker in physical
mapping.

Tentative Consensus (TC) - The identification of a sequence from an EST cluster that
represents part or all of a complete gene. TCs are usually determined by
clustering ESTs allowing for sequencing errors, artefacts such as chimeric
clones, and naturally occuring biological phenomena such as alternative
splicing. Creation of a cluster allows one to generate a consensus sequence and
then identify a long open reading frame which would suggest the possibility of
that consensus representing a bona fide gene.

Tentative Human Consensus sequences
(THCs) - A consensus sequence generated from
human EST fragments. THCs may be validated by comparison against databases of
known human gene sequences, human genomic sequences, or by identification of the
ORFs or other sequence features contained within the consensus as belonging to a
known human gene product.

Tertiary structure - Folding of a protein chain via
interactions of its sideschain molecules including formation of disulphide bonds
between cysteine residues.

TIGR - The Institute for Genomic Research

Traceback - The second stage of a
dynamic programming alignment, in which the alignment is found by following the
pointers back from the highest scoring position in the score matrix.

TrEMBL - Translated EMBL, a
SRS-based compilation of the EMBL DNA data library.

Transmembrane region - The region of a transmembrane protein that actually spans the
membrane. Transmembrane regions are usually hydrophobic in order to be
thermodynamically compatible with the lipid bilayer portion of the membrane.
They may consist of either alpha-helical or beta-strand secondary structure
elements, but in either case the external residues (the ones facing the
membrane) are invariably hydrophobic while the internal residues may be
hydrophilic (as in the case of a pore or channel) or polar. One common
transmembrane structural domain is the seven-helix bundle seen in numerous
channel proteins.

Unidentified reading frame
(URF) - An open reading frame encoding a protein
of undefined function.

UniProt - A project to create a central database of protein sequence and
function by joining the forces of the Swiss-Prot, TrEMBL and PIR protein
database activities.

Unrooted
Tree - A phylogenetic tree that
illustrates relationships between the organisms or DNA sequences being studied,
without providing information about the past evolutionary events that have
occurred.

Variable numbers of tandem repeats
(VNTRs) - DNA sequence
blocks of 2-60 base pairs which are repeated from two to more than 20 times in
different individuals. This polymorphism makes VNTRs very useful DNA markers
used in genomic mapping, linkage analysis and also DNA
fingerprinting.

Variation (genetic) - Variation in genetic sequences and the detection of DNA sequence
variants genome-wide allow studies relating the distribution of sequence
variation to a population history. This in turn allows one to determine the
density of SNPS or other markers needed for gene mapping studies. Quantitation
of these variations together with analytical tools for studying sequence
variation also relate genetic variations to phenotype.

Virtual libraries - The creation and storage of vast collections of molecular
structures in an electronic database. These databases may be queried for subsets
that exhibit specific physicochemical features, or may be "virtually screened"
for their ability to bind a drug target. This process may be performed prior to
the synthesis and testing of the molecules themselves.

Visualization - Visualization is the process of representing abstract
scientific data as images that can aid in understanding the meaning of the
data.

Weight matrix - The density of binding sites in a gene or sequence can be used
to derive a ratio of density for each element in a pattern of interest. The
combined individual density ratios of all elements are then collectively used to
build a scoring profile known as a weight matrix. This profile can be used to
test the prediction of the identification of the selected pattern and the
ability of the algorithm to discriminate them from non-pattern
sequences.

Western blot - Technique in which specific antibodies are used to identify
their antigens from a mixture of proteins. Typically, these proteins mixtures
are first separated by electrophoresis and then transfered onto nylon sheets by
electrotransfer. Radiolabeled or enzyme-linked antibodies are incubated with the
sheets and unbound antibodies washed away allowing the position of the bound
antibody to be revealed by autoradiography or color which is formed upon
addition of a substrate.

X chromosome
- In mammals, the sex chromosome that is found in
two copies in the homogametic sex (female in humans) and one copy in the
hererogametic sex (male in humans).

X-Ray
Crystallography - A technique for determining the three-dimensional structure of a
large molecule, such as a protein.

Yeast 2-hybrid system - A
yeast-based method used to simultaneously identify, and clone the gene for,
proteins interacting with a known protein. The basis of this method is a
"transcriptional reporter assay" (see definition) in which reporter gene
expression is dependent on two domains. The first domain is linked to the known
protein. The second domain is genetically linked to a library. If the library is
screened against the known protein the two domains will interact only if a
protein from the library binds the known protein, resulting in transcription
activation of the reporter gene, and a blue color. The "blue yeast clone" will
contain the gene encoding the newly identified protein.

Z

Z-DNA - A
conformation of DNA existing as a left-handed double helix (the phosphate-sugar
backbone forms a left-handed zig-zag course), which may play a role in gene
regulation.

Zinc fingers - A protein motif formed by the
interaction of repeated cysteine and histidine residues with a zinc ion. The
spacing of the repeats results in finger like arrangements of the protein loops
formed from the interaction which interact with DNA. These motifs are typically
found in transcription factors.

Zoology in the Classroom

Thursday, January 22, 2015

A - Z Bioinformatics - Glossary

No comments:

Post a Comment