Skip to main content

Biological Databases – Types of Data and DatabasesNucleotide Sequence Databases (EMBL, GenBank, DDBJ)


Biological Databases – Types of Data and Databases
Nucleotide Sequence Databases (EMBL, GenBank, DDBJ)

1. Introduction

Biological databases are systematic, computerized collections of biological information that allow efficient storage, retrieval, updating, and analysis of large volumes of biological data. With the advent of genome sequencing, molecular biology, and bioinformatics, biological databases have become essential tools in biological research.
These databases support studies in genomics, proteomics, evolutionary biology, taxonomy, medicine, agriculture, and biotechnology.

2. Types of Data Stored in Biological Databases
Biological databases store diverse types of biological information, including:

1. Sequence Data
DNA sequences
RNA sequences
Protein sequences

2. Structural Data

Three-dimensional structures of proteins
Nucleic acid structures

3. Functional Data

Gene functions
Enzyme activity
Regulatory elements

4. Genomic Annotation Data

Gene location
Exons, introns
Promoters and regulatory regions

5. Expression Data

Transcriptome data
Gene expression profiles

3. Classification of Biological Databases
Based on content and level of data processing, biological databases are classified into:

A. Primary Databases

Contain raw experimental data
Direct submissions from researchers
Minimal annotation
Examples:
GenBank, EMBL, DDBJ, Protein Data Bank (PDB)

B. Secondary Databases

Data derived from primary databases
Highly curated and analyzed
Provide functional and structural annotations
Examples:
UniProt, PROSITE, Pfam, SCOP


C. Composite (Integrated) Databases


Combine information from multiple databases
Reduce redundancy
Provide non-overlapping datasets
Examples:
RefSeq, UniGene, Ensembl

4. Nucleotide Sequence Databases

Nucleotide sequence databases store DNA and RNA sequences obtained through sequencing experiments. They are essential for gene discovery, genome analysis, comparative genomics, and evolutionary studies.

The three major global nucleotide sequence databases are:
GenBank (USA)
EMBL-ENA (Europe)
DDBJ (Japan)

These databases function under the International Nucleotide Sequence Database Collaboration (INSDC).


5. International Nucleotide Sequence Database Collaboration (INSDC)


INSDC is a global consortium that ensures:
Free and open access to nucleotide sequence data
Daily exchange of data among databases
Uniform data formats and annotation standards.


Members of INSDC:
GenBank – NCBI (USA)
EMBL-ENA – EMBL-EBI (Europe)
DDBJ – National Institute of Genetics (Japan)

6. GenBank
Overview
GenBank is a comprehensive nucleotide sequence database maintained by the National Center for Biotechnology Information (NCBI), USA. It is one of the largest and most widely used biological databases.
Types of Data Stored

Genomic DNA
cDNA and mRNA sequences
ESTs (Expressed Sequence Tags)
Whole genome sequences
Organelle genomes


7. EMBL (European Molecular Biology Laboratory Database)

Overview
The EMBL nucleotide database is maintained by the European Bioinformatics Institute (EMBL-EBI) and is now part of the European Nucleotide Archive (ENA).



8. DDBJ (DNA Data Bank of Japan)

Overview
DDBJ is maintained by the National Institute of Genetics (NIG), Japan. It mainly accepts sequence submissions from Asian countries but is globally accessible.
Data Stored
DNA and RNA sequences
Whole genome sequences
Environmental and metagenomic data
Special Features
Uses data formats similar to GenBank and EMBL
Exchanges data daily with other INSDC members
Provides online submission tools
9. Comparison of GenBank, EMBL, and DDBJ




➡ All three contain identical data but differ in access portals and management.


10. Importance of Nucleotide Sequence Databases
Preserve genetic information
Support genome sequencing projects
Enable gene identification and annotation
Facilitate evolutionary and phylogenetic studies
Assist in medical, agricultural, and environmental research
11. Applications
Comparative genomics
Molecular taxonomy
Gene cloning and primer design
Mutation analysis
Crop improvement and breeding programmes
12. Conclusion
Biological databases play a central role in modern biological research. Among them, nucleotide sequence databases such as GenBank, EMBL, and DDBJ are primary repositories that store DNA and RNA sequences. Through the INSDC collaboration, these databases ensure global data sharing, accuracy, and accessibility, making them indispensable resources for genomics, bioinformatics, and biotechnology.




Comments

Popular Posts

Secondary Databases (PROSITE, PRINTS, BLOCKS)

Secondary Databases (PROSITE, PRINTS, BLOCKS  Secondary Databases Introduction Biological databases are broadly classified into primary and secondary databases. Primary databases store raw experimental data (e.g., nucleotide or protein sequences), whereas secondary databases contain derived information obtained by analyzing primary sequence data. Secondary databases are mainly used to: Identify protein families Detect conserved motifs, patterns, and domains Predict protein function Study structure–function relationships Examples of secondary databases include PROSITE, PRINTS, BLOCKS, Pfam, etc. 1. PROSITE Database Definition PROSITE is a secondary database that documents protein domains, families, and functional sites in the form of patterns and profiles. Developed by Swiss Institute of Bioinformatics (SIB) Maintained along with UniProt Principle PROSITE is based on the idea that functionally important regions of proteins are conserved during evolution. These conserved regions can ...

DNA-Mediated Gene Transfer – Detailed Notes

DNA-Mediated Gene Transfer – Detailed Notes 1. Definition DNA-mediated gene transfer refers to the direct introduction of exogenous DNA into a host cell’s genome or cytoplasm without using viral or bacterial vectors. It is a physical or chemical approach to achieve gene delivery. Also called direct gene transfer. 2 . Principle Foreign DNA is delivered into host cells through physical or chemical methods. DNA may integrate into the host genome (stable transformation) or remain episomal (transient expression). Expression depends on: DNA sequence and promoter Type of host cell Delivery efficiency 3. Types of DNA-Mediated Gene Transfer A. Physical Methods These methods use physical forces to introduce DNA into cells. Microinjection DNA is injected directly into the nucleus or cytoplasm using a glass micropipette. Used in: animal embryos, oocytes, plant protoplasts Advantages: Precise, can deliver large DNA fragments Limitations: Labor-intensive, requires specialized equipment, low throughp...

Single Nucleotide Polymorphisms (SNPs) – Detailed Notes

Single Nucleotide Polymorphisms (SNPs) – Detailed Notes 1. Definition SNPs are single base-pair variations in the DNA sequence that occur at a specific position in the genome among individuals of a species. Example: At a specific locus, one individual may have A while another has G: Copy code Individual 1: …A T C G A T…   Individual 2: …A T C G G T… SNPs are the most common type of genetic variation in most organisms. 2. Characteristics of SNPs Single base change: Involves substitution of one nucleotide for another (A↔G, C↔T). Biallelic nature: Most SNPs have only two alleles in a population. Widespread in the genome: Found in coding regions (exons), non-coding regions (introns, promoters, intergenic regions). Stable inheritance: Passed from generation to generation like other genetic markers. Frequency: Occur approximately every 100–300 bp in the human genome. 3 . Types of SNPs SNPs are categorized based on location or effect on gene function: A. Based on genomic location Cod...

SSR (Simple Sequence Repeat) Marker

SSR (Simple Sequence Repeat) Markers – Detailed Notes Introduction SSR markers, also called microsatellites, are short tandem repeats (1–6 bp) of DNA sequences found throughout the genome. Examples: (A)n, (CA)n, (GATA)n, where n is the number of repeat units. SSRs are highly polymorphic, co-dominant, and locus-specific, widely used in genetic mapping, variety identification, population genetics, and marker-assisted selection (MAS). SSRs are similar to STRs; in plants and animals, the term SSR is more commonly used in molecular breeding, while STR is used more in forensics and human genetics. Structure of SSR Repeat motif: 1–6 bp Number of repeats: Variable among individuals → basis of polymorphism Flanking regions: Conserved sequences used to design specific PCR primers SSR loci are generally abundant in non-coding regions, though some occur in genes. Principle SSR markers exploit variation in the number of repeat units at a specific locus. PCR amplification using primers flanking the...

AFLP--Amplified Fragment Length Polymorphism

AFLP is a PCR-based DNA fingerprinting technique combining restriction digestion and selective PCR amplification of genomic DNA fragments. Developed by Vos et al., 1995. AFLP detects DNA polymorphisms at the genomic level and is highly reproducible and sensitive. Used in genetic mapping, diversity studies, phylogenetics, and marker-assisted selection. Principle AFLP relies on restriction digestion of genomic DNA, followed by ligation of adaptors and PCR amplification of a subset of fragments. Polymorphism arises due to variations in restriction sites, fragment length, insertions, or deletions. Key idea: Restriction digestion → Adaptor ligation → Selective amplification → Gel separation → Detection of polymorphic bands Materials Required Genomic DNA Restriction enzymes (usually EcoRI and MseI) Adaptors complementary to restriction sites PCR reagents: Taq polymerase, dNTPs, buffer, Mg²⁺ Primers complementary to adaptors with selective nucleotides Thermal cycler Polyacrylamide or agarose ...

Protein Structure Database (PDB)

Protein Structure Database (PDB) Introduction The Protein Structure Database (PDB) is the primary global repository for the three-dimensional (3D) structures of biological macromolecules such as proteins, nucleic acids, and protein–ligand complexes. These structures are determined experimentally using techniques like X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, and Cryo-Electron Microscopy (Cryo-EM). PDB plays a vital role in understanding: Protein structure and function Molecular interactions Drug discovery and design Structural biology and bioinformatics History and Development Established in 1971 Founded by Brookhaven National Laboratory (USA) Initially contained only 7 protein structures Now maintained by the Worldwide Protein Data Bank (wwPDB) Members of wwPDB RCSB PDB (USA) PDBe (Europe) PDBj (Japan) BMRB (Biological Magnetic Resonance Data Bank) Objectives of PDB To collect, store, and distribute 3D structural data of biomolecules To provide free and ope...

SCAR (Sequence Characterized Amplified Region) Markers

SCAR (Sequence Characterized Amplified Region) Markers   Introduction SCAR markers are PCR-based DNA markers derived from RAPD, AFLP, or other random markers. Developed by Paran and Michelmore in 1993 to convert dominant, less reproducible markers into specific, reproducible, co-dominant markers. SCAR markers are locus-specific, reproducible, and sequence-characterized, making them ideal for marker-assisted selection (MAS). Principle SCAR markers are designed based on known DNA sequences obtained from cloned RAPD/AFLP fragments. Specific primers (18–24 bp) are synthesized to amplify a single, defined locus. The PCR amplification of this region generates a distinct band, which is highly reproducible and can distinguish homozygotes from heterozygotes if designed as co-dominant. Key idea: Random marker (e.g., RAPD) → Cloning & sequencing → Design specific primers → PCR → SCAR marker Materials Required Genomic DNA from the organism Specific primers (18–24 bp) designed from sequence...

Recombinant Viral Techniques – Detailed Notes

Recombinant Viral Techniques – Detailed Notes 1. Definition Recombinant viral techniques involve using viruses as vectors to deliver foreign genes into host cells. The foreign gene is engineered into the viral genome, allowing the virus to infect target cells and express the gene. Widely used in gene therapy, functional genomics, vaccine development, and protein production. 2. Principle A viral genome is modified in vitro to carry a gene of interest. Viral genes responsible for replication or pathogenicity may be deleted or inactivated to ensure safety. The recombinant virus infects target cells, delivering the foreign gene. The gene may be expressed transiently or stably, depending on the virus type. Key Concept: Viruses act as natural delivery vehicles that efficiently enter cells, bypassing the limitations of physical or chemical gene delivery methods. 3. Types of Recombinant Viral Vectors Viral vectors can be classified based on genome type and integration behavior. A. Retroviral V...

Protoplast culture covering isolation, fusion, somatic hybrid & cybrid production, preferential chromosome elimination, role in CMS, and genetic transformation.

  Protoplast culture covering isolation, fusion, somatic hybrid & cybrid production, preferential chromosome elimination, role in CMS, and genetic transformation. Protoplast Culture 1. Introduction A protoplast is a plant cell without a cell wall, surrounded only by the plasma membrane. Protoplast culture allows direct access to the plasma membrane and genome, making it a powerful tool for: Somatic hybridization Cybrid production Genetic transformation Cytoplasmic trait transfer (e.g., CMS) 2. Isolation of Protoplasts 2.1 Source of Protoplasts Young leaves (mesophyll cells) Callus tissue Cell suspension cultures Roots or hypocotyls Young, actively dividing tissues are preferred due to high viability. 2.2 Methods of Protoplast Isolation A. Mechanical Method Cell walls removed by cutting and plasmolysis Rarely used Causes low yield and high damage B. Enzymatic Method (Most Common) Cell wall digested using enzymes: Enzyme Function Cellulase Degrades cellulose Pectinase Degrades mi...