Skip to main content

Information retrieval from databases - search concepts, Tools for searching, homology searching, finding Domain and Functional site homologies

Information retrieval from databases - search concepts, Tools for searching, homology searching, finding Domain and Functional site homologies

Information Retrieval from Databases

1. Introduction

Information retrieval in bioinformatics refers to the process of extracting relevant biological data (DNA, RNA, protein sequences, structures, or functional information) from databases.
Aim: Identify sequences, functions, or structural features for analysis, comparison, and annotation.
Databases can be primary (raw sequence data) or secondary/derived (annotated, processed data).
2. Search Concepts in Biological Databases

2.1 Types of Searches

Exact Match Search

Returns results only if the query exactly matches database entries.
Useful for known accession numbers or IDs.

Pattern/Keyword Search

Searches based on specific motifs, keywords, or annotations.
Example: “kinase domain,” “signal peptide.”

Similarity/Homology Search
Detects sequences similar to the query based on sequence alignment.
Uses scoring matrices to assess similarity (e.g., BLOSUM, PAM).
Useful for identifying homologous genes or proteins.


Complex Query Search

Combines Boolean operators (AND, OR, NOT) to refine results.
Example: “kinase AND human NOT viral.”


2.2 Search Parameters

Query sequence or keyword
Database selection (nucleotide, protein, structural, functional)
Algorithm choice (BLAST, FASTA, PSI-BLAST)
Threshold or cut-off (E-value, score, % identity)
Filters (organism, date, length, sequence type)


3. Tools for Searching Biological Databases


3.1 Nucleotide Sequence Databases


GenBank (NCBI)
EMBL (European Nucleotide Archive)
DDBJ (DNA Data Bank of Japan)
Search Tools:
BLASTN – nucleotide vs nucleotide
FASTA – nucleotide similarity search


3.2 Protein Sequence Databases

SWISS-PROT / UniProtKB – curated protein sequences
PIR / TrEMBL – unreviewed protein sequences
Search Tools:
BLASTP – protein vs protein
PSI-BLAST – iterative search for distant homologs
HMMER – profile-based search using hidden Markov models

3.3 Structural Databases

Protein Data Bank (PDB) – 3D protein structures
SCOP / CATH – structural classification of proteins
Search Tools:
BLAST 3D – structure-based sequence search
DALI – structural alignment


3.4 Specialized Databases
Pfam – protein families
PROSITE – protein motifs
InterPro – integrated database of protein domains


4. Homology Searching

Homology searching identifies evolutionarily related sequences based on similarity.


4.1 Concept

Homologous sequences: share a common ancestor.
Types:

Orthologs – homologs in different species
Paralogs – homologs in the same species
Homology suggests similar structure or function.


4.2 Methods

1. Pairwise Sequence Alignment

Tools: BLAST, FASTA
Measures similarity (% identity) and E-value


2. Multiple Sequence Alignment (MSA)
Tools: Clustal Omega, MUSCLE
Identifies conserved residues and motifs


3. Profile-based Searching

Uses Position-Specific Scoring Matrices (PSSM)
Tool: PSI-BLAST, HMMER

4. Structural Homology
Comparing 3D structures for similarity
Tools: DALI, CATH, SCOP


5. Finding Domain and Functional Site Homologies


5.1 Protein Domains

Definition: Conserved part of protein with specific function/structure.
Examples: kinase domain, zinc finger, SH2 domain.
Domains often determine protein function.


5.2 Domain Databases and Tools

Pfam – HMM-based domain identification
SMART – domains in signaling and extracellular proteins
InterPro – integrates multiple domain databases
PROSITE – motifs and functional sites

5.3 Functional Site Prediction

Active sites, binding sites, or motifs are predicted based on:
Conserved residues across homologs
3D structure information
Known motifs (PROSITE patterns)
Tools:
ScanProsite – motif scanning
MotifScan – identifies functional motifs
CDD (Conserved Domain Database) – identifies domains and key residues


5.4 Steps to Identify Domain/Functional Homology
Input protein sequence
Perform sequence similarity search (BLASTP/PSI-BLAST)
Check conserved domains (Pfam, SMART, InterPro)
Predict functional motifs (PROSITE, ScanProsite)
Validate with structure-based tools if available

6. Summary / Workflow for Information Retrieval
1.Define the query (sequence, accession, or keyword)
2.Select the appropriate database (nucleotide, protein, structural)
3.Choose the search algorithm (BLAST, FASTA, HMMER)
4. Adjust parameters (E-value, filters)
5. Analyze results:
          Sequence similarity
          Homology inference
          Domain identification
           Functional site prediction
            Validate and annotate sequences
            Optional: Structural or evolutionary analysis


7. Key Points

Homology searches are more reliable than keyword searches for function prediction.
Iterative profile-based methods (PSI-BLAST, HMMER) detect distant homologs.
Domain and motif identification is essential for functional annotation.
Integrating sequence, domain, and structure information gives robust predictions.

Comments

Popular Posts

••CLASSIFICATION OF ALGAE - FRITSCH

      MODULE -1       PHYCOLOGY  CLASSIFICATION OF ALGAE - FRITSCH  ❖F.E. Fritsch (1935, 1945) in his book“The Structure and  Reproduction of the Algae”proposed a system of classification of  algae. He treated algae giving rank of division and divided it into 11  classes. His classification of algae is mainly based upon characters of  pigments, flagella and reserve food material.     Classification of Fritsch was based on the following criteria o Pigmentation. o Types of flagella  o Assimilatory products  o Thallus structure  o Method of reproduction          Fritsch divided algae into the following 11 classes  1. Chlorophyceae  2. Xanthophyceae  3. Chrysophyceae  4. Bacillariophyceae  5. Cryptophyceae  6. Dinophyceae  7. Chloromonadineae  8. Euglenineae    9. Phaeophyceae  10. Rhodophyceae  11. Myxophyce...

Biological Databases – Types of Data and DatabasesNucleotide Sequence Databases (EMBL, GenBank, DDBJ)

Biological Databases – Types of Data and Databases Nucleotide Sequence Databases (EMBL, GenBank, DDBJ) 1. Introduction Biological databases are systematic, computerized collections of biological information that allow efficient storage, retrieval, updating, and analysis of large volumes of biological data. With the advent of genome sequencing, molecular biology, and bioinformatics, biological databases have become essential tools in biological research. These databases support studies in genomics, proteomics, evolutionary biology, taxonomy, medicine, agriculture, and biotechnology. 2. Types of Data Stored in Biological Databases Biological databases store diverse types of biological information, including: 1. Sequence Data DNA sequences RNA sequences Protein sequences 2. Structural Data Three-dimensional structures of proteins Nucleic acid structures 3. Functional Data Gene functions Enzyme activity Regulatory elements 4. Genomic Annotation Data Gene location Exons, introns Promoters a...

Gene Transfer Technologies – Detailed Notes

Gene Transfer Technologies – Detailed Notes 1. Definition Gene transfer is the process of introducing foreign DNA or genes into the genome of a target organism or cell. It allows the expression of new traits, study of gene function, and production of therapeutic proteins. Also known as gene delivery or genetic transformation. 2. Principles of Gene Transfer Involves delivery of DNA or RNA into cells or organisms. DNA can be integrated into the host genome or remain episomal (non-integrated). The goal is stable or transient expression of the transferred gene. Key considerations: Vector – vehicle for carrying the gene Target cell – plant, animal, microbial, or human cells Delivery method – physical, chemical, or biological 3. Types of Gene Transfer Gene transfer can be broadly classified into: A. Natural Gene Transfer Occurs in nature between organisms: Transformation: Uptake of naked DNA by bacteria. Transduction: DNA transfer via viruses (bacteriophages). Conjugation: Transfer of plasmi...

𓆞 Western Blotting Notes

Western Blotting (Immunoblotting) ❥ 𓆞❥ 𓆞❥ 𓆞❥ 𓆞❥ 𓆞❥ 𓆞❥ 𓆞❥ 𓆞❥ 𓆞❥  Introduction Western blotting, also known as immunoblotting, is a widely used analytical technique for the detection, identification, and quantification of specific proteins in a complex biological sample. The technique combines protein separation by gel electrophoresis with specific antigen–antibody interaction. The method was developed by Towbin et al. (1979) (Burnette 1981---its group work) and is called “Western” in analogy to Southern blotting (DNA) and Northern blotting (RNA). Principle The principle of Western blotting involves: Separation of proteins based on molecular weight using SDS-PAGE Transfer (blotting) of separated proteins onto a membrane Specific detection of the target protein using primary and secondary antibodies Visualization using enzymatic or fluorescent detection systems 👉 Antigen–antibody specificity is the core principle of Western blotting. Steps Involved in Western Blotting 1. Sa...

Microbial Production of PharmaceuticalsSomatostatin, Humulin and Interferons

Microbial Production of Pharmaceuticals Somatostatin, Humulin and Interferons 1. Introduction Advances in recombinant DNA technology have enabled microorganisms to produce human therapeutic proteins safely, economically and in large quantities. Microbial systems such as Escherichia coli and yeast (Saccharomyces cerevisiae) are widely used for the production of pharmaceuticals that were earlier isolated from human or animal tissues. Important microbial-derived pharmaceuticals include somatostatin, human insulin (Humulin) and interferons. 2. Advantages of Microbial Production of Pharmaceuticals High yield and rapid production Cost-effective and scalable Free from animal pathogens Consistent product quality Easy genetic manipulation 3. General Steps in Microbial Production of Recombinant Pharmaceuticals Isolation of target gene Construction of recombinant DNA Insertion into suitable vector Transformation into host microorganism Expression of protein Downstream processing and purification ...

Molecular Marker Techniques

Molecular Marker Techniques (30-Mark Detailed Notes) Introduction Molecular markers are DNA sequences with known locations on chromosomes that can be used to identify individuals, genotypes, or genetic differences. They reveal polymorphism at the DNA level and are not influenced by environmental factors, unlike morphological or biochemical markers. Molecular marker techniques are widely used in genetics, plant breeding, biotechnology, forensics, medical diagnosis, and evolutionary studies. Characteristics of an Ideal Molecular Marker An ideal molecular marker should: Be highly polymorphic Show co-dominant inheritance Be abundant and uniformly distributed in the genome Be environment-independent Have high reproducibility Be easy, rapid, and cost-effective Classification of Molecular Marker    Techniques 1. Hybridization-Based Markers RFLP (Restriction Fragment Length Polymorphism) 2. PCR-Based Markers RAPD AFLP SSR (Microsatellites) ISSR 3. Sequence-Based Markers SNP (Single Nu...

Direct Gene Transfer Using PEG

Direct Gene Transfer Using PEG Definition : Direct gene transfer using PEG is a chemical-mediated method to introduce foreign DNA into protoplasts (cells without cell walls) by promoting fusion of cell membranes, allowing the uptake of exogenous DNA. It is a widely used technique in plant genetic engineering and somatic hybridization. 1. Principle PEG is a polymer that induces aggregation and fusion of protoplast membranes. When protoplasts are incubated with foreign DNA in the presence of PEG, the DNA can enter the cytoplasm and nucleus. The method relies on membrane destabilization rather than a vector (virus, plasmid) for DNA delivery. Key Idea: PEG acts as a fusogen, bringing protoplasts or DNA into close contact with the cell membrane to facilitate uptake. 2. Materials Required Recipient protoplasts – plant or animal cells with cell walls removed. Donor DNA – plasmid, linear DNA, or genomic DNA. PEG solution – commonly PEG 4000–6000, at 20–50% (w/v) in water. Calcium ions (Ca²⁺) –...

Gene Therapy – Detailed Notes

Gene Therapy – Detailed Notes Definition Gene therapy is a therapeutic technique in which genetic material (DNA or RNA) is introduced, removed, or modified in a patient’s cells to treat or prevent genetic disorders and diseases by correcting defective genes or providing new functional genes. Basic Concept Many diseases occur due to mutation, deletion, or malfunction of genes. Gene therapy aims to: Replace a defective gene Add a functional gene Silence or inhibit a harmful gene It works at the molecular level, targeting the root cause of disease rather than symptoms. Types of Gene Therapy 1. Somatic Gene Therapy Gene transfer into somatic (body) cells. Effects are not inherited. Most widely used and ethically accepted. Examples: Cystic fibrosis, cancer therapy, SCID 2. Germline Gene Therapy Gene transfer into germ cells (sperm/egg) or early embryos. Genetic changes are heritable. Ethically restricted and banned in many countries. Approaches of Gene Therapy 1. Gene Replacement Therapy De...

Protein Structure Database (PDB)

Protein Structure Database (PDB) Introduction The Protein Structure Database (PDB) is the primary global repository for the three-dimensional (3D) structures of biological macromolecules such as proteins, nucleic acids, and protein–ligand complexes. These structures are determined experimentally using techniques like X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, and Cryo-Electron Microscopy (Cryo-EM). PDB plays a vital role in understanding: Protein structure and function Molecular interactions Drug discovery and design Structural biology and bioinformatics History and Development Established in 1971 Founded by Brookhaven National Laboratory (USA) Initially contained only 7 protein structures Now maintained by the Worldwide Protein Data Bank (wwPDB) Members of wwPDB RCSB PDB (USA) PDBe (Europe) PDBj (Japan) BMRB (Biological Magnetic Resonance Data Bank) Objectives of PDB To collect, store, and distribute 3D structural data of biomolecules To provide free and ope...

RAPD (Random Amplified Polymorphic DNA)

RAPD (Random Amplified Polymorphic DNA) Introduction RAPD is a PCR-based molecular marker technique used to detect genetic variation at the DNA level. Developed by Williams et al., 1990. RAPD markers are dominant, randomly distributed, and do not require prior knowledge of DNA sequences. Commonly used in genetic diversity studies, plant breeding, population genetics, and phylogenetics. Principle RAPD relies on the amplification of random DNA segments using short arbitrary primers (usually 10 nucleotides). Polymorphism occurs due to: Presence or absence of primer binding sites Insertions or deletions in the DNA Point mutations in the primer sites Key idea : Random primers anneal to complementary sites → PCR amplification → Different band patterns between individuals → Polymorphism analysis Materials Required Genomic DNA Arbitrary oligonucleotide primers (10-mer) PCR reagents: Taq polymerase, dNTPs, buffer, Mg²⁺ Thermal cycler Agarose gel and electrophoresis equipment DNA staining dyes (...