Skip to main content

Information retrieval from databases - search concepts, Tools for searching, homology searching, finding Domain and Functional site homologies

Information retrieval from databases - search concepts, Tools for searching, homology searching, finding Domain and Functional site homologies

Information Retrieval from Databases

1. Introduction

Information retrieval in bioinformatics refers to the process of extracting relevant biological data (DNA, RNA, protein sequences, structures, or functional information) from databases.
Aim: Identify sequences, functions, or structural features for analysis, comparison, and annotation.
Databases can be primary (raw sequence data) or secondary/derived (annotated, processed data).
2. Search Concepts in Biological Databases

2.1 Types of Searches

Exact Match Search

Returns results only if the query exactly matches database entries.
Useful for known accession numbers or IDs.

Pattern/Keyword Search

Searches based on specific motifs, keywords, or annotations.
Example: “kinase domain,” “signal peptide.”

Similarity/Homology Search
Detects sequences similar to the query based on sequence alignment.
Uses scoring matrices to assess similarity (e.g., BLOSUM, PAM).
Useful for identifying homologous genes or proteins.


Complex Query Search

Combines Boolean operators (AND, OR, NOT) to refine results.
Example: “kinase AND human NOT viral.”


2.2 Search Parameters

Query sequence or keyword
Database selection (nucleotide, protein, structural, functional)
Algorithm choice (BLAST, FASTA, PSI-BLAST)
Threshold or cut-off (E-value, score, % identity)
Filters (organism, date, length, sequence type)


3. Tools for Searching Biological Databases


3.1 Nucleotide Sequence Databases


GenBank (NCBI)
EMBL (European Nucleotide Archive)
DDBJ (DNA Data Bank of Japan)
Search Tools:
BLASTN – nucleotide vs nucleotide
FASTA – nucleotide similarity search


3.2 Protein Sequence Databases

SWISS-PROT / UniProtKB – curated protein sequences
PIR / TrEMBL – unreviewed protein sequences
Search Tools:
BLASTP – protein vs protein
PSI-BLAST – iterative search for distant homologs
HMMER – profile-based search using hidden Markov models

3.3 Structural Databases

Protein Data Bank (PDB) – 3D protein structures
SCOP / CATH – structural classification of proteins
Search Tools:
BLAST 3D – structure-based sequence search
DALI – structural alignment


3.4 Specialized Databases
Pfam – protein families
PROSITE – protein motifs
InterPro – integrated database of protein domains


4. Homology Searching

Homology searching identifies evolutionarily related sequences based on similarity.


4.1 Concept

Homologous sequences: share a common ancestor.
Types:

Orthologs – homologs in different species
Paralogs – homologs in the same species
Homology suggests similar structure or function.


4.2 Methods

1. Pairwise Sequence Alignment

Tools: BLAST, FASTA
Measures similarity (% identity) and E-value


2. Multiple Sequence Alignment (MSA)
Tools: Clustal Omega, MUSCLE
Identifies conserved residues and motifs


3. Profile-based Searching

Uses Position-Specific Scoring Matrices (PSSM)
Tool: PSI-BLAST, HMMER

4. Structural Homology
Comparing 3D structures for similarity
Tools: DALI, CATH, SCOP


5. Finding Domain and Functional Site Homologies


5.1 Protein Domains

Definition: Conserved part of protein with specific function/structure.
Examples: kinase domain, zinc finger, SH2 domain.
Domains often determine protein function.


5.2 Domain Databases and Tools

Pfam – HMM-based domain identification
SMART – domains in signaling and extracellular proteins
InterPro – integrates multiple domain databases
PROSITE – motifs and functional sites

5.3 Functional Site Prediction

Active sites, binding sites, or motifs are predicted based on:
Conserved residues across homologs
3D structure information
Known motifs (PROSITE patterns)
Tools:
ScanProsite – motif scanning
MotifScan – identifies functional motifs
CDD (Conserved Domain Database) – identifies domains and key residues


5.4 Steps to Identify Domain/Functional Homology
Input protein sequence
Perform sequence similarity search (BLASTP/PSI-BLAST)
Check conserved domains (Pfam, SMART, InterPro)
Predict functional motifs (PROSITE, ScanProsite)
Validate with structure-based tools if available

6. Summary / Workflow for Information Retrieval
1.Define the query (sequence, accession, or keyword)
2.Select the appropriate database (nucleotide, protein, structural)
3.Choose the search algorithm (BLAST, FASTA, HMMER)
4. Adjust parameters (E-value, filters)
5. Analyze results:
          Sequence similarity
          Homology inference
          Domain identification
           Functional site prediction
            Validate and annotate sequences
            Optional: Structural or evolutionary analysis


7. Key Points

Homology searches are more reliable than keyword searches for function prediction.
Iterative profile-based methods (PSI-BLAST, HMMER) detect distant homologs.
Domain and motif identification is essential for functional annotation.
Integrating sequence, domain, and structure information gives robust predictions.

Comments

Popular Posts

Secondary Databases (PROSITE, PRINTS, BLOCKS)

Secondary Databases (PROSITE, PRINTS, BLOCKS  Secondary Databases Introduction Biological databases are broadly classified into primary and secondary databases. Primary databases store raw experimental data (e.g., nucleotide or protein sequences), whereas secondary databases contain derived information obtained by analyzing primary sequence data. Secondary databases are mainly used to: Identify protein families Detect conserved motifs, patterns, and domains Predict protein function Study structure–function relationships Examples of secondary databases include PROSITE, PRINTS, BLOCKS, Pfam, etc. 1. PROSITE Database Definition PROSITE is a secondary database that documents protein domains, families, and functional sites in the form of patterns and profiles. Developed by Swiss Institute of Bioinformatics (SIB) Maintained along with UniProt Principle PROSITE is based on the idea that functionally important regions of proteins are conserved during evolution. These conserved regions can ...

DNA-Mediated Gene Transfer – Detailed Notes

DNA-Mediated Gene Transfer – Detailed Notes 1. Definition DNA-mediated gene transfer refers to the direct introduction of exogenous DNA into a host cell’s genome or cytoplasm without using viral or bacterial vectors. It is a physical or chemical approach to achieve gene delivery. Also called direct gene transfer. 2 . Principle Foreign DNA is delivered into host cells through physical or chemical methods. DNA may integrate into the host genome (stable transformation) or remain episomal (transient expression). Expression depends on: DNA sequence and promoter Type of host cell Delivery efficiency 3. Types of DNA-Mediated Gene Transfer A. Physical Methods These methods use physical forces to introduce DNA into cells. Microinjection DNA is injected directly into the nucleus or cytoplasm using a glass micropipette. Used in: animal embryos, oocytes, plant protoplasts Advantages: Precise, can deliver large DNA fragments Limitations: Labor-intensive, requires specialized equipment, low throughp...

Single Nucleotide Polymorphisms (SNPs) – Detailed Notes

Single Nucleotide Polymorphisms (SNPs) – Detailed Notes 1. Definition SNPs are single base-pair variations in the DNA sequence that occur at a specific position in the genome among individuals of a species. Example: At a specific locus, one individual may have A while another has G: Copy code Individual 1: …A T C G A T…   Individual 2: …A T C G G T… SNPs are the most common type of genetic variation in most organisms. 2. Characteristics of SNPs Single base change: Involves substitution of one nucleotide for another (A↔G, C↔T). Biallelic nature: Most SNPs have only two alleles in a population. Widespread in the genome: Found in coding regions (exons), non-coding regions (introns, promoters, intergenic regions). Stable inheritance: Passed from generation to generation like other genetic markers. Frequency: Occur approximately every 100–300 bp in the human genome. 3 . Types of SNPs SNPs are categorized based on location or effect on gene function: A. Based on genomic location Cod...

SSR (Simple Sequence Repeat) Marker

SSR (Simple Sequence Repeat) Markers – Detailed Notes Introduction SSR markers, also called microsatellites, are short tandem repeats (1–6 bp) of DNA sequences found throughout the genome. Examples: (A)n, (CA)n, (GATA)n, where n is the number of repeat units. SSRs are highly polymorphic, co-dominant, and locus-specific, widely used in genetic mapping, variety identification, population genetics, and marker-assisted selection (MAS). SSRs are similar to STRs; in plants and animals, the term SSR is more commonly used in molecular breeding, while STR is used more in forensics and human genetics. Structure of SSR Repeat motif: 1–6 bp Number of repeats: Variable among individuals → basis of polymorphism Flanking regions: Conserved sequences used to design specific PCR primers SSR loci are generally abundant in non-coding regions, though some occur in genes. Principle SSR markers exploit variation in the number of repeat units at a specific locus. PCR amplification using primers flanking the...

AFLP--Amplified Fragment Length Polymorphism

AFLP is a PCR-based DNA fingerprinting technique combining restriction digestion and selective PCR amplification of genomic DNA fragments. Developed by Vos et al., 1995. AFLP detects DNA polymorphisms at the genomic level and is highly reproducible and sensitive. Used in genetic mapping, diversity studies, phylogenetics, and marker-assisted selection. Principle AFLP relies on restriction digestion of genomic DNA, followed by ligation of adaptors and PCR amplification of a subset of fragments. Polymorphism arises due to variations in restriction sites, fragment length, insertions, or deletions. Key idea: Restriction digestion → Adaptor ligation → Selective amplification → Gel separation → Detection of polymorphic bands Materials Required Genomic DNA Restriction enzymes (usually EcoRI and MseI) Adaptors complementary to restriction sites PCR reagents: Taq polymerase, dNTPs, buffer, Mg²⁺ Primers complementary to adaptors with selective nucleotides Thermal cycler Polyacrylamide or agarose ...

Protein Structure Database (PDB)

Protein Structure Database (PDB) Introduction The Protein Structure Database (PDB) is the primary global repository for the three-dimensional (3D) structures of biological macromolecules such as proteins, nucleic acids, and protein–ligand complexes. These structures are determined experimentally using techniques like X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, and Cryo-Electron Microscopy (Cryo-EM). PDB plays a vital role in understanding: Protein structure and function Molecular interactions Drug discovery and design Structural biology and bioinformatics History and Development Established in 1971 Founded by Brookhaven National Laboratory (USA) Initially contained only 7 protein structures Now maintained by the Worldwide Protein Data Bank (wwPDB) Members of wwPDB RCSB PDB (USA) PDBe (Europe) PDBj (Japan) BMRB (Biological Magnetic Resonance Data Bank) Objectives of PDB To collect, store, and distribute 3D structural data of biomolecules To provide free and ope...

SCAR (Sequence Characterized Amplified Region) Markers

SCAR (Sequence Characterized Amplified Region) Markers   Introduction SCAR markers are PCR-based DNA markers derived from RAPD, AFLP, or other random markers. Developed by Paran and Michelmore in 1993 to convert dominant, less reproducible markers into specific, reproducible, co-dominant markers. SCAR markers are locus-specific, reproducible, and sequence-characterized, making them ideal for marker-assisted selection (MAS). Principle SCAR markers are designed based on known DNA sequences obtained from cloned RAPD/AFLP fragments. Specific primers (18–24 bp) are synthesized to amplify a single, defined locus. The PCR amplification of this region generates a distinct band, which is highly reproducible and can distinguish homozygotes from heterozygotes if designed as co-dominant. Key idea: Random marker (e.g., RAPD) → Cloning & sequencing → Design specific primers → PCR → SCAR marker Materials Required Genomic DNA from the organism Specific primers (18–24 bp) designed from sequence...

Recombinant Viral Techniques – Detailed Notes

Recombinant Viral Techniques – Detailed Notes 1. Definition Recombinant viral techniques involve using viruses as vectors to deliver foreign genes into host cells. The foreign gene is engineered into the viral genome, allowing the virus to infect target cells and express the gene. Widely used in gene therapy, functional genomics, vaccine development, and protein production. 2. Principle A viral genome is modified in vitro to carry a gene of interest. Viral genes responsible for replication or pathogenicity may be deleted or inactivated to ensure safety. The recombinant virus infects target cells, delivering the foreign gene. The gene may be expressed transiently or stably, depending on the virus type. Key Concept: Viruses act as natural delivery vehicles that efficiently enter cells, bypassing the limitations of physical or chemical gene delivery methods. 3. Types of Recombinant Viral Vectors Viral vectors can be classified based on genome type and integration behavior. A. Retroviral V...

Biological Databases – Types of Data and DatabasesNucleotide Sequence Databases (EMBL, GenBank, DDBJ)

Biological Databases – Types of Data and Databases Nucleotide Sequence Databases (EMBL, GenBank, DDBJ) 1. Introduction Biological databases are systematic, computerized collections of biological information that allow efficient storage, retrieval, updating, and analysis of large volumes of biological data. With the advent of genome sequencing, molecular biology, and bioinformatics, biological databases have become essential tools in biological research. These databases support studies in genomics, proteomics, evolutionary biology, taxonomy, medicine, agriculture, and biotechnology. 2. Types of Data Stored in Biological Databases Biological databases store diverse types of biological information, including: 1. Sequence Data DNA sequences RNA sequences Protein sequences 2. Structural Data Three-dimensional structures of proteins Nucleic acid structures 3. Functional Data Gene functions Enzyme activity Regulatory elements 4. Genomic Annotation Data Gene location Exons, introns Promoters a...

Protoplast culture covering isolation, fusion, somatic hybrid & cybrid production, preferential chromosome elimination, role in CMS, and genetic transformation.

  Protoplast culture covering isolation, fusion, somatic hybrid & cybrid production, preferential chromosome elimination, role in CMS, and genetic transformation. Protoplast Culture 1. Introduction A protoplast is a plant cell without a cell wall, surrounded only by the plasma membrane. Protoplast culture allows direct access to the plasma membrane and genome, making it a powerful tool for: Somatic hybridization Cybrid production Genetic transformation Cytoplasmic trait transfer (e.g., CMS) 2. Isolation of Protoplasts 2.1 Source of Protoplasts Young leaves (mesophyll cells) Callus tissue Cell suspension cultures Roots or hypocotyls Young, actively dividing tissues are preferred due to high viability. 2.2 Methods of Protoplast Isolation A. Mechanical Method Cell walls removed by cutting and plasmolysis Rarely used Causes low yield and high damage B. Enzymatic Method (Most Common) Cell wall digested using enzymes: Enzyme Function Cellulase Degrades cellulose Pectinase Degrades mi...