Skip to main content

Information retrieval from databases - search concepts, Tools for searching, homology searching, finding Domain and Functional site homologies

Information retrieval from databases - search concepts, Tools for searching, homology searching, finding Domain and Functional site homologies

Information Retrieval from Databases

1. Introduction

Information retrieval in bioinformatics refers to the process of extracting relevant biological data (DNA, RNA, protein sequences, structures, or functional information) from databases.
Aim: Identify sequences, functions, or structural features for analysis, comparison, and annotation.
Databases can be primary (raw sequence data) or secondary/derived (annotated, processed data).
2. Search Concepts in Biological Databases

2.1 Types of Searches

Exact Match Search

Returns results only if the query exactly matches database entries.
Useful for known accession numbers or IDs.

Pattern/Keyword Search

Searches based on specific motifs, keywords, or annotations.
Example: “kinase domain,” “signal peptide.”

Similarity/Homology Search
Detects sequences similar to the query based on sequence alignment.
Uses scoring matrices to assess similarity (e.g., BLOSUM, PAM).
Useful for identifying homologous genes or proteins.


Complex Query Search

Combines Boolean operators (AND, OR, NOT) to refine results.
Example: “kinase AND human NOT viral.”


2.2 Search Parameters

Query sequence or keyword
Database selection (nucleotide, protein, structural, functional)
Algorithm choice (BLAST, FASTA, PSI-BLAST)
Threshold or cut-off (E-value, score, % identity)
Filters (organism, date, length, sequence type)


3. Tools for Searching Biological Databases


3.1 Nucleotide Sequence Databases


GenBank (NCBI)
EMBL (European Nucleotide Archive)
DDBJ (DNA Data Bank of Japan)
Search Tools:
BLASTN – nucleotide vs nucleotide
FASTA – nucleotide similarity search


3.2 Protein Sequence Databases

SWISS-PROT / UniProtKB – curated protein sequences
PIR / TrEMBL – unreviewed protein sequences
Search Tools:
BLASTP – protein vs protein
PSI-BLAST – iterative search for distant homologs
HMMER – profile-based search using hidden Markov models

3.3 Structural Databases

Protein Data Bank (PDB) – 3D protein structures
SCOP / CATH – structural classification of proteins
Search Tools:
BLAST 3D – structure-based sequence search
DALI – structural alignment


3.4 Specialized Databases
Pfam – protein families
PROSITE – protein motifs
InterPro – integrated database of protein domains


4. Homology Searching

Homology searching identifies evolutionarily related sequences based on similarity.


4.1 Concept

Homologous sequences: share a common ancestor.
Types:

Orthologs – homologs in different species
Paralogs – homologs in the same species
Homology suggests similar structure or function.


4.2 Methods

1. Pairwise Sequence Alignment

Tools: BLAST, FASTA
Measures similarity (% identity) and E-value


2. Multiple Sequence Alignment (MSA)
Tools: Clustal Omega, MUSCLE
Identifies conserved residues and motifs


3. Profile-based Searching

Uses Position-Specific Scoring Matrices (PSSM)
Tool: PSI-BLAST, HMMER

4. Structural Homology
Comparing 3D structures for similarity
Tools: DALI, CATH, SCOP


5. Finding Domain and Functional Site Homologies


5.1 Protein Domains

Definition: Conserved part of protein with specific function/structure.
Examples: kinase domain, zinc finger, SH2 domain.
Domains often determine protein function.


5.2 Domain Databases and Tools

Pfam – HMM-based domain identification
SMART – domains in signaling and extracellular proteins
InterPro – integrates multiple domain databases
PROSITE – motifs and functional sites

5.3 Functional Site Prediction

Active sites, binding sites, or motifs are predicted based on:
Conserved residues across homologs
3D structure information
Known motifs (PROSITE patterns)
Tools:
ScanProsite – motif scanning
MotifScan – identifies functional motifs
CDD (Conserved Domain Database) – identifies domains and key residues


5.4 Steps to Identify Domain/Functional Homology
Input protein sequence
Perform sequence similarity search (BLASTP/PSI-BLAST)
Check conserved domains (Pfam, SMART, InterPro)
Predict functional motifs (PROSITE, ScanProsite)
Validate with structure-based tools if available

6. Summary / Workflow for Information Retrieval
1.Define the query (sequence, accession, or keyword)
2.Select the appropriate database (nucleotide, protein, structural)
3.Choose the search algorithm (BLAST, FASTA, HMMER)
4. Adjust parameters (E-value, filters)
5. Analyze results:
          Sequence similarity
          Homology inference
          Domain identification
           Functional site prediction
            Validate and annotate sequences
            Optional: Structural or evolutionary analysis


7. Key Points

Homology searches are more reliable than keyword searches for function prediction.
Iterative profile-based methods (PSI-BLAST, HMMER) detect distant homologs.
Domain and motif identification is essential for functional annotation.
Integrating sequence, domain, and structure information gives robust predictions.

Comments

Popular Posts

❥NORTHERN BLOTTING

NORTHERN BLOTTING – 30 MARK DETAILED NOTES  𓆞❥ 𓆞❥ 𓆞❥ 𓆞❥ 𓆞❥ 𓆞 ❥ 𓆞❥ 𓆞❥  Northern blotting is a molecular biology technique used to detect specific RNA molecules in a complex mixture. It provides information about gene expression, RNA size, and transcript abundance by hybridizing RNA with a labeled complementary DNA or RNA probe. 📌 Named by analogy to Southern blotting (DNA detection). 2. Principle The principle of Northern blotting is based on: Separation of RNA molecules by size using denaturing agarose gel electrophoresis Transfer (blotting) of separated RNA onto a nylon or nitrocellulose membrane Hybridization of membrane-bound RNA with a labeled complementary probe Detection of RNA–probe hybrids by autoradiography or chemiluminescence ✔ Only RNA sequences complementary to the probe will be detected. 3. Types of RNA Analyzed mRNA (most common) rRNA tRNA miRNA and siRNA (with modified protocols) 4. Requirements / Materials Total RNA or poly(A)+ RNA Denaturing agarose ...

Biological Databases – Types of Data and DatabasesNucleotide Sequence Databases (EMBL, GenBank, DDBJ)

Biological Databases – Types of Data and Databases Nucleotide Sequence Databases (EMBL, GenBank, DDBJ) 1. Introduction Biological databases are systematic, computerized collections of biological information that allow efficient storage, retrieval, updating, and analysis of large volumes of biological data. With the advent of genome sequencing, molecular biology, and bioinformatics, biological databases have become essential tools in biological research. These databases support studies in genomics, proteomics, evolutionary biology, taxonomy, medicine, agriculture, and biotechnology. 2. Types of Data Stored in Biological Databases Biological databases store diverse types of biological information, including: 1. Sequence Data DNA sequences RNA sequences Protein sequences 2. Structural Data Three-dimensional structures of proteins Nucleic acid structures 3. Functional Data Gene functions Enzyme activity Regulatory elements 4. Genomic Annotation Data Gene location Exons, introns Promoters a...

Exploitation of Somaclonal and Gametoclonal Variations for Plant Improvement

Exploitation of Somaclonal and Gametoclonal Variations for Plant Improvement  1. Introduction Plant tissue culture often induces genetic and epigenetic variations among regenerated plants. These variations, when stable and heritable, can be exploited as a source of novel traits for crop improvement. Somaclonal variation: Variation arising in plants regenerated from somatic cells cultured in vitro. Gametoclonal variation: Variation arising in plants regenerated from gametic cells (anther, pollen, ovule culture). Both provide additional genetic variability beyond conventional breeding. 2. Somaclonal Variation 2.1 Definition Somaclonal variation refers to genetic variation observed among plants regenerated from somatic tissue cultures, such as callus, suspension cultures, or explants. Term coined by Larkin and Scowcroft (1981). 2.2 Sources of Somaclonal Variation Chromosomal changes Aneuploidy Polyploidy Chromosome rearrangements Gene mutations Point mutations Insertions and deletions...

❃HPLC – High Performance Liquid Chromatography

HPLC – High Performance Liquid Chromatography ┏━━━━━ •❃°•°❀°•°❃•━━━━•━━━┓  1. Introduction High Performance Liquid Chromatography (HPLC) is an advanced analytical technique used for the separation, identification, and quantification of components present in a mixture. It is based on the differential distribution of analytes between a stationary phase and a liquid mobile phase under high pressure. HPLC is widely used in biochemistry, biotechnology, pharmaceuticals, food analysis, environmental studies, and clinical diagnostics. 2. Principle of HPLC The principle of HPLC is based on partition, adsorption, ion-exchange, or size-exclusion mechanisms, depending on the type of column used. A liquid mobile phase is pumped at high pressure through a column packed with fine stationary phase particles Sample components interact differently with the stationary phase Components with stronger interaction elute slower Components with weaker interaction elute faster Separated components are detec...

••CLASSIFICATION OF ALGAE - FRITSCH

      MODULE -1       PHYCOLOGY  CLASSIFICATION OF ALGAE - FRITSCH  ❖F.E. Fritsch (1935, 1945) in his book“The Structure and  Reproduction of the Algae”proposed a system of classification of  algae. He treated algae giving rank of division and divided it into 11  classes. His classification of algae is mainly based upon characters of  pigments, flagella and reserve food material.     Classification of Fritsch was based on the following criteria o Pigmentation. o Types of flagella  o Assimilatory products  o Thallus structure  o Method of reproduction          Fritsch divided algae into the following 11 classes  1. Chlorophyceae  2. Xanthophyceae  3. Chrysophyceae  4. Bacillariophyceae  5. Cryptophyceae  6. Dinophyceae  7. Chloromonadineae  8. Euglenineae    9. Phaeophyceae  10. Rhodophyceae  11. Myxophyce...

𓆉 INDEX PAGE -NOTETHEPOINT43

INDEX PAGE   MAIN    CONTENT 1.   HSST BOTANY SYLLABUS, DETAILED NOTES, MCQ 2.  SET GENERAL PAPER SYLLABUS, DETAILED NOTES, 50MCQ 3.  SET BOTANY SYLLABUS, DETAILED NOTES, MCQ 4. MSC BOTANY THIRD SEMESTER SYLLABUS, NOTES (KERALA UNIVERSITY ) 5. MSC BOTANY THIRD SEMESTER QUESTION PAPER (KERALA UNIVERSITY ) 6. MSC BOTANY FOURTH SEMESTER SYLLABUS &NOTES (KERALA UNIVERSITY ) 7. FOURTH SEMESTER MSC BOTANY PREVIOUS QUESTION PAPER  (KERALA UNIVERSITY )