Information retrieval from databases - search concepts, Tools for searching, homology searching, finding Domain and Functional site homologies

Information retrieval from databases - search concepts, Tools for searching, homology searching, finding Domain and Functional site homologies

Information Retrieval from Databases

1. Introduction

Information retrieval in bioinformatics refers to the process of extracting relevant biological data (DNA, RNA, protein sequences, structures, or functional information) from databases.

Aim: Identify sequences, functions, or structural features for analysis, comparison, and annotation.

Databases can be primary (raw sequence data) or secondary/derived (annotated, processed data).

2. Search Concepts in Biological Databases

2.1 Types of Searches

Exact Match Search

Returns results only if the query exactly matches database entries.

Useful for known accession numbers or IDs.

Pattern/Keyword Search

Searches based on specific motifs, keywords, or annotations.

Example: “kinase domain,” “signal peptide.”

Similarity/Homology Search

Detects sequences similar to the query based on sequence alignment.

Uses scoring matrices to assess similarity (e.g., BLOSUM, PAM).

Useful for identifying homologous genes or proteins.

Complex Query Search

Combines Boolean operators (AND, OR, NOT) to refine results.

Example: “kinase AND human NOT viral.”

2.2 Search Parameters

Query sequence or keyword

Database selection (nucleotide, protein, structural, functional)

Algorithm choice (BLAST, FASTA, PSI-BLAST)

Threshold or cut-off (E-value, score, % identity)

Filters (organism, date, length, sequence type)

3. Tools for Searching Biological Databases

3.1 Nucleotide Sequence Databases

GenBank (NCBI)

EMBL (European Nucleotide Archive)

DDBJ (DNA Data Bank of Japan)

Search Tools:

BLASTN – nucleotide vs nucleotide

FASTA – nucleotide similarity search

3.2 Protein Sequence Databases

SWISS-PROT / UniProtKB – curated protein sequences

PIR / TrEMBL – unreviewed protein sequences

Search Tools:

BLASTP – protein vs protein

PSI-BLAST – iterative search for distant homologs

HMMER – profile-based search using hidden Markov models

3.3 Structural Databases

Protein Data Bank (PDB) – 3D protein structures

SCOP / CATH – structural classification of proteins

Search Tools:

BLAST 3D – structure-based sequence search

DALI – structural alignment

3.4 Specialized Databases

Pfam – protein families

PROSITE – protein motifs

InterPro – integrated database of protein domains

4. Homology Searching

Homology searching identifies evolutionarily related sequences based on similarity.

4.1 Concept

Homologous sequences: share a common ancestor.

Types:

Orthologs – homologs in different species

Paralogs – homologs in the same species

Homology suggests similar structure or function.

4.2 Methods

1. Pairwise Sequence Alignment

Tools: BLAST, FASTA

Measures similarity (% identity) and E-value

2. Multiple Sequence Alignment (MSA)

Tools: Clustal Omega, MUSCLE

Identifies conserved residues and motifs

3. Profile-based Searching

Uses Position-Specific Scoring Matrices (PSSM)

Tool: PSI-BLAST, HMMER

4. Structural Homology

Comparing 3D structures for similarity

Tools: DALI, CATH, SCOP

5. Finding Domain and Functional Site Homologies

5.1 Protein Domains

Definition: Conserved part of protein with specific function/structure.

Examples: kinase domain, zinc finger, SH2 domain.

Domains often determine protein function.

5.2 Domain Databases and Tools

Pfam – HMM-based domain identification

SMART – domains in signaling and extracellular proteins

InterPro – integrates multiple domain databases

PROSITE – motifs and functional sites

5.3 Functional Site Prediction

Active sites, binding sites, or motifs are predicted based on:

Conserved residues across homologs

3D structure information

Known motifs (PROSITE patterns)

Tools:

ScanProsite – motif scanning

MotifScan – identifies functional motifs

CDD (Conserved Domain Database) – identifies domains and key residues

5.4 Steps to Identify Domain/Functional Homology

Input protein sequence

Perform sequence similarity search (BLASTP/PSI-BLAST)

Check conserved domains (Pfam, SMART, InterPro)

Predict functional motifs (PROSITE, ScanProsite)

Validate with structure-based tools if available

6. Summary / Workflow for Information Retrieval

1.Define the query (sequence, accession, or keyword)

2.Select the appropriate database (nucleotide, protein, structural)

3.Choose the search algorithm (BLAST, FASTA, HMMER)

4. Adjust parameters (E-value, filters)

5. Analyze results:

Sequence similarity

Homology inference

Domain identification

Functional site prediction

Validate and annotate sequences

Optional: Structural or evolutionary analysis

7. Key Points

Homology searches are more reliable than keyword searches for function prediction.

Iterative profile-based methods (PSI-BLAST, HMMER) detect distant homologs.

Domain and motif identification is essential for functional annotation.

Integrating sequence, domain, and structure information gives robust predictions.

Notethepoint 43official Previous Question Paper Updates2.0

Search This Blog

Information retrieval from databases - search concepts, Tools for searching, homology searching, finding Domain and Functional site homologies

Comments

Popular Posts

❃HPLC – High Performance Liquid Chromatography

Microbial Production of PharmaceuticalsSomatostatin, Humulin and Interferons

••CLASSIFICATION OF ALGAE - FRITSCH

Single Nucleotide Polymorphisms (SNPs) – Detailed Notes

Intellectual Property Rights (IPR) – Detailed Notes

SCAR (Sequence Characterized Amplified Region) Markers

❥NORTHERN BLOTTING

𓆉 INDEX PAGE -NOTETHEPOINT43

Suspension culture and development - methodology, kinetics of growth and production formation, elicitation methods, hairy root culture. Detailed notes

❥ Southern Blotting Notes