processIMG.sh
Specialized pipeline for processing the IMG (Integrated Microbial Genomes) database into taxonomically organized sketches. Combines IMG genome collection into a single file with taxonomic renaming, creates species-level blacklists, and generates optimized sketch files for rapid genome comparison and taxonomic identification.
Purpose
This pipeline transforms the IMG genome database from JGI into an efficient sketch-based reference system. It processes thousands of microbial genomes, organizes them by taxonomy, creates blacklists to filter uninformative kmers, and generates parallel-loadable sketch files optimized for high-speed genome comparison and taxonomic assignment.
Prerequisites
- System Access: Requires connection to NERSC systems (Genepool, Perlmutter) or similar environments
- IMG Database Access: Access to IMG genome files and metadata
- Metadata Files:
IMG_taxonID_ncbiID_fna.txt
- Complete IMG collection metadataIMG_taxonID_ncbiID_fna_HQ.txt
- High-quality genomes only
- High Memory System: 31GB+ RAM recommended
- Storage: Substantial disk space for intermediate and output files
Pipeline Stages
Stage 1: Genome Collection and Renaming
# Combine IMG genomes with taxonomic prefixing
time renameimg.sh in=auto imghq out=renamed.fa.gz fastawrap=255 zl=6
Process Details:
in=auto
- Automatically reads IMG metadata from standardized file locationimghq
- Uses high-quality genome collection (IMG_taxonID_ncbiID_fna_HQ.txt)fastawrap=255
- Sets FASTA line width to 255 characters- Renames contigs by prefixing with IMG ID for unique identification
- Consolidates thousands of genome files into single manageable file
Stage 2: Species-Level Blacklist Generation
# Create blacklist of uninformative kmers
time sketchblacklist.sh -Xmx31g in=renamed.fa.gz prepasses=1 tree=auto taxa taxlevel=species ow out=blacklist_img_species_300.sketch mincount=300 k=31,24 imghq
Blacklist Purpose:
mincount=300
- Blacklists kmers occurring in 300+ different speciestaxlevel=species
- Operates at species taxonomic levelk=31,24
- Dual kmer sizes for comprehensive coveragetree=auto
- Automatically loads taxonomic tree information- Improves query speed by removing ubiquitous, uninformative kmers
- Reduces false positives in taxonomic assignment
Stage 3: Sketch Generation with Parallel Loading
# Generate sketches distributed across 31 files
time sketch.sh -Xmx31g in=renamed.fa.gz out=img#.sketch files=31 mode=img tree=auto img=auto gi=null ow blacklist=blacklist_img_species_300.sketch k=31,24 imghq
Sketch Configuration:
files=31
- Creates img0.sketch through img30.sketch for parallel loadingmode=img
- IMG-specific sketch mode with appropriate parametersimg=auto
- Automatic IMG metadata integrationgi=null
- Disables GI number processing (not needed for IMG)blacklist=
- Applies species-level blacklist for quality improvement
Output Files
Primary Output
- img0.sketch - img30.sketch: 31 parallel-loadable sketch files containing IMG genomes
- renamed.fa.gz: Consolidated and renamed IMG genome collection
- blacklist_img_species_300.sketch: Species-level kmer blacklist for quality filtering
File Organization Benefits
- Parallel Loading: 31 files enable simultaneous loading by CompareSketch
- Balanced Distribution: Even distribution of genomes across sketch files
- Memory Efficiency: Reduces peak memory usage during queries
- I/O Optimization: Parallel file access improves query performance
Usage Examples
Running the Pipeline
# Execute complete IMG processing pipeline
time ./processIMG.sh
# Monitor progress (commands are prefixed with 'time')
# Stage 1: Genome renaming and consolidation
# Stage 2: Blacklist generation
# Stage 3: Sketch file creation
Querying Generated Sketches
# Query with specific sketch files
comparesketch.sh in=contigs.fa k=31,24 tree=auto img*.sketch blacklist=blacklist_img_species_300.sketch printimg
# Using default IMG sketches (after NERSC path configuration)
comparesketch.sh in=contigs.fa img tree=auto printimg
NERSC System Integration
# Set default IMG path on NERSC systems
ln -s /path/to/new/sketches /global/projectb/sandbox/gaag/bbtools/img/current
# Default usage after path configuration
comparesketch.sh in=sequences.fa img tree=auto printimg
Performance Characteristics
- Memory Requirements: 31GB RAM for optimal performance
- Processing Time: Several hours for complete IMG collection
- Disk I/O: Heavy read/write operations during all stages
- CPU Utilization: CPU-intensive during blacklist generation and sketching
- Query Performance: Optimized sketch structure enables rapid genome comparison
IMG Database Details
- Genome Collection: Thousands of microbial genomes from JGI IMG system
- Quality Tiers: Both complete collection and high-quality subset available
- Taxonomic Coverage: Comprehensive representation across microbial tree of life
- Metadata Integration: IMG IDs, taxonomic IDs, and NCBI IDs linked
- Update Frequency: Periodic updates as IMG database expands
System Requirements
NERSC-Specific Considerations
- File Paths: Hardcoded paths to NERSC IMG collection locations
- Scratch Space: Requires high-performance scratch storage
- Module Loading: May require specific software modules on NERSC
- Batch Submission: Long-running pipeline suitable for batch queue systems
Adaptation for Other Systems
- Modify hardcoded paths to IMG data locations
- Adjust memory allocations based on available RAM
- Update file output paths for local filesystem organization
- Consider local IMG database mirror requirements
Algorithm Details
- Sketch Technology: Uses MinHash-based sketching for rapid genome comparison
- Dual K-mer Strategy: K=31 and K=24 provide sensitivity/specificity balance
- Taxonomic Integration: Leverages NCBI taxonomy for accurate classification
- Blacklist Optimization: Statistical analysis identifies and removes uninformative kmers
- Parallel Architecture: Multi-file design enables concurrent processing
Related Tools
renameimg.sh
- IMG genome renaming and consolidationsketchblacklist.sh
- Uninformative kmer identification and blacklistingsketch.sh
- Genome sketching and reference database creationcomparesketch.sh
- Query sketches against IMG reference databasesendsketch.sh
- Alternative sketch-based identification service