processIMG.sh

Script: processIMG.sh Author: Brian Bushnell Last Updated: February 21, 2018 Environment: NERSC/Genepool Systems

Specialized pipeline for processing the IMG (Integrated Microbial Genomes) database into taxonomically organized sketches. Combines IMG genome collection into a single file with taxonomic renaming, creates species-level blacklists, and generates optimized sketch files for rapid genome comparison and taxonomic identification.

Purpose

This pipeline transforms the IMG genome database from JGI into an efficient sketch-based reference system. It processes thousands of microbial genomes, organizes them by taxonomy, creates blacklists to filter uninformative kmers, and generates parallel-loadable sketch files optimized for high-speed genome comparison and taxonomic assignment.

Prerequisites

Pipeline Stages

Stage 1: Genome Collection and Renaming

# Combine IMG genomes with taxonomic prefixing
time renameimg.sh in=auto imghq out=renamed.fa.gz fastawrap=255 zl=6

Process Details:

Stage 2: Species-Level Blacklist Generation

# Create blacklist of uninformative kmers
time sketchblacklist.sh -Xmx31g in=renamed.fa.gz prepasses=1 tree=auto taxa taxlevel=species ow out=blacklist_img_species_300.sketch mincount=300 k=31,24 imghq

Blacklist Purpose:

Stage 3: Sketch Generation with Parallel Loading

# Generate sketches distributed across 31 files
time sketch.sh -Xmx31g in=renamed.fa.gz out=img#.sketch files=31 mode=img tree=auto img=auto gi=null ow blacklist=blacklist_img_species_300.sketch k=31,24 imghq

Sketch Configuration:

Output Files

Primary Output

File Organization Benefits

Usage Examples

Running the Pipeline

# Execute complete IMG processing pipeline
time ./processIMG.sh

# Monitor progress (commands are prefixed with 'time')
# Stage 1: Genome renaming and consolidation
# Stage 2: Blacklist generation
# Stage 3: Sketch file creation

Querying Generated Sketches

# Query with specific sketch files
comparesketch.sh in=contigs.fa k=31,24 tree=auto img*.sketch blacklist=blacklist_img_species_300.sketch printimg

# Using default IMG sketches (after NERSC path configuration)
comparesketch.sh in=contigs.fa img tree=auto printimg

NERSC System Integration

# Set default IMG path on NERSC systems
ln -s /path/to/new/sketches /global/projectb/sandbox/gaag/bbtools/img/current

# Default usage after path configuration
comparesketch.sh in=sequences.fa img tree=auto printimg

Performance Characteristics

IMG Database Details

System Requirements

NERSC-Specific Considerations

Adaptation for Other Systems

Algorithm Details

Related Tools