FetchNt Outer - BBTools

Overview

The fetchNtOuter.sh script is a simple but essential wrapper that executes the fetchNt.sh pipeline using nohup. This allows the NT database processing to run in the background, making it resistant to network disconnections and providing timing information for the entire process.

Note: This wrapper is designed for long-running operations that may take many hours to complete. The NT database is very large and processing it requires significant time and computational resources.

What This Script Does

The fetchNtOuter.sh script performs a single operation:

nohup time sh fetchNt.sh

Components Breakdown

nohup - Ensures the process continues running even if the terminal session is disconnected
time - Measures and reports the total execution time for the pipeline
sh fetchNt.sh - Executes the main NT database fetching and processing pipeline

The fetchNt.sh Pipeline

This wrapper executes the comprehensive fetchNt.sh pipeline, which performs:

1. NT Database Download

Downloads the complete NCBI NT database from the FTP server:

wget -q -O - ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz

2. Taxonomic Processing

Processes sequences through gi2taxid.sh to add proper taxonomic information:

Renames sequences with taxonomic IDs
Handles bad headers (up to 5000)
Uses parallel compression (32 threads)
Applies bgzip compression with level 8

3. Taxonomic Sorting

Sorts sequences by taxonomy using sortbyname.sh:

Memory-efficient processing with 96GB allocation
Filters sequences shorter than 60bp
Wraps FASTA lines at 1023 characters
Optimizes for subsequent sketching operations

4. Blacklist Creation

Creates a blacklist of over-represented kmers:

Targets kmers occurring in 100+ different genuses
Uses both k=32 and k=24 for comprehensive coverage
Minimum count threshold of 120 for inclusion
Essential for reducing false positive matches

5. Sketch Generation

Generates 31 taxonomic sketch files for efficient querying:

One sketch per species for optimal organization
Multiple files enable faster loading on multicore systems
Auto-sizing based on sequence content
Applies blacklist to reduce noise
Includes depth information for coverage analysis

Basic Usage

# Run the complete NT processing pipeline
bash fetchNtOuter.sh

# Or execute directly from the fetch directory
cd pipelines/fetch/
./fetchNtOuter.sh

Prerequisites

System Requirements

BBTools suite installed and in PATH
At least 128GB RAM recommended for NT processing
Several TB of available disk space
Stable internet connection for database download
pigz for parallel compression (optional but recommended)

Environment Setup

Taxonomy data - Must have current BBTools taxonomy files
TAXPATH - Automatically detected or set manually in fetchNt.sh
Network access - FTP access to ftp.ncbi.nih.gov

Output Files

The pipeline generates several key files for taxonomic analysis:

Primary Output

taxa#.sketch - 31 sketch files numbered 0-30, each containing species-level sketches
blacklist_nt_genus_100.sketch - Blacklist of over-represented kmers
sorted.fa.gz - Taxonomically sorted NT database
renamed.fa.gz - NT database with taxonomic IDs added

Intermediate Files

badHeaders.txt - Log of problematic sequence headers
nohup.out - Complete log output from the pipeline execution

Using the Generated Sketches

Once the pipeline completes, the generated sketches can be used for taxonomic identification:

Basic Taxonomic Identification

# Compare contigs against NT sketches
comparesketch.sh in=contigs.fa k=32,24 tree=auto taxa#.sketch blacklist=blacklist_nt_genus_100.sketch

# Use default NT path (on NERSC systems)
comparesketch.sh in=contigs.fa nt tree=auto

Parameters for NT Comparison

k=32,24 - Uses both kmer sizes for comprehensive matching
tree=auto - Automatically loads taxonomic tree
taxa#.sketch - References all 31 sketch files
blacklist=... - Applies blacklist to reduce false positives

Performance Characteristics

Execution Time

Total runtime - Can exceed 70 hours for complete NT processing
Download phase - Several hours depending on bandwidth
Processing phase - Most time-consuming due to large dataset size
Sorting phase - Memory-intensive but parallelized

Resource Requirements

Memory - Up to 128GB during sorting and sketching phases
CPU - Benefits from multiple cores (32+ threads recommended)
Disk space - Several TB for intermediate and final files
Network - Stable connection essential for initial download

Optimization Features

Parallel compression - Uses pigz with 32 threads for faster I/O
Streaming processing - wget pipes directly to gi2taxid.sh
Memory management - Auto-sizing prevents out-of-memory errors
Multiple sketch files - Enables faster loading on multicore systems

Background Processing Benefits

Why Use nohup

The nohup wrapper provides several critical advantages for long-running NT processing:

Session independence - Process continues if SSH session disconnects
Terminal independence - Can close terminal without stopping the job
Automatic logging - Output captured to nohup.out file
Process persistence - Survives network interruptions

Monitoring Long-Running Jobs

# Check if the process is still running
ps aux | grep fetchNt

# Monitor progress via log file
tail -f nohup.out

# Check current stage of processing
ls -la *.fa.gz *.sketch

SLURM Integration

The fetchNt.sh script includes SLURM directives for HPC environments:

SLURM Configuration

Job name - sketch_refseq
Queue - genepool (NERSC-specific)
Account - gtrqc
Nodes - Single node (exclusive)
Architecture - Haswell processors
Time limit - 71 hours

Adapting for Other Systems

To use outside of NERSC:

Modify or remove SLURM directives
Set TAXPATH to your taxonomy directory location
Ensure module loading commands match your system
Adjust memory allocations based on available resources

Error Handling and Recovery

Built-in Safety Features

set -e - Script stops on first error
Error tolerance - Allows up to 5000 bad headers
Overwrite protection - Uses ow flag to overwrite existing files safely
Memory management - Auto-sizing prevents memory exhaustion

Common Issues and Solutions

Network timeouts - Restart the script; wget will resume where possible
Out of memory - Reduce memory allocations in fetchNt.sh
Disk space - Ensure sufficient space before starting
Taxonomy errors - Update taxonomy data first as noted in comments

Integration with BBTools Ecosystem

Related Tools

The generated sketches integrate with several BBTools utilities:

comparesketch.sh - Primary tool for comparing sequences to NT
sendsketch.sh - Online version using JGI servers
bbsketch.sh - Direct sketch generation tool
gi2taxid.sh - Taxonomic ID assignment
sortbyname.sh - Taxonomic sorting utility

Workflow Integration

# Complete NT setup and usage workflow
# 1. Run the outer wrapper (this script)
bash fetchNtOuter.sh

# 2. Wait for completion (many hours)
# 3. Use for taxonomic identification
comparesketch.sh in=assembly.fa nt tree=auto

# 4. Get detailed per-contig results
comparesketch.sh in=assembly.fa nt tree=auto format=2 records=5

Performance Optimization

Hardware Recommendations

Memory - 128GB+ for processing full NT database
CPU cores - 32+ cores for optimal parallel compression
Storage - Fast SSD storage recommended for intermediate files
Network - High-bandwidth connection for initial download

Scaling Considerations

Memory scaling - Adjust -Xmx values in fetchNt.sh based on available RAM
Thread scaling - Modify pigz=32 parameter to match core count
I/O optimization - Use local fast storage for processing
Batch processing - Consider processing subsets if full NT is too large

Output and Results

Success Indicators

When the pipeline completes successfully, you should have:

31 sketch files - taxa0.sketch through taxa30.sketch
Blacklist file - blacklist_nt_genus_100.sketch
Processed database - sorted.fa.gz with taxonomic organization
Timing information - Complete execution time in nohup.out

File Sizes

Expected approximate file sizes for reference:

NT download - ~100GB compressed
Processed database - ~150GB after processing
Sketch files - ~1GB total for all 31 files
Blacklist - ~100MB

Monitoring and Logging

Default Output

Since this is a nohup wrapper, all output is automatically captured:

# Check progress
tail -f nohup.out

# Monitor system resources
top -u $USER

# Check current stage
ls -la *.fa.gz *.sketch 2>/dev/null || echo "Still processing..."

Progress Indicators

You can monitor progress by watching for these files to appear:

renamed.fa.gz - Download and initial processing complete
sorted.fa.gz - Taxonomic sorting complete
blacklist_nt_genus_100.sketch - Blacklist generation complete
taxa0.sketch...taxa30.sketch - Final sketch generation in progress/complete

Best Practices

Before Running

Update taxonomy - Ensure BBTools taxonomy data is current
Check disk space - Verify several TB available
Test network - Confirm FTP access to NCBI servers
System resources - Ensure exclusive access to processing node

During Execution

Don't interrupt - Let the complete pipeline finish
Monitor resources - Watch for memory or disk space issues
Keep logs - Preserve nohup.out for troubleshooting
Plan timing - Start during low-usage periods

After Completion

Validate output - Check that all expected files were generated
Test sketches - Verify sketches work with comparesketch.sh
Archive files - Move to permanent storage if needed
Document location - Note path for future reference

Troubleshooting

Common Problems

Script exits early: Check nohup.out for error messages. Most commonly caused by insufficient disk space or memory.
Download fails: Network connectivity issues or NCBI server problems. Try restarting - wget should resume where it left off.
Out of memory errors: Reduce memory allocations (-Xmx values) in fetchNt.sh or run on a larger memory system.
Missing taxonomy data: Ensure BBTools taxonomy files are present and up to date. Set TAXPATH correctly.

Recovery Procedures

If the pipeline fails partway through:

Check which stage failed by examining existing output files
Modify fetchNt.sh to skip completed stages
Restart from the failed stage
Consider running stages individually for better control

Algorithm Details

NT Database Processing Strategy

The pipeline uses a multi-stage approach optimized for the massive NT database:

Streaming download - Pipes wget output directly to processing to minimize disk usage
Taxonomic enrichment - Adds taxonomic information during download phase
Taxonomic sorting - Groups sequences by taxonomy for memory-efficient sketching
Noise reduction - Creates genus-level blacklist to filter common kmers
Distributed sketching - Splits sketches across multiple files for parallel loading

Memory Management

The pipeline carefully manages memory usage across stages:

gi2taxid.sh - Limited to 1GB to avoid overwhelming during download
sortbyname.sh - Uses 96GB for efficient large-scale sorting
sketchblacklist.sh - Uses 31GB for blacklist generation
bbsketch.sh - Uses 31GB with auto-sizing for optimal sketch creation

Kmer Strategy

The pipeline employs dual kmer sizes (k=32 and k=24) for comprehensive coverage:

k=32 - High specificity for precise taxonomic assignment
k=24 - Increased sensitivity for distantly related sequences
Blacklisting - Removes kmers present in 100+ genuses to reduce noise
Genus-level filtering - Focuses on meaningful taxonomic distinctions

Advanced Configuration

Customizing for Local Systems

To adapt this pipeline for non-NERSC systems:

Remove or modify SLURM directives in fetchNt.sh
Set TAXPATH to your local taxonomy directory
Adjust memory allocations based on available RAM
Modify thread counts (pigz parameter) to match your system
Consider breaking the pipeline into smaller chunks for limited-resource systems

Parameter Tuning

Key parameters that may need adjustment:

mincount=120: Threshold for blacklist inclusion. Lower values create more aggressive filtering.
minsize=300: Minimum sketch size. Larger values provide better resolution but increase file sizes.
files=31: Number of sketch files to create. More files enable better parallelization but complicate management.
minlen=60: Minimum sequence length to include. Adjust based on your analysis needs.