FetchNt Outer
Wrapper script for running NCBI NT database fetching and processing with nohup for long-running operations. This script provides a convenient way to execute the fetchNt.sh pipeline in the background with proper logging and time tracking.
Overview
The fetchNtOuter.sh script is a simple but essential wrapper that executes the fetchNt.sh pipeline using nohup. This allows the NT database processing to run in the background, making it resistant to network disconnections and providing timing information for the entire process.
What This Script Does
The fetchNtOuter.sh script performs a single operation:
nohup time sh fetchNt.sh
Components Breakdown
- nohup - Ensures the process continues running even if the terminal session is disconnected
- time - Measures and reports the total execution time for the pipeline
- sh fetchNt.sh - Executes the main NT database fetching and processing pipeline
The fetchNt.sh Pipeline
This wrapper executes the comprehensive fetchNt.sh pipeline, which performs:
1. NT Database Download
Downloads the complete NCBI NT database from the FTP server:
wget -q -O - ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
2. Taxonomic Processing
Processes sequences through gi2taxid.sh to add proper taxonomic information:
- Renames sequences with taxonomic IDs
- Handles bad headers (up to 5000)
- Uses parallel compression (32 threads)
- Applies bgzip compression with level 8
3. Taxonomic Sorting
Sorts sequences by taxonomy using sortbyname.sh:
- Memory-efficient processing with 96GB allocation
- Filters sequences shorter than 60bp
- Wraps FASTA lines at 1023 characters
- Optimizes for subsequent sketching operations
4. Blacklist Creation
Creates a blacklist of over-represented kmers:
- Targets kmers occurring in 100+ different genuses
- Uses both k=32 and k=24 for comprehensive coverage
- Minimum count threshold of 120 for inclusion
- Essential for reducing false positive matches
5. Sketch Generation
Generates 31 taxonomic sketch files for efficient querying:
- One sketch per species for optimal organization
- Multiple files enable faster loading on multicore systems
- Auto-sizing based on sequence content
- Applies blacklist to reduce noise
- Includes depth information for coverage analysis
Basic Usage
# Run the complete NT processing pipeline
bash fetchNtOuter.sh
# Or execute directly from the fetch directory
cd pipelines/fetch/
./fetchNtOuter.sh
Prerequisites
System Requirements
- BBTools suite installed and in PATH
- At least 128GB RAM recommended for NT processing
- Several TB of available disk space
- Stable internet connection for database download
- pigz for parallel compression (optional but recommended)
Environment Setup
- Taxonomy data - Must have current BBTools taxonomy files
- TAXPATH - Automatically detected or set manually in fetchNt.sh
- Network access - FTP access to ftp.ncbi.nih.gov
Output Files
The pipeline generates several key files for taxonomic analysis:
Primary Output
- taxa#.sketch - 31 sketch files numbered 0-30, each containing species-level sketches
- blacklist_nt_genus_100.sketch - Blacklist of over-represented kmers
- sorted.fa.gz - Taxonomically sorted NT database
- renamed.fa.gz - NT database with taxonomic IDs added
Intermediate Files
- badHeaders.txt - Log of problematic sequence headers
- nohup.out - Complete log output from the pipeline execution
Using the Generated Sketches
Once the pipeline completes, the generated sketches can be used for taxonomic identification:
Basic Taxonomic Identification
# Compare contigs against NT sketches
comparesketch.sh in=contigs.fa k=32,24 tree=auto taxa#.sketch blacklist=blacklist_nt_genus_100.sketch
# Use default NT path (on NERSC systems)
comparesketch.sh in=contigs.fa nt tree=auto
Parameters for NT Comparison
- k=32,24 - Uses both kmer sizes for comprehensive matching
- tree=auto - Automatically loads taxonomic tree
- taxa#.sketch - References all 31 sketch files
- blacklist=... - Applies blacklist to reduce false positives
Performance Characteristics
Execution Time
- Total runtime - Can exceed 70 hours for complete NT processing
- Download phase - Several hours depending on bandwidth
- Processing phase - Most time-consuming due to large dataset size
- Sorting phase - Memory-intensive but parallelized
Resource Requirements
- Memory - Up to 128GB during sorting and sketching phases
- CPU - Benefits from multiple cores (32+ threads recommended)
- Disk space - Several TB for intermediate and final files
- Network - Stable connection essential for initial download
Optimization Features
- Parallel compression - Uses pigz with 32 threads for faster I/O
- Streaming processing - wget pipes directly to gi2taxid.sh
- Memory management - Auto-sizing prevents out-of-memory errors
- Multiple sketch files - Enables faster loading on multicore systems
Background Processing Benefits
Why Use nohup
The nohup wrapper provides several critical advantages for long-running NT processing:
- Session independence - Process continues if SSH session disconnects
- Terminal independence - Can close terminal without stopping the job
- Automatic logging - Output captured to nohup.out file
- Process persistence - Survives network interruptions
Monitoring Long-Running Jobs
# Check if the process is still running
ps aux | grep fetchNt
# Monitor progress via log file
tail -f nohup.out
# Check current stage of processing
ls -la *.fa.gz *.sketch
SLURM Integration
The fetchNt.sh script includes SLURM directives for HPC environments:
SLURM Configuration
- Job name - sketch_refseq
- Queue - genepool (NERSC-specific)
- Account - gtrqc
- Nodes - Single node (exclusive)
- Architecture - Haswell processors
- Time limit - 71 hours
Adapting for Other Systems
To use outside of NERSC:
- Modify or remove SLURM directives
- Set TAXPATH to your taxonomy directory location
- Ensure module loading commands match your system
- Adjust memory allocations based on available resources
Error Handling and Recovery
Built-in Safety Features
- set -e - Script stops on first error
- Error tolerance - Allows up to 5000 bad headers
- Overwrite protection - Uses ow flag to overwrite existing files safely
- Memory management - Auto-sizing prevents memory exhaustion
Common Issues and Solutions
- Network timeouts - Restart the script; wget will resume where possible
- Out of memory - Reduce memory allocations in fetchNt.sh
- Disk space - Ensure sufficient space before starting
- Taxonomy errors - Update taxonomy data first as noted in comments
Integration with BBTools Ecosystem
Related Tools
The generated sketches integrate with several BBTools utilities:
- comparesketch.sh - Primary tool for comparing sequences to NT
- sendsketch.sh - Online version using JGI servers
- bbsketch.sh - Direct sketch generation tool
- gi2taxid.sh - Taxonomic ID assignment
- sortbyname.sh - Taxonomic sorting utility
Workflow Integration
# Complete NT setup and usage workflow
# 1. Run the outer wrapper (this script)
bash fetchNtOuter.sh
# 2. Wait for completion (many hours)
# 3. Use for taxonomic identification
comparesketch.sh in=assembly.fa nt tree=auto
# 4. Get detailed per-contig results
comparesketch.sh in=assembly.fa nt tree=auto format=2 records=5
Performance Optimization
Hardware Recommendations
- Memory - 128GB+ for processing full NT database
- CPU cores - 32+ cores for optimal parallel compression
- Storage - Fast SSD storage recommended for intermediate files
- Network - High-bandwidth connection for initial download
Scaling Considerations
- Memory scaling - Adjust -Xmx values in fetchNt.sh based on available RAM
- Thread scaling - Modify pigz=32 parameter to match core count
- I/O optimization - Use local fast storage for processing
- Batch processing - Consider processing subsets if full NT is too large
Output and Results
Success Indicators
When the pipeline completes successfully, you should have:
- 31 sketch files - taxa0.sketch through taxa30.sketch
- Blacklist file - blacklist_nt_genus_100.sketch
- Processed database - sorted.fa.gz with taxonomic organization
- Timing information - Complete execution time in nohup.out
File Sizes
Expected approximate file sizes for reference:
- NT download - ~100GB compressed
- Processed database - ~150GB after processing
- Sketch files - ~1GB total for all 31 files
- Blacklist - ~100MB
Monitoring and Logging
Default Output
Since this is a nohup wrapper, all output is automatically captured:
# Check progress
tail -f nohup.out
# Monitor system resources
top -u $USER
# Check current stage
ls -la *.fa.gz *.sketch 2>/dev/null || echo "Still processing..."
Progress Indicators
You can monitor progress by watching for these files to appear:
- renamed.fa.gz - Download and initial processing complete
- sorted.fa.gz - Taxonomic sorting complete
- blacklist_nt_genus_100.sketch - Blacklist generation complete
- taxa0.sketch...taxa30.sketch - Final sketch generation in progress/complete
Best Practices
Before Running
- Update taxonomy - Ensure BBTools taxonomy data is current
- Check disk space - Verify several TB available
- Test network - Confirm FTP access to NCBI servers
- System resources - Ensure exclusive access to processing node
During Execution
- Don't interrupt - Let the complete pipeline finish
- Monitor resources - Watch for memory or disk space issues
- Keep logs - Preserve nohup.out for troubleshooting
- Plan timing - Start during low-usage periods
After Completion
- Validate output - Check that all expected files were generated
- Test sketches - Verify sketches work with comparesketch.sh
- Archive files - Move to permanent storage if needed
- Document location - Note path for future reference
Troubleshooting
Common Problems
- Script exits early
- Check nohup.out for error messages. Most commonly caused by insufficient disk space or memory.
- Download fails
- Network connectivity issues or NCBI server problems. Try restarting - wget should resume where it left off.
- Out of memory errors
- Reduce memory allocations (-Xmx values) in fetchNt.sh or run on a larger memory system.
- Missing taxonomy data
- Ensure BBTools taxonomy files are present and up to date. Set TAXPATH correctly.
Recovery Procedures
If the pipeline fails partway through:
- Check which stage failed by examining existing output files
- Modify fetchNt.sh to skip completed stages
- Restart from the failed stage
- Consider running stages individually for better control
Algorithm Details
NT Database Processing Strategy
The pipeline uses a multi-stage approach optimized for the massive NT database:
- Streaming download - Pipes wget output directly to processing to minimize disk usage
- Taxonomic enrichment - Adds taxonomic information during download phase
- Taxonomic sorting - Groups sequences by taxonomy for memory-efficient sketching
- Noise reduction - Creates genus-level blacklist to filter common kmers
- Distributed sketching - Splits sketches across multiple files for parallel loading
Memory Management
The pipeline carefully manages memory usage across stages:
- gi2taxid.sh - Limited to 1GB to avoid overwhelming during download
- sortbyname.sh - Uses 96GB for efficient large-scale sorting
- sketchblacklist.sh - Uses 31GB for blacklist generation
- bbsketch.sh - Uses 31GB with auto-sizing for optimal sketch creation
Kmer Strategy
The pipeline employs dual kmer sizes (k=32 and k=24) for comprehensive coverage:
- k=32 - High specificity for precise taxonomic assignment
- k=24 - Increased sensitivity for distantly related sequences
- Blacklisting - Removes kmers present in 100+ genuses to reduce noise
- Genus-level filtering - Focuses on meaningful taxonomic distinctions
Advanced Configuration
Customizing for Local Systems
To adapt this pipeline for non-NERSC systems:
- Remove or modify SLURM directives in fetchNt.sh
- Set TAXPATH to your local taxonomy directory
- Adjust memory allocations based on available RAM
- Modify thread counts (pigz parameter) to match your system
- Consider breaking the pipeline into smaller chunks for limited-resource systems
Parameter Tuning
Key parameters that may need adjustment:
- mincount=120
- Threshold for blacklist inclusion. Lower values create more aggressive filtering.
- minsize=300
- Minimum sketch size. Larger values provide better resolution but increase file sizes.
- files=31
- Number of sketch files to create. More files enable better parallelization but complicate management.
- minlen=60
- Minimum sequence length to include. Adjust based on your analysis needs.