FetchNt Outer

Script: fetchNtOuter.sh Source Directory: pipelines/fetch/ Author: Brian Bushnell

Wrapper script for running NCBI NT database fetching and processing with nohup for long-running operations. This script provides a convenient way to execute the fetchNt.sh pipeline in the background with proper logging and time tracking.

Overview

The fetchNtOuter.sh script is a simple but essential wrapper that executes the fetchNt.sh pipeline using nohup. This allows the NT database processing to run in the background, making it resistant to network disconnections and providing timing information for the entire process.

Note: This wrapper is designed for long-running operations that may take many hours to complete. The NT database is very large and processing it requires significant time and computational resources.

What This Script Does

The fetchNtOuter.sh script performs a single operation:

nohup time sh fetchNt.sh

Components Breakdown

The fetchNt.sh Pipeline

This wrapper executes the comprehensive fetchNt.sh pipeline, which performs:

1. NT Database Download

Downloads the complete NCBI NT database from the FTP server:

wget -q -O - ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz

2. Taxonomic Processing

Processes sequences through gi2taxid.sh to add proper taxonomic information:

3. Taxonomic Sorting

Sorts sequences by taxonomy using sortbyname.sh:

4. Blacklist Creation

Creates a blacklist of over-represented kmers:

5. Sketch Generation

Generates 31 taxonomic sketch files for efficient querying:

Basic Usage

# Run the complete NT processing pipeline
bash fetchNtOuter.sh

# Or execute directly from the fetch directory
cd pipelines/fetch/
./fetchNtOuter.sh

Prerequisites

System Requirements

Environment Setup

Output Files

The pipeline generates several key files for taxonomic analysis:

Primary Output

Intermediate Files

Using the Generated Sketches

Once the pipeline completes, the generated sketches can be used for taxonomic identification:

Basic Taxonomic Identification

# Compare contigs against NT sketches
comparesketch.sh in=contigs.fa k=32,24 tree=auto taxa#.sketch blacklist=blacklist_nt_genus_100.sketch

# Use default NT path (on NERSC systems)
comparesketch.sh in=contigs.fa nt tree=auto

Parameters for NT Comparison

Performance Characteristics

Execution Time

Resource Requirements

Optimization Features

Background Processing Benefits

Why Use nohup

The nohup wrapper provides several critical advantages for long-running NT processing:

Monitoring Long-Running Jobs

# Check if the process is still running
ps aux | grep fetchNt

# Monitor progress via log file
tail -f nohup.out

# Check current stage of processing
ls -la *.fa.gz *.sketch

SLURM Integration

The fetchNt.sh script includes SLURM directives for HPC environments:

SLURM Configuration

Adapting for Other Systems

To use outside of NERSC:

Error Handling and Recovery

Built-in Safety Features

Common Issues and Solutions

Integration with BBTools Ecosystem

Related Tools

The generated sketches integrate with several BBTools utilities:

Workflow Integration

# Complete NT setup and usage workflow
# 1. Run the outer wrapper (this script)
bash fetchNtOuter.sh

# 2. Wait for completion (many hours)
# 3. Use for taxonomic identification
comparesketch.sh in=assembly.fa nt tree=auto

# 4. Get detailed per-contig results
comparesketch.sh in=assembly.fa nt tree=auto format=2 records=5

Performance Optimization

Hardware Recommendations

Scaling Considerations

Output and Results

Success Indicators

When the pipeline completes successfully, you should have:

File Sizes

Expected approximate file sizes for reference:

Monitoring and Logging

Default Output

Since this is a nohup wrapper, all output is automatically captured:

# Check progress
tail -f nohup.out

# Monitor system resources
top -u $USER

# Check current stage
ls -la *.fa.gz *.sketch 2>/dev/null || echo "Still processing..."

Progress Indicators

You can monitor progress by watching for these files to appear:

  1. renamed.fa.gz - Download and initial processing complete
  2. sorted.fa.gz - Taxonomic sorting complete
  3. blacklist_nt_genus_100.sketch - Blacklist generation complete
  4. taxa0.sketch...taxa30.sketch - Final sketch generation in progress/complete

Best Practices

Before Running

During Execution

After Completion

Troubleshooting

Common Problems

Script exits early
Check nohup.out for error messages. Most commonly caused by insufficient disk space or memory.
Download fails
Network connectivity issues or NCBI server problems. Try restarting - wget should resume where it left off.
Out of memory errors
Reduce memory allocations (-Xmx values) in fetchNt.sh or run on a larger memory system.
Missing taxonomy data
Ensure BBTools taxonomy files are present and up to date. Set TAXPATH correctly.

Recovery Procedures

If the pipeline fails partway through:

  1. Check which stage failed by examining existing output files
  2. Modify fetchNt.sh to skip completed stages
  3. Restart from the failed stage
  4. Consider running stages individually for better control

Algorithm Details

NT Database Processing Strategy

The pipeline uses a multi-stage approach optimized for the massive NT database:

  1. Streaming download - Pipes wget output directly to processing to minimize disk usage
  2. Taxonomic enrichment - Adds taxonomic information during download phase
  3. Taxonomic sorting - Groups sequences by taxonomy for memory-efficient sketching
  4. Noise reduction - Creates genus-level blacklist to filter common kmers
  5. Distributed sketching - Splits sketches across multiple files for parallel loading

Memory Management

The pipeline carefully manages memory usage across stages:

Kmer Strategy

The pipeline employs dual kmer sizes (k=32 and k=24) for comprehensive coverage:

Advanced Configuration

Customizing for Local Systems

To adapt this pipeline for non-NERSC systems:

  1. Remove or modify SLURM directives in fetchNt.sh
  2. Set TAXPATH to your local taxonomy directory
  3. Adjust memory allocations based on available RAM
  4. Modify thread counts (pigz parameter) to match your system
  5. Consider breaking the pipeline into smaller chunks for limited-resource systems

Parameter Tuning

Key parameters that may need adjustment:

mincount=120
Threshold for blacklist inclusion. Lower values create more aggressive filtering.
minsize=300
Minimum sketch size. Larger values provide better resolution but increase file sizes.
files=31
Number of sketch files to create. More files enable better parallelization but complicate management.
minlen=60
Minimum sequence length to include. Adjust based on your analysis needs.