Pileup2
Multi-threaded version of Pileup designed to process multiple SAM/BAM files concurrently for coverage analysis. For single input files, use regular pileup.sh instead. Supports all the same parameters as pileup.sh with additional multi-threading controls.
Basic Usage
pileup2.sh in=<file,file,file> out=<file>
pileup2.sh *.sam out=<file>
Processes multiple SAM/BAM files simultaneously using multiple threads to calculate coverage statistics.
Parameters
Parameters are organized by function. Pileup2 supports all parameters from regular pileup.sh plus multi-threading specific controls.
Input/Output Parameters
- in=<file,file>
- The input SAM/BAM files, comma-separated. Omit the 'in=' if using wildcards (e.g., *.sam).
- out=<file>
- Primary output file for coverage statistics. Can use '#' symbol for strand-specific output when stranded=true.
- ref=<file>
- Reference fasta file for calculating GC content and adding scaffold information.
- basecov=<file>
- Write per-base coverage to file. Use '#' in filename for strand-specific output.
- bincov=<file>
- Write binned coverage to file. Use '#' in filename for strand-specific output.
- normcov=<file>
- Write length-normalized coverage to file. Use '#' in filename for strand-specific output.
- normcovOverall=<file>
- Write overall normalized coverage combining all scaffolds. Use '#' in filename for strand-specific output.
- hist=<file>
- Write coverage histogram to file. Use '#' in filename for strand-specific output.
- rpkm=<file>
- Write RPKM/FPKM statistics to file.
- outorf=<file>
- Write ORF coverage statistics to file.
- orffasta=<file>
- Input ORF fasta file for calculating ORF-specific coverage.
Multi-threading Parameters
- streams=-1
- Maximum number of concurrent file processing streams. Default is half the number of logical processors. Each stream uses multiple threads internally.
- atomic=false
- Use atomic arrays instead of locks for thread synchronization. Can improve performance but uses more memory.
- prealloc=false
- Preallocate coverage arrays instead of creating them as needed. Reduces memory allocation overhead for large references.
Coverage Calculation Parameters
- mindepth=1
- Minimum depth for a base to be considered covered.
- minscaf=0
- Ignore scaffolds shorter than this length.
- binsize=1000
- Bin size in base pairs for binned coverage output.
- 32bit=false
- Use 32-bit arrays instead of 16-bit for coverage counting. Allows higher coverage depths.
- histmax=-1
- Maximum coverage depth for histogram. Default uses maximum possible value for array type.
- reads=-1
- Process at most this many reads. -1 means process all reads.
- scafs=4096
- Initial estimate of scaffold count for memory allocation optimization.
Filtering Parameters
- minq=0
- Minimum mapping quality score for reads to be included in coverage calculation.
- secondary=true
- Include secondary alignments in coverage calculation.
- softclip=false
- Include soft-clipped bases in coverage calculation.
- dupecov=true
- Include reads flagged as duplicates in coverage calculation.
- delcov=true
- Include deletions in coverage calculation.
Quality Trimming Parameters
- qtrim=null
- Quality trimming mode: 'r' (right), 'l' (left), 'rl' (both), or numeric value for trimq threshold.
- trimq=-1
- Quality threshold for trimming. Bases with quality below this value are trimmed.
- border=0
- Trim this many bases from read ends near scaffold boundaries.
Coverage Mode Parameters
- physical=false
- Calculate physical coverage including unsequenced middle portion of paired reads.
- tlen=true
- Use TLEN field for physical coverage calculation when available.
- stranded=false
- Calculate coverage separately for each strand. Requires '#' symbol in output filenames.
- startcov=false
- Only track coverage at read start positions.
- stopcov=false
- Only track coverage at read stop positions.
- delta=false
- Only output positions where coverage changes (delta-only mode).
Data Structure Parameters
- bitset=auto
- Use bitsets for coverage tracking instead of arrays. Automatically determined based on output needs.
- array=auto
- Use arrays for coverage tracking. Automatically determined based on output needs.
- median=false
- Calculate median coverage (forces array mode).
Output Format Parameters
- nonzero=false
- Only output scaffolds/positions with non-zero coverage.
- twocolumn=false
- Output in simple two-column format: ID and average coverage.
- keyvalue=false
- Output in machine-readable key=value format.
- concise=false
- Use concise output format, skipping zero-coverage positions.
- header=true
- Include header lines in output files.
- headerpound=true
- Prefix header lines with '#' symbol.
- keepshortbins=true
- Include the last bin of each scaffold even if shorter than binsize.
Statistics Parameters
- stdev=true
- Calculate standard deviation of coverage.
- countgc=true
- Count GC content of reads covering each scaffold.
- covwindow=false
- Enable low-coverage window detection. Can specify window size as numeric value.
- covwindowavg=5.0
- Average coverage threshold for low-coverage window detection.
- k=0
- K-mer length for k-mer coverage statistics.
Normalization Parameters
- normc=false
- Normalize coverage by expressing as fraction of maximum coverage per scaffold.
- normb=100
- Number of bins per scaffold for length normalization. Use -1 to disable.
General Parameters
- addfromref=true
- Add scaffold information from reference file in addition to SAM header.
- append=false
- Append to existing output files instead of overwriting.
- overwrite=true
- Allow overwriting of existing output files.
- verbose=false
- Print verbose progress information.
- verbosetime=false
- Print detailed timing information for each processing step.
Java Parameters
- -Xmx
- Set Java heap size. Example: -Xmx20g for 20GB, -Xmx200m for 200MB. Maximum is typically 85% of physical memory.
- -eoom
- Exit on out-of-memory exception. Requires Java 8u92 or later.
- -da
- Disable Java assertions for better performance.
Examples
Basic Multi-file Coverage
pileup2.sh in=file1.bam,file2.bam,file3.bam out=coverage_stats.txt
Process three BAM files simultaneously and output basic coverage statistics.
Wildcard Input with Reference
pileup2.sh *.sam ref=reference.fasta out=stats.txt basecov=per_base.txt
Process all SAM files in directory with reference for GC calculation and per-base coverage output.
High-throughput Processing with Optimization
pileup2.sh in=sample1.bam,sample2.bam,sample3.bam \
out=stats.txt streams=8 atomic=true prealloc=true \
32bit=true -Xmx64g
Optimize for high-throughput processing with 8 streams, atomic arrays, preallocation, and 64GB memory.
Strand-specific Coverage Analysis
pileup2.sh in=rnaseq_*.bam stranded=true \
out=stats_#.txt basecov=coverage_#.txt hist=histogram_#.txt
Calculate strand-specific coverage with separate outputs for each strand (# replaced with 1/2).
Physical Coverage with Quality Filtering
pileup2.sh in=paired_*.bam physical=true minq=20 \
qtrim=rl trimq=10 out=physical_coverage.txt
Calculate physical coverage including insert regions with quality filtering and trimming.
Comprehensive Output Suite
pileup2.sh in=samples_*.bam ref=genome.fasta \
out=stats.txt basecov=per_base.txt bincov=binned.txt \
hist=histogram.txt rpkm=expression.txt \
binsize=500 stdev=true countgc=true
Generate complete coverage analysis including statistics, per-base coverage, binned coverage, histogram, and expression levels.
Algorithm Details
Multi-threaded Architecture
Pileup2 implements CoveragePileupMT.LoadThread workers with ArrayBlockingQueue for concurrent processing of multiple input files:
- Stream-based Processing: Creates multiple LoadThread instances, each processing different input files simultaneously
- Producer-Consumer Pattern: Uses ArrayBlockingQueue for thread-safe file distribution among worker threads
- Thread Pool Management: Tools.mid() determines thread count based on streams parameter, available processors, and input file count
- Memory-conscious Design: Accumulates results from worker threads using synchronized accumulation to minimize memory overhead
Coverage Data Structures
Supports multiple data structures for different use cases:
- CoverageArray: CoverageArray.makeArray() creates numeric arrays storing exact coverage depths, supporting 16-bit or 32-bit precision via caType parameter
- BitSet: Java BitSet instances for memory-efficient boolean coverage tracking when only presence/absence is needed
- Atomic Arrays: Lock-free data structures using CoverageArray.getType(atomic, bits32) for high-concurrency scenarios
- Dual Strategy: Automatically selects between USE_COVERAGE_ARRAYS and USE_BITSETS based on output requirements (histogram, basecov, bincov)
Coverage Calculation Modes
Implements multiple coverage calculation strategies via processSamLine() method:
- Standard Coverage: Counts all aligned bases including insertions and deletions using addCoverage() method
- Physical Coverage: Tracks entire insert size for paired reads using pairTable HashMap, including unsequenced middle portion
- Strand-specific: Maintains separate coverage tracks using scaf.obj0 (plus strand) and scaf.obj1 (minus strand)
- Position-specific: START_ONLY and STOP_ONLY flags for tracking coverage only at read start or stop positions
- Deletion-aware: INCLUDE_DELETIONS and INCLUDE_SOFT_CLIP flags for configurable handling of deletions and soft-clipped regions
Quality Control Integration
Built-in quality control using specific classes and methods:
- Quality Trimming: TrimRead.testOptimal() and TrimRead.trimReadWithMatch() for dynamic quality-based trimming
- Mapping Quality: Configurable minMapq thresholds with sl.mapq>=minMapq filtering
- Duplicate Handling: sl.duplicate() checks with INCLUDE_DUPLICATES flag for optional PCR duplicate inclusion/exclusion
- Border Trimming: Automatic trimming near scaffold boundaries using border parameter and skipTrimRange calculations
Memory Management
Memory optimization using specific allocation strategies:
- Preallocation: Optional preallocation via prealloc parameter creates CoverageArray and BitSet instances during scaffold header processing
- Dynamic Allocation: On-demand creation using makeCA() method minimizes memory usage for sparse coverage
- Lock Optimization: Choice between synchronized blocks and atomic operations via atomic parameter based on contention patterns
- Garbage Collection Friendly: clear() method nullifies references and Tools.max() minimizes object churn during processing
Output Generation
File I/O system using specific writer classes:
- Multi-threaded Statistics: StandardDeviator.calculateStuff() with Shared.threads() for parallel calculation of coverage statistics
- Streaming Writers: ByteStreamWriter and TextStreamWriter classes with start() and poisonAndWait() lifecycle methods
- Format Flexibility: writeCoveragePerBase(), writeCoveragePerBaseBinned(), and writeRPKM() methods support multiple output formats
- Strand Separation: replaceFirst("#", "1") and replaceFirst("#", "2") for automatic filename manipulation in strand-specific outputs
Performance Characteristics
Implementation specifics for genomics workflows:
- Scalability: ThreadWaiter.startThreads() provides linear scaling with available CPU cores and input file count
- Memory Efficiency: bits32 parameter balances memory usage (16-bit vs 32-bit arrays) with maximum coverage depth (Character.MAX_VALUE vs 1000000)
- I/O Optimization: SamLineStreamer with streamerThreads parameter supports compressed formats and streaming for large datasets
- Cache Friendly: CoverageArray data structures designed for sequential access patterns during coverage accumulation
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org