Pileup2

Script: pileup2.sh Package: jgi Class: CoveragePileupMT.java

Multi-threaded version of Pileup designed to process multiple SAM/BAM files concurrently for coverage analysis. For single input files, use regular pileup.sh instead. Supports all the same parameters as pileup.sh with additional multi-threading controls.

Basic Usage

pileup2.sh in=<file,file,file> out=<file>
pileup2.sh *.sam out=<file>

Processes multiple SAM/BAM files simultaneously using multiple threads to calculate coverage statistics.

Parameters

Parameters are organized by function. Pileup2 supports all parameters from regular pileup.sh plus multi-threading specific controls.

Input/Output Parameters

in=<file,file>: The input SAM/BAM files, comma-separated. Omit the 'in=' if using wildcards (e.g., *.sam).
out=<file>: Primary output file for coverage statistics. Can use '#' symbol for strand-specific output when stranded=true.
ref=<file>: Reference fasta file for calculating GC content and adding scaffold information.
basecov=<file>: Write per-base coverage to file. Use '#' in filename for strand-specific output.
bincov=<file>: Write binned coverage to file. Use '#' in filename for strand-specific output.
normcov=<file>: Write length-normalized coverage to file. Use '#' in filename for strand-specific output.
normcovOverall=<file>: Write overall normalized coverage combining all scaffolds. Use '#' in filename for strand-specific output.
hist=<file>: Write coverage histogram to file. Use '#' in filename for strand-specific output.
rpkm=<file>: Write RPKM/FPKM statistics to file.
outorf=<file>: Write ORF coverage statistics to file.
orffasta=<file>: Input ORF fasta file for calculating ORF-specific coverage.

Multi-threading Parameters

streams=-1: Maximum number of concurrent file processing streams. Default is half the number of logical processors. Each stream uses multiple threads internally.
atomic=false: Use atomic arrays instead of locks for thread synchronization. Can improve performance but uses more memory.
prealloc=false: Preallocate coverage arrays instead of creating them as needed. Reduces memory allocation overhead for large references.

Coverage Calculation Parameters

mindepth=1: Minimum depth for a base to be considered covered.
minscaf=0: Ignore scaffolds shorter than this length.
binsize=1000: Bin size in base pairs for binned coverage output.
32bit=false: Use 32-bit arrays instead of 16-bit for coverage counting. Allows higher coverage depths.
histmax=-1: Maximum coverage depth for histogram. Default uses maximum possible value for array type.
reads=-1: Process at most this many reads. -1 means process all reads.
scafs=4096: Initial estimate of scaffold count for memory allocation optimization.

Filtering Parameters

minq=0: Minimum mapping quality score for reads to be included in coverage calculation.
secondary=true: Include secondary alignments in coverage calculation.
softclip=false: Include soft-clipped bases in coverage calculation.
dupecov=true: Include reads flagged as duplicates in coverage calculation.
delcov=true: Include deletions in coverage calculation.

Quality Trimming Parameters

qtrim=null: Quality trimming mode: 'r' (right), 'l' (left), 'rl' (both), or numeric value for trimq threshold.
trimq=-1: Quality threshold for trimming. Bases with quality below this value are trimmed.
border=0: Trim this many bases from read ends near scaffold boundaries.

Coverage Mode Parameters

physical=false: Calculate physical coverage including unsequenced middle portion of paired reads.
tlen=true: Use TLEN field for physical coverage calculation when available.
stranded=false: Calculate coverage separately for each strand. Requires '#' symbol in output filenames.
startcov=false: Only track coverage at read start positions.
stopcov=false: Only track coverage at read stop positions.
delta=false: Only output positions where coverage changes (delta-only mode).

Data Structure Parameters

bitset=auto: Use bitsets for coverage tracking instead of arrays. Automatically determined based on output needs.
array=auto: Use arrays for coverage tracking. Automatically determined based on output needs.
median=false: Calculate median coverage (forces array mode).

Output Format Parameters

nonzero=false: Only output scaffolds/positions with non-zero coverage.
twocolumn=false: Output in simple two-column format: ID and average coverage.
keyvalue=false: Output in machine-readable key=value format.
concise=false: Use concise output format, skipping zero-coverage positions.
header=true: Include header lines in output files.
headerpound=true: Prefix header lines with '#' symbol.
keepshortbins=true: Include the last bin of each scaffold even if shorter than binsize.

Statistics Parameters

stdev=true: Calculate standard deviation of coverage.
countgc=true: Count GC content of reads covering each scaffold.
covwindow=false: Enable low-coverage window detection. Can specify window size as numeric value.
covwindowavg=5.0: Average coverage threshold for low-coverage window detection.
k=0: K-mer length for k-mer coverage statistics.

Normalization Parameters

normc=false: Normalize coverage by expressing as fraction of maximum coverage per scaffold.
normb=100: Number of bins per scaffold for length normalization. Use -1 to disable.

General Parameters

addfromref=true: Add scaffold information from reference file in addition to SAM header.
append=false: Append to existing output files instead of overwriting.
overwrite=true: Allow overwriting of existing output files.
verbose=false: Print verbose progress information.
verbosetime=false: Print detailed timing information for each processing step.

Java Parameters

-Xmx: Set Java heap size. Example: -Xmx20g for 20GB, -Xmx200m for 200MB. Maximum is typically 85% of physical memory.
-eoom: Exit on out-of-memory exception. Requires Java 8u92 or later.
-da: Disable Java assertions for better performance.

Examples

Basic Multi-file Coverage

pileup2.sh in=file1.bam,file2.bam,file3.bam out=coverage_stats.txt

Process three BAM files simultaneously and output basic coverage statistics.

Wildcard Input with Reference

pileup2.sh *.sam ref=reference.fasta out=stats.txt basecov=per_base.txt

Process all SAM files in directory with reference for GC calculation and per-base coverage output.

High-throughput Processing with Optimization

pileup2.sh in=sample1.bam,sample2.bam,sample3.bam \
    out=stats.txt streams=8 atomic=true prealloc=true \
    32bit=true -Xmx64g

Optimize for high-throughput processing with 8 streams, atomic arrays, preallocation, and 64GB memory.

Strand-specific Coverage Analysis

pileup2.sh in=rnaseq_*.bam stranded=true \
    out=stats_#.txt basecov=coverage_#.txt hist=histogram_#.txt

Calculate strand-specific coverage with separate outputs for each strand (# replaced with 1/2).

Physical Coverage with Quality Filtering

pileup2.sh in=paired_*.bam physical=true minq=20 \
    qtrim=rl trimq=10 out=physical_coverage.txt

Calculate physical coverage including insert regions with quality filtering and trimming.

Comprehensive Output Suite

pileup2.sh in=samples_*.bam ref=genome.fasta \
    out=stats.txt basecov=per_base.txt bincov=binned.txt \
    hist=histogram.txt rpkm=expression.txt \
    binsize=500 stdev=true countgc=true

Generate complete coverage analysis including statistics, per-base coverage, binned coverage, histogram, and expression levels.

Algorithm Details

Multi-threaded Architecture

Pileup2 implements CoveragePileupMT.LoadThread workers with ArrayBlockingQueue for concurrent processing of multiple input files:

Stream-based Processing: Creates multiple LoadThread instances, each processing different input files simultaneously
Producer-Consumer Pattern: Uses ArrayBlockingQueue for thread-safe file distribution among worker threads
Thread Pool Management: Tools.mid() determines thread count based on streams parameter, available processors, and input file count
Memory-conscious Design: Accumulates results from worker threads using synchronized accumulation to minimize memory overhead

Coverage Data Structures

Supports multiple data structures for different use cases:

CoverageArray: CoverageArray.makeArray() creates numeric arrays storing exact coverage depths, supporting 16-bit or 32-bit precision via caType parameter
BitSet: Java BitSet instances for memory-efficient boolean coverage tracking when only presence/absence is needed
Atomic Arrays: Lock-free data structures using CoverageArray.getType(atomic, bits32) for high-concurrency scenarios
Dual Strategy: Automatically selects between USE_COVERAGE_ARRAYS and USE_BITSETS based on output requirements (histogram, basecov, bincov)

Coverage Calculation Modes

Implements multiple coverage calculation strategies via processSamLine() method:

Standard Coverage: Counts all aligned bases including insertions and deletions using addCoverage() method
Physical Coverage: Tracks entire insert size for paired reads using pairTable HashMap, including unsequenced middle portion
Strand-specific: Maintains separate coverage tracks using scaf.obj0 (plus strand) and scaf.obj1 (minus strand)
Position-specific: START_ONLY and STOP_ONLY flags for tracking coverage only at read start or stop positions
Deletion-aware: INCLUDE_DELETIONS and INCLUDE_SOFT_CLIP flags for configurable handling of deletions and soft-clipped regions

Quality Control Integration

Built-in quality control using specific classes and methods:

Quality Trimming: TrimRead.testOptimal() and TrimRead.trimReadWithMatch() for dynamic quality-based trimming
Mapping Quality: Configurable minMapq thresholds with sl.mapq>=minMapq filtering
Duplicate Handling: sl.duplicate() checks with INCLUDE_DUPLICATES flag for optional PCR duplicate inclusion/exclusion
Border Trimming: Automatic trimming near scaffold boundaries using border parameter and skipTrimRange calculations

Memory Management

Memory optimization using specific allocation strategies:

Preallocation: Optional preallocation via prealloc parameter creates CoverageArray and BitSet instances during scaffold header processing
Dynamic Allocation: On-demand creation using makeCA() method minimizes memory usage for sparse coverage
Lock Optimization: Choice between synchronized blocks and atomic operations via atomic parameter based on contention patterns
Garbage Collection Friendly: clear() method nullifies references and Tools.max() minimizes object churn during processing

Output Generation

File I/O system using specific writer classes:

Multi-threaded Statistics: StandardDeviator.calculateStuff() with Shared.threads() for parallel calculation of coverage statistics
Streaming Writers: ByteStreamWriter and TextStreamWriter classes with start() and poisonAndWait() lifecycle methods
Format Flexibility: writeCoveragePerBase(), writeCoveragePerBaseBinned(), and writeRPKM() methods support multiple output formats
Strand Separation: replaceFirst("#", "1") and replaceFirst("#", "2") for automatic filename manipulation in strand-specific outputs

Performance Characteristics

Implementation specifics for genomics workflows:

Scalability: ThreadWaiter.startThreads() provides linear scaling with available CPU cores and input file count
Memory Efficiency: bits32 parameter balances memory usage (16-bit vs 32-bit arrays) with maximum coverage depth (Character.MAX_VALUE vs 1000000)
I/O Optimization: SamLineStreamer with streamerThreads parameter supports compressed formats and streaming for large datasets
Cache Friendly: CoverageArray data structures designed for sequential access patterns during coverage accumulation

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org