Pileup2

Script: pileup2.sh Package: jgi Class: CoveragePileupMT.java

Multi-threaded version of Pileup designed to process multiple SAM/BAM files concurrently for coverage analysis. For single input files, use regular pileup.sh instead. Supports all the same parameters as pileup.sh with additional multi-threading controls.

Basic Usage

pileup2.sh in=<file,file,file> out=<file>
pileup2.sh *.sam out=<file>

Processes multiple SAM/BAM files simultaneously using multiple threads to calculate coverage statistics.

Parameters

Parameters are organized by function. Pileup2 supports all parameters from regular pileup.sh plus multi-threading specific controls.

Input/Output Parameters

in=<file,file>
The input SAM/BAM files, comma-separated. Omit the 'in=' if using wildcards (e.g., *.sam).
out=<file>
Primary output file for coverage statistics. Can use '#' symbol for strand-specific output when stranded=true.
ref=<file>
Reference fasta file for calculating GC content and adding scaffold information.
basecov=<file>
Write per-base coverage to file. Use '#' in filename for strand-specific output.
bincov=<file>
Write binned coverage to file. Use '#' in filename for strand-specific output.
normcov=<file>
Write length-normalized coverage to file. Use '#' in filename for strand-specific output.
normcovOverall=<file>
Write overall normalized coverage combining all scaffolds. Use '#' in filename for strand-specific output.
hist=<file>
Write coverage histogram to file. Use '#' in filename for strand-specific output.
rpkm=<file>
Write RPKM/FPKM statistics to file.
outorf=<file>
Write ORF coverage statistics to file.
orffasta=<file>
Input ORF fasta file for calculating ORF-specific coverage.

Multi-threading Parameters

streams=-1
Maximum number of concurrent file processing streams. Default is half the number of logical processors. Each stream uses multiple threads internally.
atomic=false
Use atomic arrays instead of locks for thread synchronization. Can improve performance but uses more memory.
prealloc=false
Preallocate coverage arrays instead of creating them as needed. Reduces memory allocation overhead for large references.

Coverage Calculation Parameters

mindepth=1
Minimum depth for a base to be considered covered.
minscaf=0
Ignore scaffolds shorter than this length.
binsize=1000
Bin size in base pairs for binned coverage output.
32bit=false
Use 32-bit arrays instead of 16-bit for coverage counting. Allows higher coverage depths.
histmax=-1
Maximum coverage depth for histogram. Default uses maximum possible value for array type.
reads=-1
Process at most this many reads. -1 means process all reads.
scafs=4096
Initial estimate of scaffold count for memory allocation optimization.

Filtering Parameters

minq=0
Minimum mapping quality score for reads to be included in coverage calculation.
secondary=true
Include secondary alignments in coverage calculation.
softclip=false
Include soft-clipped bases in coverage calculation.
dupecov=true
Include reads flagged as duplicates in coverage calculation.
delcov=true
Include deletions in coverage calculation.

Quality Trimming Parameters

qtrim=null
Quality trimming mode: 'r' (right), 'l' (left), 'rl' (both), or numeric value for trimq threshold.
trimq=-1
Quality threshold for trimming. Bases with quality below this value are trimmed.
border=0
Trim this many bases from read ends near scaffold boundaries.

Coverage Mode Parameters

physical=false
Calculate physical coverage including unsequenced middle portion of paired reads.
tlen=true
Use TLEN field for physical coverage calculation when available.
stranded=false
Calculate coverage separately for each strand. Requires '#' symbol in output filenames.
startcov=false
Only track coverage at read start positions.
stopcov=false
Only track coverage at read stop positions.
delta=false
Only output positions where coverage changes (delta-only mode).

Data Structure Parameters

bitset=auto
Use bitsets for coverage tracking instead of arrays. Automatically determined based on output needs.
array=auto
Use arrays for coverage tracking. Automatically determined based on output needs.
median=false
Calculate median coverage (forces array mode).

Output Format Parameters

nonzero=false
Only output scaffolds/positions with non-zero coverage.
twocolumn=false
Output in simple two-column format: ID and average coverage.
keyvalue=false
Output in machine-readable key=value format.
concise=false
Use concise output format, skipping zero-coverage positions.
header=true
Include header lines in output files.
headerpound=true
Prefix header lines with '#' symbol.
keepshortbins=true
Include the last bin of each scaffold even if shorter than binsize.

Statistics Parameters

stdev=true
Calculate standard deviation of coverage.
countgc=true
Count GC content of reads covering each scaffold.
covwindow=false
Enable low-coverage window detection. Can specify window size as numeric value.
covwindowavg=5.0
Average coverage threshold for low-coverage window detection.
k=0
K-mer length for k-mer coverage statistics.

Normalization Parameters

normc=false
Normalize coverage by expressing as fraction of maximum coverage per scaffold.
normb=100
Number of bins per scaffold for length normalization. Use -1 to disable.

General Parameters

addfromref=true
Add scaffold information from reference file in addition to SAM header.
append=false
Append to existing output files instead of overwriting.
overwrite=true
Allow overwriting of existing output files.
verbose=false
Print verbose progress information.
verbosetime=false
Print detailed timing information for each processing step.

Java Parameters

-Xmx
Set Java heap size. Example: -Xmx20g for 20GB, -Xmx200m for 200MB. Maximum is typically 85% of physical memory.
-eoom
Exit on out-of-memory exception. Requires Java 8u92 or later.
-da
Disable Java assertions for better performance.

Examples

Basic Multi-file Coverage

pileup2.sh in=file1.bam,file2.bam,file3.bam out=coverage_stats.txt

Process three BAM files simultaneously and output basic coverage statistics.

Wildcard Input with Reference

pileup2.sh *.sam ref=reference.fasta out=stats.txt basecov=per_base.txt

Process all SAM files in directory with reference for GC calculation and per-base coverage output.

High-throughput Processing with Optimization

pileup2.sh in=sample1.bam,sample2.bam,sample3.bam \
    out=stats.txt streams=8 atomic=true prealloc=true \
    32bit=true -Xmx64g

Optimize for high-throughput processing with 8 streams, atomic arrays, preallocation, and 64GB memory.

Strand-specific Coverage Analysis

pileup2.sh in=rnaseq_*.bam stranded=true \
    out=stats_#.txt basecov=coverage_#.txt hist=histogram_#.txt

Calculate strand-specific coverage with separate outputs for each strand (# replaced with 1/2).

Physical Coverage with Quality Filtering

pileup2.sh in=paired_*.bam physical=true minq=20 \
    qtrim=rl trimq=10 out=physical_coverage.txt

Calculate physical coverage including insert regions with quality filtering and trimming.

Comprehensive Output Suite

pileup2.sh in=samples_*.bam ref=genome.fasta \
    out=stats.txt basecov=per_base.txt bincov=binned.txt \
    hist=histogram.txt rpkm=expression.txt \
    binsize=500 stdev=true countgc=true

Generate complete coverage analysis including statistics, per-base coverage, binned coverage, histogram, and expression levels.

Algorithm Details

Multi-threaded Architecture

Pileup2 implements CoveragePileupMT.LoadThread workers with ArrayBlockingQueue for concurrent processing of multiple input files:

Coverage Data Structures

Supports multiple data structures for different use cases:

Coverage Calculation Modes

Implements multiple coverage calculation strategies via processSamLine() method:

Quality Control Integration

Built-in quality control using specific classes and methods:

Memory Management

Memory optimization using specific allocation strategies:

Output Generation

File I/O system using specific writer classes:

Performance Characteristics

Implementation specifics for genomics workflows:

Support

For questions and support: