Pileup

Script: pileup.sh Package: jgi Class: CoveragePileup.java

Calculates per-scaffold or per-base coverage information from an unsorted sam or bam file. Supports SAM/BAM format for reads and FASTA for reference. Sorting is not needed, so output may be streamed directly from a mapping program. Requires a minimum of 1 bit per reference base plus 100 bytes per scaffold (even if no reference is specified). If per-base coverage is needed (including for stdev and median), at least 4 bytes per base is needed.

Basic Usage

pileup.sh in=<input> out=<output>

The input parameter is required and specifies the SAM/BAM file. Other parameters are optional and provide various coverage analysis options.

Parameters

Parameters are organized by their function in the coverage analysis process. The tool implements automatic data structure selection through CoveragePileup.java, choosing between CoverageArray2/CoverageArray3 classes for detailed statistics or Java BitSet objects for memory-constrained tracking.

Input Parameters

in=<file>: The input sam file; this is the only required parameter.
ref=<file>: Scans a reference fasta for per-scaffold GC counts, not otherwise needed.
fastaorf=<file>: An optional fasta file with ORF header information in PRODIGAL's output format. Must also specify 'outorf'.
unpigz=t: Decompress with pigz for faster decompression.
addfromref=t: Allow ref scaffolds not present in sam header to be added from the reference.
addfromreads=f: Allow ref scaffolds not present in sam header to be added from the reads. Note that in this case the ref scaffold lengths will be inaccurate.

Output Parameters

out=<file>: (covstats) Per-scaffold coverage info.
rpkm=<file>: Per-scaffold RPKM/FPKM counts.
twocolumn=f: Change to true to print only ID and Avg_fold instead of all 11 columns.
countgc=t: Enable/disable counting of read GC content.
outorf=<file>: Per-orf coverage info to this file (only if 'fastaorf' is specified).
outsam=<file>: Print the input sam stream to this file (or stdout). Useful for piping data.
hist=<file>: Histogram of # occurrences of each depth level.
basecov=<file>: Coverage per base location.
rangecov=<file>: Concise ranges where coverage depth is at least mincov.
mincov=1: When calculating percent covered, ignore bases under this depth. Also used as threshold for rangecov.
bincov=<file>: Binned coverage per location (one line per X bases).
binsize=1000: Binsize for binned coverage output.
keepshortbins=t: (ksb) Keep residual bins shorter than binsize.
normcov=<file>: Normalized coverage by normalized location (X lines per scaffold).
normcovo=<file>: Overall normalized coverage by normalized location (X lines for the entire assembly).
normb=-1: If positive, use a fixed number of bins per scaffold; affects 'normcov' and 'normcovo'.
normc=f: Normalize coverage to fraction of max per scaffold; affects 'normcov' and 'normcovo'.
delta=f: Only print base coverage lines when the coverage differs from the previous base.
nzo=f: Only print scaffolds with nonzero coverage.
concise=f: Write 'basecov' in a more concise format.
header=t: (hdr) Include headers in output files.
headerpound=t: (#) Prepend header lines with '#' symbol.
stdev=t: Calculate coverage standard deviation.
covminscaf=0: (minscaf) Don't print coverage for scaffolds shorter than this.
covwindow=0: Calculate how many bases are in windows of this size with low average coverage. Produces an extra stats column.
covwindowavg=5: Average coverage below this will be classified as low.
k=0: If positive, calculate kmer coverage statistics for this kmer length.
keyvalue=f: Output statistics to screen as key=value pairs.

Processing Parameters

strandedcov=f: Track coverage for plus and minus strand independently.
startcov=f: Only track start positions of reads.
stopcov=f: Only track stop positions of reads.
secondary=t: Use secondary alignments, if present.
softclip=f: Include soft-clipped bases in coverage.
minmapq=0: (minq) Ignore alignments with mapq below this.
physical=f: (physcov) Calculate physical coverage for paired reads. This includes the unsequenced bases.
tlen=t: Track physical coverage from the tlen field rather than recalculating it.
arrays=auto: Set to t/f to manually force the use of coverage arrays. Arrays and bitsets are mutually exclusive.
bitsets=auto: Set to t/f to manually force the use of coverage bitsets.
32bit=f: Set to true if you need per-base coverage over 64k; does not affect per-scaffold coverage precision. This option will double RAM usage (when calculating per-base coverage).
delcoverage=t: (delcov) Count bases covered by deletions or introns as covered. True is faster than false.
dupecoverage=t: (dupes) Include reads flagged as duplicates in coverage.
samstreamer=t: (ss) Load reads multithreaded to increase speed.

Trimming Parameters

NOTE: These are applied before adding coverage, to allow mimicking tools like CallVariants, which uses 'qtrim=r trimq=10 border=5'

qtrim=f: Quality-trim. May be set to: f (false): Don't quality-trim. r (right): Trim right (3') end only. l (left): Trim left (5') end only. rl (both): Trim both ends.
trimq=-1: If positive, quality-trim to this threshold.
border=0: Ignore this many bases on the left and right end.

Output Columns (tab-delimited)

Standard output contains the following columns:

ID: Scaffold ID
Length: Scaffold length
Ref_GC: GC ratio of reference
Avg_fold: Average fold coverage of this scaffold
Covered_percent: Percent of scaffold with any coverage (only if arrays or bitsets are used)
Covered_bases: Number of bases with any coverage (only if arrays or bitsets are used)
Plus_reads: Number of reads mapped to plus strand
Minus_reads: Number of reads mapped to minus strand
Read_GC: Average GC ratio of reads mapped to this scaffold
Median_fold: Median fold coverage of this scaffold (only if arrays are used)
Std_Dev: Standard deviation of coverage (only if arrays are used)

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Coverage Statistics

pileup.sh in=aligned.sam out=coverage_stats.txt

Generates basic per-scaffold coverage statistics from a SAM file.

Per-Base Coverage with Reference

pileup.sh in=aligned.bam ref=reference.fasta basecov=per_base_coverage.txt out=scaffold_stats.txt

Calculates both per-base and per-scaffold coverage statistics using a reference genome for GC content calculation.

Coverage Histogram and Binned Output

pileup.sh in=mapped.sam hist=depth_histogram.txt bincov=binned_coverage.txt binsize=500 out=stats.txt

Generates a coverage depth histogram and binned coverage data with 500bp bins.

Physical Coverage for Paired Reads

pileup.sh in=paired.bam physical=t out=physical_coverage.txt

Calculates physical coverage including unsequenced bases between paired reads.

Strand-Specific Coverage Analysis

pileup.sh in=stranded.sam strandedcov=t basecov=coverage_#.txt out=stats_#.txt

Performs strand-specific coverage analysis. The '#' symbol is replaced with '1' for plus strand and '2' for minus strand in output filenames.

Quality Trimming with Coverage

pileup.sh in=reads.sam qtrim=r trimq=10 border=5 out=trimmed_coverage.txt

Applies quality trimming (right end, Q10 threshold) and border trimming (5bp) before calculating coverage, mimicking CallVariants behavior.

Algorithm Details

Adaptive Data Structure Selection

Pileup implements automatic data structure selection in CoveragePileup.createDataStructures() based on output requirements:

BitSet Mode (java.util.BitSet): When only basic coverage statistics are needed, uses Java BitSets requiring just 1 bit per reference base plus 100 bytes per scaffold
CoverageArray2 (16-bit): When detailed statistics (median, standard deviation, per-base output) are required, uses CoverageArray2 structures with char arrays for coverage values up to 65,535
CoverageArray3 (32-bit): For high-coverage data requiring per-base coverage over 64k, uses CoverageArray3 with int arrays
Automatic Selection Logic: The tool evaluates required outputs in vectorMode selection to choose the most memory-optimal data structure

Streaming SAM/BAM Processing via SamLineStreamer

The tool processes SAM/BAM files through two processing pathways:

SamLineStreamer (Multi-threaded): Uses stream.SamLineStreamer class with configurable thread count for parallel SAM line processing when enabled with samstreamer=t
ByteFile Sequential Processing: Falls back to fileIO.ByteFile for streaming input when SamLineStreamer is disabled or for stdin
Header Processing: Dedicated processHeader() method extracts @SQ lines to build Scaffold objects with HashMap<String, Scaffold> lookup table
Memory Management: Automatically manages memory allocation based on scaffold count through initialScaffolds parameter

Physical Coverage Calculation with Mate Pairing

For paired-end reads, physical coverage includes the unsequenced region between reads using HashMap-based mate tracking:

Mate Hash Table: Uses HashMap<String, SamLine> pairTable to match read pairs during streaming
Template Length Calculation: Uses TLEN field when USE_TLEN=true, otherwise calculates from read positions via Tools.max(stop1, stop2)-Tools.min(start1, start2)+1
Proper Pair Validation: Only considers reads with sl.properPair() flag for physical coverage calculation
Coverage Addition: Calls addCoverage() with combined start/stop positions covering the entire fragment

Quality and Border Trimming via TrimRead Integration

Applies trimming before coverage calculation using shared.TrimRead methods:

Quality Trimming: Uses TrimRead.testOptimal() with Phred error probability conversion via QualityTools.phredToProbError()
Border Trimming: Ignores specified number of bases from read ends with adaptive reduction near scaffold boundaries
Match String Processing: Handles both short and long match strings via Read.toLongMatchString() for deletion-aware coverage
Trimming Integration: TrimRead.trimReadWithMatch() modifies SamLine objects before coverage calculation

Coverage Increment Methods

Array-based: CoverageArray.increment() and CoverageArray.incrementRange() methods for detailed tracking
BitSet-based: BitSet.set() and BitSet.set(start, stop+1) for memory-constrained coverage presence tracking
Deletion Handling: addCoverageIgnoringDeletions() parses match strings to exclude 'D' (deletion) positions from coverage
Strand Separation: Uses scaf.obj0 for forward strand and scaf.obj1 for reverse strand when STRANDED=true

Performance Characteristics

Memory Usage: Minimum 1 bit/base + 100 bytes/scaffold; 4+ bytes/base for detailed statistics
Scalability: HashMap-based scaffold lookup provides O(1) access time for coverage addition
Threading: Optional multithreaded SAM parsing via SamLineStreamer for improved I/O performance
Stream Processing: ByteStreamWriter enables concurrent output writing for piped workflows

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org