Pileup
Calculates per-scaffold or per-base coverage information from an unsorted sam or bam file. Supports SAM/BAM format for reads and FASTA for reference. Sorting is not needed, so output may be streamed directly from a mapping program. Requires a minimum of 1 bit per reference base plus 100 bytes per scaffold (even if no reference is specified). If per-base coverage is needed (including for stdev and median), at least 4 bytes per base is needed.
Basic Usage
pileup.sh in=<input> out=<output>
The input parameter is required and specifies the SAM/BAM file. Other parameters are optional and provide various coverage analysis options.
Parameters
Parameters are organized by their function in the coverage analysis process. The tool implements automatic data structure selection through CoveragePileup.java, choosing between CoverageArray2/CoverageArray3 classes for detailed statistics or Java BitSet objects for memory-constrained tracking.
Input Parameters
- in=<file>
- The input sam file; this is the only required parameter.
- ref=<file>
- Scans a reference fasta for per-scaffold GC counts, not otherwise needed.
- fastaorf=<file>
- An optional fasta file with ORF header information in PRODIGAL's output format. Must also specify 'outorf'.
- unpigz=t
- Decompress with pigz for faster decompression.
- addfromref=t
- Allow ref scaffolds not present in sam header to be added from the reference.
- addfromreads=f
- Allow ref scaffolds not present in sam header to be added from the reads. Note that in this case the ref scaffold lengths will be inaccurate.
Output Parameters
- out=<file>
- (covstats) Per-scaffold coverage info.
- rpkm=<file>
- Per-scaffold RPKM/FPKM counts.
- twocolumn=f
- Change to true to print only ID and Avg_fold instead of all 11 columns.
- countgc=t
- Enable/disable counting of read GC content.
- outorf=<file>
- Per-orf coverage info to this file (only if 'fastaorf' is specified).
- outsam=<file>
- Print the input sam stream to this file (or stdout). Useful for piping data.
- hist=<file>
- Histogram of # occurrences of each depth level.
- basecov=<file>
- Coverage per base location.
- rangecov=<file>
- Concise ranges where coverage depth is at least mincov.
- mincov=1
- When calculating percent covered, ignore bases under this depth. Also used as threshold for rangecov.
- bincov=<file>
- Binned coverage per location (one line per X bases).
- binsize=1000
- Binsize for binned coverage output.
- keepshortbins=t
- (ksb) Keep residual bins shorter than binsize.
- normcov=<file>
- Normalized coverage by normalized location (X lines per scaffold).
- normcovo=<file>
- Overall normalized coverage by normalized location (X lines for the entire assembly).
- normb=-1
- If positive, use a fixed number of bins per scaffold; affects 'normcov' and 'normcovo'.
- normc=f
- Normalize coverage to fraction of max per scaffold; affects 'normcov' and 'normcovo'.
- delta=f
- Only print base coverage lines when the coverage differs from the previous base.
- nzo=f
- Only print scaffolds with nonzero coverage.
- concise=f
- Write 'basecov' in a more concise format.
- header=t
- (hdr) Include headers in output files.
- headerpound=t
- (#) Prepend header lines with '#' symbol.
- stdev=t
- Calculate coverage standard deviation.
- covminscaf=0
- (minscaf) Don't print coverage for scaffolds shorter than this.
- covwindow=0
- Calculate how many bases are in windows of this size with low average coverage. Produces an extra stats column.
- covwindowavg=5
- Average coverage below this will be classified as low.
- k=0
- If positive, calculate kmer coverage statistics for this kmer length.
- keyvalue=f
- Output statistics to screen as key=value pairs.
Processing Parameters
- strandedcov=f
- Track coverage for plus and minus strand independently.
- startcov=f
- Only track start positions of reads.
- stopcov=f
- Only track stop positions of reads.
- secondary=t
- Use secondary alignments, if present.
- softclip=f
- Include soft-clipped bases in coverage.
- minmapq=0
- (minq) Ignore alignments with mapq below this.
- physical=f
- (physcov) Calculate physical coverage for paired reads. This includes the unsequenced bases.
- tlen=t
- Track physical coverage from the tlen field rather than recalculating it.
- arrays=auto
- Set to t/f to manually force the use of coverage arrays. Arrays and bitsets are mutually exclusive.
- bitsets=auto
- Set to t/f to manually force the use of coverage bitsets.
- 32bit=f
- Set to true if you need per-base coverage over 64k; does not affect per-scaffold coverage precision. This option will double RAM usage (when calculating per-base coverage).
- delcoverage=t
- (delcov) Count bases covered by deletions or introns as covered. True is faster than false.
- dupecoverage=t
- (dupes) Include reads flagged as duplicates in coverage.
- samstreamer=t
- (ss) Load reads multithreaded to increase speed.
Trimming Parameters
NOTE: These are applied before adding coverage, to allow mimicking tools like CallVariants, which uses 'qtrim=r trimq=10 border=5'
- qtrim=f
- Quality-trim. May be set to: f (false): Don't quality-trim. r (right): Trim right (3') end only. l (left): Trim left (5') end only. rl (both): Trim both ends.
- trimq=-1
- If positive, quality-trim to this threshold.
- border=0
- Ignore this many bases on the left and right end.
Output Columns (tab-delimited)
Standard output contains the following columns:
- ID
- Scaffold ID
- Length
- Scaffold length
- Ref_GC
- GC ratio of reference
- Avg_fold
- Average fold coverage of this scaffold
- Covered_percent
- Percent of scaffold with any coverage (only if arrays or bitsets are used)
- Covered_bases
- Number of bases with any coverage (only if arrays or bitsets are used)
- Plus_reads
- Number of reads mapped to plus strand
- Minus_reads
- Number of reads mapped to minus strand
- Read_GC
- Average GC ratio of reads mapped to this scaffold
- Median_fold
- Median fold coverage of this scaffold (only if arrays are used)
- Std_Dev
- Standard deviation of coverage (only if arrays are used)
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Coverage Statistics
pileup.sh in=aligned.sam out=coverage_stats.txt
Generates basic per-scaffold coverage statistics from a SAM file.
Per-Base Coverage with Reference
pileup.sh in=aligned.bam ref=reference.fasta basecov=per_base_coverage.txt out=scaffold_stats.txt
Calculates both per-base and per-scaffold coverage statistics using a reference genome for GC content calculation.
Coverage Histogram and Binned Output
pileup.sh in=mapped.sam hist=depth_histogram.txt bincov=binned_coverage.txt binsize=500 out=stats.txt
Generates a coverage depth histogram and binned coverage data with 500bp bins.
Physical Coverage for Paired Reads
pileup.sh in=paired.bam physical=t out=physical_coverage.txt
Calculates physical coverage including unsequenced bases between paired reads.
Strand-Specific Coverage Analysis
pileup.sh in=stranded.sam strandedcov=t basecov=coverage_#.txt out=stats_#.txt
Performs strand-specific coverage analysis. The '#' symbol is replaced with '1' for plus strand and '2' for minus strand in output filenames.
Quality Trimming with Coverage
pileup.sh in=reads.sam qtrim=r trimq=10 border=5 out=trimmed_coverage.txt
Applies quality trimming (right end, Q10 threshold) and border trimming (5bp) before calculating coverage, mimicking CallVariants behavior.
Algorithm Details
Adaptive Data Structure Selection
Pileup implements automatic data structure selection in CoveragePileup.createDataStructures() based on output requirements:
- BitSet Mode (java.util.BitSet): When only basic coverage statistics are needed, uses Java BitSets requiring just 1 bit per reference base plus 100 bytes per scaffold
- CoverageArray2 (16-bit): When detailed statistics (median, standard deviation, per-base output) are required, uses CoverageArray2 structures with char arrays for coverage values up to 65,535
- CoverageArray3 (32-bit): For high-coverage data requiring per-base coverage over 64k, uses CoverageArray3 with int arrays
- Automatic Selection Logic: The tool evaluates required outputs in vectorMode selection to choose the most memory-optimal data structure
Streaming SAM/BAM Processing via SamLineStreamer
The tool processes SAM/BAM files through two processing pathways:
- SamLineStreamer (Multi-threaded): Uses stream.SamLineStreamer class with configurable thread count for parallel SAM line processing when enabled with samstreamer=t
- ByteFile Sequential Processing: Falls back to fileIO.ByteFile for streaming input when SamLineStreamer is disabled or for stdin
- Header Processing: Dedicated processHeader() method extracts @SQ lines to build Scaffold objects with HashMap<String, Scaffold> lookup table
- Memory Management: Automatically manages memory allocation based on scaffold count through initialScaffolds parameter
Physical Coverage Calculation with Mate Pairing
For paired-end reads, physical coverage includes the unsequenced region between reads using HashMap-based mate tracking:
- Mate Hash Table: Uses HashMap<String, SamLine> pairTable to match read pairs during streaming
- Template Length Calculation: Uses TLEN field when USE_TLEN=true, otherwise calculates from read positions via Tools.max(stop1, stop2)-Tools.min(start1, start2)+1
- Proper Pair Validation: Only considers reads with sl.properPair() flag for physical coverage calculation
- Coverage Addition: Calls addCoverage() with combined start/stop positions covering the entire fragment
Quality and Border Trimming via TrimRead Integration
Applies trimming before coverage calculation using shared.TrimRead methods:
- Quality Trimming: Uses TrimRead.testOptimal() with Phred error probability conversion via QualityTools.phredToProbError()
- Border Trimming: Ignores specified number of bases from read ends with adaptive reduction near scaffold boundaries
- Match String Processing: Handles both short and long match strings via Read.toLongMatchString() for deletion-aware coverage
- Trimming Integration: TrimRead.trimReadWithMatch() modifies SamLine objects before coverage calculation
Coverage Increment Methods
- Array-based: CoverageArray.increment() and CoverageArray.incrementRange() methods for detailed tracking
- BitSet-based: BitSet.set() and BitSet.set(start, stop+1) for memory-constrained coverage presence tracking
- Deletion Handling: addCoverageIgnoringDeletions() parses match strings to exclude 'D' (deletion) positions from coverage
- Strand Separation: Uses scaf.obj0 for forward strand and scaf.obj1 for reverse strand when STRANDED=true
Performance Characteristics
- Memory Usage: Minimum 1 bit/base + 100 bytes/scaffold; 4+ bytes/base for detailed statistics
- Scalability: HashMap-based scaffold lookup provides O(1) access time for coverage addition
- Threading: Optional multithreaded SAM parsing via SamLineStreamer for improved I/O performance
- Stream Processing: ByteStreamWriter enables concurrent output writing for piped workflows
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org