FilterByCoverage
Filters an assembly by contig coverage. The coverage stats file can be generated by BBMap or Pileup.
Basic Usage
filterbycoverage.sh in=<assembly> cov=<coverage stats> out=<filtered assembly> mincov=5
FilterByCoverage removes contigs from assemblies based on coverage statistics, helping to eliminate contamination, low-coverage artifacts, and poorly assembled regions. The tool requires a coverage statistics file generated by pileup.sh or bbmap.sh.
Parameters
Parameters control input/output files, filtering criteria, and processing options. Multiple filtering criteria are applied simultaneously using AND logic - contigs must pass all specified thresholds to be retained.
Input/Output Parameters
- in=<file>
- File containing input assembly. Accepts fasta format (gzipped or uncompressed).
- cov=<file>
- File containing coverage stats generated by pileup.sh or bbmap.sh. This is the primary coverage file used for filtering decisions.
- cov0=<file>
- Optional file containing coverage stats before normalization. When provided, enables ratio-based filtering to identify contigs where coverage dropped significantly.
- out=<file>
- Destination of clean output assembly containing contigs that passed all filters.
- outd=<file>
- (outdirty) Destination of dirty output containing only removed contigs. Useful for analyzing what was filtered out.
Filtering Parameters
- minc=5
- (mincov) Discard contigs with lower average coverage. Default: 5x. Contigs with insufficient coverage depth are likely artifacts or contaminants.
- minp=40
- (minpercent) Discard contigs with a lower percent covered bases. Default: 40%. Contigs with large uncovered regions may be misassembled or chimeric.
- minr=0
- (minreads) Discard contigs with fewer mapped reads. Default: 0 (disabled). Useful for removing contigs supported by very few reads.
- minl=1
- (minlength) Discard contigs shorter than this (after trimming). Default: 1. Length filtering is applied after end-trimming.
- trim=0
- (trimends) Trim the first and last X bases of each sequence. Default: 0 (disabled). Removes potentially problematic contig ends before applying length filters.
- ratio=0
- If cov0 is set, contigs will not be removed unless the coverage ratio (of cov to cov0) is at least this. Default: 0 (disabled). Identifies contigs where coverage dropped during normalization, indicating potential contamination.
File Handling Parameters
- ow=t
- (overwrite) Overwrites files that already exist. Default: true.
- app=f
- (append) Append to files that already exist. Default: false.
- zl=4
- (ziplevel) Set compression level, 1 (low) to 9 (max). Default: 4. Higher values produce smaller files but require more CPU time.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Advanced Parameters
- verbose
- Print verbose output for debugging and monitoring progress.
- basesundermin
- Advanced parameter for filtering contigs based on the number of bases in low-coverage windows. Used internally for specialized filtering scenarios.
- log=<file>
- (results) Output detailed results to a log file containing per-contig filtering decisions.
- logappend=f
- (appendlog) Append to log file instead of overwriting. Default: false.
- logheader=t
- Include header line in log output. Default: true.
Examples
Basic Assembly Filtering
filterbycoverage.sh in=assembly.fasta cov=coverage.txt out=filtered.fasta mincov=5
Removes contigs with average coverage below 5x, keeping only well-covered sequences.
Multi-Criteria Filtering
filterbycoverage.sh in=assembly.fasta cov=coverage.txt out=clean.fasta outd=removed.fasta mincov=3 minp=50 minl=500
Applies multiple filters: minimum 3x coverage, at least 50% bases covered, and minimum length of 500bp. Removed contigs are saved separately.
Normalization-Based Filtering
filterbycoverage.sh in=assembly.fasta cov=coverage_after.txt cov0=coverage_before.txt out=filtered.fasta ratio=0.5
Uses before/after normalization coverage to identify contaminants. Retains contigs where coverage didn't drop more than 50% during normalization.
Conservative Filtering with Trimming
filterbycoverage.sh in=assembly.fasta cov=coverage.txt out=trimmed.fasta mincov=2 trim=100 minl=1000
Trims 100bp from each end of contigs, then filters for minimum 2x coverage and 1000bp length. Helps remove problematic contig ends.
Detailed Logging
filterbycoverage.sh in=assembly.fasta cov=coverage.txt out=filtered.fasta log=filtering_results.txt mincov=5 minp=40
Generates detailed log showing filtering decisions for each contig, useful for understanding what was removed and why.
Algorithm Details
Filtering Strategy
FilterByCoverage implements a multi-criteria filtering approach that evaluates each contig against all specified thresholds. The filtering logic uses AND operations - contigs must satisfy ALL criteria to be retained:
- Coverage depth filtering: Removes contigs with average coverage below minc threshold
- Coverage breadth filtering: Removes contigs where less than minp% of bases have coverage
- Read support filtering: Removes contigs supported by fewer than minr mapped reads
- Length filtering: Applied after optional end-trimming, removes short contigs
- Ratio-based filtering: When cov0 is provided, identifies contigs where coverage dropped significantly
Coverage Statistics Integration
The tool integrates with BBTools coverage statistics files generated by pileup.sh or bbmap.sh. It parses coverage data including:
- Average fold coverage per contig
- Number of mapped reads (plus and minus strand)
- Percent of bases with coverage
- Coverage distribution statistics
Dual Coverage Analysis
When both cov and cov0 files are provided, the tool performs comparative analysis:
- Calculates coverage ratio (cov0/cov) for each contig using Tools.max(0.01, csl1.avgFold) to prevent division by zero
- Identifies contigs where coverage dropped during normalization
- Applies ratio threshold to prevent removal of contigs with stable coverage
- Useful for identifying contamination that becomes apparent after normalization
Memory Implementation
The implementation uses two HashMap<String, CovStatsLine> data structures (cslMap0 and cslMap1) for coverage statistics lookups, providing O(1) access time for each contig ID. The CovStatsLine class stores coverage metrics parsed from pileup.sh or bbmap.sh output files. Memory usage scales linearly with the number of contigs (1024 initial capacity with automatic expansion). The tool processes sequences using ConcurrentReadInputStream streaming to minimize peak memory requirements and enable processing of assemblies larger than available RAM.
Stream Processing Architecture
FilterByCoverage generates multiple output streams simultaneously using ConcurrentReadOutputStream instances with 4-buffer capacity:
- Clean output: Contigs passing all filters via rosClean stream
- Dirty output: Contigs failing any filter via rosDirty stream (optional)
- Log output: Per-contig filtering decisions via TextStreamWriter with configurable tab-delimited format including assembly name, contig ID, contamination flag, length, coverage metrics, and ratio calculations
Quality Control Implementation
The tool includes several quality control mechanisms implemented in the Java source:
- Validates coverage file format and headers using CovStatsLine.initializeHeader() method
- Ensures coverage statistics exist for all processed contigs via HashMap lookups
- Reports detailed statistics including processing rates calculated as reads/nanoseconds and bases/nanoseconds
- Provides error handling for file operations using ReadWrite.closeStreams() with errorState tracking
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org