FilterByCoverage

Script: filterbycoverage.sh Package: jgi Class: FilterByCoverage.java

Filters an assembly by contig coverage. The coverage stats file can be generated by BBMap or Pileup.

Basic Usage

filterbycoverage.sh in=<assembly> cov=<coverage stats> out=<filtered assembly> mincov=5

FilterByCoverage removes contigs from assemblies based on coverage statistics, helping to eliminate contamination, low-coverage artifacts, and poorly assembled regions. The tool requires a coverage statistics file generated by pileup.sh or bbmap.sh.

Parameters

Parameters control input/output files, filtering criteria, and processing options. Multiple filtering criteria are applied simultaneously using AND logic - contigs must pass all specified thresholds to be retained.

Input/Output Parameters

in=<file>
File containing input assembly. Accepts fasta format (gzipped or uncompressed).
cov=<file>
File containing coverage stats generated by pileup.sh or bbmap.sh. This is the primary coverage file used for filtering decisions.
cov0=<file>
Optional file containing coverage stats before normalization. When provided, enables ratio-based filtering to identify contigs where coverage dropped significantly.
out=<file>
Destination of clean output assembly containing contigs that passed all filters.
outd=<file>
(outdirty) Destination of dirty output containing only removed contigs. Useful for analyzing what was filtered out.

Filtering Parameters

minc=5
(mincov) Discard contigs with lower average coverage. Default: 5x. Contigs with insufficient coverage depth are likely artifacts or contaminants.
minp=40
(minpercent) Discard contigs with a lower percent covered bases. Default: 40%. Contigs with large uncovered regions may be misassembled or chimeric.
minr=0
(minreads) Discard contigs with fewer mapped reads. Default: 0 (disabled). Useful for removing contigs supported by very few reads.
minl=1
(minlength) Discard contigs shorter than this (after trimming). Default: 1. Length filtering is applied after end-trimming.
trim=0
(trimends) Trim the first and last X bases of each sequence. Default: 0 (disabled). Removes potentially problematic contig ends before applying length filters.
ratio=0
If cov0 is set, contigs will not be removed unless the coverage ratio (of cov to cov0) is at least this. Default: 0 (disabled). Identifies contigs where coverage dropped during normalization, indicating potential contamination.

File Handling Parameters

ow=t
(overwrite) Overwrites files that already exist. Default: true.
app=f
(append) Append to files that already exist. Default: false.
zl=4
(ziplevel) Set compression level, 1 (low) to 9 (max). Default: 4. Higher values produce smaller files but require more CPU time.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Advanced Parameters

verbose
Print verbose output for debugging and monitoring progress.
basesundermin
Advanced parameter for filtering contigs based on the number of bases in low-coverage windows. Used internally for specialized filtering scenarios.
log=<file>
(results) Output detailed results to a log file containing per-contig filtering decisions.
logappend=f
(appendlog) Append to log file instead of overwriting. Default: false.
logheader=t
Include header line in log output. Default: true.

Examples

Basic Assembly Filtering

filterbycoverage.sh in=assembly.fasta cov=coverage.txt out=filtered.fasta mincov=5

Removes contigs with average coverage below 5x, keeping only well-covered sequences.

Multi-Criteria Filtering

filterbycoverage.sh in=assembly.fasta cov=coverage.txt out=clean.fasta outd=removed.fasta mincov=3 minp=50 minl=500

Applies multiple filters: minimum 3x coverage, at least 50% bases covered, and minimum length of 500bp. Removed contigs are saved separately.

Normalization-Based Filtering

filterbycoverage.sh in=assembly.fasta cov=coverage_after.txt cov0=coverage_before.txt out=filtered.fasta ratio=0.5

Uses before/after normalization coverage to identify contaminants. Retains contigs where coverage didn't drop more than 50% during normalization.

Conservative Filtering with Trimming

filterbycoverage.sh in=assembly.fasta cov=coverage.txt out=trimmed.fasta mincov=2 trim=100 minl=1000

Trims 100bp from each end of contigs, then filters for minimum 2x coverage and 1000bp length. Helps remove problematic contig ends.

Detailed Logging

filterbycoverage.sh in=assembly.fasta cov=coverage.txt out=filtered.fasta log=filtering_results.txt mincov=5 minp=40

Generates detailed log showing filtering decisions for each contig, useful for understanding what was removed and why.

Algorithm Details

Filtering Strategy

FilterByCoverage implements a multi-criteria filtering approach that evaluates each contig against all specified thresholds. The filtering logic uses AND operations - contigs must satisfy ALL criteria to be retained:

Coverage Statistics Integration

The tool integrates with BBTools coverage statistics files generated by pileup.sh or bbmap.sh. It parses coverage data including:

Dual Coverage Analysis

When both cov and cov0 files are provided, the tool performs comparative analysis:

Memory Implementation

The implementation uses two HashMap<String, CovStatsLine> data structures (cslMap0 and cslMap1) for coverage statistics lookups, providing O(1) access time for each contig ID. The CovStatsLine class stores coverage metrics parsed from pileup.sh or bbmap.sh output files. Memory usage scales linearly with the number of contigs (1024 initial capacity with automatic expansion). The tool processes sequences using ConcurrentReadInputStream streaming to minimize peak memory requirements and enable processing of assemblies larger than available RAM.

Stream Processing Architecture

FilterByCoverage generates multiple output streams simultaneously using ConcurrentReadOutputStream instances with 4-buffer capacity:

Quality Control Implementation

The tool includes several quality control mechanisms implemented in the Java source:

Support

For questions and support: