PostFilter
Maps reads, then filters an assembly by contig coverage. Intended to reduce misassembly rate of SPAdes by removing suspicious contigs.
Basic Usage
postfilter.sh in=<reads> ref=<contigs> out=<filtered contigs>
Postfilter performs a two-step process: first it maps reads to the assembly using BBMap to generate coverage statistics, then it filters contigs based on coverage metrics using FilterByCoverage.
Parameters
Parameters are organized by their function in the postfiltering process. The tool combines BBMap alignment with coverage-based filtering to remove suspicious contigs from assemblies.
Standard Parameters
- in=<file>
- File containing input reads. Can be fasta or fastq format, optionally gzipped.
- in2=<file>
- Optional file containing read mates for paired-end sequencing data.
- ref=<file>
- File containing input assembly to be filtered. Should be in fasta format.
- cov=covstats.txt
- File to write coverage stats generated by pileup during the mapping step.
- out=filtered.fa
- Destination of clean output assembly containing only contigs that pass filtering criteria.
- outdirty=<file>
- (outd) Destination of removed contigs; optional. Allows inspection of filtered contigs.
- ow=f
- (overwrite) Overwrites files that already exist. Default: false.
- app=f
- (append) Append to files that already exist. Default: false.
- zl=4
- (ziplevel) Set compression level for output files, 1 (low) to 9 (max). Default: 4.
- int=f
- (interleaved) Determines whether input reads are considered interleaved. Default: false.
Filtering Parameters
- minc=2
- (mincov) Discard contigs with lower average coverage. Contigs with coverage below this threshold are considered unreliable.
- minp=95
- (minpercent) Discard contigs with a lower percent covered bases. Contigs where less than this percentage of bases have coverage are filtered out.
- minr=6
- (minreads) Discard contigs with fewer mapped reads. Contigs supported by fewer reads are considered unreliable.
- minl=400
- (minlength) Discard shorter contigs. Very short contigs are often artifacts or low-quality sequences.
- trim=0
- (trimends) Trim the first and last X bases of each sequence before filtering analysis. Default: 0 (no trimming).
Mapping Parameters (unlisted params will use BBMap defaults)
- minhits=2
- Minimum number of kmer hits required for a mapping to be considered valid. Lower values allow more sensitive mapping but may increase false positives.
- maxindel=0
- Maximum indel length allowed in alignments. Set to 0 to disallow indels, which is appropriate for coverage calculation where exact matches are preferred.
- tipsearch=0
- Length of sequence tips to search for better alignments. Set to 0 to disable tip searching for faster processing.
- bw=20
- Bandwidth for banded alignment. Controls the width of the alignment band around the expected position.
- rescue=f
- Attempt to rescue unmapped reads by searching for their mates. Disabled by default for faster processing since coverage statistics are the primary goal.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Assembly Filtering
postfilter.sh in=reads.fq ref=spades_contigs.fa out=filtered_contigs.fa
Filters SPAdes assembly contigs using default parameters, removing contigs with coverage < 2x, < 95% covered bases, < 6 supporting reads, or < 400bp length.
Paired-End Reads with Custom Thresholds
postfilter.sh in=reads_R1.fq in2=reads_R2.fq ref=assembly.fa out=clean.fa minc=5 minp=90 minl=1000
Uses paired-end reads with stricter filtering: minimum coverage of 5x, at least 90% bases covered, and minimum contig length of 1000bp.
Preserve Rejected Contigs
postfilter.sh in=reads.fq ref=contigs.fa out=good.fa outdirty=rejected.fa cov=coverage.txt
Saves rejected contigs to a separate file and writes detailed coverage statistics for analysis of filtering decisions.
Conservative Filtering for Metagenomics
postfilter.sh in=meta_reads.fq ref=meta_assembly.fa out=filtered.fa minc=1 minp=80 minr=3 minl=200
Uses more permissive parameters suitable for metagenomic assemblies where lower coverage contigs may still be biologically relevant.
Algorithm Details
Postfilter implements a coordinated two-phase pipeline through sequential BBMap.main() and FilterByCoverage.main() invocations with argument routing via mapArgs ArrayList:
Phase 1: Read Mapping and Coverage Calculation
The tool first uses BBMap to align input reads to the assembly with specific parameters optimized for coverage calculation:
- Ambiguous mapping enabled: Uses "ambig=all" to count all valid alignments for coverage statistics
- No disk caching: Uses "nodisk" mode for faster processing when filtering is the primary goal
- Conservative alignment: Default settings use minhits=2, maxindel=0 to ensure reliable alignments
- Bandwidth optimization: Uses bw=20 for efficient alignment within expected regions
Phase 2: Coverage-Based Filtering
After generating coverage statistics, the tool applies FilterByCoverage with the specified thresholds:
- Average coverage filtering: Removes contigs below minimum coverage threshold (minc)
- Breadth of coverage filtering: Removes contigs where fewer than minp% of bases have coverage
- Read count filtering: Removes contigs supported by fewer than minr mapped reads
- Length filtering: Removes contigs shorter than minl base pairs
- End trimming: Optionally trims sequence ends before analysis to remove low-quality regions
Design Rationale
This approach is specifically designed to address common SPAdes assembly artifacts:
- Chimeric contigs: Often have uneven coverage patterns that fail the breadth test
- Contamination sequences: Usually have very low coverage relative to the main genome
- Assembly errors: May create short, poorly-supported contigs that fail multiple filters
- Repetitive elements: Genuine repeats typically pass all filters due to consistent high coverage
Performance Characteristics
Memory usage is primarily determined by the BBMap alignment phase, typically requiring ~1GB per million reference bases for the index. The filtering phase is lightweight and processes coverage statistics efficiently. Processing time scales linearly with read count and assembly size.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org