PostFilter

Script: postfilter.sh Package: assemble Class: Postfilter.java

Maps reads, then filters an assembly by contig coverage. Intended to reduce misassembly rate of SPAdes by removing suspicious contigs.

Basic Usage

postfilter.sh in=<reads> ref=<contigs> out=<filtered contigs>

Postfilter performs a two-step process: first it maps reads to the assembly using BBMap to generate coverage statistics, then it filters contigs based on coverage metrics using FilterByCoverage.

Parameters

Parameters are organized by their function in the postfiltering process. The tool combines BBMap alignment with coverage-based filtering to remove suspicious contigs from assemblies.

Standard Parameters

in=<file>
File containing input reads. Can be fasta or fastq format, optionally gzipped.
in2=<file>
Optional file containing read mates for paired-end sequencing data.
ref=<file>
File containing input assembly to be filtered. Should be in fasta format.
cov=covstats.txt
File to write coverage stats generated by pileup during the mapping step.
out=filtered.fa
Destination of clean output assembly containing only contigs that pass filtering criteria.
outdirty=<file>
(outd) Destination of removed contigs; optional. Allows inspection of filtered contigs.
ow=f
(overwrite) Overwrites files that already exist. Default: false.
app=f
(append) Append to files that already exist. Default: false.
zl=4
(ziplevel) Set compression level for output files, 1 (low) to 9 (max). Default: 4.
int=f
(interleaved) Determines whether input reads are considered interleaved. Default: false.

Filtering Parameters

minc=2
(mincov) Discard contigs with lower average coverage. Contigs with coverage below this threshold are considered unreliable.
minp=95
(minpercent) Discard contigs with a lower percent covered bases. Contigs where less than this percentage of bases have coverage are filtered out.
minr=6
(minreads) Discard contigs with fewer mapped reads. Contigs supported by fewer reads are considered unreliable.
minl=400
(minlength) Discard shorter contigs. Very short contigs are often artifacts or low-quality sequences.
trim=0
(trimends) Trim the first and last X bases of each sequence before filtering analysis. Default: 0 (no trimming).

Mapping Parameters (unlisted params will use BBMap defaults)

minhits=2
Minimum number of kmer hits required for a mapping to be considered valid. Lower values allow more sensitive mapping but may increase false positives.
maxindel=0
Maximum indel length allowed in alignments. Set to 0 to disallow indels, which is appropriate for coverage calculation where exact matches are preferred.
tipsearch=0
Length of sequence tips to search for better alignments. Set to 0 to disable tip searching for faster processing.
bw=20
Bandwidth for banded alignment. Controls the width of the alignment band around the expected position.
rescue=f
Attempt to rescue unmapped reads by searching for their mates. Disabled by default for faster processing since coverage statistics are the primary goal.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Basic Assembly Filtering

postfilter.sh in=reads.fq ref=spades_contigs.fa out=filtered_contigs.fa

Filters SPAdes assembly contigs using default parameters, removing contigs with coverage < 2x, < 95% covered bases, < 6 supporting reads, or < 400bp length.

Paired-End Reads with Custom Thresholds

postfilter.sh in=reads_R1.fq in2=reads_R2.fq ref=assembly.fa out=clean.fa minc=5 minp=90 minl=1000

Uses paired-end reads with stricter filtering: minimum coverage of 5x, at least 90% bases covered, and minimum contig length of 1000bp.

Preserve Rejected Contigs

postfilter.sh in=reads.fq ref=contigs.fa out=good.fa outdirty=rejected.fa cov=coverage.txt

Saves rejected contigs to a separate file and writes detailed coverage statistics for analysis of filtering decisions.

Conservative Filtering for Metagenomics

postfilter.sh in=meta_reads.fq ref=meta_assembly.fa out=filtered.fa minc=1 minp=80 minr=3 minl=200

Uses more permissive parameters suitable for metagenomic assemblies where lower coverage contigs may still be biologically relevant.

Algorithm Details

Postfilter implements a coordinated two-phase pipeline through sequential BBMap.main() and FilterByCoverage.main() invocations with argument routing via mapArgs ArrayList:

Phase 1: Read Mapping and Coverage Calculation

The tool first uses BBMap to align input reads to the assembly with specific parameters optimized for coverage calculation:

Phase 2: Coverage-Based Filtering

After generating coverage statistics, the tool applies FilterByCoverage with the specified thresholds:

Design Rationale

This approach is specifically designed to address common SPAdes assembly artifacts:

Performance Characteristics

Memory usage is primarily determined by the BBMap alignment phase, typically requiring ~1GB per million reference bases for the index. The filtering phase is lightweight and processes coverage statistics efficiently. Processing time scales linearly with read count and assembly size.

Support

For questions and support: