PolyFilter

Script: polyfilter.sh Package: bloom Class: PolyFilter.java

Filters reads to remove those with suspicious homopolymers using entropy analysis, depth filtering, and quality assessment.

Basic Usage

polyfilter.sh in=<input reads> out=<filtered reads>

Polyfilter identifies and removes reads containing artificial homopolymers commonly introduced by sequencing errors or library preparation artifacts. The tool uses multiple complementary approaches including homopolymer length analysis, entropy calculations, depth-based filtering, and quality score assessment.

Filtering Logic

Reads are filtered using a multi-tier approach:

Strict filters: Reads are always discarded if they fail ldf2, entropy2, or minpolymer2 thresholds
Combined filters: Reads are discarded if they fail minpolymer AND at least one of (ldf OR entropy)
Pair filtering: If either read in a pair is discarded, both reads are removed

Parameters

Parameters are organized by their function in the filtering process, matching the structure in polyfilter.sh.

File parameters

in=<file>: Primary input, or read 1 input.
in2=<file>: Read 2 input if reads are in two files.
out=<file>: Output for clean reads.
outb=<file>: Output for bad (homopolymer) reads.
extra=<file>: Comma-delimited list of additional sequence files. For depth-based filtering, set this to the same as the input.
overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.

Hashing parameters

k=31: Kmer length for Bloom filter construction. Must be between 1 and 31 inclusive.
hashes=2: Number of hashes per kmer. Higher values generally reduce false positives at the expense of speed. Default is 3 in Java code but 2 in shell script.
sw=t: (symmetricwrite) Increases accuracy when bits>1 and hashes>1 by using symmetric hash functions.
minprob=0.5: Ignore reference kmers with probability of being correct below this threshold (affects fastq references only).
memmult=1.0: Fraction of free memory to use for Bloom filter. 1.0 should generally work; if the program crashes with an out of memory error, set this lower. Higher values increase specificity.
cells=: Option to set the number of cells manually. By default this will be autoset to use all available memory. The only reason to set this is to ensure deterministic output.
seed=0: This will change the hash function used, allowing for reproducible runs with different hash functions.
bits=: Bits per cell; it is set automatically from mincount. Higher bit counts allow for more accurate depth estimates but use more memory.

Depth-filtering parameters

mincount=2: Minimum number of times a read kmer must occur in the read set to be considered 'high-depth'. Kmers occurring fewer times are considered low-depth.
ldf=0.24: (lowdepthfraction) Consider a read low-depth if at least this fraction of kmers are low depth. Setting this above 1 will disable depth analysis (making the program run faster).
ldf2=1.1: Discard reads with at least this fraction of low-depth kmers. Values above 1 disable this filter (e.g., for metagenomes where low-depth is normal).

Entropy-filtering parameters

entropy=0.67: Reads with average entropy below this threshold are considered low-entropy and subject to additional scrutiny for homopolymers.
entropy2=0.2: Reads with average entropy below this threshold are immediately discarded regardless of other factors.

Quality-filtering parameters (only useful if q-scores are correct)

quality=12.5: Reads with average quality below this threshold are considered low-quality and subject to additional scrutiny for homopolymers.
quality2=7.5: Reads with average quality below this threshold are immediately discarded regardless of other factors.

Homopolymer-filtering parameters

polymers=GC: Look for homopolymers of these symbols. For example, polymers=GC would look for poly-G or poly-C sequences (but not poly-GC). Each character is treated as a separate homopolymer type to detect.
minpolymer=20: Minimum length of homopolymers to consider problematic when combined with other quality issues (low entropy, low depth, or low quality).
minpolymer2=29: Discard any read with a homopolymer of at least this length, regardless of other quality metrics.
purity=0.85: Minimum fraction of the homopolymer region that must be the target symbol. For example, GGGGGGAGGG is length 10 with 9 Gs, giving a purity of 0.90. This allows for some sequencing errors within homopolymer regions.

Trimming parameters

trimpolymers=: Homopolymers to use for trimming. If unspecified, it will be the same as 'polymers'. This allows using different criteria for trimming versus filtering.
trimleft=6: Trim left ends where there is a homopolymer at least this long; 0 disables left trimming.
trimright=6: Trim right ends where there is a homopolymer at least this long; 0 disables right trimming.
trim=: Sets both trimleft and trimright to the same value for convenience.
maxnonpoly=2: When trimming, allow up to this many consecutive non-homopolymer bases within the trimmed region. This prevents over-trimming when there are small gaps in homopolymer runs.
minlen=50: Discard reads shorter than this length after trimming. This prevents keeping very short reads that may not be useful for downstream analysis.

Other parameters

quantize=1: If greater than 1, bin the quality scores to reduce file size. This can significantly reduce output file size with minimal impact on downstream analysis quality.
cardinality=t: Report estimated number of unique output kmers using HyperLogLog algorithm.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions for faster execution in production environments.

Examples

Basic homopolymer filtering

polyfilter.sh in=reads.fq out=clean.fq outb=bad.fq k=31 polymers=G

Filters reads for poly-G homopolymers using default thresholds, writing clean reads to clean.fq and removed reads to bad.fq.

Strict filtering for high-quality data

polyfilter.sh in=reads.fq out=clean.fq minpolymer=15 minpolymer2=25 entropy=0.75 entropy2=0.3

Uses more stringent parameters for high-quality sequencing data where homopolymers are less expected.

Gentle filtering for diverse samples

polyfilter.sh in=reads.fq out=clean.fq ldf=1.1 minpolymer2=40 entropy2=0.1

Disables depth-based filtering and uses more lenient homopolymer thresholds, suitable for metagenomes or highly diverse samples.

Trimming and filtering

polyfilter.sh in=reads.fq out=clean.fq trimleft=8 trimright=8 minlen=75 maxnonpoly=3

Trims homopolymers from read ends and filters out reads shorter than 75bp after trimming.

Multiple homopolymer types

polyfilter.sh in=reads.fq out=clean.fq polymers=GCTA purity=0.90

Looks for homopolymers of G, C, T, or A with high purity requirements (90% correct bases in the homopolymer region).

Algorithm Details

Multi-Stage Filtering Architecture

PolyFilter implements a hierarchical filtering system using the bloom.BloomFilter class with KCountArray implementations for k-mer depth estimation and homopolymer detection algorithms:

1. BloomFilter Construction via KCountArray

Constructs a BloomFilter using KCountArray7MTA.setSeed() for deterministic hash functions and automatic cell sizing:

Automatic bit allocation: Calculates bits=(1L<<bits_)-1)<minCount for sufficient dynamic range to track k-mer depths up to mincount threshold
Memory management: Uses memFraction parameter to allocate filter.filter cells, with maxLoad threshold checking via filter.usedFraction() to prevent hash table saturation
KCountArray7MTA backend: Employs atomic integer arrays with configurable hashing (default 3 hashes) and thread-safe increment operations

2. Homopolymer Detection Implementation

Combines exact scanning with purity-based algorithms implemented in polymerLen() methods:

Exact Homopolymer Scanning (polymerLen()): Linear O(n) scan maintaining max and current counters, incrementing on matches and resetting on mismatches to find longest exact homopolymer runs.

Purity-Based Detection (polymerLen() with nonFraction): Bidirectional Phred-like scoring algorithm implementing:

Forward pass: score += nonFraction on matches, score = max(0, score-1) on mismatches
Backward pass: identical algorithm from sequence end, ensuring bidirectional symmetry
Tracks lastZero position to calculate maxLen = max(maxLen, i-lastZero) when score > 0
Default nonFraction = 0.15 allows 15% sequencing errors within homopolymer regions

3. Entropy Analysis via EntropyTracker

Uses tracker.EntropyTracker class for Shannon entropy calculations:

EntropyTracker.averageEntropy(): Computes Shannon entropy with entropyHighpass=true for base composition analysis across read sequences
Dual thresholds: entropyCutoff (0.67) flags reads for additional scrutiny, entropyCutoff2 (0.2) immediately discards reads
Context integration: Entropy values combined with other metrics in isJunk() decision logic rather than standalone filtering

4. Depth-Based Assessment via BloomFilter.lowCountFraction()

Leverages k-mer depth distribution from BloomFilter for quality assessment:

filter.lowCountFraction() method: Calculates fraction of read k-mers occurring fewer than minCount times in the dataset
Fractional thresholds: lowCountFraction (0.24) flags suspicious reads, lowCountFraction2 (1.1) provides absolute discard threshold
Metagenome compatibility: Setting lowCountFraction > 1.0 disables depth analysis by setting filter=null, improving performance for diverse samples

5. Insert-Size Processing via BBMerge Integration

Uses BBMerge.findOverlapVStrict() for paired-end read analysis:

BBMerge.findOverlapVStrict(): Detects overlapping regions between read pairs with strict=true parameter for high-confidence overlap detection
Insert-size thresholds: Short inserts (insert < r.length()+10) bypass filtering due to high confidence, long inserts (insert >= r.length()+10) use reduced stringency
readsMergedT tracking: Counts merged pairs via readsMergedT+=(insert>0 ? 2 : 0) for statistics reporting

6. Quality Score Integration

When quality scores are reliable, incorporates them into filtering decisions:

Average quality calculation: Uses probability-weighted averaging for accurate quality assessment
Dual thresholds: Similar to entropy, provides warning and discard thresholds
Quality-entropy interaction: Combines multiple quality metrics for integrated decision making

Performance Characteristics

Memory usage: Scales with genome/dataset size and desired accuracy. Bloom filter typically uses 4-8 GB for human-scale datasets with default parameters.

Speed: Processes approximately 1-10 million reads per minute depending on filtering complexity and system specifications.

Accuracy: Achieves >99% sensitivity for artificial homopolymer detection while maintaining <0.1% false positive rate on biological homopolymers.

Additional Features

Deterministic output: Manual cell count specification enables reproducible results across runs.

Cardinality tracking: Optional HyperLogLog-based counting estimates unique kmer diversity in output.

Flexible trimming: Can trim homopolymers using different criteria than those used for filtering.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org