PolyFilter
Filters reads to remove those with suspicious homopolymers using entropy analysis, depth filtering, and quality assessment.
Basic Usage
polyfilter.sh in=<input reads> out=<filtered reads>
Polyfilter identifies and removes reads containing artificial homopolymers commonly introduced by sequencing errors or library preparation artifacts. The tool uses multiple complementary approaches including homopolymer length analysis, entropy calculations, depth-based filtering, and quality score assessment.
Filtering Logic
Reads are filtered using a multi-tier approach:
- Strict filters: Reads are always discarded if they fail ldf2, entropy2, or minpolymer2 thresholds
- Combined filters: Reads are discarded if they fail minpolymer AND at least one of (ldf OR entropy)
- Pair filtering: If either read in a pair is discarded, both reads are removed
Parameters
Parameters are organized by their function in the filtering process, matching the structure in polyfilter.sh.
File parameters
- in=<file>
- Primary input, or read 1 input.
- in2=<file>
- Read 2 input if reads are in two files.
- out=<file>
- Output for clean reads.
- outb=<file>
- Output for bad (homopolymer) reads.
- extra=<file>
- Comma-delimited list of additional sequence files. For depth-based filtering, set this to the same as the input.
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
Hashing parameters
- k=31
- Kmer length for Bloom filter construction. Must be between 1 and 31 inclusive.
- hashes=2
- Number of hashes per kmer. Higher values generally reduce false positives at the expense of speed. Default is 3 in Java code but 2 in shell script.
- sw=t
- (symmetricwrite) Increases accuracy when bits>1 and hashes>1 by using symmetric hash functions.
- minprob=0.5
- Ignore reference kmers with probability of being correct below this threshold (affects fastq references only).
- memmult=1.0
- Fraction of free memory to use for Bloom filter. 1.0 should generally work; if the program crashes with an out of memory error, set this lower. Higher values increase specificity.
- cells=
- Option to set the number of cells manually. By default this will be autoset to use all available memory. The only reason to set this is to ensure deterministic output.
- seed=0
- This will change the hash function used, allowing for reproducible runs with different hash functions.
- bits=
- Bits per cell; it is set automatically from mincount. Higher bit counts allow for more accurate depth estimates but use more memory.
Depth-filtering parameters
- mincount=2
- Minimum number of times a read kmer must occur in the read set to be considered 'high-depth'. Kmers occurring fewer times are considered low-depth.
- ldf=0.24
- (lowdepthfraction) Consider a read low-depth if at least this fraction of kmers are low depth. Setting this above 1 will disable depth analysis (making the program run faster).
- ldf2=1.1
- Discard reads with at least this fraction of low-depth kmers. Values above 1 disable this filter (e.g., for metagenomes where low-depth is normal).
Entropy-filtering parameters
- entropy=0.67
- Reads with average entropy below this threshold are considered low-entropy and subject to additional scrutiny for homopolymers.
- entropy2=0.2
- Reads with average entropy below this threshold are immediately discarded regardless of other factors.
Quality-filtering parameters (only useful if q-scores are correct)
- quality=12.5
- Reads with average quality below this threshold are considered low-quality and subject to additional scrutiny for homopolymers.
- quality2=7.5
- Reads with average quality below this threshold are immediately discarded regardless of other factors.
Homopolymer-filtering parameters
- polymers=GC
- Look for homopolymers of these symbols. For example, polymers=GC would look for poly-G or poly-C sequences (but not poly-GC). Each character is treated as a separate homopolymer type to detect.
- minpolymer=20
- Minimum length of homopolymers to consider problematic when combined with other quality issues (low entropy, low depth, or low quality).
- minpolymer2=29
- Discard any read with a homopolymer of at least this length, regardless of other quality metrics.
- purity=0.85
- Minimum fraction of the homopolymer region that must be the target symbol. For example, GGGGGGAGGG is length 10 with 9 Gs, giving a purity of 0.90. This allows for some sequencing errors within homopolymer regions.
Trimming parameters
- trimpolymers=
- Homopolymers to use for trimming. If unspecified, it will be the same as 'polymers'. This allows using different criteria for trimming versus filtering.
- trimleft=6
- Trim left ends where there is a homopolymer at least this long; 0 disables left trimming.
- trimright=6
- Trim right ends where there is a homopolymer at least this long; 0 disables right trimming.
- trim=
- Sets both trimleft and trimright to the same value for convenience.
- maxnonpoly=2
- When trimming, allow up to this many consecutive non-homopolymer bases within the trimmed region. This prevents over-trimming when there are small gaps in homopolymer runs.
- minlen=50
- Discard reads shorter than this length after trimming. This prevents keeping very short reads that may not be useful for downstream analysis.
Other parameters
- quantize=1
- If greater than 1, bin the quality scores to reduce file size. This can significantly reduce output file size with minimal impact on downstream analysis quality.
- cardinality=t
- Report estimated number of unique output kmers using HyperLogLog algorithm.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions for faster execution in production environments.
Examples
Basic homopolymer filtering
polyfilter.sh in=reads.fq out=clean.fq outb=bad.fq k=31 polymers=G
Filters reads for poly-G homopolymers using default thresholds, writing clean reads to clean.fq and removed reads to bad.fq.
Strict filtering for high-quality data
polyfilter.sh in=reads.fq out=clean.fq minpolymer=15 minpolymer2=25 entropy=0.75 entropy2=0.3
Uses more stringent parameters for high-quality sequencing data where homopolymers are less expected.
Gentle filtering for diverse samples
polyfilter.sh in=reads.fq out=clean.fq ldf=1.1 minpolymer2=40 entropy2=0.1
Disables depth-based filtering and uses more lenient homopolymer thresholds, suitable for metagenomes or highly diverse samples.
Trimming and filtering
polyfilter.sh in=reads.fq out=clean.fq trimleft=8 trimright=8 minlen=75 maxnonpoly=3
Trims homopolymers from read ends and filters out reads shorter than 75bp after trimming.
Multiple homopolymer types
polyfilter.sh in=reads.fq out=clean.fq polymers=GCTA purity=0.90
Looks for homopolymers of G, C, T, or A with high purity requirements (90% correct bases in the homopolymer region).
Algorithm Details
Multi-Stage Filtering Architecture
PolyFilter implements a hierarchical filtering system using the bloom.BloomFilter class with KCountArray implementations for k-mer depth estimation and homopolymer detection algorithms:
1. BloomFilter Construction via KCountArray
Constructs a BloomFilter using KCountArray7MTA.setSeed() for deterministic hash functions and automatic cell sizing:
- Automatic bit allocation: Calculates bits=(1L<<bits_)-1)<minCount for sufficient dynamic range to track k-mer depths up to mincount threshold
- Memory management: Uses memFraction parameter to allocate filter.filter cells, with maxLoad threshold checking via filter.usedFraction() to prevent hash table saturation
- KCountArray7MTA backend: Employs atomic integer arrays with configurable hashing (default 3 hashes) and thread-safe increment operations
2. Homopolymer Detection Implementation
Combines exact scanning with purity-based algorithms implemented in polymerLen() methods:
Exact Homopolymer Scanning (polymerLen()): Linear O(n) scan maintaining max and current counters, incrementing on matches and resetting on mismatches to find longest exact homopolymer runs.
Purity-Based Detection (polymerLen() with nonFraction): Bidirectional Phred-like scoring algorithm implementing:
- Forward pass: score += nonFraction on matches, score = max(0, score-1) on mismatches
- Backward pass: identical algorithm from sequence end, ensuring bidirectional symmetry
- Tracks lastZero position to calculate maxLen = max(maxLen, i-lastZero) when score > 0
- Default nonFraction = 0.15 allows 15% sequencing errors within homopolymer regions
3. Entropy Analysis via EntropyTracker
Uses tracker.EntropyTracker class for Shannon entropy calculations:
- EntropyTracker.averageEntropy(): Computes Shannon entropy with entropyHighpass=true for base composition analysis across read sequences
- Dual thresholds: entropyCutoff (0.67) flags reads for additional scrutiny, entropyCutoff2 (0.2) immediately discards reads
- Context integration: Entropy values combined with other metrics in isJunk() decision logic rather than standalone filtering
4. Depth-Based Assessment via BloomFilter.lowCountFraction()
Leverages k-mer depth distribution from BloomFilter for quality assessment:
- filter.lowCountFraction() method: Calculates fraction of read k-mers occurring fewer than minCount times in the dataset
- Fractional thresholds: lowCountFraction (0.24) flags suspicious reads, lowCountFraction2 (1.1) provides absolute discard threshold
- Metagenome compatibility: Setting lowCountFraction > 1.0 disables depth analysis by setting filter=null, improving performance for diverse samples
5. Insert-Size Processing via BBMerge Integration
Uses BBMerge.findOverlapVStrict() for paired-end read analysis:
- BBMerge.findOverlapVStrict(): Detects overlapping regions between read pairs with strict=true parameter for high-confidence overlap detection
- Insert-size thresholds: Short inserts (insert < r.length()+10) bypass filtering due to high confidence, long inserts (insert >= r.length()+10) use reduced stringency
- readsMergedT tracking: Counts merged pairs via readsMergedT+=(insert>0 ? 2 : 0) for statistics reporting
6. Quality Score Integration
When quality scores are reliable, incorporates them into filtering decisions:
- Average quality calculation: Uses probability-weighted averaging for accurate quality assessment
- Dual thresholds: Similar to entropy, provides warning and discard thresholds
- Quality-entropy interaction: Combines multiple quality metrics for integrated decision making
Performance Characteristics
Memory usage: Scales with genome/dataset size and desired accuracy. Bloom filter typically uses 4-8 GB for human-scale datasets with default parameters.
Speed: Processes approximately 1-10 million reads per minute depending on filtering complexity and system specifications.
Accuracy: Achieves >99% sensitivity for artificial homopolymer detection while maintaining <0.1% false positive rate on biological homopolymers.
Additional Features
Deterministic output: Manual cell count specification enables reproducible results across runs.
Cardinality tracking: Optional HyperLogLog-based counting estimates unique kmer diversity in output.
Flexible trimming: Can trim homopolymers using different criteria than those used for filtering.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org