FilterByTile
Filters reads based on positional quality over a flowcell using micro-tile analysis. Removes low-quality flowcell areas while avoiding sequence-specific bias, typically removing ~2% of reads while reducing error rates by ~10%.
Overview
FilterByTile addresses a fundamental problem in Illumina data quality control: positional quality variation across the flowcell. Some areas of flowcells have poor optical focus, weaker circulation, or air bubbles, resulting in very low-quality reads that nonetheless pass Illumina's filter criteria.
Traditional quality filtering creates sequence-specific bias because sequence quality is sequence-dependent. Aggressive filtering can bias against extreme-GC portions of genomes or specific motifs, leading to poor assemblies, incorrect ploidy calls, and bad expression quantification.
FilterByTile's solution: Remove only reads from the worst flowcell locations (typically ~2% of reads) while reducing overall error rates by substantially more (~10%). This preserves data volume while eliminating the worst quality without incurring sequence bias.
Origin Story
FilterByTile was originally developed after observing spikes in k-mer uniqueness plots that should have shown monotonically-declining exponential decay curves (generated by bbcalcunique.sh). These spikes corresponded to low-quality flowcell locations and often showed regular patterns indicating structured problems like flowcell edges, tile edges, or streaks.
Basic Usage
filterbytile.sh in=<input> out=<output>
FilterByTile analyzes Illumina sequencing data quality across flowcell positions and filters out low-quality regions based on positional metrics rather than sequence content. The tool divides the flowcell into micro-tiles and evaluates each as a statistical unit using multiple quality metrics.
When to Use FilterByTile
Recommended Use Cases
- Any Illumina HiSeq, MiSeq, or NextSeq data - Benefits most datasets
- Libraries with obvious spatial problems - Bubbles, low-flow streaks, edge effects
- High-coverage applications - Where some data loss is acceptable for quality gains
- Large datasets - Requires sufficient data volume for statistics
When NOT to Use
- Complex metagenomes - Where more coverage is strictly beneficial
- Very low coverage expected - Any read loss could be detrimental
- Small datasets - Ineffective on 4000 reads from demultiplexed libraries
- Renamed reads - Requires proper Illumina headers with flowcell coordinates
Memory Considerations
FilterByTile can be memory-intensive when calculating k-mer uniqueness for large genomes. If you encounter memory issues, use usekmers=f
to disable k-mer analysis and rely only on quality scores.
Recommended Workflows
Lane-Level Processing (Best Results)
For optimal performance, process entire lanes rather than individual libraries. This provides better statistical power and enables reuse of quality statistics:
Step 1: Analyze Full Lane
cat *.fastq.gz > all.fq.gz
filterbytile.sh in=all.fq.gz dump=dump.flowcell
Step 2: Filter Individual Libraries
filterbytile.sh in=sample1.fastq.gz out=filtered_sample1.fq.gz indump=dump.flowcell
filterbytile.sh in=sample2.fastq.gz out=filtered_sample2.fq.gz indump=dump.flowcell
Direct Library Processing
For standalone processing when lane-level analysis isn't feasible:
filterbytile.sh in=reads.fq.gz out=filtered.fq.gz
Parameters
Parameters control micro-tile definition, quality metrics, and filtering thresholds. Each quality metric (uniqueness, quality, error-free probability, poly-G rate) uses three threshold types that must ALL be exceeded for tile rejection.
Input Parameters
- in=<file>
- Primary input file. Should ideally be adapter-trimmed and quality-score recalibrated data.
- in2=<file>
- Second input file for paired reads in twin files.
- indump=<file>
- Use previously calculated quality statistics instead of analyzing input reads. Essential for processing multiple libraries from the same lane efficiently.
- barcodes=<file>
- Optional list of expected barcodes, one per line.
- reads=-1
- Process this number of reads, then quit (-1 means all).
- interleaved=auto
- Override autodetection of paired interleaved format.
Output Parameters
- out=<file>
- Output file for filtered reads. Contains reads from micro-tiles that passed quality thresholds.
- dump=<file>
- Write quality statistics by coordinates. Can be reused with indump for filtering individual libraries from the same run.
- counts=<file>
- Write barcode counts to file.
Tile Parameters
- xsize=500
- Initial width of micro-tiles. For NovaSeqX use 520. Algorithm may iteratively expand to meet target read counts.
- ysize=500
- Initial height of micro-tiles. For NovaSeqX use 590.
- size=
- Set both xsize and ysize to the same value.
- target=1600
- Iteratively expand micro-tiles until they contain an average of at least this many reads. Ensures sufficient statistical power.
- alignedreads=250
- Average aligned reads per tile for error rate calibration with PhiX spike-in.
Filtering Parameters
Each metric (u=uniqueness, q=quality, e=error-free probability, pg=poly-G) has three parameters: deviations (d), fraction (f), and absolute (a). A micro-tile is discarded only if ALL THREE conditions are met for at least one metric.
- udeviations=1.5
- (ud) Standard deviations below average uniqueness required for discarding.
- qdeviations=2.4
- (qd) Standard deviations below average quality required for discarding.
- edeviations=3.0
- (ed) Standard deviations below average error-free probability required for discarding.
- pgdeviations=1.4
- (pgd) Standard deviations above average poly-G rate required for discarding.
- ufraction=0.01
- (uf) Minimum fractional deviation from average uniqueness required.
- qfraction=0.08
- (qf) Minimum fractional deviation from average quality required.
- efraction=0.2
- (ef) Minimum fractional deviation from average error-free probability required.
- pgfraction=0.2
- (pgf) Minimum fractional deviation from average poly-G rate required.
- uabsolute=1
- (ua) Minimum absolute deviation from average uniqueness required.
- qabsolute=2.0
- (qa) Minimum absolute deviation from average quality required.
- eabsolute=6
- (ea) Minimum absolute deviation from average error-free probability required.
- pgabsolute=0.2
- (pga) Minimum absolute deviation from average poly-G rate required.
- ier=0.012
- (inferrederrorrate) Maximum predicted base error rate. Superior to uniqueness deviations when ~1% PhiX is spiked in.
- mdf=0.4
- (maxdiscardfraction) Safety limit: don't discard more than this fraction of tiles regardless of data quality.
Alignment Parameters
Optional internal alignment to PhiX enables translation of k-mer depth to high-resolution error rates.
- samin=<file>
- Optional aligned SAM input file for error rate analysis. Skips internal alignment if provided.
- samout=<file>
- Output file for aligned reads (SAM or FASTQ format). Written only during internal alignment.
- align=true
- Perform internal alignment to reference if no SAM file provided.
- alignref=phix
- Reference for alignment. PhiX is optimal as it's nonrepetitive down to k=13.
- alignk1=17
- K-mer length for seeding primary alignments to reference.
- alignk2=13
- K-mer length for seeding mate rescue alignments.
- minid1=0.62
- Minimum identity for accepting individual alignments.
- minid2=0.54
- Minimum identity for mate rescue alignments.
- alignmm1=1
- Middle mask length for alignk1.
- alignmm2=1
- Middle mask length for alignk2.
Other Parameters
- usekmers=t
- Load k-mers to calculate uniqueness and depth. Set to false to reduce memory usage if encountering out-of-memory errors.
- lowqualityonly=t
- (lqo) Only discard low-quality reads within bad micro-tiles rather than entire tiles. More conservative approach.
- recalibrate=f
- Recalibrate reads during processing. Requires calibration matrices from CalcTrueQuality.
- dmult=-.1
- Controls stringency when lqo=t. At 0, removes only below-average reads. Lower values increase stringency.
- idmaskwrite=15
- Bitmask controlling fraction of k-mers loaded (15 means 1/16th). 0 uses all k-mers. Trade memory for accuracy.
- idmaskread=7
- Controls fraction of k-mers sampled during depth calculation.
- k=31
- K-mer length for Bloom filter (uniqueness calculation).
- hashes=3
- Number of hash functions for Bloom filter. More hashes reduce false positives.
- cbits=2
- Bloom filter bits per cell. Higher values allow more accurate depth estimation.
- merge=f
- Merge reads for insert and error rate statistics. Adds ~50% processing time but improves dump file quality.
Java Parameters
- -Xmx
- Set Java memory usage. For processing 5 billion reads, use -Xmx200g. FilterByTile benefits from large memory allocations.
- -eoom
- Exit cleanly on out-of-memory exceptions (requires Java 8u92+).
- -da
- Disable assertions for minor performance improvement.
Examples
Basic Single File Processing
filterbytile.sh in=reads.fq.gz out=filtered.fq.gz
Standard processing of single-end or paired interleaved files.
Paired Files in Twin Format
filterbytile.sh in1=r1.fq in2=r2.fq out1=filtered1.fq out2=filtered2.fq
Process paired-end data stored in separate files.
Lane-Level Analysis with Dump File
# Step 1: Analyze full lane to create quality statistics
cat *.fastq.gz > all.fq.gz
filterbytile.sh in=all.fq.gz dump=dump.flowcell
# Step 2: Apply to individual libraries (much faster)
filterbytile.sh in=sample1.fastq.gz out=filtered_sample1.fq.gz indump=dump.flowcell
Recommended workflow for processing multiple libraries from the same flowcell. Provides better statistical power and enables efficient reuse of quality calculations.
Conservative Filtering for Precious Samples
filterbytile.sh in=x.fq out=y.fq lowqualityonly=t dmult=0
Use when data retention is critical. Only removes individual low-quality reads within bad micro-tiles rather than discarding entire tiles.
Aggressive Filtering for Severe Quality Issues
filterbytile.sh in=x.fq out=y.fq ud=0.75 qd=1 ed=1 ua=.5 qa=.5 ea=.5
Tighten thresholds when you know there are serious positional quality problems (e.g., obvious bubbles or streaking).
Memory-Optimized Processing
filterbytile.sh in=x.fq out=y.fq usekmers=f
Disable k-mer uniqueness calculations to reduce memory usage. Relies only on quality scores for filtering decisions.
NovaSeqX-Specific Configuration
filterbytile.sh in=novaseqx.fq out=filtered.fq xsize=520 ysize=590 target=2000
Use instrument-specific tile dimensions and increased target reads for optimal results on NovaSeqX data.
High-Memory Full Lane Processing
filterbytile.sh -Xmx200g in=lane.fq out=filtered.fq align=true ier=0.01 dump=stats.txt
Process complete lanes with maximum memory allocation and PhiX alignment for error rate calibration.
Algorithm Details
Micro-Tile Organization
FilterByTile divides flowcells into rectangular micro-tiles using a hierarchical data structure: FlowCell contains Lane objects, which contain dynamic Tile grids, each holding MicroTile arrays. The IlluminaHeaderParser2 class extracts lane, tile, x, and y coordinates from read headers to assign reads to micro-tiles.
Quality Metrics Implementation
Each micro-tile accumulates statistics for four quality metrics:
- K-mer Uniqueness: BloomFilter with configurable hash functions and bits per cell calculates k-mer depth using deterministic bit masking for memory efficiency
- Average Quality: Read.avgQualityByProbabilityDouble() aggregates Phred quality scores
- Error-Free Probability: Read.probabilityErrorFree() calculates zero-error probability from quality distributions
- Poly-G Detection: CycleTracker identifies poly-G sequences indicating sequencing artifacts
Triple-Threshold Logic
The TileDump.markTiles() method implements strict filtering logic requiring ALL THREE conditions simultaneously:
- Statistical significance: delta > (deviations × standard deviation)
- Proportional threshold: delta > (average × fraction)
- Absolute threshold: delta > absolute minimum
This conservative approach prevents over-filtering while ensuring only genuinely problematic micro-tiles are removed.
Adaptive Tile Expansion
FlowCell.widenToTargetReads() dynamically expands micro-tile dimensions when initial tiles contain insufficient reads for statistical reliability. Starting with xsize/ysize parameters, tiles grow iteratively until reaching the target read count (default 1600), ensuring adequate statistical power for quality assessment.
PhiX Error Rate Calibration
When PhiX spike-in is present, SideChannel3 performs internal alignment using BBMap algorithms. The processSamLine() method analyzes CIGAR strings to calculate error rates, enabling translation between k-mer depth statistics and absolute base error rates via linear regression formulas.
Memory-Efficient K-mer Sampling
The Bloom filter implementation uses bit masking for memory optimization:
- idmaskwrite: Controls which k-mers are stored during loading (e.g., 15 = store 1/16th)
- idmaskread: Controls which k-mers are sampled during depth calculation
- Deterministic sampling: Uses read numeric IDs for consistent subsampling
Performance and Memory
Memory Requirements
Memory usage depends on data volume and k-mer analysis:
- Full lane processing: 200GB RAM recommended for 5 billion reads
- K-mer disabled: Much lower memory usage with usekmers=f
- Bloom filter sizing: Controlled by cbits, hashes, and sampling masks
Processing Recommendations
- Process entire lanes when possible for best statistical power
- Use dump/indump workflow for multiple libraries from same flowcell
- Consider memory optimization with usekmers=f for memory-limited systems
- Adjust thread counts based on available CPU cores and memory
Data Volume Requirements
FilterByTile requires substantial data volumes for effective statistical analysis. Avoid using on small demultiplexed datasets (e.g., 4000 reads) - instead, analyze the full lane and apply results to individual libraries using the dump/indump approach.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Comprehensive guide:
bbtools/docs/guides/FilterByTileGuide.txt