FilterByTile

Script: filterbytile.sh Package: hiseq Class: AnalyzeFlowCell.java

Filters reads based on positional quality over a flowcell using micro-tile analysis. Removes low-quality flowcell areas while avoiding sequence-specific bias, typically removing ~2% of reads while reducing error rates by ~10%.

Overview

FilterByTile addresses a fundamental problem in Illumina data quality control: positional quality variation across the flowcell. Some areas of flowcells have poor optical focus, weaker circulation, or air bubbles, resulting in very low-quality reads that nonetheless pass Illumina's filter criteria.

Traditional quality filtering creates sequence-specific bias because sequence quality is sequence-dependent. Aggressive filtering can bias against extreme-GC portions of genomes or specific motifs, leading to poor assemblies, incorrect ploidy calls, and bad expression quantification.

FilterByTile's solution: Remove only reads from the worst flowcell locations (typically ~2% of reads) while reducing overall error rates by substantially more (~10%). This preserves data volume while eliminating the worst quality without incurring sequence bias.

Origin Story

FilterByTile was originally developed after observing spikes in k-mer uniqueness plots that should have shown monotonically-declining exponential decay curves (generated by bbcalcunique.sh). These spikes corresponded to low-quality flowcell locations and often showed regular patterns indicating structured problems like flowcell edges, tile edges, or streaks.

Basic Usage

filterbytile.sh in=<input> out=<output>

FilterByTile analyzes Illumina sequencing data quality across flowcell positions and filters out low-quality regions based on positional metrics rather than sequence content. The tool divides the flowcell into micro-tiles and evaluates each as a statistical unit using multiple quality metrics.

When to Use FilterByTile

Recommended Use Cases

When NOT to Use

Memory Considerations

FilterByTile can be memory-intensive when calculating k-mer uniqueness for large genomes. If you encounter memory issues, use usekmers=f to disable k-mer analysis and rely only on quality scores.

Recommended Workflows

Lane-Level Processing (Best Results)

For optimal performance, process entire lanes rather than individual libraries. This provides better statistical power and enables reuse of quality statistics:

Step 1: Analyze Full Lane

cat *.fastq.gz > all.fq.gz
filterbytile.sh in=all.fq.gz dump=dump.flowcell

Step 2: Filter Individual Libraries

filterbytile.sh in=sample1.fastq.gz out=filtered_sample1.fq.gz indump=dump.flowcell
filterbytile.sh in=sample2.fastq.gz out=filtered_sample2.fq.gz indump=dump.flowcell

Direct Library Processing

For standalone processing when lane-level analysis isn't feasible:

filterbytile.sh in=reads.fq.gz out=filtered.fq.gz

Parameters

Parameters control micro-tile definition, quality metrics, and filtering thresholds. Each quality metric (uniqueness, quality, error-free probability, poly-G rate) uses three threshold types that must ALL be exceeded for tile rejection.

Input Parameters

in=<file>
Primary input file. Should ideally be adapter-trimmed and quality-score recalibrated data.
in2=<file>
Second input file for paired reads in twin files.
indump=<file>
Use previously calculated quality statistics instead of analyzing input reads. Essential for processing multiple libraries from the same lane efficiently.
barcodes=<file>
Optional list of expected barcodes, one per line.
reads=-1
Process this number of reads, then quit (-1 means all).
interleaved=auto
Override autodetection of paired interleaved format.

Output Parameters

out=<file>
Output file for filtered reads. Contains reads from micro-tiles that passed quality thresholds.
dump=<file>
Write quality statistics by coordinates. Can be reused with indump for filtering individual libraries from the same run.
counts=<file>
Write barcode counts to file.

Tile Parameters

xsize=500
Initial width of micro-tiles. For NovaSeqX use 520. Algorithm may iteratively expand to meet target read counts.
ysize=500
Initial height of micro-tiles. For NovaSeqX use 590.
size=
Set both xsize and ysize to the same value.
target=1600
Iteratively expand micro-tiles until they contain an average of at least this many reads. Ensures sufficient statistical power.
alignedreads=250
Average aligned reads per tile for error rate calibration with PhiX spike-in.

Filtering Parameters

Each metric (u=uniqueness, q=quality, e=error-free probability, pg=poly-G) has three parameters: deviations (d), fraction (f), and absolute (a). A micro-tile is discarded only if ALL THREE conditions are met for at least one metric.

udeviations=1.5
(ud) Standard deviations below average uniqueness required for discarding.
qdeviations=2.4
(qd) Standard deviations below average quality required for discarding.
edeviations=3.0
(ed) Standard deviations below average error-free probability required for discarding.
pgdeviations=1.4
(pgd) Standard deviations above average poly-G rate required for discarding.
ufraction=0.01
(uf) Minimum fractional deviation from average uniqueness required.
qfraction=0.08
(qf) Minimum fractional deviation from average quality required.
efraction=0.2
(ef) Minimum fractional deviation from average error-free probability required.
pgfraction=0.2
(pgf) Minimum fractional deviation from average poly-G rate required.
uabsolute=1
(ua) Minimum absolute deviation from average uniqueness required.
qabsolute=2.0
(qa) Minimum absolute deviation from average quality required.
eabsolute=6
(ea) Minimum absolute deviation from average error-free probability required.
pgabsolute=0.2
(pga) Minimum absolute deviation from average poly-G rate required.
ier=0.012
(inferrederrorrate) Maximum predicted base error rate. Superior to uniqueness deviations when ~1% PhiX is spiked in.
mdf=0.4
(maxdiscardfraction) Safety limit: don't discard more than this fraction of tiles regardless of data quality.

Alignment Parameters

Optional internal alignment to PhiX enables translation of k-mer depth to high-resolution error rates.

samin=<file>
Optional aligned SAM input file for error rate analysis. Skips internal alignment if provided.
samout=<file>
Output file for aligned reads (SAM or FASTQ format). Written only during internal alignment.
align=true
Perform internal alignment to reference if no SAM file provided.
alignref=phix
Reference for alignment. PhiX is optimal as it's nonrepetitive down to k=13.
alignk1=17
K-mer length for seeding primary alignments to reference.
alignk2=13
K-mer length for seeding mate rescue alignments.
minid1=0.62
Minimum identity for accepting individual alignments.
minid2=0.54
Minimum identity for mate rescue alignments.
alignmm1=1
Middle mask length for alignk1.
alignmm2=1
Middle mask length for alignk2.

Other Parameters

usekmers=t
Load k-mers to calculate uniqueness and depth. Set to false to reduce memory usage if encountering out-of-memory errors.
lowqualityonly=t
(lqo) Only discard low-quality reads within bad micro-tiles rather than entire tiles. More conservative approach.
recalibrate=f
Recalibrate reads during processing. Requires calibration matrices from CalcTrueQuality.
dmult=-.1
Controls stringency when lqo=t. At 0, removes only below-average reads. Lower values increase stringency.
idmaskwrite=15
Bitmask controlling fraction of k-mers loaded (15 means 1/16th). 0 uses all k-mers. Trade memory for accuracy.
idmaskread=7
Controls fraction of k-mers sampled during depth calculation.
k=31
K-mer length for Bloom filter (uniqueness calculation).
hashes=3
Number of hash functions for Bloom filter. More hashes reduce false positives.
cbits=2
Bloom filter bits per cell. Higher values allow more accurate depth estimation.
merge=f
Merge reads for insert and error rate statistics. Adds ~50% processing time but improves dump file quality.

Java Parameters

-Xmx
Set Java memory usage. For processing 5 billion reads, use -Xmx200g. FilterByTile benefits from large memory allocations.
-eoom
Exit cleanly on out-of-memory exceptions (requires Java 8u92+).
-da
Disable assertions for minor performance improvement.

Examples

Basic Single File Processing

filterbytile.sh in=reads.fq.gz out=filtered.fq.gz

Standard processing of single-end or paired interleaved files.

Paired Files in Twin Format

filterbytile.sh in1=r1.fq in2=r2.fq out1=filtered1.fq out2=filtered2.fq

Process paired-end data stored in separate files.

Lane-Level Analysis with Dump File

# Step 1: Analyze full lane to create quality statistics
cat *.fastq.gz > all.fq.gz
filterbytile.sh in=all.fq.gz dump=dump.flowcell

# Step 2: Apply to individual libraries (much faster)
filterbytile.sh in=sample1.fastq.gz out=filtered_sample1.fq.gz indump=dump.flowcell

Recommended workflow for processing multiple libraries from the same flowcell. Provides better statistical power and enables efficient reuse of quality calculations.

Conservative Filtering for Precious Samples

filterbytile.sh in=x.fq out=y.fq lowqualityonly=t dmult=0

Use when data retention is critical. Only removes individual low-quality reads within bad micro-tiles rather than discarding entire tiles.

Aggressive Filtering for Severe Quality Issues

filterbytile.sh in=x.fq out=y.fq ud=0.75 qd=1 ed=1 ua=.5 qa=.5 ea=.5

Tighten thresholds when you know there are serious positional quality problems (e.g., obvious bubbles or streaking).

Memory-Optimized Processing

filterbytile.sh in=x.fq out=y.fq usekmers=f

Disable k-mer uniqueness calculations to reduce memory usage. Relies only on quality scores for filtering decisions.

NovaSeqX-Specific Configuration

filterbytile.sh in=novaseqx.fq out=filtered.fq xsize=520 ysize=590 target=2000

Use instrument-specific tile dimensions and increased target reads for optimal results on NovaSeqX data.

High-Memory Full Lane Processing

filterbytile.sh -Xmx200g in=lane.fq out=filtered.fq align=true ier=0.01 dump=stats.txt

Process complete lanes with maximum memory allocation and PhiX alignment for error rate calibration.

Algorithm Details

Micro-Tile Organization

FilterByTile divides flowcells into rectangular micro-tiles using a hierarchical data structure: FlowCell contains Lane objects, which contain dynamic Tile grids, each holding MicroTile arrays. The IlluminaHeaderParser2 class extracts lane, tile, x, and y coordinates from read headers to assign reads to micro-tiles.

Quality Metrics Implementation

Each micro-tile accumulates statistics for four quality metrics:

Triple-Threshold Logic

The TileDump.markTiles() method implements strict filtering logic requiring ALL THREE conditions simultaneously:

  1. Statistical significance: delta > (deviations × standard deviation)
  2. Proportional threshold: delta > (average × fraction)
  3. Absolute threshold: delta > absolute minimum

This conservative approach prevents over-filtering while ensuring only genuinely problematic micro-tiles are removed.

Adaptive Tile Expansion

FlowCell.widenToTargetReads() dynamically expands micro-tile dimensions when initial tiles contain insufficient reads for statistical reliability. Starting with xsize/ysize parameters, tiles grow iteratively until reaching the target read count (default 1600), ensuring adequate statistical power for quality assessment.

PhiX Error Rate Calibration

When PhiX spike-in is present, SideChannel3 performs internal alignment using BBMap algorithms. The processSamLine() method analyzes CIGAR strings to calculate error rates, enabling translation between k-mer depth statistics and absolute base error rates via linear regression formulas.

Memory-Efficient K-mer Sampling

The Bloom filter implementation uses bit masking for memory optimization:

Performance and Memory

Memory Requirements

Memory usage depends on data volume and k-mer analysis:

Processing Recommendations

Data Volume Requirements

FilterByTile requires substantial data volumes for effective statistical analysis. Avoid using on small demultiplexed datasets (e.g., 4000 reads) - instead, analyze the full lane and apply results to individual libraries using the dump/indump approach.

Support

For questions and support: