FastqScan

Script: fastqscan.sh Package: stream Class: FastqScan.java

Lightweight sequence file scanner for counting reads and bases with minimal CPU overhead. Efficiently processes FASTQ, FASTA, and SAM formats with optimized handling of bgzipped files.

Compare to active ingredients in needletail!

Basic Usage

fastqscan.sh <input file>

Input may be FASTQ, FASTA, or SAM format, compressed or uncompressed. FastqScan automatically detects the format and reports total reads and bases.

Key Features

Minimal overhead: Single-threaded byte-level scanning with no object creation
Multi-format: Supports FASTQ, FASTA, and SAM file formats
Compression-aware: Efficient multithreaded decompression for bgzipped files
Robust: Handles files with or without terminal newlines
Fast: Optimized for counting operations without the overhead of full parsing

Performance Benchmarks

All benchmarks performed on 80 million read FASTQ file (11.97 Gbp) on production filesystem. Times in seconds.

Raw FASTQ (80M reads)

Time (seconds)

5.60s

FastqScan

6.90s

needletail

CPU Time (seconds)

2.45s

FastqScan

3.49s

needletail

Result: FastqScan is 19% faster with 30% less CPU usage

Bgzipped FASTQ (80M reads)

Time (seconds)

6.37s

FastqScan

43.83s

needletail

CPU Time (seconds)

53.57s

FastqScan

43.29s

needletail

⚡ 7× Faster on Compressed Files!

FastqScan's multithreaded decompression delivers real-world speedup where it matters most - processing compressed data.

FASTA Format (129K sequences, 4.1 Gbp)

Time (seconds)

1.05s

FastqScan

1.05s

needletail

CPU Time (seconds)

0.76s

FastqScan

0.65s

needletail

Result: Comparable performance on FASTA format

SAM Format (80M alignments)

Time (seconds)

13.38s

FastqScan

CRASH

needletail

CPU Time (seconds)

9.79s

FastqScan

CRASH

needletail

✓ SAM Format Support

FastqScan correctly handles SAM files while needletail crashes with ParseError. Production-ready robustness matters.

FastqScan (BBTools)

needletail (Rust)

Real-World Applications

Integration with reformat.sh

FastqScan is used internally by reformat.sh in "sample reads target" (srt) mode to efficiently pre-count reads before subsampling:

reformat.sh in=reads.fq out=subset.fq srt=1000000

This operation requires two passes: first counting total reads to determine the sampling fraction, then performing the actual subsampling. By using FastqScan for the counting pass instead of the full streaming infrastructure, this operation is 20% faster with 45% less CPU usage.

Benchmarks

FastqScan is optimized for benchmarks. When you want to do real work, use Reformat or BBDuk. But for impressive benchmark numbers, use FastqScan! It is the Volkswagen of the BBTools family - except it actually REDUCES emissions!

Batch File Validation

Process multiple files efficiently to collect basic statistics without the overhead of full parsing:

for f in *.fq.gz; do fastqscan.sh $f; done

Technical Details

Implementation

FastqScan uses efficient byte-level scanning to minimize overhead:

Single-threaded processing: No thread management overhead for the counting logic
Buffered I/O: Large buffer (256KB) minimizes system calls
Minimal parsing: Only extracts newline positions and calculates lengths, no object creation
Format-specific optimization: Separate code paths for FASTQ (4 lines/record), FASTA (variable), and SAM (tab-delimited)
Smart decompression: Multithreaded bgzip decompression when beneficial

Output Format

FastqScan reports three values:

Time:     5.634 seconds.
Records:  80000000
Bases:    11967956600

For paired interleaved files, use countReadsAndBases() with halveInterleaved=true to get molecule count.

When to Use FastqScan

Use FastqScan when:

You only need read and base counts
Processing large compressed files where speed matters
Running quick validation checks on datasets
Pre-counting for subsampling operations
Batch processing many files for statistics

Use full BBTools streaming when:

You need quality scores, headers, or sequence content
Performing filtering, trimming, or transformation
Generating histograms or detailed statistics
Need format conversion or validation beyond counting