FastqScan

Script: fastqscan.sh Package: stream Class: FastqScan.java

Lightweight sequence file scanner for counting reads and bases with minimal CPU overhead. Efficiently processes FASTQ, FASTA, and SAM formats with optimized handling of bgzipped files.

Compare to active ingredients in needletail!

Basic Usage

fastqscan.sh <input file>

Input may be FASTQ, FASTA, or SAM format, compressed or uncompressed. FastqScan automatically detects the format and reports total reads and bases.

Key Features

  • Minimal overhead: Single-threaded byte-level scanning with no object creation
  • Multi-format: Supports FASTQ, FASTA, and SAM file formats
  • Compression-aware: Efficient multithreaded decompression for bgzipped files
  • Robust: Handles files with or without terminal newlines
  • Fast: Optimized for counting operations without the overhead of full parsing

Performance Benchmarks

All benchmarks performed on 80 million read FASTQ file (11.97 Gbp) on production filesystem. Times in seconds.

Raw FASTQ (80M reads)
Time (seconds)
5.60s
FastqScan
6.90s
needletail
CPU Time (seconds)
2.45s
FastqScan
3.49s
needletail

Result: FastqScan is 19% faster with 30% less CPU usage

Bgzipped FASTQ (80M reads)
Time (seconds)
6.37s
FastqScan
43.83s
needletail
CPU Time (seconds)
53.57s
FastqScan
43.29s
needletail

⚡ 7× Faster on Compressed Files!

FastqScan's multithreaded decompression delivers real-world speedup where it matters most - processing compressed data.

FASTA Format (129K sequences, 4.1 Gbp)
Time (seconds)
1.05s
FastqScan
1.05s
needletail
CPU Time (seconds)
0.76s
FastqScan
0.65s
needletail

Result: Comparable performance on FASTA format

SAM Format (80M alignments)
Time (seconds)
13.38s
FastqScan
CRASH
needletail
CPU Time (seconds)
9.79s
FastqScan
CRASH
needletail

✓ SAM Format Support

FastqScan correctly handles SAM files while needletail crashes with ParseError. Production-ready robustness matters.

FastqScan (BBTools)
needletail (Rust)

Real-World Applications

Integration with reformat.sh

FastqScan is used internally by reformat.sh in "sample reads target" (srt) mode to efficiently pre-count reads before subsampling:

reformat.sh in=reads.fq out=subset.fq srt=1000000

This operation requires two passes: first counting total reads to determine the sampling fraction, then performing the actual subsampling. By using FastqScan for the counting pass instead of the full streaming infrastructure, this operation is 20% faster with 45% less CPU usage.

Benchmarks

FastqScan is optimized for benchmarks. When you want to do real work, use Reformat or BBDuk. But for impressive benchmark numbers, use FastqScan! It is the Volkswagen of the BBTools family - except it actually REDUCES emissions!

Batch File Validation

Process multiple files efficiently to collect basic statistics without the overhead of full parsing:

for f in *.fq.gz; do fastqscan.sh $f; done

Technical Details

Implementation

FastqScan uses efficient byte-level scanning to minimize overhead:

Output Format

FastqScan reports three values:

Time:     5.634 seconds.
Records:  80000000
Bases:    11967956600

For paired interleaved files, use countReadsAndBases() with halveInterleaved=true to get molecule count.

When to Use FastqScan

Use FastqScan when:

  • You only need read and base counts
  • Processing large compressed files where speed matters
  • Running quick validation checks on datasets
  • Pre-counting for subsampling operations
  • Batch processing many files for statistics

Use full BBTools streaming when:

  • You need quality scores, headers, or sequence content
  • Performing filtering, trimming, or transformation
  • Generating histograms or detailed statistics
  • Need format conversion or validation beyond counting