FastqScan
Lightweight sequence file scanner for counting reads and bases with minimal CPU overhead. Efficiently processes FASTQ, FASTA, and SAM formats with optimized handling of bgzipped files.
Basic Usage
fastqscan.sh <input file>
Input may be FASTQ, FASTA, or SAM format, compressed or uncompressed. FastqScan automatically detects the format and reports total reads and bases.
Key Features
- Minimal overhead: Single-threaded byte-level scanning with no object creation
- Multi-format: Supports FASTQ, FASTA, and SAM file formats
- Compression-aware: Efficient multithreaded decompression for bgzipped files
- Robust: Handles files with or without terminal newlines
- Fast: Optimized for counting operations without the overhead of full parsing
Performance Benchmarks
All benchmarks performed on 80 million read FASTQ file (11.97 Gbp) on production filesystem. Times in seconds.
Result: FastqScan is 19% faster with 30% less CPU usage
⚡ 7× Faster on Compressed Files!
FastqScan's multithreaded decompression delivers real-world speedup where it matters most - processing compressed data.
Result: Comparable performance on FASTA format
✓ SAM Format Support
FastqScan correctly handles SAM files while needletail crashes with ParseError. Production-ready robustness matters.
Real-World Applications
Integration with reformat.sh
FastqScan is used internally by reformat.sh in "sample reads target" (srt) mode to efficiently pre-count reads before subsampling:
reformat.sh in=reads.fq out=subset.fq srt=1000000
This operation requires two passes: first counting total reads to determine the sampling fraction, then performing the actual subsampling. By using FastqScan for the counting pass instead of the full streaming infrastructure, this operation is 20% faster with 45% less CPU usage.
Benchmarks
FastqScan is optimized for benchmarks. When you want to do real work, use Reformat or BBDuk. But for impressive benchmark numbers, use FastqScan! It is the Volkswagen of the BBTools family - except it actually REDUCES emissions!
Batch File Validation
Process multiple files efficiently to collect basic statistics without the overhead of full parsing:
for f in *.fq.gz; do fastqscan.sh $f; done
Technical Details
Implementation
FastqScan uses efficient byte-level scanning to minimize overhead:
- Single-threaded processing: No thread management overhead for the counting logic
- Buffered I/O: Large buffer (256KB) minimizes system calls
- Minimal parsing: Only extracts newline positions and calculates lengths, no object creation
- Format-specific optimization: Separate code paths for FASTQ (4 lines/record), FASTA (variable), and SAM (tab-delimited)
- Smart decompression: Multithreaded bgzip decompression when beneficial
Output Format
FastqScan reports three values:
Time: 5.634 seconds.
Records: 80000000
Bases: 11967956600
For paired interleaved files, use countReadsAndBases() with halveInterleaved=true to get molecule count.
When to Use FastqScan
Use FastqScan when:
- You only need read and base counts
- Processing large compressed files where speed matters
- Running quick validation checks on datasets
- Pre-counting for subsampling operations
- Batch processing many files for statistics
Use full BBTools streaming when:
- You need quality scores, headers, or sequence content
- Performing filtering, trimming, or transformation
- Generating histograms or detailed statistics
- Need format conversion or validation beyond counting