FileScan
Fast lightweight scanner that counts lines and bytes in text files. Parses newlines efficiently with minimal overhead. Supports raw, gzip, bgzip, and bz2 compression with multithreaded bgzip decompression for optimal performance.
Basic Usage
filescan.sh <file> [threads]
Examples
filescan.sh contigs.fasta
filescan.sh reads.fq.gz
filescan.sh reads.fq 2
Output
FileScan outputs three values to stdout:
- Time: Total processing time
- Lines: Total number of newline characters
- Bytes: Total number of bytes processed
Performance Characteristics
- Minimal overhead: Counts newlines without parsing file format or creating objects
- Multithreaded bgzip: Parallel decompression provides significant speedup for .bgz files
- SIMD acceleration: Autodetected and enabled by default for faster processing
- Fixed memory: Uses only 256MB maximum memory regardless of file size
- Thread scaling: Performance peaks at 2 read threads; bgzip decompression uses up to 18 threads
Common Use Cases
Quick File Validation
Rapidly check file integrity and size before processing:
filescan.sh large_dataset.fq.gz
Verifies the file can be read and reports line/byte counts in seconds.
FASTQ Record Counting
Count sequences in FASTQ files (4 lines per record):
filescan.sh reads.fq.gz | grep "Lines:" | awk '{print $2/4}'
Divide line count by 4 to get the number of sequences.
Benchmark Compression Performance
Compare processing speed across compression formats:
filescan.sh dataset.fq # Raw (baseline)
filescan.sh dataset.fq.gz # Gzip (single-threaded)
filescan.sh dataset.fq.bgz # Bgzip (multithreaded)
Bgzipped files process significantly faster due to parallel decompression.
Stdin Processing
FileScan works with piped input from other tools:
cat file1.fq file2.fq | filescan.sh stdin.fq
Use stdin as the filename with appropriate format extension.
Parameters
FileScan uses a minimal parameter set focused on performance tuning. The first argument is always the input file. Additional parameters control threading and SIMD behavior.
Input parameters
- <file>
- Input filename (required, first argument). Supports raw text, .gz (gzip), .bgz (bgzip), or .bz2 (bzip2) compression. Use "stdin" with appropriate extension (e.g., "stdin.fq.gz") for piped input.
Threading parameters
- [threads]
- Number of read threads (optional, second argument or t= parameter). Default is 1. Performance peaks at 2 threads for most files. Can also be specified as
threads=Nort=N. For bgzipped files, decompression automatically uses up to 18 threads regardless of this setting.
Performance parameters
- simd=t
- Enable SIMD (Single Instruction Multiple Data) acceleration. Autodetected and enabled by default. Set to
simd=fto disable. SIMD provides faster newline detection on supported processors.
Compression Support
Compression Format Details
FileScan automatically detects compression from file extensions and selects appropriate decompression:
- Raw (uncompressed): Direct file reading with minimal overhead
- Gzip (.gz): Single-threaded decompression using standard gzip algorithm
- Bgzip (.bgz, .bz): Multithreaded decompression using BGZF format - significantly faster for large files
- Bzip2 (.bz2): Single-threaded decompression using bzip2 algorithm
Bgzip Performance Advantage
Bgzipped files use the BGZF (Blocked GNU Zip Format) which divides the file into independent compressed blocks. This enables parallel decompression across multiple threads, delivering substantial speedup compared to standard gzip. For large files, bgzip compression with FileScan provides the fastest processing.
Thread allocation: Bgzip decompression automatically uses up to 18 threads (configurable via BgzfSettings.READ_THREADS in Java code). This is independent of the read threads parameter.
Technical Details
Implementation
FileScan uses a minimalist architecture optimized for speed:
- Newline parsing only: Does not parse file format structure - simply counts '\n' characters
- Zero object creation: Operates directly on byte buffers without creating sequence objects
- Buffer-based reading: Fixed buffer size with synchronized filling from input stream
- Thread coordination: Multiple threads read from shared input stream with synchronized access
- SIMD optimization: Uses Vector API for accelerated newline detection when available
Memory Usage
FileScan uses fixed memory allocation regardless of file size:
- Maximum: 256MB (-Xmx256m)
- Minimum: 128MB (-Xms128m)
- Mode: Fixed memory (does not use autodetection)
This makes FileScan safe to run on systems with limited memory or alongside other memory-intensive BBTools processes.
Limitations
- Does not validate file format correctness - only counts lines and bytes
- Cannot distinguish between different sequence formats
- Thread scaling peaks at 2 read threads (additional threads provide minimal benefit)
- For format-specific validation, use dedicated tools like reformat.sh or readlength.sh
Workflow Integration
Pipeline Preprocessing
Use FileScan before expensive operations to verify file integrity:
# Validate input before long-running assembly
filescan.sh reads.fq.gz || exit 1
assembler.sh in=reads.fq.gz out=contigs.fa
Log File Analysis
Quickly check log file sizes and line counts:
filescan.sh pipeline.log
filescan.sh error.log.gz
Integration with BBTools Workflows
FileScan is used internally by other BBTools utilities for rapid file size assessment. It can be called programmatically from Java code using the static method:
long[] result = FileScanMT.countLinesAndBytes(filename, readThreads, zipThreads);
Returns an array: [totalLines, totalBytes]
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org