FileScan

Script: filescan.sh Package: stream Class: FileScanMT.java

Fast lightweight scanner that counts lines and bytes in text files. Parses newlines efficiently with minimal overhead. Supports raw, gzip, bgzip, and bz2 compression with multithreaded bgzip decompression for optimal performance.

Basic Usage

filescan.sh <file> [threads]

Examples

filescan.sh contigs.fasta
filescan.sh reads.fq.gz
filescan.sh reads.fq 2

Output

FileScan outputs three values to stdout:

Time: Total processing time
Lines: Total number of newline characters
Bytes: Total number of bytes processed

Performance Characteristics

Minimal overhead: Counts newlines without parsing file format or creating objects
Multithreaded bgzip: Parallel decompression provides significant speedup for .bgz files
SIMD acceleration: Autodetected and enabled by default for faster processing
Fixed memory: Uses only 256MB maximum memory regardless of file size
Thread scaling: Performance peaks at 2 read threads; bgzip decompression uses up to 18 threads

Common Use Cases

Quick File Validation

Rapidly check file integrity and size before processing:

filescan.sh large_dataset.fq.gz

Verifies the file can be read and reports line/byte counts in seconds.

FASTQ Record Counting

Count sequences in FASTQ files (4 lines per record):

filescan.sh reads.fq.gz | grep "Lines:" | awk '{print $2/4}'

Divide line count by 4 to get the number of sequences.

Benchmark Compression Performance

Compare processing speed across compression formats:

filescan.sh dataset.fq        # Raw (baseline)
filescan.sh dataset.fq.gz     # Gzip (single-threaded)
filescan.sh dataset.fq.bgz    # Bgzip (multithreaded)

Bgzipped files process significantly faster due to parallel decompression.

Stdin Processing

FileScan works with piped input from other tools:

cat file1.fq file2.fq | filescan.sh stdin.fq

Use stdin as the filename with appropriate format extension.

Parameters

FileScan uses a minimal parameter set focused on performance tuning. The first argument is always the input file. Additional parameters control threading and SIMD behavior.

Input parameters

<file>: Input filename (required, first argument). Supports raw text, .gz (gzip), .bgz (bgzip), or .bz2 (bzip2) compression. Use "stdin" with appropriate extension (e.g., "stdin.fq.gz") for piped input.

Threading parameters

[threads]: Number of read threads (optional, second argument or t= parameter). Default is 1. Performance peaks at 2 threads for most files. Can also be specified as threads=N or t=N. For bgzipped files, decompression automatically uses up to 18 threads regardless of this setting.

Performance parameters

simd=t: Enable SIMD (Single Instruction Multiple Data) acceleration. Autodetected and enabled by default. Set to simd=f to disable. SIMD provides faster newline detection on supported processors.

Compression Support

Compression Format Details

FileScan automatically detects compression from file extensions and selects appropriate decompression:

Raw (uncompressed): Direct file reading with minimal overhead
Gzip (.gz): Single-threaded decompression using standard gzip algorithm
Bgzip (.bgz, .bz): Multithreaded decompression using BGZF format - significantly faster for large files
Bzip2 (.bz2): Single-threaded decompression using bzip2 algorithm

Bgzip Performance Advantage

Bgzipped files use the BGZF (Blocked GNU Zip Format) which divides the file into independent compressed blocks. This enables parallel decompression across multiple threads, delivering substantial speedup compared to standard gzip. For large files, bgzip compression with FileScan provides the fastest processing.

Thread allocation: Bgzip decompression automatically uses up to 18 threads (configurable via BgzfSettings.READ_THREADS in Java code). This is independent of the read threads parameter.

Technical Details

Implementation

FileScan uses a minimalist architecture optimized for speed:

Newline parsing only: Does not parse file format structure - simply counts '\n' characters
Zero object creation: Operates directly on byte buffers without creating sequence objects
Buffer-based reading: Fixed buffer size with synchronized filling from input stream
Thread coordination: Multiple threads read from shared input stream with synchronized access
SIMD optimization: Uses Vector API for accelerated newline detection when available

Memory Usage

FileScan uses fixed memory allocation regardless of file size:

Maximum: 256MB (-Xmx256m)
Minimum: 128MB (-Xms128m)
Mode: Fixed memory (does not use autodetection)

This makes FileScan safe to run on systems with limited memory or alongside other memory-intensive BBTools processes.

Limitations

Does not validate file format correctness - only counts lines and bytes
Cannot distinguish between different sequence formats
Thread scaling peaks at 2 read threads (additional threads provide minimal benefit)
For format-specific validation, use dedicated tools like reformat.sh or readlength.sh

Workflow Integration

Pipeline Preprocessing

Use FileScan before expensive operations to verify file integrity:

# Validate input before long-running assembly
filescan.sh reads.fq.gz || exit 1
assembler.sh in=reads.fq.gz out=contigs.fa

Log File Analysis

Quickly check log file sizes and line counts:

filescan.sh pipeline.log
filescan.sh error.log.gz

Integration with BBTools Workflows

FileScan is used internally by other BBTools utilities for rapid file size assessment. It can be called programmatically from Java code using the static method:

long[] result = FileScanMT.countLinesAndBytes(filename, readThreads, zipThreads);

Returns an array: [totalLines, totalBytes]

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org