FileScan

Script: filescan.sh Package: stream Class: FileScanMT.java

Fast lightweight scanner that counts lines and bytes in text files. Parses newlines efficiently with minimal overhead. Supports raw, gzip, bgzip, and bz2 compression with multithreaded bgzip decompression for optimal performance.

Basic Usage

filescan.sh <file> [threads]

Examples

filescan.sh contigs.fasta
filescan.sh reads.fq.gz
filescan.sh reads.fq 2

Output

FileScan outputs three values to stdout:

Performance Characteristics

  • Minimal overhead: Counts newlines without parsing file format or creating objects
  • Multithreaded bgzip: Parallel decompression provides significant speedup for .bgz files
  • SIMD acceleration: Autodetected and enabled by default for faster processing
  • Fixed memory: Uses only 256MB maximum memory regardless of file size
  • Thread scaling: Performance peaks at 2 read threads; bgzip decompression uses up to 18 threads

Common Use Cases

Quick File Validation

Rapidly check file integrity and size before processing:

filescan.sh large_dataset.fq.gz

Verifies the file can be read and reports line/byte counts in seconds.

FASTQ Record Counting

Count sequences in FASTQ files (4 lines per record):

filescan.sh reads.fq.gz | grep "Lines:" | awk '{print $2/4}'

Divide line count by 4 to get the number of sequences.

Benchmark Compression Performance

Compare processing speed across compression formats:

filescan.sh dataset.fq        # Raw (baseline)
filescan.sh dataset.fq.gz     # Gzip (single-threaded)
filescan.sh dataset.fq.bgz    # Bgzip (multithreaded)

Bgzipped files process significantly faster due to parallel decompression.

Stdin Processing

FileScan works with piped input from other tools:

cat file1.fq file2.fq | filescan.sh stdin.fq

Use stdin as the filename with appropriate format extension.

Parameters

FileScan uses a minimal parameter set focused on performance tuning. The first argument is always the input file. Additional parameters control threading and SIMD behavior.

Input parameters

<file>
Input filename (required, first argument). Supports raw text, .gz (gzip), .bgz (bgzip), or .bz2 (bzip2) compression. Use "stdin" with appropriate extension (e.g., "stdin.fq.gz") for piped input.

Threading parameters

[threads]
Number of read threads (optional, second argument or t= parameter). Default is 1. Performance peaks at 2 threads for most files. Can also be specified as threads=N or t=N. For bgzipped files, decompression automatically uses up to 18 threads regardless of this setting.

Performance parameters

simd=t
Enable SIMD (Single Instruction Multiple Data) acceleration. Autodetected and enabled by default. Set to simd=f to disable. SIMD provides faster newline detection on supported processors.

Compression Support

Compression Format Details

FileScan automatically detects compression from file extensions and selects appropriate decompression:

Bgzip Performance Advantage

Bgzipped files use the BGZF (Blocked GNU Zip Format) which divides the file into independent compressed blocks. This enables parallel decompression across multiple threads, delivering substantial speedup compared to standard gzip. For large files, bgzip compression with FileScan provides the fastest processing.

Thread allocation: Bgzip decompression automatically uses up to 18 threads (configurable via BgzfSettings.READ_THREADS in Java code). This is independent of the read threads parameter.

Technical Details

Implementation

FileScan uses a minimalist architecture optimized for speed:

Memory Usage

FileScan uses fixed memory allocation regardless of file size:

This makes FileScan safe to run on systems with limited memory or alongside other memory-intensive BBTools processes.

Limitations

Workflow Integration

Pipeline Preprocessing

Use FileScan before expensive operations to verify file integrity:

# Validate input before long-running assembly
filescan.sh reads.fq.gz || exit 1
assembler.sh in=reads.fq.gz out=contigs.fa

Log File Analysis

Quickly check log file sizes and line counts:

filescan.sh pipeline.log
filescan.sh error.log.gz

Integration with BBTools Workflows

FileScan is used internally by other BBTools utilities for rapid file size assessment. It can be called programmatically from Java code using the static method:

long[] result = FileScanMT.countLinesAndBytes(filename, readThreads, zipThreads);

Returns an array: [totalLines, totalBytes]

Support

For questions and support: