TestFormat

Basic Usage

testformat.sh <file1> [file2] [file3] ...

Analyzes one or more files to determine their format characteristics. Files can be specified as command-line arguments or using the 'in' parameter.

Parameters

TESTFORMAT accepts both positional arguments and named parameters for format detection and analysis control.

Input Parameters

in=file: Specify input file(s) for format analysis. Multiple files can be specified using in1=file1, in2=file2, etc. Can also be specified as positional arguments.

Analysis Parameters

verbose=false: Enable verbose output for detailed format detection information. Shows internal processing steps and diagnostic information.
full=false: Run full TestFormat analysis. When enabled, delegates to jgi.TestFormat for comprehensive format testing instead of the lightweight format detection.

Output Format

TESTFORMAT outputs format information in tab-delimited format with the following fields:

Quality Encoding: "sanger" (ASCII-33), "illumina" (ASCII-64), or numeric offset
File Format: Format type (fastq, fasta, sam, bam, vcf, etc.)
Compression: Compression method (raw, gz, bz2, zip, xz, etc.)
Interleaving: "interleaved" or "single-ended" for sequence files
Read Length: Average read length in base pairs (for sequence files)
Extension Notes: Indicates if file extension differs from detected content

Examples

Basic Format Detection

# Test a single FASTQ file
testformat.sh reads.fq
# Output: sanger	fastq	raw	single-ended	150bp

# Test multiple files
testformat.sh reads.fq assembly.fa aligned.sam
# Outputs format info for each file on separate lines

Compressed File Detection

# Test compressed files
testformat.sh reads.fq.gz assembly.fa.bz2
# Output shows compression type: gz, bz2, etc.

Using Named Parameters

# Verbose analysis of multiple files
testformat.sh verbose=true in1=reads.fq in2=assembly.fa

# Full comprehensive analysis
testformat.sh full=true reads.fastq

Interleaved File Detection

# Detect interleaved paired-end files
testformat.sh paired_reads.fq
# Output: sanger	fastq	raw	interleaved	150bp

Supported File Formats

TESTFORMAT can detect and analyze the following file formats:

Sequence Formats

FASTA: .fa, .fasta, .fas, .fna, .ffn, .frn, .seq, .fsa, .faa (amino acids), .prot
FASTQ: .fq, .fastq (with quality score detection)
FASTR: .fastr, .fr (BBTools flat format)
ONELINE: .oneline, .flat (single-line format)
BREAD: .bread (BBTools binary format)
CSFASTA: .csfasta (colorspace)
SCARF: .scarf (Solexa/Illumina format)

Alignment Formats

SAM: .sam (Sequence Alignment/Map)
BAM: .bam (Binary Alignment/Map)

Variant and Annotation Formats

VCF: .vcf (Variant Call Format)
VAR: .var (BBTools variant format)
GFF: .gff, .gff3 (Gene Feature Format)
BED: .bed (Browser Extensible Data)

Specialized Formats

SKETCH: .sketch (BBSketch format)
PGM: .pgm, .pkm (Phylogenetic Group Map)
PHYLIP: .phylip (phylogenetic analysis)
EMBL: .embl (European Molecular Biology Laboratory)
GENBANK: .gbk, .gbff (GenBank formats)
BBNET: .bbnet (BBTools network format)
BBVEC: .bbvec, .vec (BBTools vector format)
CLADE: .clade, .spectra (taxonomic classification)

Compression Formats

GZIP: .gz (most common for bioinformatics)
BZIP2: .bz2 (better compression, slower)
XZ: .xz (LZMA compression)
ZIP: .zip (standard archive format)
7-Zip: .7z (high compression)
ZSTD: .zst (fast modern compression)
FQZ: .fqz (FASTQ-specific compression)
DSRC: .dsrc (DNA sequence compression)

Algorithm Details

TESTFORMAT employs a multi-stage detection strategy combining file extension analysis with content inspection:

Detection Strategy

Extension Analysis: Uses testFormat() method to map file extensions via ReadWrite.rawExtension() and ReadWrite.compressionType()
Magic Number Detection: ReadWrite.getInputStream() checks file headers for compression signatures using BufferedReader
Content Analysis: getFirstOctet() method reads exactly 8 lines using BufferedReader for format detection
Quality Encoding Detection: stream.FASTQ.testQuality() analyzes quality character distribution and ASCII ranges
Interleaving Detection: stream.FASTQ.testInterleaved() examines read name patterns and pairing consistency
Read Length Calculation: testInterleavedAndQuality() measures sequence length from lines 1 and 5 using Tools.max()

Format Recognition Patterns

FASTA: Lines starting with '>' character
FASTQ: Four-line pattern with '@' headers and '+' quality separators
SAM: Tab-delimited with '@' headers or alignment records
VCF: Headers starting with "##fileformat=VCF"
GFF: Headers starting with "##gff-version"
SKETCH: Headers starting with "#SZ:" or "#SIZE:"
FASTR: Headers starting with "#FASTR" or "#FR"

Quality Score Detection

For FASTQ files, TESTFORMAT analyzes the quality score distribution to distinguish between encoding schemes:

Sanger (ASCII-33): Quality scores 0-93, ASCII range 33-126
Illumina 1.3+ (ASCII-64): Quality scores 0-62, ASCII range 64-126
Solexa (ASCII-64): Historical format, quality scores -5 to 62

Detection uses stream.FASTQ.testQuality() which counts character frequencies in ASCII ranges 33-126 vs 64-126 to determine encoding offset.

Barcode Detection

TESTFORMAT uses Read.headerToBarcode() and barcodeDelimiter() methods to parse FASTQ headers:

Single Barcodes: Read.headerToBarcode() extracts barcode from FASTQ header line
Dual Barcodes: barcodeDelimiter() scans for non-letter characters separating barcode segments
Barcode Length: Calculates bcLen1 and bcLen2 using string.length() and indexOf()
Delimiter Recognition: countLetters() method identifies single non-letter separator characters

Performance Characteristics

Memory Usage: Uses -Xmx120m default heap size, only allocates ArrayList for 8 string lines
Speed: getFirstOctet() reads maximum 8 lines via BufferedReader, minimal I/O operations
Validation: Three-layer detection: extension via testFormat(), magic numbers via ReadWrite.getInputStream(), content via getFirstOctet()
Error Handling: IOException try-catch blocks with printStackTrace(), continues processing on file read errors
File Processing: Sequential file iteration through command line arguments, no parallel processing

Stream Support

TESTFORMAT uses FileFormat constants and isStdin() method for stream detection:

Standard Input: isStdin() checks for "stdin", "standardin" strings, sets type=STDIO
Standard Output: Results printed via System.out.print() and System.out.println()
File Streams: ReadWrite.getInputStream() with File.exists() checks for direct file access
/dev/null: String comparison "/dev/null".equalsIgnoreCase() sets type=DEVNULL

Use Cases

Pipeline Validation: Verify input file formats before processing
Quality Assessment: Determine quality encoding for downstream tools
Batch Processing: Automatically detect formats for mixed file collections
Data Migration: Verify file integrity during transfers
Tool Selection: Choose appropriate tools based on detected formats
Compression Analysis: Identify optimal compression strategies
Interleaving Detection: Determine if paired-end data needs de-interleaving
Read Length Estimation: Get read length statistics without full file processing

Related Tools

testformat2.sh: Extended version with additional format testing capabilities
stats.sh: Comprehensive sequence statistics including format validation
reformat.sh: Format conversion tool that uses similar detection logic

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org