TestFormat
Tests file extensions and contents to determine format, quality, compression, interleaving, and read length. More than one file may be specified. Note that ASCII-33 (sanger) and ASCII-64 (old Illumina/Solexa) cannot always be differentiated.
Basic Usage
testformat.sh <file1> [file2] [file3] ...
Analyzes one or more files to determine their format characteristics. Files can be specified as command-line arguments or using the 'in' parameter.
Parameters
TESTFORMAT accepts both positional arguments and named parameters for format detection and analysis control.
Input Parameters
- in=file
- Specify input file(s) for format analysis. Multiple files can be specified using in1=file1, in2=file2, etc. Can also be specified as positional arguments.
Analysis Parameters
- verbose=false
- Enable verbose output for detailed format detection information. Shows internal processing steps and diagnostic information.
- full=false
- Run full TestFormat analysis. When enabled, delegates to jgi.TestFormat for comprehensive format testing instead of the lightweight format detection.
Output Format
TESTFORMAT outputs format information in tab-delimited format with the following fields:
- Quality Encoding: "sanger" (ASCII-33), "illumina" (ASCII-64), or numeric offset
- File Format: Format type (fastq, fasta, sam, bam, vcf, etc.)
- Compression: Compression method (raw, gz, bz2, zip, xz, etc.)
- Interleaving: "interleaved" or "single-ended" for sequence files
- Read Length: Average read length in base pairs (for sequence files)
- Extension Notes: Indicates if file extension differs from detected content
Examples
Basic Format Detection
# Test a single FASTQ file
testformat.sh reads.fq
# Output: sanger fastq raw single-ended 150bp
# Test multiple files
testformat.sh reads.fq assembly.fa aligned.sam
# Outputs format info for each file on separate lines
Compressed File Detection
# Test compressed files
testformat.sh reads.fq.gz assembly.fa.bz2
# Output shows compression type: gz, bz2, etc.
Using Named Parameters
# Verbose analysis of multiple files
testformat.sh verbose=true in1=reads.fq in2=assembly.fa
# Full comprehensive analysis
testformat.sh full=true reads.fastq
Interleaved File Detection
# Detect interleaved paired-end files
testformat.sh paired_reads.fq
# Output: sanger fastq raw interleaved 150bp
Supported File Formats
TESTFORMAT can detect and analyze the following file formats:
Sequence Formats
- FASTA: .fa, .fasta, .fas, .fna, .ffn, .frn, .seq, .fsa, .faa (amino acids), .prot
- FASTQ: .fq, .fastq (with quality score detection)
- FASTR: .fastr, .fr (BBTools flat format)
- ONELINE: .oneline, .flat (single-line format)
- BREAD: .bread (BBTools binary format)
- CSFASTA: .csfasta (colorspace)
- SCARF: .scarf (Solexa/Illumina format)
Alignment Formats
- SAM: .sam (Sequence Alignment/Map)
- BAM: .bam (Binary Alignment/Map)
Variant and Annotation Formats
- VCF: .vcf (Variant Call Format)
- VAR: .var (BBTools variant format)
- GFF: .gff, .gff3 (Gene Feature Format)
- BED: .bed (Browser Extensible Data)
Specialized Formats
- SKETCH: .sketch (BBSketch format)
- PGM: .pgm, .pkm (Phylogenetic Group Map)
- PHYLIP: .phylip (phylogenetic analysis)
- EMBL: .embl (European Molecular Biology Laboratory)
- GENBANK: .gbk, .gbff (GenBank formats)
- BBNET: .bbnet (BBTools network format)
- BBVEC: .bbvec, .vec (BBTools vector format)
- CLADE: .clade, .spectra (taxonomic classification)
Compression Formats
- GZIP: .gz (most common for bioinformatics)
- BZIP2: .bz2 (better compression, slower)
- XZ: .xz (LZMA compression)
- ZIP: .zip (standard archive format)
- 7-Zip: .7z (high compression)
- ZSTD: .zst (fast modern compression)
- FQZ: .fqz (FASTQ-specific compression)
- DSRC: .dsrc (DNA sequence compression)
Algorithm Details
TESTFORMAT employs a multi-stage detection strategy combining file extension analysis with content inspection:
Detection Strategy
- Extension Analysis: Uses testFormat() method to map file extensions via ReadWrite.rawExtension() and ReadWrite.compressionType()
- Magic Number Detection: ReadWrite.getInputStream() checks file headers for compression signatures using BufferedReader
- Content Analysis: getFirstOctet() method reads exactly 8 lines using BufferedReader for format detection
- Quality Encoding Detection: stream.FASTQ.testQuality() analyzes quality character distribution and ASCII ranges
- Interleaving Detection: stream.FASTQ.testInterleaved() examines read name patterns and pairing consistency
- Read Length Calculation: testInterleavedAndQuality() measures sequence length from lines 1 and 5 using Tools.max()
Format Recognition Patterns
- FASTA: Lines starting with '>' character
- FASTQ: Four-line pattern with '@' headers and '+' quality separators
- SAM: Tab-delimited with '@' headers or alignment records
- VCF: Headers starting with "##fileformat=VCF"
- GFF: Headers starting with "##gff-version"
- SKETCH: Headers starting with "#SZ:" or "#SIZE:"
- FASTR: Headers starting with "#FASTR" or "#FR"
Quality Score Detection
For FASTQ files, TESTFORMAT analyzes the quality score distribution to distinguish between encoding schemes:
- Sanger (ASCII-33): Quality scores 0-93, ASCII range 33-126
- Illumina 1.3+ (ASCII-64): Quality scores 0-62, ASCII range 64-126
- Solexa (ASCII-64): Historical format, quality scores -5 to 62
Detection uses stream.FASTQ.testQuality() which counts character frequencies in ASCII ranges 33-126 vs 64-126 to determine encoding offset.
Barcode Detection
TESTFORMAT uses Read.headerToBarcode() and barcodeDelimiter() methods to parse FASTQ headers:
- Single Barcodes: Read.headerToBarcode() extracts barcode from FASTQ header line
- Dual Barcodes: barcodeDelimiter() scans for non-letter characters separating barcode segments
- Barcode Length: Calculates bcLen1 and bcLen2 using string.length() and indexOf()
- Delimiter Recognition: countLetters() method identifies single non-letter separator characters
Performance Characteristics
- Memory Usage: Uses -Xmx120m default heap size, only allocates ArrayList for 8 string lines
- Speed: getFirstOctet() reads maximum 8 lines via BufferedReader, minimal I/O operations
- Validation: Three-layer detection: extension via testFormat(), magic numbers via ReadWrite.getInputStream(), content via getFirstOctet()
- Error Handling: IOException try-catch blocks with printStackTrace(), continues processing on file read errors
- File Processing: Sequential file iteration through command line arguments, no parallel processing
Stream Support
TESTFORMAT uses FileFormat constants and isStdin() method for stream detection:
- Standard Input: isStdin() checks for "stdin", "standardin" strings, sets type=STDIO
- Standard Output: Results printed via System.out.print() and System.out.println()
- File Streams: ReadWrite.getInputStream() with File.exists() checks for direct file access
- /dev/null: String comparison "/dev/null".equalsIgnoreCase() sets type=DEVNULL
Use Cases
- Pipeline Validation: Verify input file formats before processing
- Quality Assessment: Determine quality encoding for downstream tools
- Batch Processing: Automatically detect formats for mixed file collections
- Data Migration: Verify file integrity during transfers
- Tool Selection: Choose appropriate tools based on detected formats
- Compression Analysis: Identify optimal compression strategies
- Interleaving Detection: Determine if paired-end data needs de-interleaving
- Read Length Estimation: Get read length statistics without full file processing
Related Tools
- testformat2.sh: Extended version with additional format testing capabilities
- stats.sh: Comprehensive sequence statistics including format validation
- reformat.sh: Format conversion tool that uses similar detection logic
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org