TestFormat

Script: testformat.sh Package: fileIO Class: FileFormat.java

Tests file extensions and contents to determine format, quality, compression, interleaving, and read length. More than one file may be specified. Note that ASCII-33 (sanger) and ASCII-64 (old Illumina/Solexa) cannot always be differentiated.

Basic Usage

testformat.sh <file1> [file2] [file3] ...

Analyzes one or more files to determine their format characteristics. Files can be specified as command-line arguments or using the 'in' parameter.

Parameters

TESTFORMAT accepts both positional arguments and named parameters for format detection and analysis control.

Input Parameters

in=file
Specify input file(s) for format analysis. Multiple files can be specified using in1=file1, in2=file2, etc. Can also be specified as positional arguments.

Analysis Parameters

verbose=false
Enable verbose output for detailed format detection information. Shows internal processing steps and diagnostic information.
full=false
Run full TestFormat analysis. When enabled, delegates to jgi.TestFormat for comprehensive format testing instead of the lightweight format detection.

Output Format

TESTFORMAT outputs format information in tab-delimited format with the following fields:

Examples

Basic Format Detection

# Test a single FASTQ file
testformat.sh reads.fq
# Output: sanger	fastq	raw	single-ended	150bp

# Test multiple files
testformat.sh reads.fq assembly.fa aligned.sam
# Outputs format info for each file on separate lines

Compressed File Detection

# Test compressed files
testformat.sh reads.fq.gz assembly.fa.bz2
# Output shows compression type: gz, bz2, etc.

Using Named Parameters

# Verbose analysis of multiple files
testformat.sh verbose=true in1=reads.fq in2=assembly.fa

# Full comprehensive analysis
testformat.sh full=true reads.fastq

Interleaved File Detection

# Detect interleaved paired-end files
testformat.sh paired_reads.fq
# Output: sanger	fastq	raw	interleaved	150bp

Supported File Formats

TESTFORMAT can detect and analyze the following file formats:

Sequence Formats

  • FASTA: .fa, .fasta, .fas, .fna, .ffn, .frn, .seq, .fsa, .faa (amino acids), .prot
  • FASTQ: .fq, .fastq (with quality score detection)
  • FASTR: .fastr, .fr (BBTools flat format)
  • ONELINE: .oneline, .flat (single-line format)
  • BREAD: .bread (BBTools binary format)
  • CSFASTA: .csfasta (colorspace)
  • SCARF: .scarf (Solexa/Illumina format)

Alignment Formats

  • SAM: .sam (Sequence Alignment/Map)
  • BAM: .bam (Binary Alignment/Map)

Variant and Annotation Formats

  • VCF: .vcf (Variant Call Format)
  • VAR: .var (BBTools variant format)
  • GFF: .gff, .gff3 (Gene Feature Format)
  • BED: .bed (Browser Extensible Data)

Specialized Formats

  • SKETCH: .sketch (BBSketch format)
  • PGM: .pgm, .pkm (Phylogenetic Group Map)
  • PHYLIP: .phylip (phylogenetic analysis)
  • EMBL: .embl (European Molecular Biology Laboratory)
  • GENBANK: .gbk, .gbff (GenBank formats)
  • BBNET: .bbnet (BBTools network format)
  • BBVEC: .bbvec, .vec (BBTools vector format)
  • CLADE: .clade, .spectra (taxonomic classification)

Compression Formats

  • GZIP: .gz (most common for bioinformatics)
  • BZIP2: .bz2 (better compression, slower)
  • XZ: .xz (LZMA compression)
  • ZIP: .zip (standard archive format)
  • 7-Zip: .7z (high compression)
  • ZSTD: .zst (fast modern compression)
  • FQZ: .fqz (FASTQ-specific compression)
  • DSRC: .dsrc (DNA sequence compression)

Algorithm Details

TESTFORMAT employs a multi-stage detection strategy combining file extension analysis with content inspection:

Detection Strategy

  • Extension Analysis: Uses testFormat() method to map file extensions via ReadWrite.rawExtension() and ReadWrite.compressionType()
  • Magic Number Detection: ReadWrite.getInputStream() checks file headers for compression signatures using BufferedReader
  • Content Analysis: getFirstOctet() method reads exactly 8 lines using BufferedReader for format detection
  • Quality Encoding Detection: stream.FASTQ.testQuality() analyzes quality character distribution and ASCII ranges
  • Interleaving Detection: stream.FASTQ.testInterleaved() examines read name patterns and pairing consistency
  • Read Length Calculation: testInterleavedAndQuality() measures sequence length from lines 1 and 5 using Tools.max()

Format Recognition Patterns

  • FASTA: Lines starting with '>' character
  • FASTQ: Four-line pattern with '@' headers and '+' quality separators
  • SAM: Tab-delimited with '@' headers or alignment records
  • VCF: Headers starting with "##fileformat=VCF"
  • GFF: Headers starting with "##gff-version"
  • SKETCH: Headers starting with "#SZ:" or "#SIZE:"
  • FASTR: Headers starting with "#FASTR" or "#FR"

Quality Score Detection

For FASTQ files, TESTFORMAT analyzes the quality score distribution to distinguish between encoding schemes:

  • Sanger (ASCII-33): Quality scores 0-93, ASCII range 33-126
  • Illumina 1.3+ (ASCII-64): Quality scores 0-62, ASCII range 64-126
  • Solexa (ASCII-64): Historical format, quality scores -5 to 62

Detection uses stream.FASTQ.testQuality() which counts character frequencies in ASCII ranges 33-126 vs 64-126 to determine encoding offset.

Barcode Detection

TESTFORMAT uses Read.headerToBarcode() and barcodeDelimiter() methods to parse FASTQ headers:

  • Single Barcodes: Read.headerToBarcode() extracts barcode from FASTQ header line
  • Dual Barcodes: barcodeDelimiter() scans for non-letter characters separating barcode segments
  • Barcode Length: Calculates bcLen1 and bcLen2 using string.length() and indexOf()
  • Delimiter Recognition: countLetters() method identifies single non-letter separator characters

Performance Characteristics

  • Memory Usage: Uses -Xmx120m default heap size, only allocates ArrayList for 8 string lines
  • Speed: getFirstOctet() reads maximum 8 lines via BufferedReader, minimal I/O operations
  • Validation: Three-layer detection: extension via testFormat(), magic numbers via ReadWrite.getInputStream(), content via getFirstOctet()
  • Error Handling: IOException try-catch blocks with printStackTrace(), continues processing on file read errors
  • File Processing: Sequential file iteration through command line arguments, no parallel processing

Stream Support

TESTFORMAT uses FileFormat constants and isStdin() method for stream detection:

  • Standard Input: isStdin() checks for "stdin", "standardin" strings, sets type=STDIO
  • Standard Output: Results printed via System.out.print() and System.out.println()
  • File Streams: ReadWrite.getInputStream() with File.exists() checks for direct file access
  • /dev/null: String comparison "/dev/null".equalsIgnoreCase() sets type=DEVNULL

Use Cases

Related Tools

Support

For questions and support: