TestFormat2

Script: testformat2.sh Package: jgi Class: TestFormat.java

Reads the entire file to find extended information about the format and contents. This file analyzer examines sequence files to determine format characteristics, quality metrics, composition statistics, and provides detailed reports for quality assessment and pipeline planning.

Basic Usage

testformat2.sh <file>

The tool accepts a single file as input and performs analysis to determine file format, compression, quality metrics, base composition, and various other characteristics important for bioinformatics pipeline planning.

Parameters

TestFormat2 organizes parameters into functional groups for analysis control and output generation. The tool provides customization options for different analysis depths and output formats.

Analysis Control Parameters

full=t
Process the full file for complete analysis. When true, performs complete file scanning to gather all statistics. Set to false for faster format-only detection.
speed=f
Print processing time and throughput statistics. When enabled, reports time elapsed and processing rates for performance monitoring.
fast=t
Use faster processing mode with reduced precision for some statistics. Trades some accuracy for significantly improved speed on large files.
slow=f
Opposite of fast parameter. When true, enables most comprehensive analysis with maximum precision.

Sequence-Specific Analysis

zmw=t
Parse PacBio ZMW (Zero Mode Waveguide) IDs from sequence headers. Enables ZMW-specific statistics and pass count analysis for PacBio data.
barcodelist=
Optional list of expected barcodes. May be a filename with one line per barcode, or a comma-delimited literal string. Used for barcode validation and statistics.
merge=t
Calculate mergability statistics via BBMerge algorithm. Attempts to merge paired reads to determine insert size distributions and adapter contamination.
sketch=t
(card) Calculate cardinality via BBSketch. If enabled, also sends the sketch to the RefSeq server for taxonomic identification. Provides genome size estimates and species identification.
trim=t
Calculate trimmability statistics from quality scores. Simulates quality-based trimming at various thresholds to predict trim rates.
sketchsize=40000
Size of sketch to generate for cardinality estimation. Larger sketches provide more accurate cardinality estimates but require more memory.
bhistlen=10k
Maximum read length for base composition histogram calculation. Reads longer than this value are excluded from bhist.txt. Set to 0 to include all reads.
maxbhistlen=10000
Alias for bhistlen parameter. Controls the maximum read length included in base composition analysis.

Output Control Parameters

printjunk=f
Print headers of junk reads to stdout. Junk reads are those with invalid bases, format errors, or other quality issues.
printbarcodes=f
Print barcode sequences and their occurrence counts to stdout. Useful for validating expected barcode lists and identifying contamination.
printqhist=f
Print quality score histogram to stdout. Shows distribution of quality scores across all bases in the file.
printihist=f
Print insert size histogram to stdout. Derived from paired-read merging analysis when merge=t is enabled.
edist=f
Calculate and report edit distances for barcode analysis. Computes minimum edit distances between observed and expected barcodes.

File Output Parameters (these can be eliminated by setting to null)

junk=junk.txt
Print headers of junk reads to this file. Contains sequence identifiers for reads that failed quality checks or format validation.
barcodes=barcodes.txt
Print barcode analysis results to this file. Includes barcode sequences, counts, and validation results against expected barcode lists.
hist=t
Master switch for histogram file generation. Setting to false will clear all default histogram file outputs (qhist, ihist, khist, bhist, lhist, gchist, zmwhist).
qhist=qhist.txt
Print quality histogram to this file. Contains quality score distributions with frequency counts for each quality value observed.
ihist=ihist.txt
Print insert size histogram to this file. Generated from paired-read overlap analysis and contains insert size frequency distribution.
khist=khist.txt
Print k-mer frequency histogram to this file. Derived from sketch analysis and shows k-mer depth distribution for genome size estimation.
bhist=bhist.txt
Print base composition histogram to this file. Contains per-read base composition statistics including GC content distribution.
lhist=lhist.txt
Print length histogram to this file. Shows read length distribution across the entire dataset.
gchist=gchist.txt
Print GC content histogram to this file. Contains detailed GC content distribution statistics for quality assessment.
zmwhist=zmwhist.txt
Print ZMW pass count histogram to this file. For PacBio data, shows distribution of how many passes each ZMW had.

Output Terminology Reference

Format
File format detected (e.g., FASTA, FASTQ, SAM, BAM). Determined by header analysis and content structure.
Compression
Compression format identified (e.g., gzip, bzip2, uncompressed). Important for pipeline processing requirements.
Interleaved
True if paired reads are stored in a single file with alternating mate pairs. False indicates separate files or unpaired data.
MaxLen
Maximum observed read length in the dataset. Critical for memory allocation and processing parameters.
MinLen
Minimum observed read length in the dataset. Useful for quality filtering and adapter detection.
StdevLen
Standard deviation of read lengths. Indicates length variability within the dataset.
ModeLen
Most common read length in the dataset. Typically represents the primary sequencing technology read length.
QualOffset
Quality score encoding offset (typically 33 or 64). Essential for proper quality score interpretation.
NegativeQuals
Number of bases with negative quality scores after offset correction. Indicates potential encoding issues.
Content
Sequence content type classification: Nucleotides or AminoAcids based on character analysis.
Type
Nucleic acid type classification: DNA, RNA, or Mixed based on presence of T vs U bases.
Reads
Total number of sequence reads processed in the analysis.
-JunkReads
Reads with invalid bases, format errors, or other quality issues that failed validation.
-ChastityFail
Reads failing Illumina's chastity filter as indicated in sequence headers.
-BadPairNames
Paired reads whose sequence identifiers don't match expected pairing patterns.
Bases
Total number of nucleotide bases processed across all reads.
-Lowercase
Count of lowercase nucleotide bases. Often indicates low-quality or soft-masked regions.
-Uppercase
Count of uppercase nucleotide bases. Typically indicates high-confidence sequence.
-Non-Letter
Count of non-alphabetic characters in sequence data. May indicate formatting issues.
-FullyDefined
Count of standard nucleotide bases (A, C, G, T, or U). Represents well-defined sequence content.
-No-call
Count of ambiguous 'N' bases where sequencing could not determine the nucleotide.
-Degenerate
Count of valid IUPAC ambiguity codes (other than N) representing multiple possible bases.
-Gap
Count of gap characters (-) in the sequence data.
-Invalid
Count of characters that are not valid sequence symbols according to standard conventions.
GC
GC content calculated as (C+G)/(C+G+A+T+U). Important for sequencing bias assessment and genome characterization.
Cardinality
Approximate number of unique 31-mers in the file, estimated via MinHash sketching. Provides genome size estimates.
Organism
Taxonomic name of the top-matching organism from BBSketch RefSeq server comparison.
TaxID
NCBI Taxonomy ID corresponding to the organism identification from sketch analysis.
Barcodes
Number of unique barcode sequences observed in Illumina-style sequence headers.
ZMWs
Number of unique Zero Mode Waveguide identifiers observed in PacBio sequence data.
Mergable
Fraction of read pairs that successfully overlap and can be merged into single consensus sequences.
-InsertMean
Average insert size calculated from successfully merged read pairs.
-InsertMode
Most common insert size from the merge analysis distribution.
-AdapterReads
Fraction of reads containing adapter sequences, detected through overlap analysis.
-AdapterBases
Fraction of total bases that represent adapter sequence contamination.
QErrorRate
Average estimated error rate derived from quality score analysis across all bases.
-QAvgLog
Logarithmic average of quality scores, providing error-weighted quality assessment.
-QAvgLinear
Linear average of quality scores across all bases in the dataset.
-TrimmedAtQ5
Fraction of bases that would be removed by quality trimming at Q5 threshold.
-TrimmedAtQ10
Fraction of bases that would be removed by quality trimming at Q10 threshold.
-TrimmedAtQ15
Fraction of bases that would be removed by quality trimming at Q15 threshold.
-TrimmedAtQ20
Fraction of bases that would be removed by quality trimming at Q20 threshold.
Qhist
Quality score histogram showing frequency distribution of all quality values observed.
Ihist
Insert size histogram derived from paired-read merging analysis.
BarcodeList
Complete list of observed barcode sequences with occurrence counts.
JunkList
List of sequence headers from reads that failed quality or format validation.

Examples

Basic File Analysis

# Analyze a FASTQ file with default settings
testformat2.sh reads.fastq

# Quick format detection without full analysis
testformat2.sh reads.fastq full=f

# Comprehensive analysis with timing information
testformat2.sh reads.fastq speed=t

These examples demonstrate basic usage patterns for file format analysis and timing assessment.

Advanced Analysis with Custom Output

# Full analysis with custom histogram files
testformat2.sh reads.fastq qhist=quality_dist.txt ihist=insert_sizes.txt

# Barcode analysis with expected barcode validation
testformat2.sh barcoded_reads.fastq barcodelist=expected_barcodes.txt printbarcodes=t

# PacBio data analysis with ZMW tracking
testformat2.sh pacbio_reads.fastq zmw=t zmwhist=zmw_passes.txt

Advanced examples showing specialized analysis for different data types and custom output configurations.

Quality Assessment and Trimming Simulation

# Quality assessment with trim simulation
testformat2.sh reads.fastq trim=t printqhist=t

# Disable sketching for faster processing of large files
testformat2.sh large_file.fastq sketch=f merge=f

# Generate all histogram files for comprehensive QC
testformat2.sh reads.fastq hist=t bhistlen=0

Examples focused on quality control and comprehensive statistical analysis.

Algorithm Details

Multi-threaded Analysis Architecture

TestFormat2 employs a sophisticated multi-threaded processing architecture that automatically scales thread usage based on system resources. For systems with 48+ threads, the tool intelligently limits usage to a maximum of 40 threads or half the available threads to prevent resource contention while maintaining optimal performance.

Format Detection Strategy

The tool uses a dual-phase format detection approach: filename-based initial detection followed by content-based validation. This strategy first examines file extensions and compression indicators, then validates the prediction by analyzing file headers and content structure. Quality offset detection specifically tests both Phred+33 and Phred+64 encodings to ensure accurate quality score interpretation.

Statistical Analysis Methods

Base composition analysis uses specialized lookup tables (toNum, toLUS, toAmino arrays) for high-speed character classification. Quality score statistics employ error probability calculations using QualityTools.PROB_ERROR arrays for accurate error rate estimation. Length statistics utilize SuperLongList data structures optimized for large datasets with efficient median and mode calculations.

Sketching and Cardinality Estimation

When sketching is enabled, the tool implements MinHash sketching through SketchMakerMini with configurable sketch sizes (default 40,000). The sketching algorithm automatically detects PacBio vs Illumina data characteristics and adjusts parameters accordingly. Cardinality estimation uses both simple and 3+ k-mer frequency thresholds for robust genome size estimation, with optional RefSeq server queries for taxonomic identification.

Merging and Insert Size Analysis

Paired-read merging employs the BBMerge.findOverlapLoose algorithm to detect read pair overlaps without requiring perfect matches. The algorithm identifies adapter contamination by comparing successful merge lengths with read lengths, tracking both adapter-containing reads and total adapter bases. Insert size histograms are generated from successful merges with automatic binning optimization.

Quality-Based Trimming Simulation

Trim analysis uses TrimRead.testOptimal with configurable quality thresholds (Q5, Q10, Q15, Q20) to simulate various trimming strategies. The algorithm employs error probability matrices to determine optimal trim points that balance data retention with quality requirements, providing predictive data for downstream processing decisions.

Memory Management and Scalability

The tool implements adaptive memory management with automatic heap size calculation based on available system memory (calcXmx function). Large file processing uses streaming analysis to minimize memory footprint while maintaining statistical accuracy. Histogram data structures are pre-allocated with appropriate sizes (256 quality bins, 1000 insert size bins) to prevent memory fragmentation during analysis.

Error Handling and Data Validation

Robust error detection identifies various data quality issues including invalid base characters, malformed quality strings, mismatched pair names, and chastity filter failures. The tool maintains separate counters for different error types, enabling detailed quality assessment and troubleshooting of problematic datasets.

Support

For questions and support: