TestFormat2

Basic Usage

testformat2.sh <file>

The tool accepts a single file as input and performs analysis to determine file format, compression, quality metrics, base composition, and various other characteristics important for bioinformatics pipeline planning.

Parameters

TestFormat2 organizes parameters into functional groups for analysis control and output generation. The tool provides customization options for different analysis depths and output formats.

Analysis Control Parameters

full=t: Process the full file for complete analysis. When true, performs complete file scanning to gather all statistics. Set to false for faster format-only detection.
speed=f: Print processing time and throughput statistics. When enabled, reports time elapsed and processing rates for performance monitoring.
fast=t: Use faster processing mode with reduced precision for some statistics. Trades some accuracy for significantly improved speed on large files.
slow=f: Opposite of fast parameter. When true, enables most comprehensive analysis with maximum precision.

Sequence-Specific Analysis

zmw=t: Parse PacBio ZMW (Zero Mode Waveguide) IDs from sequence headers. Enables ZMW-specific statistics and pass count analysis for PacBio data.
barcodelist=: Optional list of expected barcodes. May be a filename with one line per barcode, or a comma-delimited literal string. Used for barcode validation and statistics.
merge=t: Calculate mergability statistics via BBMerge algorithm. Attempts to merge paired reads to determine insert size distributions and adapter contamination.
sketch=t: (card) Calculate cardinality via BBSketch. If enabled, also sends the sketch to the RefSeq server for taxonomic identification. Provides genome size estimates and species identification.
trim=t: Calculate trimmability statistics from quality scores. Simulates quality-based trimming at various thresholds to predict trim rates.
sketchsize=40000: Size of sketch to generate for cardinality estimation. Larger sketches provide more accurate cardinality estimates but require more memory.
bhistlen=10k: Maximum read length for base composition histogram calculation. Reads longer than this value are excluded from bhist.txt. Set to 0 to include all reads.
maxbhistlen=10000: Alias for bhistlen parameter. Controls the maximum read length included in base composition analysis.

Output Control Parameters

printjunk=f: Print headers of junk reads to stdout. Junk reads are those with invalid bases, format errors, or other quality issues.
printbarcodes=f: Print barcode sequences and their occurrence counts to stdout. Useful for validating expected barcode lists and identifying contamination.
printqhist=f: Print quality score histogram to stdout. Shows distribution of quality scores across all bases in the file.
printihist=f: Print insert size histogram to stdout. Derived from paired-read merging analysis when merge=t is enabled.
edist=f: Calculate and report edit distances for barcode analysis. Computes minimum edit distances between observed and expected barcodes.

File Output Parameters (these can be eliminated by setting to null)

junk=junk.txt: Print headers of junk reads to this file. Contains sequence identifiers for reads that failed quality checks or format validation.
barcodes=barcodes.txt: Print barcode analysis results to this file. Includes barcode sequences, counts, and validation results against expected barcode lists.
hist=t: Master switch for histogram file generation. Setting to false will clear all default histogram file outputs (qhist, ihist, khist, bhist, lhist, gchist, zmwhist).
qhist=qhist.txt: Print quality histogram to this file. Contains quality score distributions with frequency counts for each quality value observed.
ihist=ihist.txt: Print insert size histogram to this file. Generated from paired-read overlap analysis and contains insert size frequency distribution.
khist=khist.txt: Print k-mer frequency histogram to this file. Derived from sketch analysis and shows k-mer depth distribution for genome size estimation.
bhist=bhist.txt: Print base composition histogram to this file. Contains per-read base composition statistics including GC content distribution.
lhist=lhist.txt: Print length histogram to this file. Shows read length distribution across the entire dataset.
gchist=gchist.txt: Print GC content histogram to this file. Contains detailed GC content distribution statistics for quality assessment.
zmwhist=zmwhist.txt: Print ZMW pass count histogram to this file. For PacBio data, shows distribution of how many passes each ZMW had.

Output Terminology Reference

Format: File format detected (e.g., FASTA, FASTQ, SAM, BAM). Determined by header analysis and content structure.
Compression: Compression format identified (e.g., gzip, bzip2, uncompressed). Important for pipeline processing requirements.
Interleaved: True if paired reads are stored in a single file with alternating mate pairs. False indicates separate files or unpaired data.
MaxLen: Maximum observed read length in the dataset. Critical for memory allocation and processing parameters.
MinLen: Minimum observed read length in the dataset. Useful for quality filtering and adapter detection.
StdevLen: Standard deviation of read lengths. Indicates length variability within the dataset.
ModeLen: Most common read length in the dataset. Typically represents the primary sequencing technology read length.
QualOffset: Quality score encoding offset (typically 33 or 64). Essential for proper quality score interpretation.
NegativeQuals: Number of bases with negative quality scores after offset correction. Indicates potential encoding issues.
Content: Sequence content type classification: Nucleotides or AminoAcids based on character analysis.
Type: Nucleic acid type classification: DNA, RNA, or Mixed based on presence of T vs U bases.
Reads: Total number of sequence reads processed in the analysis.
-JunkReads: Reads with invalid bases, format errors, or other quality issues that failed validation.
-ChastityFail: Reads failing Illumina's chastity filter as indicated in sequence headers.
-BadPairNames: Paired reads whose sequence identifiers don't match expected pairing patterns.
Bases: Total number of nucleotide bases processed across all reads.
-Lowercase: Count of lowercase nucleotide bases. Often indicates low-quality or soft-masked regions.
-Uppercase: Count of uppercase nucleotide bases. Typically indicates high-confidence sequence.
-Non-Letter: Count of non-alphabetic characters in sequence data. May indicate formatting issues.
-FullyDefined: Count of standard nucleotide bases (A, C, G, T, or U). Represents well-defined sequence content.
-No-call: Count of ambiguous 'N' bases where sequencing could not determine the nucleotide.
-Degenerate: Count of valid IUPAC ambiguity codes (other than N) representing multiple possible bases.
-Gap: Count of gap characters (-) in the sequence data.
-Invalid: Count of characters that are not valid sequence symbols according to standard conventions.
GC: GC content calculated as (C+G)/(C+G+A+T+U). Important for sequencing bias assessment and genome characterization.
Cardinality: Approximate number of unique 31-mers in the file, estimated via MinHash sketching. Provides genome size estimates.
Organism: Taxonomic name of the top-matching organism from BBSketch RefSeq server comparison.
TaxID: NCBI Taxonomy ID corresponding to the organism identification from sketch analysis.
Barcodes: Number of unique barcode sequences observed in Illumina-style sequence headers.
ZMWs: Number of unique Zero Mode Waveguide identifiers observed in PacBio sequence data.
Mergable: Fraction of read pairs that successfully overlap and can be merged into single consensus sequences.
-InsertMean: Average insert size calculated from successfully merged read pairs.
-InsertMode: Most common insert size from the merge analysis distribution.
-AdapterReads: Fraction of reads containing adapter sequences, detected through overlap analysis.
-AdapterBases: Fraction of total bases that represent adapter sequence contamination.
QErrorRate: Average estimated error rate derived from quality score analysis across all bases.
-QAvgLog: Logarithmic average of quality scores, providing error-weighted quality assessment.
-QAvgLinear: Linear average of quality scores across all bases in the dataset.
-TrimmedAtQ5: Fraction of bases that would be removed by quality trimming at Q5 threshold.
-TrimmedAtQ10: Fraction of bases that would be removed by quality trimming at Q10 threshold.
-TrimmedAtQ15: Fraction of bases that would be removed by quality trimming at Q15 threshold.
-TrimmedAtQ20: Fraction of bases that would be removed by quality trimming at Q20 threshold.
Qhist: Quality score histogram showing frequency distribution of all quality values observed.
Ihist: Insert size histogram derived from paired-read merging analysis.
BarcodeList: Complete list of observed barcode sequences with occurrence counts.
JunkList: List of sequence headers from reads that failed quality or format validation.

Examples

Basic File Analysis

# Analyze a FASTQ file with default settings
testformat2.sh reads.fastq

# Quick format detection without full analysis
testformat2.sh reads.fastq full=f

# Comprehensive analysis with timing information
testformat2.sh reads.fastq speed=t

These examples demonstrate basic usage patterns for file format analysis and timing assessment.

Advanced Analysis with Custom Output

# Full analysis with custom histogram files
testformat2.sh reads.fastq qhist=quality_dist.txt ihist=insert_sizes.txt

# Barcode analysis with expected barcode validation
testformat2.sh barcoded_reads.fastq barcodelist=expected_barcodes.txt printbarcodes=t

# PacBio data analysis with ZMW tracking
testformat2.sh pacbio_reads.fastq zmw=t zmwhist=zmw_passes.txt

Advanced examples showing specialized analysis for different data types and custom output configurations.

Quality Assessment and Trimming Simulation

# Quality assessment with trim simulation
testformat2.sh reads.fastq trim=t printqhist=t

# Disable sketching for faster processing of large files
testformat2.sh large_file.fastq sketch=f merge=f

# Generate all histogram files for comprehensive QC
testformat2.sh reads.fastq hist=t bhistlen=0

Examples focused on quality control and comprehensive statistical analysis.

Algorithm Details

Multi-threaded Analysis Architecture

TestFormat2 employs a sophisticated multi-threaded processing architecture that automatically scales thread usage based on system resources. For systems with 48+ threads, the tool intelligently limits usage to a maximum of 40 threads or half the available threads to prevent resource contention while maintaining optimal performance.

Format Detection Strategy

The tool uses a dual-phase format detection approach: filename-based initial detection followed by content-based validation. This strategy first examines file extensions and compression indicators, then validates the prediction by analyzing file headers and content structure. Quality offset detection specifically tests both Phred+33 and Phred+64 encodings to ensure accurate quality score interpretation.

Statistical Analysis Methods

Base composition analysis uses specialized lookup tables (toNum, toLUS, toAmino arrays) for high-speed character classification. Quality score statistics employ error probability calculations using QualityTools.PROB_ERROR arrays for accurate error rate estimation. Length statistics utilize SuperLongList data structures optimized for large datasets with efficient median and mode calculations.

Sketching and Cardinality Estimation

When sketching is enabled, the tool implements MinHash sketching through SketchMakerMini with configurable sketch sizes (default 40,000). The sketching algorithm automatically detects PacBio vs Illumina data characteristics and adjusts parameters accordingly. Cardinality estimation uses both simple and 3+ k-mer frequency thresholds for robust genome size estimation, with optional RefSeq server queries for taxonomic identification.

Merging and Insert Size Analysis

Paired-read merging employs the BBMerge.findOverlapLoose algorithm to detect read pair overlaps without requiring perfect matches. The algorithm identifies adapter contamination by comparing successful merge lengths with read lengths, tracking both adapter-containing reads and total adapter bases. Insert size histograms are generated from successful merges with automatic binning optimization.

Quality-Based Trimming Simulation

Trim analysis uses TrimRead.testOptimal with configurable quality thresholds (Q5, Q10, Q15, Q20) to simulate various trimming strategies. The algorithm employs error probability matrices to determine optimal trim points that balance data retention with quality requirements, providing predictive data for downstream processing decisions.

Memory Management and Scalability

The tool implements adaptive memory management with automatic heap size calculation based on available system memory (calcXmx function). Large file processing uses streaming analysis to minimize memory footprint while maintaining statistical accuracy. Histogram data structures are pre-allocated with appropriate sizes (256 quality bins, 1000 insert size bins) to prevent memory fragmentation during analysis.

Error Handling and Data Validation

Robust error detection identifies various data quality issues including invalid base characters, malformed quality strings, mismatched pair names, and chastity filter failures. The tool maintains separate counters for different error types, enabling detailed quality assessment and troubleshooting of problematic datasets.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org