CladeLoader

Script: cladeloader.sh Package: bin Class: CladeLoader.java

Loads fasta files with TID-labeled contigs to produce Clade record output with kmer frequencies, entropy tracking, and taxonomic analysis for phylogenetic classification workflows.

Basic Usage

cladeloader.sh in=contigs.fa out=clades.clade

CladeLoader processes FASTA files containing sequences with taxonomic ID (TID) labels in the headers. It generates Clade records that include kmer frequency profiles, entropy measurements, and ribosomal RNA sequences for phylogenetic analysis.

Parameters

Parameters are organized by their function in the clade loading process. All parameters from the shell script usage function are documented here.

Input/Output Parameters

in=<file,file>: Input fasta files with taxonomic ID (tid) in headers. Multiple files can be specified comma-separated. Headers must contain taxonomic information parseable by TaxTree.parseHeaderStatic().
out=<file>: Output file for clade records. Will contain taxonomic profiles with kmer frequencies, entropy data, and sequence statistics.

Processing Parameters

maxk=5: Limit maximum kmer length for frequency analysis (range 3-5). Controls the depth of kmer profiling used for taxonomic characterization. Higher values provide more specificity but require more memory.
a48: Output counts in ASCII-48 encoding instead of decimal format. This encoding method is more compact for storage and transmission of frequency data.

Ribosomal RNA Parameters

16s=<file,file>: Optional taxonomically-labeled file(s) of 16S ribosomal RNA sequences. These sequences will be integrated into matching clades based on taxonomic ID for enhanced phylogenetic profiling.
18s=<file,file>: Optional taxonomically-labeled file(s) of 18S ribosomal RNA sequences. Used for eukaryotic phylogenetic classification alongside the primary sequence data.
replaceribo: Set to true if existing ribosomal RNA sequences should be replaced by new ones from the 16s/18s files. Default behavior preserves existing sequences unless this flag is set.

Taxonomic Tree Parameters

usetree=f: Load a taxonomic tree to generate lineage strings for enhanced taxonomic context. When enabled, provides full taxonomic lineages for each clade.
tree: Path to taxonomic tree file. Can be set to "auto" for automatic detection, "true"/"t" to enable with default path, "false"/"f" to disable, or a specific file path.

Alignment Parameters

aligner=quantum: Alignment algorithm for sequence analysis. Options include ssa2, glocal, drifting, banded, crosscut, and quantum. Each offers different speed/accuracy tradeoffs for taxonomic classification.

Processing Control Parameters

mergedupes: Merge duplicate taxonomic IDs encountered in the input. When enabled, combines statistics from multiple sequences with the same TID rather than replacing.
verbose: Print detailed processing messages during execution. Useful for monitoring progress and debugging processing issues.
ordered: Maintain input order in the output. When disabled, clades may be output in processed order for better performance.
dummy: Use dummy clade for temporary storage during processing. This optimization reduces memory allocation overhead for large datasets.
callssu: Enable Small Subunit (SSU) ribosomal RNA calling during processing. Automatically identifies and processes ribosomal sequences found in the input.

Java Parameters

-Xmx: Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20 gigs of RAM, -Xmx200m specifies 200 megs. Maximum is typically 85% of physical memory. Default is 4GB for this tool.
-eoom: Exit if an out-of-memory exception occurs. Requires Java 8u92+. Prevents hanging processes when insufficient memory is available.
-da: Disable assertions for improved performance in production environments.

Examples

Basic Clade Loading

cladeloader.sh in=annotated_contigs.fa out=taxonomy_profiles.clade

Process FASTA contigs with taxonomic IDs in headers to generate clade records with kmer profiles.

Enhanced Taxonomic Analysis

cladeloader.sh in=contigs.fa out=clades.clade usetree=t 16s=ribosomal_16s.fa maxk=4

Load contigs with taxonomic tree integration and 16S ribosomal sequences, using 4-mer frequency analysis.

Multiple Input Files with Compression

cladeloader.sh in=sample1.fa,sample2.fa,sample3.fa out=combined_clades.clade a48 verbose

Process multiple input files simultaneously, output in ASCII-48 encoding with verbose progress reporting.

Complete Ribosomal Integration

cladeloader.sh in=metagenome.fa out=taxonomy.clade 16s=16s_refs.fa 18s=18s_refs.fa replaceribo

Integrate both 16S and 18S ribosomal references, replacing any existing ribosomal sequences with the provided references.

Algorithm Details

Clade Record Generation

CladeLoader implements a multi-threaded processing architecture using ConcurrentHashMap<Integer, Clade> storage. The core algorithm processes sequences through the following stages:

Taxonomic ID Resolution: Each input sequence header is parsed to extract taxonomic identifiers using TaxTree.parseHeaderStatic(). This supports standard NCBI taxonomic formatting and custom labeling schemes.

Kmer Frequency Analysis: Sequences undergo kmer decomposition up to the specified maxk length (3-5 mers). The system uses AdjustEntropy with predetermined parameters (k=4, window=150) for consistent entropy calculations across all sequences.

Entropy Tracking: Each processing thread maintains an EntropyTracker instance that calculates sequence complexity metrics. This information is crucial for distinguishing between repetitive and unique genomic regions in taxonomic classification.

Concurrent Processing: The tool uses a ConcurrentHashMap<Integer, Clade> architecture where each taxonomic ID maps to a Clade object. Thread safety is maintained through careful synchronization during clade creation and updating.

Memory Optimization Strategies

Dummy Clade Pattern: When useDummy=true, each thread maintains a temporary "dummy" clade for accumulating statistics before committing to the shared map. This reduces lock contention and memory allocation overhead.

Dual Data Structure Strategy: The system automatically selects between count arrays for short sequences (<32KB) and list structures for longer sequences (≥32KB), optimizing memory usage based on sequence length distribution.

Streaming Processing: Input files are processed in streaming fashion using ConcurrentReadInputStream, allowing processing of datasets larger than available memory.

Ribosomal RNA Integration

When 16S or 18S files are provided, the system performs targeted integration of ribosomal sequences into matching clades. The addRibo() method cross-references taxonomic IDs between the primary sequences and ribosomal references, ensuring phylogenetically relevant associations.

Output Format and Encoding

Clade records are output in a structured format that includes kmer frequencies, entropy measurements, sequence statistics, and integrated ribosomal data. The ASCII-48 encoding option provides compact storage for frequency data while maintaining full precision.

Performance Characteristics

Memory Usage: Default allocation of 4GB with automatic scaling based on input size. Memory usage scales linearly with the number of unique taxonomic IDs and selected kmer length.

Threading: Automatically utilizes all available CPU cores with optimal load balancing. File I/O is performed in separate threads to maximize processing throughput.

Scalability: Designed to handle large metagenomic datasets with millions of sequences. The concurrent architecture ensures efficient processing regardless of dataset size.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org