CladeLoader
Loads fasta files with TID-labeled contigs to produce Clade record output with kmer frequencies, entropy tracking, and taxonomic analysis for phylogenetic classification workflows.
Basic Usage
cladeloader.sh in=contigs.fa out=clades.clade
CladeLoader processes FASTA files containing sequences with taxonomic ID (TID) labels in the headers. It generates Clade records that include kmer frequency profiles, entropy measurements, and ribosomal RNA sequences for phylogenetic analysis.
Parameters
Parameters are organized by their function in the clade loading process. All parameters from the shell script usage function are documented here.
Input/Output Parameters
- in=<file,file>
- Input fasta files with taxonomic ID (tid) in headers. Multiple files can be specified comma-separated. Headers must contain taxonomic information parseable by TaxTree.parseHeaderStatic().
- out=<file>
- Output file for clade records. Will contain taxonomic profiles with kmer frequencies, entropy data, and sequence statistics.
Processing Parameters
- maxk=5
- Limit maximum kmer length for frequency analysis (range 3-5). Controls the depth of kmer profiling used for taxonomic characterization. Higher values provide more specificity but require more memory.
- a48
- Output counts in ASCII-48 encoding instead of decimal format. This encoding method is more compact for storage and transmission of frequency data.
Ribosomal RNA Parameters
- 16s=<file,file>
- Optional taxonomically-labeled file(s) of 16S ribosomal RNA sequences. These sequences will be integrated into matching clades based on taxonomic ID for enhanced phylogenetic profiling.
- 18s=<file,file>
- Optional taxonomically-labeled file(s) of 18S ribosomal RNA sequences. Used for eukaryotic phylogenetic classification alongside the primary sequence data.
- replaceribo
- Set to true if existing ribosomal RNA sequences should be replaced by new ones from the 16s/18s files. Default behavior preserves existing sequences unless this flag is set.
Taxonomic Tree Parameters
- usetree=f
- Load a taxonomic tree to generate lineage strings for enhanced taxonomic context. When enabled, provides full taxonomic lineages for each clade.
- tree
- Path to taxonomic tree file. Can be set to "auto" for automatic detection, "true"/"t" to enable with default path, "false"/"f" to disable, or a specific file path.
Alignment Parameters
- aligner=quantum
- Alignment algorithm for sequence analysis. Options include ssa2, glocal, drifting, banded, crosscut, and quantum. Each offers different speed/accuracy tradeoffs for taxonomic classification.
Processing Control Parameters
- mergedupes
- Merge duplicate taxonomic IDs encountered in the input. When enabled, combines statistics from multiple sequences with the same TID rather than replacing.
- verbose
- Print detailed processing messages during execution. Useful for monitoring progress and debugging processing issues.
- ordered
- Maintain input order in the output. When disabled, clades may be output in processed order for better performance.
- dummy
- Use dummy clade for temporary storage during processing. This optimization reduces memory allocation overhead for large datasets.
- callssu
- Enable Small Subunit (SSU) ribosomal RNA calling during processing. Automatically identifies and processes ribosomal sequences found in the input.
Java Parameters
- -Xmx
- Set Java's memory usage, overriding autodetection. -Xmx20g specifies 20 gigs of RAM, -Xmx200m specifies 200 megs. Maximum is typically 85% of physical memory. Default is 4GB for this tool.
- -eoom
- Exit if an out-of-memory exception occurs. Requires Java 8u92+. Prevents hanging processes when insufficient memory is available.
- -da
- Disable assertions for improved performance in production environments.
Examples
Basic Clade Loading
cladeloader.sh in=annotated_contigs.fa out=taxonomy_profiles.clade
Process FASTA contigs with taxonomic IDs in headers to generate clade records with kmer profiles.
Enhanced Taxonomic Analysis
cladeloader.sh in=contigs.fa out=clades.clade usetree=t 16s=ribosomal_16s.fa maxk=4
Load contigs with taxonomic tree integration and 16S ribosomal sequences, using 4-mer frequency analysis.
Multiple Input Files with Compression
cladeloader.sh in=sample1.fa,sample2.fa,sample3.fa out=combined_clades.clade a48 verbose
Process multiple input files simultaneously, output in ASCII-48 encoding with verbose progress reporting.
Complete Ribosomal Integration
cladeloader.sh in=metagenome.fa out=taxonomy.clade 16s=16s_refs.fa 18s=18s_refs.fa replaceribo
Integrate both 16S and 18S ribosomal references, replacing any existing ribosomal sequences with the provided references.
Algorithm Details
Clade Record Generation
CladeLoader implements a multi-threaded processing architecture using ConcurrentHashMap<Integer, Clade> storage. The core algorithm processes sequences through the following stages:
Taxonomic ID Resolution: Each input sequence header is parsed to extract taxonomic identifiers using TaxTree.parseHeaderStatic(). This supports standard NCBI taxonomic formatting and custom labeling schemes.
Kmer Frequency Analysis: Sequences undergo kmer decomposition up to the specified maxk length (3-5 mers). The system uses AdjustEntropy with predetermined parameters (k=4, window=150) for consistent entropy calculations across all sequences.
Entropy Tracking: Each processing thread maintains an EntropyTracker instance that calculates sequence complexity metrics. This information is crucial for distinguishing between repetitive and unique genomic regions in taxonomic classification.
Concurrent Processing: The tool uses a ConcurrentHashMap<Integer, Clade> architecture where each taxonomic ID maps to a Clade object. Thread safety is maintained through careful synchronization during clade creation and updating.
Memory Optimization Strategies
Dummy Clade Pattern: When useDummy=true, each thread maintains a temporary "dummy" clade for accumulating statistics before committing to the shared map. This reduces lock contention and memory allocation overhead.
Dual Data Structure Strategy: The system automatically selects between count arrays for short sequences (<32KB) and list structures for longer sequences (≥32KB), optimizing memory usage based on sequence length distribution.
Streaming Processing: Input files are processed in streaming fashion using ConcurrentReadInputStream, allowing processing of datasets larger than available memory.
Ribosomal RNA Integration
When 16S or 18S files are provided, the system performs targeted integration of ribosomal sequences into matching clades. The addRibo() method cross-references taxonomic IDs between the primary sequences and ribosomal references, ensuring phylogenetically relevant associations.
Output Format and Encoding
Clade records are output in a structured format that includes kmer frequencies, entropy measurements, sequence statistics, and integrated ribosomal data. The ASCII-48 encoding option provides compact storage for frequency data while maintaining full precision.
Performance Characteristics
Memory Usage: Default allocation of 4GB with automatic scaling based on input size. Memory usage scales linearly with the number of unique taxonomic IDs and selected kmer length.
Threading: Automatically utilizes all available CPU cores with optimal load balancing. File I/O is performed in separate threads to maximize processing throughput.
Scalability: Designed to handle large metagenomic datasets with millions of sequences. The concurrent architecture ensures efficient processing regardless of dataset size.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org