AnalyzeAccession
Analyzes accession patterns to determine optimal compression strategies for accession-to-TaxID mapping files.
Basic Usage
analyzeaccession.sh *accession2taxid.gz out=<output file>
Analyzes one or more accession-to-TaxID files to identify patterns that can be used for efficient compression and storage optimization.
Parameters
Parameters control how the analysis is performed and how multiple files are processed.
Processing Parameters
- perfile=t
- Use multiple threads per file and multiple files at a time. When true, processes multiple files simultaneously with multiple threads per file (up to 16 threads). When false, processes files sequentially with up to 8 threads per file.
- in=file
- Input file(s) containing accession-to-TaxID mappings. Multiple files can be specified with comma separation or by listing multiple files as arguments. Supports compressed files (.gz).
- out=file
- Output file for pattern analysis results. Contains pattern frequency, combination counts, and compression potential (in bits) for each identified pattern.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Analyze Single File
analyzeaccession.sh nucl_gb.accession2taxid.gz out=patterns.txt
Analyzes patterns in the GenBank nucleotide accession file and outputs compression analysis to patterns.txt.
Analyze Multiple Files
analyzeaccession.sh *.accession2taxid.gz out=all_patterns.txt
Analyzes patterns across all accession2taxid files in the current directory.
Sequential Processing
analyzeaccession.sh perfile=f prot.accession2taxid.gz out=prot_patterns.txt
Processes the protein accession file sequentially with multiple threads per file instead of parallel file processing.
Algorithm Details
Pattern Recognition System
AnalyzeAccession implements a character remapping system using a 128-byte lookup table for biological accession identifiers. The algorithm maps each character in accession strings to pattern categories using the makeRemap() method:
- D - Digits (0-9)
- L - Letters (A-Z, case-insensitive)
- - - Separators (underscore, hyphen)
- ? - Other characters
Compression Analysis
The tool calculates compression potential by analyzing pattern combinations:
- Each digit position contributes 10 possible values
- Each letter position contributes 26 possible values
- Total combinations = 10^(digit_positions) × 26^(letter_positions)
- Bits required = log₂(combinations)
Multithreaded Processing
The algorithm uses two processing strategies:
- Per-file mode (default): Up to 16 threads per file, processes multiple files simultaneously
- Sequential mode: Up to 8 threads per file, processes files one at a time
Pattern Digitization
For efficient storage, patterns can be digitized into 64-bit integers:
- Pattern code stored in lower bits
- Numeric value stored in upper bits
- Enables fast lookup and comparison operations
- Supports up to 2^(63-codeBits) combinations per pattern
Output Format
The analysis output contains tab-separated columns:
- Pattern: The identified pattern (e.g., "LLDDDDDD")
- Count: Number of accessions matching this pattern
- Combos: Total possible combinations for this pattern
- Bits: Number of bits required to represent all combinations
Memory Efficiency
The tool is designed for efficient memory usage when processing large accession databases:
- Stream processing avoids loading entire files into memory
- Thread-local hash maps reduce synchronization overhead
- Pattern strings are deduplicated across threads
- Default memory allocation is 400MB, suitable for most datasets
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org