AnalyzeAccession

Basic Usage

analyzeaccession.sh *accession2taxid.gz out=<output file>

Analyzes one or more accession-to-TaxID files to identify patterns that can be used for efficient compression and storage optimization.

Parameters

Parameters control how the analysis is performed and how multiple files are processed.

Processing Parameters

perfile=t: Use multiple threads per file and multiple files at a time. When true, processes multiple files simultaneously with multiple threads per file (up to 16 threads). When false, processes files sequentially with up to 8 threads per file.
in=file: Input file(s) containing accession-to-TaxID mappings. Multiple files can be specified with comma separation or by listing multiple files as arguments. Supports compressed files (.gz).
out=file: Output file for pattern analysis results. Contains pattern frequency, combination counts, and compression potential (in bits) for each identified pattern.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Analyze Single File

analyzeaccession.sh nucl_gb.accession2taxid.gz out=patterns.txt

Analyzes patterns in the GenBank nucleotide accession file and outputs compression analysis to patterns.txt.

Analyze Multiple Files

analyzeaccession.sh *.accession2taxid.gz out=all_patterns.txt

Analyzes patterns across all accession2taxid files in the current directory.

Sequential Processing

analyzeaccession.sh perfile=f prot.accession2taxid.gz out=prot_patterns.txt

Processes the protein accession file sequentially with multiple threads per file instead of parallel file processing.

Algorithm Details

Pattern Recognition System

AnalyzeAccession implements a character remapping system using a 128-byte lookup table for biological accession identifiers. The algorithm maps each character in accession strings to pattern categories using the makeRemap() method:

D - Digits (0-9)
L - Letters (A-Z, case-insensitive)
- - Separators (underscore, hyphen)
? - Other characters

Compression Analysis

The tool calculates compression potential by analyzing pattern combinations:

Each digit position contributes 10 possible values
Each letter position contributes 26 possible values
Total combinations = 10^(digit_positions) × 26^(letter_positions)
Bits required = log₂(combinations)

Multithreaded Processing

The algorithm uses two processing strategies:

Per-file mode (default): Up to 16 threads per file, processes multiple files simultaneously
Sequential mode: Up to 8 threads per file, processes files one at a time

Pattern Digitization

For efficient storage, patterns can be digitized into 64-bit integers:

Pattern code stored in lower bits
Numeric value stored in upper bits
Enables fast lookup and comparison operations
Supports up to 2^(63-codeBits) combinations per pattern

Output Format

The analysis output contains tab-separated columns:

Pattern: The identified pattern (e.g., "LLDDDDDD")
Count: Number of accessions matching this pattern
Combos: Total possible combinations for this pattern
Bits: Number of bits required to represent all combinations

Memory Efficiency

The tool is designed for efficient memory usage when processing large accession databases:

Stream processing avoids loading entire files into memory
Thread-local hash maps reduce synchronization overhead
Pattern strings are deduplicated across threads
Default memory allocation is 400MB, suitable for most datasets

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org