AnalyzeAccession

Script: analyzeaccession.sh Package: tax Class: AnalyzeAccession.java

Analyzes accession patterns to determine optimal compression strategies for accession-to-TaxID mapping files.

Basic Usage

analyzeaccession.sh *accession2taxid.gz out=<output file>

Analyzes one or more accession-to-TaxID files to identify patterns that can be used for efficient compression and storage optimization.

Parameters

Parameters control how the analysis is performed and how multiple files are processed.

Processing Parameters

perfile=t
Use multiple threads per file and multiple files at a time. When true, processes multiple files simultaneously with multiple threads per file (up to 16 threads). When false, processes files sequentially with up to 8 threads per file.
in=file
Input file(s) containing accession-to-TaxID mappings. Multiple files can be specified with comma separation or by listing multiple files as arguments. Supports compressed files (.gz).
out=file
Output file for pattern analysis results. Contains pattern frequency, combination counts, and compression potential (in bits) for each identified pattern.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Analyze Single File

analyzeaccession.sh nucl_gb.accession2taxid.gz out=patterns.txt

Analyzes patterns in the GenBank nucleotide accession file and outputs compression analysis to patterns.txt.

Analyze Multiple Files

analyzeaccession.sh *.accession2taxid.gz out=all_patterns.txt

Analyzes patterns across all accession2taxid files in the current directory.

Sequential Processing

analyzeaccession.sh perfile=f prot.accession2taxid.gz out=prot_patterns.txt

Processes the protein accession file sequentially with multiple threads per file instead of parallel file processing.

Algorithm Details

Pattern Recognition System

AnalyzeAccession implements a character remapping system using a 128-byte lookup table for biological accession identifiers. The algorithm maps each character in accession strings to pattern categories using the makeRemap() method:

Compression Analysis

The tool calculates compression potential by analyzing pattern combinations:

Multithreaded Processing

The algorithm uses two processing strategies:

Pattern Digitization

For efficient storage, patterns can be digitized into 64-bit integers:

Output Format

The analysis output contains tab-separated columns:

Memory Efficiency

The tool is designed for efficient memory usage when processing large accession databases:

Support

For questions and support: