IDMatrix

Basic Usage

idmatrix.sh in=<file> out=<file>

Creates an identity matrix by performing all-to-all alignment of sequences in the input file. Each cell in the matrix represents the sequence identity between two input sequences.

Parameters

Parameters control input/output files, alignment behavior, and output formatting. The tool uses BandedAligner.alignQuadrupleProgressive() with configurable bandwidth and edit distance constraints.

Parameters

in=<file>: File containing reads. in=stdin.fa will pipe from stdin. Input sequences should not be paired-end reads as the tool is designed for single sequences only.
out=<file>: Matrix output. out=stdout will pipe to stdout. The output is a tab-delimited matrix with sequence IDs as row/column headers and identity values as cells.
threads=auto: (t) Set number of threads to use; default is number of logical processors. Multi-threading is used for processing alignments in parallel.
percent=f: Output identity as percent rather than a fraction. When true, values are multiplied by 100 and displayed as percentages (0-100). When false, values are displayed as fractions (0.0-1.0) with 4 decimal places.
edits=: Allow at most this much edit distance. Default is the length of the longest input sequence. Lower values are faster but may miss alignments with high divergence. This parameter controls the maximum number of edit operations (insertions, deletions, substitutions) allowed in the alignment.
width=: Alignment bandwidth, lower is faster. Default: 2*edits+1. This controls the width of the banded alignment algorithm. Narrower bands are faster but may miss optimal alignments if sequences have large indels.
usejni=f: (jni) Do alignments in C code using JNI (Java Native Interface). Requires compiling the C code; details are in /jni/README.txt. When enabled, uses native C implementations for BandedAligner operations instead of Java implementations.

Java Parameters

-Xmx: This will set Java's memory usage, overriding automatic memory detection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Memory usage scales quadratically with the number of input sequences as the identity matrix must be stored in memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Recommended when processing large numbers of sequences to avoid system instability.
-da: Disable assertions. May provide minor performance improvements in production use.

Examples

Basic Identity Matrix

idmatrix.sh in=sequences.fasta out=identity_matrix.txt

Generate an identity matrix from sequences in FASTA format, outputting fractional identities (0.0-1.0).

Percentage Output

idmatrix.sh in=sequences.fasta out=identity_matrix.txt percent=t

Generate an identity matrix with percentage values (0-100%) instead of fractions.

Limited Edit Distance

idmatrix.sh in=sequences.fasta out=identity_matrix.txt edits=10 width=21

Limit alignments to maximum 10 edit operations with a bandwidth of 21 for faster processing of highly similar sequences.

JNI-Accelerated Processing

idmatrix.sh in=sequences.fasta out=identity_matrix.txt usejni=t threads=16

Use native C code for BandedAligner operations with 16 threads (requires JNI compilation).

Algorithm Details

IDMATRIX implements all-to-all sequence alignment using BandedAligner.alignQuadrupleProgressive() with specific threading and memory management strategies:

Alignment Implementation

The tool uses BandedAligner.alignQuadrupleProgressive() which implements a two-stage progressive banded alignment strategy. The algorithm begins with a narrow band (width 10) and progressively widens to the specified maximum width if no alignment is found. This approach balances computational speed for closely related sequences with alignment completeness for divergent sequences.

Data Structure Strategy

Memory usage scales quadratically with sequence count (O(n²)) as the full identity matrix must be maintained in memory. Each sequence object stores a float array of size n, where n is the total number of sequences. The implementation uses the Read.obj field to store per-sequence float arrays containing pairwise identity values.

1,000 sequences: ~4 MB for matrix storage
10,000 sequences: ~400 MB for matrix storage
100,000 sequences: ~40 GB for matrix storage

Threading Architecture

The tool implements a producer-consumer threading model using ConcurrentCollectionReadInputStream for sequence distribution. Each ProcessThread maintains its own BandedAligner instance (created via BandedAligner.makeBandedAligner()) to eliminate synchronization overhead. The matrix is populated in a thread-safe manner by exploiting the symmetric property: identity[i][j] = identity[j][i].

Identity Calculation

Sequence identity is calculated as:

identity = 1 - (edit_distance / max(length1, length2))

This normalization approach uses the longer of the two sequences as the denominator, which provides more conservative identity estimates for sequences of different lengths.

Performance Characteristics

Time complexity is O(n² × s × b) where n is sequence count, s is average sequence length, and b is bandwidth. The banded algorithm reduces the s factor significantly compared to full dynamic programming. Performance metrics are reported including alignment rate and similarity statistics (min, max, average).

Output Format

The output matrix is tab-delimited with sequence IDs as the first column and row headers. The matrix is symmetric with self-identities on the diagonal (always 1.0 or 100%). Values are formatted to 2 decimal places for percentages or 4 decimal places for fractions.

Important Notes

⚠️ Known Issues

This tool contains a warning that it may produce incorrect results in some circumstances and is not advisable for use until fixed. Users should verify results carefully and consider alternative tools for critical applications.

Memory Considerations

Memory usage scales quadratically with sequence count. For large datasets (>10,000 sequences), monitor memory usage carefully and ensure adequate RAM is available. Use the -Xmx parameter to set appropriate memory limits.

Input Requirements

The tool expects single sequences and will fail with paired-end reads. Input files should be in FASTA or FASTQ format (gzipped files are supported). Sequence IDs should be unique to avoid confusion in the output matrix.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org