IDMatrix
Generates an identity matrix via all-to-all alignment. Warning: This program may produce incorrect results in some circumstances and is not advisable to use until fixed.
Basic Usage
idmatrix.sh in=<file> out=<file>
Creates an identity matrix by performing all-to-all alignment of sequences in the input file. Each cell in the matrix represents the sequence identity between two input sequences.
Parameters
Parameters control input/output files, alignment behavior, and output formatting. The tool uses BandedAligner.alignQuadrupleProgressive() with configurable bandwidth and edit distance constraints.
Parameters
- in=<file>
- File containing reads. in=stdin.fa will pipe from stdin. Input sequences should not be paired-end reads as the tool is designed for single sequences only.
- out=<file>
- Matrix output. out=stdout will pipe to stdout. The output is a tab-delimited matrix with sequence IDs as row/column headers and identity values as cells.
- threads=auto
- (t) Set number of threads to use; default is number of logical processors. Multi-threading is used for processing alignments in parallel.
- percent=f
- Output identity as percent rather than a fraction. When true, values are multiplied by 100 and displayed as percentages (0-100). When false, values are displayed as fractions (0.0-1.0) with 4 decimal places.
- edits=
- Allow at most this much edit distance. Default is the length of the longest input sequence. Lower values are faster but may miss alignments with high divergence. This parameter controls the maximum number of edit operations (insertions, deletions, substitutions) allowed in the alignment.
- width=
- Alignment bandwidth, lower is faster. Default: 2*edits+1. This controls the width of the banded alignment algorithm. Narrower bands are faster but may miss optimal alignments if sequences have large indels.
- usejni=f
- (jni) Do alignments in C code using JNI (Java Native Interface). Requires compiling the C code; details are in /jni/README.txt. When enabled, uses native C implementations for BandedAligner operations instead of Java implementations.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding automatic memory detection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Memory usage scales quadratically with the number of input sequences as the identity matrix must be stored in memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Recommended when processing large numbers of sequences to avoid system instability.
- -da
- Disable assertions. May provide minor performance improvements in production use.
Examples
Basic Identity Matrix
idmatrix.sh in=sequences.fasta out=identity_matrix.txt
Generate an identity matrix from sequences in FASTA format, outputting fractional identities (0.0-1.0).
Percentage Output
idmatrix.sh in=sequences.fasta out=identity_matrix.txt percent=t
Generate an identity matrix with percentage values (0-100%) instead of fractions.
Limited Edit Distance
idmatrix.sh in=sequences.fasta out=identity_matrix.txt edits=10 width=21
Limit alignments to maximum 10 edit operations with a bandwidth of 21 for faster processing of highly similar sequences.
JNI-Accelerated Processing
idmatrix.sh in=sequences.fasta out=identity_matrix.txt usejni=t threads=16
Use native C code for BandedAligner operations with 16 threads (requires JNI compilation).
Algorithm Details
IDMATRIX implements all-to-all sequence alignment using BandedAligner.alignQuadrupleProgressive() with specific threading and memory management strategies:
Alignment Implementation
The tool uses BandedAligner.alignQuadrupleProgressive()
which implements a two-stage progressive banded alignment strategy. The algorithm begins with a narrow band (width 10) and progressively widens to the specified maximum width if no alignment is found. This approach balances computational speed for closely related sequences with alignment completeness for divergent sequences.
Data Structure Strategy
Memory usage scales quadratically with sequence count (O(n²)) as the full identity matrix must be maintained in memory. Each sequence object stores a float array of size n, where n is the total number of sequences. The implementation uses the Read.obj field to store per-sequence float arrays containing pairwise identity values.
- 1,000 sequences: ~4 MB for matrix storage
- 10,000 sequences: ~400 MB for matrix storage
- 100,000 sequences: ~40 GB for matrix storage
Threading Architecture
The tool implements a producer-consumer threading model using ConcurrentCollectionReadInputStream for sequence distribution. Each ProcessThread maintains its own BandedAligner instance (created via BandedAligner.makeBandedAligner()) to eliminate synchronization overhead. The matrix is populated in a thread-safe manner by exploiting the symmetric property: identity[i][j] = identity[j][i].
Identity Calculation
Sequence identity is calculated as:
identity = 1 - (edit_distance / max(length1, length2))
This normalization approach uses the longer of the two sequences as the denominator, which provides more conservative identity estimates for sequences of different lengths.
Performance Characteristics
Time complexity is O(n² × s × b) where n is sequence count, s is average sequence length, and b is bandwidth. The banded algorithm reduces the s factor significantly compared to full dynamic programming. Performance metrics are reported including alignment rate and similarity statistics (min, max, average).
Output Format
The output matrix is tab-delimited with sequence IDs as the first column and row headers. The matrix is symmetric with self-identities on the diagonal (always 1.0 or 100%). Values are formatted to 2 decimal places for percentages or 4 decimal places for fractions.
Important Notes
⚠️ Known Issues
This tool contains a warning that it may produce incorrect results in some circumstances and is not advisable for use until fixed. Users should verify results carefully and consider alternative tools for critical applications.
Memory Considerations
Memory usage scales quadratically with sequence count. For large datasets (>10,000 sequences), monitor memory usage carefully and ensure adequate RAM is available. Use the -Xmx parameter to set appropriate memory limits.
Input Requirements
The tool expects single sequences and will fail with paired-end reads. Input files should be in FASTA or FASTQ format (gzipped files are supported). Sequence IDs should be unique to avoid confusion in the output matrix.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org