AllToAll
Aligns all to all to produce an identity matrix.
Basic Usage
alltoall.sh in=<input file> out=<output file>
Input may be fasta or fastq, compressed or uncompressed. The tool performs pairwise alignment between all input sequences and outputs a symmetric identity matrix showing sequence similarity scores.
Parameters
Parameters control input/output settings, threading, and memory management for the all-to-all alignment process.
Standard parameters
- in=<file>
- Input sequences. Accepts FASTA or FASTQ format, compressed or uncompressed.
- out=<file>
- Output data. Tab-delimited identity matrix with sequence names as headers and percentage identity values (0-100).
- t=
- Set the number of threads; default is logical processors. Multi-threading uses AtomicInteger work distribution across ProcessThread workers.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- showspeed=t
- (ss) Set to 'f' to suppress display of processing speed.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression uses less CPU time.
- reads=-1
- If positive, quit after this many sequences. Useful for testing with subset of large datasets.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic All-to-All Alignment
alltoall.sh in=sequences.fasta out=identity_matrix.txt
Performs pairwise alignment between all sequences in sequences.fasta and outputs an identity matrix to identity_matrix.txt.
Multi-threaded Processing
alltoall.sh in=large_dataset.fq out=results.txt t=8
Uses 8 threads to process a large FASTQ dataset with AtomicInteger work distribution across ProcessThread workers.
Subset Analysis
alltoall.sh in=sequences.fa out=subset_matrix.txt reads=100
Processes only the first 100 sequences from the input file, useful for testing or smaller analyses.
Algorithm Details
AllToAll implements an all-versus-all sequence alignment algorithm using SketchObject.align() with AtomicInteger work distribution and lower-triangle computation:
Alignment Strategy
- SketchObject Integration: Uses SketchObject.align() method for approximate sequence alignment between all pairs
- Symmetric Matrix: Generates a symmetric identity matrix where entry (i,j) represents the sequence identity between sequence i and sequence j
- Self-Alignment: Diagonal entries are automatically set to 1.0 (100% identity) representing perfect self-alignment
- Half-Matrix Computation: Only computes the lower triangle of the matrix, then mirrors values to create the symmetric upper triangle
Multi-Threading Implementation
- Thread Pool: Creates ProcessThread workers equal to the number of available threads
- Work Distribution: Uses AtomicInteger counter for lock-free work distribution among threads
- Query Processing: Each thread processes query sequences independently, performing alignments against all reference sequences with lower indices
- Synchronized Output: Results are synchronized when writing to the shared results matrix
Memory and Performance
- Memory Usage: Stores all input sequences in memory using ArrayList<Read> for direct array access during alignment
- Space Complexity: O(n²) for the identity matrix where n is the number of sequences
- Time Complexity: O(n²) pairwise comparisons, parallelized across available threads
- Default Memory: Uses 4GB heap by default (-Xmx4g), automatically adjusted based on available system RAM
Output Format
- Tab-Delimited Matrix: First row contains sequence names as column headers
- Identity Scores: Values represent percentage identity (0-100) with 2 decimal places
- Row Labels: Each data row starts with the sequence name
- Symmetric Structure: Matrix is symmetric around the diagonal, with diagonal values always 100.00
Statistical Reporting
The tool reports processing statistics including total sequences processed, bases analyzed, number of alignments performed, and processing time with throughput metrics.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org