AlignRandom

Script: alignrandom.sh Package: aligner Class: AlignRandom.java

Statistical analysis tool that calculates Average Nucleotide Identity (ANI) distributions for random DNA sequences using GlocalPlusAligner5 traceback-free alignment. Generates pairs of random sequences at specified lengths using uniform base distribution, aligns them through dynamic programming, and accumulates results into histogram bins. Random sequences converge to length-dependent identity distributions centered around 55% with decreasing variance, providing null models for statistical significance testing of biological sequence alignments.

Basic Usage

alignrandom.sh <start> <mult> <steps> <iters> <buckets> <maxloops> <output>

All parameters are positional and optional. They must be specified in order, but you can omit trailing parameters to use their defaults.

Parameters

Parameters are positional arguments that control the random sequence generation, alignment testing, and output formatting. Each parameter has a specific position and default value that will be used if not specified.

Positional Parameters and Defaults (optional, ordered, without the name)

start=10: Starting sequence length for analysis. This is the initial length of random DNA sequences that will be generated and aligned. The tool will test multiple length intervals starting from this value.
mult=10: Length multiplier between intervals (each step: length*=mult). Each successive test interval will use sequences that are this many times longer than the previous interval. For example, with start=10 and mult=10, the sequence lengths tested would be 10, 100, 1000, etc.
steps=4: Number of length intervals to test. This determines how many different sequence lengths will be analyzed. With start=10, mult=10, and steps=4, the tool will test lengths 10, 100, 1000, and 10000.
iters=200: Number of random sequence pairs to align per interval. For each sequence length being tested, this many pairs of random sequences will be generated and aligned to build the identity distribution histogram. Higher values give more accurate statistics but take longer to run.
buckets=100: Number of histogram bins for identity distribution. The alignment identity scores (0.0 to 1.0) will be divided into this many bins for the output histogram. More buckets provide finer resolution of the identity distribution.
maxloops=max: Maximum total alignments to prevent excessive runtime. This safety parameter limits the total number of alignments that will be performed. The actual limit is calculated as maxloops/(length*length) for each sequence length. Default is Long.MAX_VALUE (effectively unlimited).
output=stdout: Output file for ANI histogram results. Specifies where to write the tab-delimited histogram data. Use "stdout" or "stdout.txt" to write to standard output, or provide a filename to write to a file.

Examples

Basic Analysis

alignrandom.sh 20 5 6 500

Tests sequence lengths 20, 100, 500, 2500, 12500, 62500 with 500 random sequence pairs aligned at each length. Uses default 100 histogram buckets and writes to standard output.

High-Resolution Analysis

alignrandom.sh 10 2 8 1000 200

Tests 8 length intervals starting at 10bp with 2x multiplier (10, 20, 40, 80, 160, 320, 640, 1280), using 1000 iterations per length and 200 histogram bins for high resolution.

Limited Runtime with File Output

alignrandom.sh 50 10 5 500 100 1000000 results.tsv

Tests lengths 50, 500, 5000, 50000, 500000 with maximum 1 million total alignments and saves results to results.tsv file.

Output Format

The tool produces a tab-delimited histogram showing the distribution of alignment identities for each sequence length tested. The output format is:

ANI	0.0000	0.0100	0.0200	...	1.0000
50	0.00000	0.00000	0.00200	...	0.00000
500	0.00000	0.00000	0.00000	...	0.00600
5000	0.00000	0.00000	0.00000	...	0.01000

Each row represents a sequence length, and each column represents an identity bin. Values are the fraction of sequence pairs that fell into each identity bin. The identity bins are evenly spaced from 0.0 to 1.0 based on the number of buckets specified.

Algorithm Details

AlignRandom implements statistical analysis using aligner.AlignRandom class for evaluating DNA sequence alignment significance through baseline identity distributions. The tool combines uniform random sequence generation with traceback-free alignment to establish null models for sequence identity testing:

Random Sequence Generation

Random DNA sequences are generated using the randomSequence(int len, Random randy) method which creates byte arrays through uniform base selection. The implementation uses randy.nextInt(4) to generate integer values 0-3, then maps these to nucleotide bytes via AminoAcid.numberToBase[x] lookup table. Each position has exactly 25% probability of being A, C, G, or T, ensuring unbiased composition without GC content constraints.

Alignment Algorithm

Sequence alignments use GlocalPlusAligner5.alignStatic(a, b, null) method from the traceback-free alignment research implementation. GlocalPlusAligner5 implements glocal (global-local) alignment with 64-bit bit-packed scoring encoding position, deletion count, and alignment score in single values. The method uses mathematical constraint solving (M+S+I=qLen, M+S+D=refAlnLength, Score=M-S-I-D) for exact operation count recovery without storing traceback matrices. Returns float identity score as the fraction of aligned positions that match.

Statistical Sampling Strategy

For each sequence length interval, the tool generates pairs of random sequences using randomSequence(len, randy) and aligns them with identity calculation. Identity scores are binned using Math.round(id*buckets) to convert continuous scores (0.0-1.0) into discrete histogram bins. The hist[Math.round(id*buckets)]++ operation accumulates alignment results into integer arrays representing probability distributions of random sequence identity.

Multithreaded Processing

The runMT() method distributes alignment work across threads using ExecutorService executor = Executors.newFixedThreadPool(Shared.threads()). Each iteration submits a Runnable task that creates thread-local Random generators (Random randy = new Random()), generates sequence pairs, performs alignment, and updates results via AtomicIntegerArray atomicHist with atomicHist.incrementAndGet(bucket) for lock-free histogram accumulation. Tasks are managed through List<Future<?>> futures with f.get() synchronization.

Runtime Limitation

The tool implements quadratic scaling protection using long iters2=(maxLoops/len)/len calculation, recognizing that alignment complexity scales O(n²) with sequence length. The effective iteration count becomes int iters3=(int)Tools.min(iters, iters2), limiting computational work to prevent excessive runtime with long sequences. Default maxLoops is Long.MAX_VALUE, but can be specified as the 6th positional parameter.

Scientific Application

Generated identity distributions serve as null models for statistical significance testing of real sequence alignments. Random DNA sequences typically converge to identity distributions centered around ~55% for shorter sequences, with variance decreasing as sequence length increases due to the law of large numbers. This baseline enables determination of whether observed alignment scores between biological sequences exceed random expectation by statistically significant margins.

Performance Characteristics

Memory Usage: Memory footprint scales as O(buckets + threads) with fixed-size int[buckets+1] arrays for histogram storage plus byte[len] arrays for sequence pairs during alignment. AtomicIntegerArray(buckets + 1) is used for thread-safe accumulation in multithreaded mode. GlocalPlusAligner5 uses O(n) space through dual rolling arrays rather than O(mn) full matrix storage.

Runtime Scaling: Runtime complexity is O(iters × len²) due to GlocalPlusAligner5's dynamic programming alignment scaling quadratically with sequence length. The maxLoops/(len*len) safety calculation prevents excessive runtime by reducing iterations for long sequences. Multithreaded mode distributes iters iterations across Shared.threads() cores with work partitioning through ExecutorService.submit().

Accuracy: Statistical precision improves as 1/√(iters) following central limit theorem. The tool uses Java's standard Random class with linear congruential generator providing uniform distribution over nucleotide space. Histogram bin resolution is determined by buckets parameter creating identity intervals of width 1.0/buckets.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org