AdjustHomopolymers
Shrinks or expands homopolymers in DNA sequences. This tool modifies homopolymer runs (consecutive identical bases) by either expanding or contracting them based on a specified rate parameter.
Basic Usage
adjusthomopolymers.sh in=<input file> out=<output file> rate=<float>
Input may be fasta or fastq, compressed or uncompressed.
Parameters
Parameters are organized into functional groups matching the shell script organization. All parameters from the shell script are documented below.
Standard parameters
- in=<file>
- Primary input, or read 1 input.
- in2=<file>
- Read 2 input if reads are in two files.
- out=<file>
- Primary output, or read 1 output.
- out2=<file>
- Read 2 output if reads are in two files.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
Processing parameters
- rate=0.0
- Controls homopolymer adjustment. Positive values expand homopolymers (rate=0.1 expands by 10%), negative values shrink them (rate=-0.1 shrinks by 10%). Default is 0.0 (no change).
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Expanding homopolymers by 20%
adjusthomopolymers.sh in=reads.fq out=expanded.fq rate=0.2
Expands all homopolymer runs by 20%. A run of 5 A's would become 6 A's (5 + 5*0.2 = 6).
Shrinking homopolymers by 15%
adjusthomopolymers.sh in=reads.fq out=shrunk.fq rate=-0.15
Shrinks all homopolymer runs by 15%. A run of 10 T's would become 9 T's (10 + 10*(-0.15) = 8.5, truncated to 8).
Processing paired reads
adjusthomopolymers.sh in1=reads1.fq in2=reads2.fq out1=adj1.fq out2=adj2.fq rate=0.1
Processes paired-end reads, expanding homopolymers by 10% in both files.
Algorithm Details
Homopolymer Detection and Adjustment:
The algorithm processes each read base-by-base using a streak counter approach:
- Streak Tracking: Maintains a counter for consecutive identical bases
- Base Transition: When the base changes, calculates adjustment as
(int)(rate * streak_length)
- Selective Processing: Only processes fully defined bases (A, C, G, T), ignoring ambiguous bases like N
- Bidirectional Adjustment: Positive rates add bases, negative rates remove bases from homopolymer runs
Implementation Details:
- Memory Efficiency: Uses ByteBuilder for dynamic sequence construction without repeated array allocation
- Quality Preservation: Maintains quality scores for fastq files, extending or truncating quality arrays as needed
- Truncation Behavior: Negative adjustments truncate to integer values (e.g., removing 1.5 bases removes 1 base)
- Single-threaded Processing: Processes reads sequentially for consistent results
Use Cases:
- PacBio/Nanopore Error Correction: Shrink homopolymers to correct systematic over-calling
- Synthetic Data Generation: Expand homopolymers to simulate sequencing errors
- Assembly Preprocessing: Normalize homopolymer lengths before assembly
- Comparative Analysis: Adjust sequences to match expected homopolymer distributions
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org