AdjustHomopolymers

Script: adjusthomopolymers.sh Package: jgi Class: AdjustHomopolymers.java

Shrinks or expands homopolymers in DNA sequences. This tool modifies homopolymer runs (consecutive identical bases) by either expanding or contracting them based on a specified rate parameter.

Basic Usage

adjusthomopolymers.sh in=<input file> out=<output file> rate=<float>

Input may be fasta or fastq, compressed or uncompressed.

Parameters

Parameters are organized into functional groups matching the shell script organization. All parameters from the shell script are documented below.

Standard parameters

in=<file>: Primary input, or read 1 input.
in2=<file>: Read 2 input if reads are in two files.
out=<file>: Primary output, or read 1 output.
out2=<file>: Read 2 output if reads are in two files.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Processing parameters

rate=0.0: Controls homopolymer adjustment. Positive values expand homopolymers (rate=0.1 expands by 10%), negative values shrink them (rate=-0.1 shrinks by 10%). Default is 0.0 (no change).

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Expanding homopolymers by 20%

adjusthomopolymers.sh in=reads.fq out=expanded.fq rate=0.2

Expands all homopolymer runs by 20%. A run of 5 A's would become 6 A's (5 + 5*0.2 = 6).

Shrinking homopolymers by 15%

adjusthomopolymers.sh in=reads.fq out=shrunk.fq rate=-0.15

Shrinks all homopolymer runs by 15%. A run of 10 T's would become 9 T's (10 + 10*(-0.15) = 8.5, truncated to 8).

Processing paired reads

adjusthomopolymers.sh in1=reads1.fq in2=reads2.fq out1=adj1.fq out2=adj2.fq rate=0.1

Processes paired-end reads, expanding homopolymers by 10% in both files.

Algorithm Details

Homopolymer Detection and Adjustment:

The algorithm processes each read base-by-base using a streak counter approach:

Streak Tracking: Maintains a counter for consecutive identical bases
Base Transition: When the base changes, calculates adjustment as (int)(rate * streak_length)
Selective Processing: Only processes fully defined bases (A, C, G, T), ignoring ambiguous bases like N
Bidirectional Adjustment: Positive rates add bases, negative rates remove bases from homopolymer runs

Implementation Details:

Memory Efficiency: Uses ByteBuilder for dynamic sequence construction without repeated array allocation
Quality Preservation: Maintains quality scores for fastq files, extending or truncating quality arrays as needed
Truncation Behavior: Negative adjustments truncate to integer values (e.g., removing 1.5 bases removes 1 base)
Single-threaded Processing: Processes reads sequentially for consistent results

Use Cases:

PacBio/Nanopore Error Correction: Shrink homopolymers to correct systematic over-calling
Synthetic Data Generation: Expand homopolymers to simulate sequencing errors
Assembly Preprocessing: Normalize homopolymer lengths before assembly
Comparative Analysis: Adjust sequences to match expected homopolymer distributions

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org