MakePolymers

Basic Usage

makepolymers.sh out=<output file> k=<repeat length> minlen=<sequence length>

Generates synthetic polymer sequences by creating all possible k-mers of a specified length and repeating them to ensure complete coverage of the k-mer space.

Parameters

Parameters are organized by their function in the polymer generation process.

I/O Parameters

out=<file>: Output genome file. The generated polymer sequences will be written in FASTA format to this file.
overwrite=f: (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true

Processing Parameters

k=1: Length of repeating polymeric units. This determines the size of k-mers that will be generated. All possible k-mers of this length will be created. To generate a sweep of multiple values of k, specify both mink and maxk. Default: 1
mink=1: Minimum k-mer length when generating a range of k values. Used in conjunction with maxk to create polymers for multiple k-mer sizes. Default: same as k
maxk=1: Maximum k-mer length when generating a range of k values. Used in conjunction with mink to create polymers for multiple k-mer sizes. Default: same as k
minlen=31: Ensure sequences are at least this long. Specifically, minlen=X will ensure sequences are long enough that all possible kmers of length X are present. The algorithm calculates the minimum number of repeats needed to achieve this length. Default: 31
verbose=f: Enable verbose output for debugging and detailed progress information. Default: false

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: auto-detected based on available memory
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions for slightly improved performance in production environments.

Examples

Generate Monomers

makepolymers.sh out=monomers.fa k=1 minlen=100

Creates sequences containing repeated single nucleotides (A, T, G, C), each repeated enough times to create sequences of at least 100 bp. This generates 4 sequences: one for each nucleotide.

Generate Dimers

makepolymers.sh out=dimers.fa k=2 minlen=200

Creates sequences containing all possible 2-mers (AA, AT, AG, AC, TA, TT, etc.), each repeated to create sequences of at least 200 bp. This generates 16 sequences (4² = 16 possible 2-mers).

Generate Range of K-mers

makepolymers.sh out=polymers.fa mink=1 maxk=3 minlen=150

Creates polymer sequences for k-mer lengths 1, 2, and 3. This will generate 4 + 16 + 64 = 84 sequences total, covering all possible 1-mers, 2-mers, and 3-mers.

Use with Mutation Tool

makepolymers.sh out=polymers.fa k=4 minlen=1000
mutate.sh in=polymers.fa out=mutated_polymers.fa rate=0.01

First generates 4-mer polymer sequences, then introduces random mutations at a 1% rate to create low-complexity sequences with some variation for testing purposes.

Algorithm Details

MakePolymers generates synthetic polymer sequences through a systematic enumeration approach implemented in the writeSequence() method:

K-mer Enumeration Implementation

The algorithm generates all possible k-mers of a specified length using specific Java implementation details:

Bit-shift Generation: For k-mer length k, generates all 4^k possible combinations using the formula final long max=(1<<(2*k))-1
Two-bit Encoding: Uses AminoAcid.numberToBase[x] array lookup where each nucleotide is represented by 2 bits extracted via (kmer>>(2*i))&3
Sequential Iteration: Loops through all numbers from 0 to max, converting each to its corresponding k-mer sequence using the toBytes() static method

Length Calculation Implementation

The tool calculates minimum repeats using specific arithmetic operations in the writeSequence() method:

Target Length Calculation: Computes minLen2=((minLen+k-1)/k)*k to ensure the final length is a multiple of k
Repeat Count Logic: Uses conditional logic: if minLen2-minLen>=k-1 then minCount=minLen2/k, otherwise minCount=minLen2/k+1
Coverage Implementation: Each k-mer is repeated exactly minCount times in a for-loop to guarantee complete coverage

Output Format and Naming Implementation

Each generated sequence follows a systematic naming convention implemented through ByteBuilder operations:

Header Construction: Uses bb.append('>').append(k).append('_').append(kmer).append('\n') to create headers
Sequence Generation: Calls toBytes(kmer, k, bb) method that extracts nucleotides using bit shifting and AminoAcid lookup
FASTA Formatting: Uses ByteBuilder.nl() to add newlines and ByteStreamWriter for output formatting

Memory and I/O Implementation Details

The implementation uses specific memory and I/O optimizations:

Buffer Management: Uses 16,384-byte ByteBuilder buffers with conditional flushing when bb.length>=16384
Streaming Output: Employs ByteStreamWriter.start() and ByteStreamWriter.print() for concurrent I/O operations
Memory Allocation: Sets default heap space to 4GB via -Xmx4g and -Xms4g JVM parameters in shell script
Statistics Tracking: Maintains readsProcessed and basesProcessed counters incremented during sequence generation

Use Cases and Applications

MakePolymers serves several important purposes in bioinformatics workflows:

Tool Testing: Creates controlled synthetic datasets for testing sequence analysis tools
Algorithm Validation: Provides known-content sequences for validating k-mer counting and analysis algorithms
Low-Complexity Generation: When combined with mutation tools, creates realistic low-complexity sequences
Reference Creation: Generates complete k-mer space references for comparative analysis

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org