MakePolymers

Script: makepolymers.sh Package: jgi Class: MakePolymers.java

Creates polymer sequences. Can be used in conjunction with mutate.sh to generate low-complexity sequence.

Basic Usage

makepolymers.sh out=<output file> k=<repeat length> minlen=<sequence length>

Generates synthetic polymer sequences by creating all possible k-mers of a specified length and repeating them to ensure complete coverage of the k-mer space.

Parameters

Parameters are organized by their function in the polymer generation process.

I/O Parameters

out=<file>
Output genome file. The generated polymer sequences will be written in FASTA format to this file.
overwrite=f
(ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true

Processing Parameters

k=1
Length of repeating polymeric units. This determines the size of k-mers that will be generated. All possible k-mers of this length will be created. To generate a sweep of multiple values of k, specify both mink and maxk. Default: 1
mink=1
Minimum k-mer length when generating a range of k values. Used in conjunction with maxk to create polymers for multiple k-mer sizes. Default: same as k
maxk=1
Maximum k-mer length when generating a range of k values. Used in conjunction with mink to create polymers for multiple k-mer sizes. Default: same as k
minlen=31
Ensure sequences are at least this long. Specifically, minlen=X will ensure sequences are long enough that all possible kmers of length X are present. The algorithm calculates the minimum number of repeats needed to achieve this length. Default: 31
verbose=f
Enable verbose output for debugging and detailed progress information. Default: false

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: auto-detected based on available memory
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions for slightly improved performance in production environments.

Examples

Generate Monomers

makepolymers.sh out=monomers.fa k=1 minlen=100

Creates sequences containing repeated single nucleotides (A, T, G, C), each repeated enough times to create sequences of at least 100 bp. This generates 4 sequences: one for each nucleotide.

Generate Dimers

makepolymers.sh out=dimers.fa k=2 minlen=200

Creates sequences containing all possible 2-mers (AA, AT, AG, AC, TA, TT, etc.), each repeated to create sequences of at least 200 bp. This generates 16 sequences (4² = 16 possible 2-mers).

Generate Range of K-mers

makepolymers.sh out=polymers.fa mink=1 maxk=3 minlen=150

Creates polymer sequences for k-mer lengths 1, 2, and 3. This will generate 4 + 16 + 64 = 84 sequences total, covering all possible 1-mers, 2-mers, and 3-mers.

Use with Mutation Tool

makepolymers.sh out=polymers.fa k=4 minlen=1000
mutate.sh in=polymers.fa out=mutated_polymers.fa rate=0.01

First generates 4-mer polymer sequences, then introduces random mutations at a 1% rate to create low-complexity sequences with some variation for testing purposes.

Algorithm Details

MakePolymers generates synthetic polymer sequences through a systematic enumeration approach implemented in the writeSequence() method:

K-mer Enumeration Implementation

The algorithm generates all possible k-mers of a specified length using specific Java implementation details:

Length Calculation Implementation

The tool calculates minimum repeats using specific arithmetic operations in the writeSequence() method:

Output Format and Naming Implementation

Each generated sequence follows a systematic naming convention implemented through ByteBuilder operations:

Memory and I/O Implementation Details

The implementation uses specific memory and I/O optimizations:

Use Cases and Applications

MakePolymers serves several important purposes in bioinformatics workflows:

Support

For questions and support: