MakePolymers
Creates polymer sequences. Can be used in conjunction with mutate.sh to generate low-complexity sequence.
Basic Usage
makepolymers.sh out=<output file> k=<repeat length> minlen=<sequence length>
Generates synthetic polymer sequences by creating all possible k-mers of a specified length and repeating them to ensure complete coverage of the k-mer space.
Parameters
Parameters are organized by their function in the polymer generation process.
I/O Parameters
- out=<file>
- Output genome file. The generated polymer sequences will be written in FASTA format to this file.
- overwrite=f
- (ow) Set to false to force the program to abort rather than overwrite an existing file. Default: true
Processing Parameters
- k=1
- Length of repeating polymeric units. This determines the size of k-mers that will be generated. All possible k-mers of this length will be created. To generate a sweep of multiple values of k, specify both mink and maxk. Default: 1
- mink=1
- Minimum k-mer length when generating a range of k values. Used in conjunction with maxk to create polymers for multiple k-mer sizes. Default: same as k
- maxk=1
- Maximum k-mer length when generating a range of k values. Used in conjunction with mink to create polymers for multiple k-mer sizes. Default: same as k
- minlen=31
- Ensure sequences are at least this long. Specifically, minlen=X will ensure sequences are long enough that all possible kmers of length X are present. The algorithm calculates the minimum number of repeats needed to achieve this length. Default: 31
- verbose=f
- Enable verbose output for debugging and detailed progress information. Default: false
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default: auto-detected based on available memory
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions for slightly improved performance in production environments.
Examples
Generate Monomers
makepolymers.sh out=monomers.fa k=1 minlen=100
Creates sequences containing repeated single nucleotides (A, T, G, C), each repeated enough times to create sequences of at least 100 bp. This generates 4 sequences: one for each nucleotide.
Generate Dimers
makepolymers.sh out=dimers.fa k=2 minlen=200
Creates sequences containing all possible 2-mers (AA, AT, AG, AC, TA, TT, etc.), each repeated to create sequences of at least 200 bp. This generates 16 sequences (4² = 16 possible 2-mers).
Generate Range of K-mers
makepolymers.sh out=polymers.fa mink=1 maxk=3 minlen=150
Creates polymer sequences for k-mer lengths 1, 2, and 3. This will generate 4 + 16 + 64 = 84 sequences total, covering all possible 1-mers, 2-mers, and 3-mers.
Use with Mutation Tool
makepolymers.sh out=polymers.fa k=4 minlen=1000
mutate.sh in=polymers.fa out=mutated_polymers.fa rate=0.01
First generates 4-mer polymer sequences, then introduces random mutations at a 1% rate to create low-complexity sequences with some variation for testing purposes.
Algorithm Details
MakePolymers generates synthetic polymer sequences through a systematic enumeration approach implemented in the writeSequence() method:
K-mer Enumeration Implementation
The algorithm generates all possible k-mers of a specified length using specific Java implementation details:
- Bit-shift Generation: For k-mer length k, generates all 4^k possible combinations using the formula
final long max=(1<<(2*k))-1
- Two-bit Encoding: Uses AminoAcid.numberToBase[x] array lookup where each nucleotide is represented by 2 bits extracted via
(kmer>>(2*i))&3
- Sequential Iteration: Loops through all numbers from 0 to max, converting each to its corresponding k-mer sequence using the toBytes() static method
Length Calculation Implementation
The tool calculates minimum repeats using specific arithmetic operations in the writeSequence() method:
- Target Length Calculation: Computes
minLen2=((minLen+k-1)/k)*k
to ensure the final length is a multiple of k - Repeat Count Logic: Uses conditional logic: if
minLen2-minLen>=k-1
thenminCount=minLen2/k
, otherwiseminCount=minLen2/k+1
- Coverage Implementation: Each k-mer is repeated exactly minCount times in a for-loop to guarantee complete coverage
Output Format and Naming Implementation
Each generated sequence follows a systematic naming convention implemented through ByteBuilder operations:
- Header Construction: Uses
bb.append('>').append(k).append('_').append(kmer).append('\n')
to create headers - Sequence Generation: Calls toBytes(kmer, k, bb) method that extracts nucleotides using bit shifting and AminoAcid lookup
- FASTA Formatting: Uses ByteBuilder.nl() to add newlines and ByteStreamWriter for output formatting
Memory and I/O Implementation Details
The implementation uses specific memory and I/O optimizations:
- Buffer Management: Uses 16,384-byte ByteBuilder buffers with conditional flushing when
bb.length>=16384
- Streaming Output: Employs ByteStreamWriter.start() and ByteStreamWriter.print() for concurrent I/O operations
- Memory Allocation: Sets default heap space to 4GB via -Xmx4g and -Xms4g JVM parameters in shell script
- Statistics Tracking: Maintains readsProcessed and basesProcessed counters incremented during sequence generation
Use Cases and Applications
MakePolymers serves several important purposes in bioinformatics workflows:
- Tool Testing: Creates controlled synthetic datasets for testing sequence analysis tools
- Algorithm Validation: Provides known-content sequences for validating k-mer counting and analysis algorithms
- Low-Complexity Generation: When combined with mutation tools, creates realistic low-complexity sequences
- Reference Creation: Generates complete k-mer space references for comparative analysis
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org