MakeChimeras

Basic Usage

makechimeras.sh in=<input> out=<output> chimeras=<integer>

Creates artificial chimeric sequences by randomly fusing together pieces from input reads. This tool is particularly designed for generating synthetic PacBio reads with chimeric characteristics for testing and validation purposes.

Parameters

Parameters are organized by their function in the chimera creation process.

Input Parameters

in=<file>: The input file containing nonchimeric reads. Can be fasta or fastq format, compressed or uncompressed.
unpigz=t: Decompress with pigz for faster decompression. Uses parallel gzip decompression when available.

Output Parameters

out=<file>: Fasta output destination. The output file will contain the generated chimeric sequences.
chimeras=-1: Number of chimeras to create (required parameter). Must be set to a positive integer to specify how many synthetic chimeric sequences to generate.
forcelength=0: If a positive number X, one parent will be length X, and the other will be length-X. This forces specific length distributions in the chimeric products. When set to 0, random length selection is used.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Basic Chimera Generation

makechimeras.sh in=input_reads.fasta out=chimeric_reads.fasta chimeras=1000

Creates 1000 synthetic chimeric sequences from the input reads by randomly selecting and fusing pieces from different source sequences.

Forced Length Distribution

makechimeras.sh in=pacbio_reads.fasta out=test_chimeras.fasta chimeras=500 forcelength=5000

Generates 500 chimeric sequences where each chimera is constructed with one piece of exactly 5000bp and another piece that makes up the remaining length.

Processing Compressed Input

makechimeras.sh in=reads.fasta.gz out=chimeras.fasta chimeras=2000 unpigz=t

Processes compressed input using parallel gzip decompression for better performance on large datasets.

Algorithm Details

Chimera Construction Process

MakeChimeras uses Random.nextInt() selection with Shared.threadLocalRandom() for creating synthetic chimeric sequences:

Random Read Selection Strategy

ArrayList Source Storage: All input reads are stored in an ArrayList<Read> for constant-time random access via source.get(randy.nextInt(mod))
Independent Selection: Parent reads are chosen independently, allowing the same read to potentially contribute to multiple chimeras or even both parts of a single chimera
Memory Efficiency: All source reads are loaded into memory once and reused for random access during chimera generation

Fragment Extraction Methods

The tool uses different strategies for extracting fragments from parent reads:

Random Length Mode (default): Fragment length is chosen using randy.nextInt(a.length())+1, providing uniform distribution from 1 to full read length
Force Length Mode: When forcelength parameter is set, getPiece(Read, Random, int) enforces exact fragment lengths using Tools.min(len, a.length())
Position Selection: Fragment start positions are chosen using weighted random selection with three strategies:
- 50% probability: Random internal position using randy.nextInt(range+1)
- 25% probability: Start from beginning (position 0)
- 25% probability: End-anchored (position = read_length - fragment_length)

Sequence Assembly and Post-Processing

Direct Concatenation: Fragments are joined using array copying loops in makeChimera() method
Quality Score Preservation: Quality arrays are concatenated when present using separate loops for bases and quality scores
Random Orientation: Tools.nextBoolean(randy) determines 50% probability of reverse complementation via r.reverseComplement()
ID Generation: Chimeric sequence IDs combine the parent IDs with " ~ " separator for traceability

Performance Characteristics

Memory Usage: O(n) where n is the total size of input sequences, as all reads are held in memory
Time Complexity: O(m) where m is the number of chimeras to generate, with constant-time random access to source reads
Scalability: Limited by available RAM for ArrayList storage, but generates chimeras at constant speed once loaded
Thread Safety: Uses Shared.threadLocalRandom() to prevent race conditions during concurrent Random operations

Quality Control Features

Length Validation: Tools.min(len, a.length()) ensures requested fragment lengths don't exceed parent read lengths
Null Handling: makeChimera() returns null for invalid sequences, triggering retry logic with i-- in the generation loop
Progress Tracking: Tools.timeReadsBasesProcessed() provides timing statistics for both input processing and chimera generation phases

Technical Notes

Input Requirements

Input must be single-ended reads (assert(r1.mate==null) enforces this constraint)
Supports both FASTA and FASTQ formats via ConcurrentReadInputStream
Compressed files (gzip) are automatically detected and handled through FileFormat.testInput()
Quality scores are preserved if present using KillSwitch.copyOfRange() on quality arrays

Memory Considerations

All input reads are loaded into ArrayList<Read> for Random.nextInt() access during generation
Memory usage includes Read objects with bases, quality arrays, and ID strings
For large datasets, ensure sufficient RAM or process smaller input files in batches

Output Characteristics

Output format is always FASTA regardless of input format
Chimeric sequence lengths vary based on parent read sizes and random selection
Each output sequence can be traced back to its parent reads via the sequence ID

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org