MakeChimeras
Makes chimeric sequences from nonchimeric sequences. Designed for PacBio reads.
Basic Usage
makechimeras.sh in=<input> out=<output> chimeras=<integer>
Creates artificial chimeric sequences by randomly fusing together pieces from input reads. This tool is particularly designed for generating synthetic PacBio reads with chimeric characteristics for testing and validation purposes.
Parameters
Parameters are organized by their function in the chimera creation process.
Input Parameters
- in=<file>
- The input file containing nonchimeric reads. Can be fasta or fastq format, compressed or uncompressed.
- unpigz=t
- Decompress with pigz for faster decompression. Uses parallel gzip decompression when available.
Output Parameters
- out=<file>
- Fasta output destination. The output file will contain the generated chimeric sequences.
- chimeras=-1
- Number of chimeras to create (required parameter). Must be set to a positive integer to specify how many synthetic chimeric sequences to generate.
- forcelength=0
- If a positive number X, one parent will be length X, and the other will be length-X. This forces specific length distributions in the chimeric products. When set to 0, random length selection is used.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Basic Chimera Generation
makechimeras.sh in=input_reads.fasta out=chimeric_reads.fasta chimeras=1000
Creates 1000 synthetic chimeric sequences from the input reads by randomly selecting and fusing pieces from different source sequences.
Forced Length Distribution
makechimeras.sh in=pacbio_reads.fasta out=test_chimeras.fasta chimeras=500 forcelength=5000
Generates 500 chimeric sequences where each chimera is constructed with one piece of exactly 5000bp and another piece that makes up the remaining length.
Processing Compressed Input
makechimeras.sh in=reads.fasta.gz out=chimeras.fasta chimeras=2000 unpigz=t
Processes compressed input using parallel gzip decompression for better performance on large datasets.
Algorithm Details
Chimera Construction Process
MakeChimeras uses Random.nextInt() selection with Shared.threadLocalRandom() for creating synthetic chimeric sequences:
Random Read Selection Strategy
- ArrayList Source Storage: All input reads are stored in an ArrayList<Read> for constant-time random access via source.get(randy.nextInt(mod))
- Independent Selection: Parent reads are chosen independently, allowing the same read to potentially contribute to multiple chimeras or even both parts of a single chimera
- Memory Efficiency: All source reads are loaded into memory once and reused for random access during chimera generation
Fragment Extraction Methods
The tool uses different strategies for extracting fragments from parent reads:
- Random Length Mode (default): Fragment length is chosen using randy.nextInt(a.length())+1, providing uniform distribution from 1 to full read length
- Force Length Mode: When forcelength parameter is set, getPiece(Read, Random, int) enforces exact fragment lengths using Tools.min(len, a.length())
- Position Selection: Fragment start positions are chosen using weighted random selection with three strategies:
- 50% probability: Random internal position using randy.nextInt(range+1)
- 25% probability: Start from beginning (position 0)
- 25% probability: End-anchored (position = read_length - fragment_length)
Sequence Assembly and Post-Processing
- Direct Concatenation: Fragments are joined using array copying loops in makeChimera() method
- Quality Score Preservation: Quality arrays are concatenated when present using separate loops for bases and quality scores
- Random Orientation: Tools.nextBoolean(randy) determines 50% probability of reverse complementation via r.reverseComplement()
- ID Generation: Chimeric sequence IDs combine the parent IDs with " ~ " separator for traceability
Performance Characteristics
- Memory Usage: O(n) where n is the total size of input sequences, as all reads are held in memory
- Time Complexity: O(m) where m is the number of chimeras to generate, with constant-time random access to source reads
- Scalability: Limited by available RAM for ArrayList storage, but generates chimeras at constant speed once loaded
- Thread Safety: Uses Shared.threadLocalRandom() to prevent race conditions during concurrent Random operations
Quality Control Features
- Length Validation: Tools.min(len, a.length()) ensures requested fragment lengths don't exceed parent read lengths
- Null Handling: makeChimera() returns null for invalid sequences, triggering retry logic with i-- in the generation loop
- Progress Tracking: Tools.timeReadsBasesProcessed() provides timing statistics for both input processing and chimera generation phases
Technical Notes
Input Requirements
- Input must be single-ended reads (assert(r1.mate==null) enforces this constraint)
- Supports both FASTA and FASTQ formats via ConcurrentReadInputStream
- Compressed files (gzip) are automatically detected and handled through FileFormat.testInput()
- Quality scores are preserved if present using KillSwitch.copyOfRange() on quality arrays
Memory Considerations
- All input reads are loaded into ArrayList<Read> for Random.nextInt() access during generation
- Memory usage includes Read objects with bases, quality arrays, and ID strings
- For large datasets, ensure sufficient RAM or process smaller input files in batches
Output Characteristics
- Output format is always FASTA regardless of input format
- Chimeric sequence lengths vary based on parent read sizes and random selection
- Each output sequence can be traced back to its parent reads via the sequence ID
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org