CutPrimers
Cuts out sequences between primers identified in sam files. Intended for use with sam files generated by msa.sh; one sam file for the forward primer, and one for the reverse primer.
Basic Usage
cutprimers.sh in=<file> out=<file> sam1=<file> sam2=<file>
CutPrimers extracts sequences located between two primer binding sites. It requires SAM files containing the mapped positions of both forward and reverse primers, typically generated using msa.sh (Multiple Sequence Alignment). The tool identifies primer locations in the input sequences and extracts the region between them, with options to include or exclude the primer sequences themselves.
Parameters
Parameters control input/output files, primer handling behavior, and sequence extraction options. The tool requires both primer SAM files to function correctly.
Input/Output Parameters
- in=<file>
- File containing reads to process. Supports FASTA and FASTQ formats. Use in=stdin.fa to pipe from standard input.
- out=<file>
- Output file for extracted sequences between primers. Use out=stdout to pipe to standard output. Output format matches input format.
- sam1=<file>
- SAM file containing mapped locations of the first primer sequence (typically forward primer). Must contain alignment positions for primer sequences against the input reads.
- sam2=<file>
- SAM file containing mapped locations of the second primer sequence (typically reverse primer). Must contain alignment positions for primer sequences against the input reads.
Extraction Behavior
- fake=t
- Generate fake output reads when primers are not found. When set to true (default), outputs a single 'N' base for reads where primers cannot be located or where primer regions overlap. When false, no output is generated for such reads.
- include=f
- Include the flanking primer sequences in the extracted output. When false (default), only the sequence between primers is extracted. When true, the extracted sequence includes both primer regions plus the sequence between them.
Java Parameters
- -Xmx
- Sets Java's memory usage, overriding automatic memory detection. Use format like -Xmx20g for 20 gigabytes or -Xmx200m for 200 megabytes. Maximum is typically 85% of physical memory. Default allocation is 1GB for this tool.
- -eoom
- Exit on out-of-memory exception. Causes the process to terminate immediately if Java runs out of memory. Requires Java 8u92 or later.
- -da
- Disable Java assertions. May provide slight performance improvement in production use.
Examples
Basic Primer Cutting
cutprimers.sh in=sequences.fq out=extracted.fq sam1=forward_primer.sam sam2=reverse_primer.sam
Extracts sequences between forward and reverse primers, excluding the primer sequences themselves from the output.
Including Primer Sequences
cutprimers.sh in=sequences.fq out=extracted_with_primers.fq sam1=forward_primer.sam sam2=reverse_primer.sam include=t
Extracts sequences including both primer sequences in the output, useful when the full amplicon sequence is needed.
Skipping Failed Extractions
cutprimers.sh in=sequences.fq out=extracted_only.fq sam1=forward_primer.sam sam2=reverse_primer.sam fake=f
Only outputs successfully extracted sequences, omitting reads where primers cannot be found or regions overlap.
Processing with Memory Optimization
cutprimers.sh -Xmx8g in=large_dataset.fq out=extracted.fq sam1=forward_primer.sam sam2=reverse_primer.sam
Processes large datasets with increased memory allocation for better performance.
Algorithm Details
Primer Location Processing
CutPrimers uses LinkedHashMap structures and coordinate arithmetic to process SAM alignment data and identify primer binding locations:
- SAM File Parsing: The toSamLines() method loads primer SAM files into LinkedHashMap<String, SamLine> structures, organizing alignments by read name using SamLine.rname() for lookup during sequence processing
- Coordinate Calculation: For each read, extracts start and stop positions using SamLine.start() and SamLine.stop() methods with boundary enforcement, accounting for alignment boundaries and sequence length constraints
- Overlap Detection: Uses Tools.overlap() method to check for overlapping primer regions through coordinate comparison - when primers overlap, the read is typically discarded or a fake 'N' sequence is generated
- Boundary Management: Uses Tools.mid() function to ensure all coordinates remain within valid sequence boundaries, preventing array access errors
Sequence Extraction Strategy
The extraction process implements different strategies based on primer arrangement and user preferences:
- Non-overlapping Primers: When Tools.overlap() returns false, determines extraction boundaries based on relative primer positions and INCLUDE_PRIMERS flag setting
- Include Mode: When INCLUDE_PRIMERS=true, extracts from the start of the earlier primer to the end of the later primer using coordinate comparison (a1<a2), preserving primer sequences
- Exclude Mode: When INCLUDE_PRIMERS=false (default), extracts only the sequence between primer endpoints using KillSwitch.copyOfRange(), excluding primer sequences themselves
- Quality Preservation: Maintains quality scores corresponding to extracted base positions using KillSwitch.copyOfRange() on both bases and quality arrays when quality data is available
Performance Characteristics
- Memory Usage: Maintains primer alignments in memory using LinkedHashMap<String, SamLine> structures, with default 1GB allocation (-Xmx1g) suitable for most datasets
- Processing Speed: Linear time complexity relative to input read count, with constant-time primer lookup per read
- I/O Efficiency: Uses concurrent streaming for input/output operations, processing reads in batches to minimize file system overhead
- Error Handling: Gracefully handles missing primer alignments, coordinate boundary violations, and overlapping primer regions
Coordinate Logic
The tool implements precise coordinate arithmetic for different primer configurations:
- Forward-Reverse Order: When primer1 appears before primer2, extraction boundaries are calculated as either (primer1_end + 1, primer2_start) for exclude mode or (primer1_start, primer2_end + 1) for include mode
- Reverse-Forward Order: When primer2 appears before primer1, boundaries are swapped accordingly to maintain proper extraction logic
- Boundary Validation: All coordinates are validated against sequence length and adjusted using Tools.mid() to prevent out-of-bounds access
Use Cases
Amplicon Sequence Extraction
Primary application for extracting amplified regions from PCR products or targeted sequencing data where primer sequences need to be removed from analysis.
Targeted Region Analysis
Useful for isolating specific genomic regions bounded by known primer sequences, particularly in metagenomics or environmental sequencing projects.
Quality Control
Can identify reads with missing or improperly aligned primers, helping assess PCR amplification success and primer binding efficiency.
Pipeline Integration
Designed to work with msa.sh for primer alignment generation, forming part of larger sequence processing workflows for targeted sequencing analysis.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org