CutPrimers

Basic Usage

cutprimers.sh in=<file> out=<file> sam1=<file> sam2=<file>

CutPrimers extracts sequences located between two primer binding sites. It requires SAM files containing the mapped positions of both forward and reverse primers, typically generated using msa.sh (Multiple Sequence Alignment). The tool identifies primer locations in the input sequences and extracts the region between them, with options to include or exclude the primer sequences themselves.

Parameters

Parameters control input/output files, primer handling behavior, and sequence extraction options. The tool requires both primer SAM files to function correctly.

Input/Output Parameters

in=<file>: File containing reads to process. Supports FASTA and FASTQ formats. Use in=stdin.fa to pipe from standard input.
out=<file>: Output file for extracted sequences between primers. Use out=stdout to pipe to standard output. Output format matches input format.
sam1=<file>: SAM file containing mapped locations of the first primer sequence (typically forward primer). Must contain alignment positions for primer sequences against the input reads.
sam2=<file>: SAM file containing mapped locations of the second primer sequence (typically reverse primer). Must contain alignment positions for primer sequences against the input reads.

Extraction Behavior

fake=t: Generate fake output reads when primers are not found. When set to true (default), outputs a single 'N' base for reads where primers cannot be located or where primer regions overlap. When false, no output is generated for such reads.
include=f: Include the flanking primer sequences in the extracted output. When false (default), only the sequence between primers is extracted. When true, the extracted sequence includes both primer regions plus the sequence between them.

Java Parameters

-Xmx: Sets Java's memory usage, overriding automatic memory detection. Use format like -Xmx20g for 20 gigabytes or -Xmx200m for 200 megabytes. Maximum is typically 85% of physical memory. Default allocation is 1GB for this tool.
-eoom: Exit on out-of-memory exception. Causes the process to terminate immediately if Java runs out of memory. Requires Java 8u92 or later.
-da: Disable Java assertions. May provide slight performance improvement in production use.

Examples

Basic Primer Cutting

cutprimers.sh in=sequences.fq out=extracted.fq sam1=forward_primer.sam sam2=reverse_primer.sam

Extracts sequences between forward and reverse primers, excluding the primer sequences themselves from the output.

Including Primer Sequences

cutprimers.sh in=sequences.fq out=extracted_with_primers.fq sam1=forward_primer.sam sam2=reverse_primer.sam include=t

Extracts sequences including both primer sequences in the output, useful when the full amplicon sequence is needed.

Skipping Failed Extractions

cutprimers.sh in=sequences.fq out=extracted_only.fq sam1=forward_primer.sam sam2=reverse_primer.sam fake=f

Only outputs successfully extracted sequences, omitting reads where primers cannot be found or regions overlap.

Processing with Memory Optimization

cutprimers.sh -Xmx8g in=large_dataset.fq out=extracted.fq sam1=forward_primer.sam sam2=reverse_primer.sam

Processes large datasets with increased memory allocation for better performance.

Algorithm Details

Primer Location Processing

CutPrimers uses LinkedHashMap structures and coordinate arithmetic to process SAM alignment data and identify primer binding locations:

SAM File Parsing: The toSamLines() method loads primer SAM files into LinkedHashMap<String, SamLine> structures, organizing alignments by read name using SamLine.rname() for lookup during sequence processing
Coordinate Calculation: For each read, extracts start and stop positions using SamLine.start() and SamLine.stop() methods with boundary enforcement, accounting for alignment boundaries and sequence length constraints
Overlap Detection: Uses Tools.overlap() method to check for overlapping primer regions through coordinate comparison - when primers overlap, the read is typically discarded or a fake 'N' sequence is generated
Boundary Management: Uses Tools.mid() function to ensure all coordinates remain within valid sequence boundaries, preventing array access errors

Sequence Extraction Strategy

The extraction process implements different strategies based on primer arrangement and user preferences:

Non-overlapping Primers: When Tools.overlap() returns false, determines extraction boundaries based on relative primer positions and INCLUDE_PRIMERS flag setting
Include Mode: When INCLUDE_PRIMERS=true, extracts from the start of the earlier primer to the end of the later primer using coordinate comparison (a1<a2), preserving primer sequences
Exclude Mode: When INCLUDE_PRIMERS=false (default), extracts only the sequence between primer endpoints using KillSwitch.copyOfRange(), excluding primer sequences themselves
Quality Preservation: Maintains quality scores corresponding to extracted base positions using KillSwitch.copyOfRange() on both bases and quality arrays when quality data is available

Performance Characteristics

Memory Usage: Maintains primer alignments in memory using LinkedHashMap<String, SamLine> structures, with default 1GB allocation (-Xmx1g) suitable for most datasets
Processing Speed: Linear time complexity relative to input read count, with constant-time primer lookup per read
I/O Efficiency: Uses concurrent streaming for input/output operations, processing reads in batches to minimize file system overhead
Error Handling: Gracefully handles missing primer alignments, coordinate boundary violations, and overlapping primer regions

Coordinate Logic

The tool implements precise coordinate arithmetic for different primer configurations:

Forward-Reverse Order: When primer1 appears before primer2, extraction boundaries are calculated as either (primer1_end + 1, primer2_start) for exclude mode or (primer1_start, primer2_end + 1) for include mode
Reverse-Forward Order: When primer2 appears before primer1, boundaries are swapped accordingly to maintain proper extraction logic
Boundary Validation: All coordinates are validated against sequence length and adjusted using Tools.mid() to prevent out-of-bounds access

Use Cases

Amplicon Sequence Extraction

Primary application for extracting amplified regions from PCR products or targeted sequencing data where primer sequences need to be removed from analysis.

Targeted Region Analysis

Useful for isolating specific genomic regions bounded by known primer sequences, particularly in metagenomics or environmental sequencing projects.

Quality Control

Can identify reads with missing or improperly aligned primers, helping assess PCR amplification success and primer binding efficiency.

Pipeline Integration

Designed to work with msa.sh for primer alignment generation, forming part of larger sequence processing workflows for targeted sequencing analysis.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org