CutPrimers

Script: cutprimers.sh Package: jgi Class: CutPrimers.java

Cuts out sequences between primers identified in sam files. Intended for use with sam files generated by msa.sh; one sam file for the forward primer, and one for the reverse primer.

Basic Usage

cutprimers.sh in=<file> out=<file> sam1=<file> sam2=<file>

CutPrimers extracts sequences located between two primer binding sites. It requires SAM files containing the mapped positions of both forward and reverse primers, typically generated using msa.sh (Multiple Sequence Alignment). The tool identifies primer locations in the input sequences and extracts the region between them, with options to include or exclude the primer sequences themselves.

Parameters

Parameters control input/output files, primer handling behavior, and sequence extraction options. The tool requires both primer SAM files to function correctly.

Input/Output Parameters

in=<file>
File containing reads to process. Supports FASTA and FASTQ formats. Use in=stdin.fa to pipe from standard input.
out=<file>
Output file for extracted sequences between primers. Use out=stdout to pipe to standard output. Output format matches input format.
sam1=<file>
SAM file containing mapped locations of the first primer sequence (typically forward primer). Must contain alignment positions for primer sequences against the input reads.
sam2=<file>
SAM file containing mapped locations of the second primer sequence (typically reverse primer). Must contain alignment positions for primer sequences against the input reads.

Extraction Behavior

fake=t
Generate fake output reads when primers are not found. When set to true (default), outputs a single 'N' base for reads where primers cannot be located or where primer regions overlap. When false, no output is generated for such reads.
include=f
Include the flanking primer sequences in the extracted output. When false (default), only the sequence between primers is extracted. When true, the extracted sequence includes both primer regions plus the sequence between them.

Java Parameters

-Xmx
Sets Java's memory usage, overriding automatic memory detection. Use format like -Xmx20g for 20 gigabytes or -Xmx200m for 200 megabytes. Maximum is typically 85% of physical memory. Default allocation is 1GB for this tool.
-eoom
Exit on out-of-memory exception. Causes the process to terminate immediately if Java runs out of memory. Requires Java 8u92 or later.
-da
Disable Java assertions. May provide slight performance improvement in production use.

Examples

Basic Primer Cutting

cutprimers.sh in=sequences.fq out=extracted.fq sam1=forward_primer.sam sam2=reverse_primer.sam

Extracts sequences between forward and reverse primers, excluding the primer sequences themselves from the output.

Including Primer Sequences

cutprimers.sh in=sequences.fq out=extracted_with_primers.fq sam1=forward_primer.sam sam2=reverse_primer.sam include=t

Extracts sequences including both primer sequences in the output, useful when the full amplicon sequence is needed.

Skipping Failed Extractions

cutprimers.sh in=sequences.fq out=extracted_only.fq sam1=forward_primer.sam sam2=reverse_primer.sam fake=f

Only outputs successfully extracted sequences, omitting reads where primers cannot be found or regions overlap.

Processing with Memory Optimization

cutprimers.sh -Xmx8g in=large_dataset.fq out=extracted.fq sam1=forward_primer.sam sam2=reverse_primer.sam

Processes large datasets with increased memory allocation for better performance.

Algorithm Details

Primer Location Processing

CutPrimers uses LinkedHashMap structures and coordinate arithmetic to process SAM alignment data and identify primer binding locations:

Sequence Extraction Strategy

The extraction process implements different strategies based on primer arrangement and user preferences:

Performance Characteristics

Coordinate Logic

The tool implements precise coordinate arithmetic for different primer configurations:

Use Cases

Amplicon Sequence Extraction

Primary application for extracting amplified regions from PCR products or targeted sequencing data where primer sequences need to be removed from analysis.

Targeted Region Analysis

Useful for isolating specific genomic regions bounded by known primer sequences, particularly in metagenomics or environmental sequencing projects.

Quality Control

Can identify reads with missing or improperly aligned primers, helping assess PCR amplification success and primer binding efficiency.

Pipeline Integration

Designed to work with msa.sh for primer alignment generation, forming part of larger sequence processing workflows for targeted sequencing analysis.

Support

For questions and support: