SplitNextera

Script: splitnextera.sh Package: jgi Class: SplitNexteraLMP.java

Splits Nextera LMP (long-mate-pair) libraries into subsets based on linker orientation: LMP, fragment, unknown, and singleton. This tool is designed strictly for Nextera LMP reads and is an essential processing step that must be performed before further analysis.

Important Usage Notes

Critical Requirements

  • LMP Libraries Only: This tool is designed strictly for Nextera LMP (long-mate-pair) reads, not for normal libraries using a Nextera kit
  • Required Processing Step: Nextera LMP libraries must be split prior to further processing - they are not usable raw
  • Adapter Trimming First: Perform adapter trimming on Nextera LMP libraries before splitting

Recommended Workflow

Brian Bushnell recommends this two-step workflow for processing Nextera LMP libraries:

Step 1: Adapter Trimming

bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

Remove standard Illumina adapters from the raw Nextera LMP library.

Step 2: Library Splitting

splitnextera.sh in=trimmed.fq out=lmp.fq outf=fragments.fq outu=unknown.fq outs=singletons.fq mask

Split the adapter-trimmed library into four output categories based on linker orientation.

This produces four essential output files:

Basic Usage

splitnextera.sh in=<file> out=<file> outf=<file> outu=<file> outs=<file>

For pairs in two files, use in1, in2, out1, out2, etc.

Processing Approaches

SplitNextera offers two approaches for junction detection, with different performance characteristics:

Approach 1: Built-in Junction Detection

splitnextera.sh in=trimmed.fq out=lmp.fq outf=fragments.fq outu=unknown.fq outs=singletons.fq mask=t

Uses the built-in masking capability to automatically detect Nextera junction sequences.

Approach 2: Pre-masking with BBDuk (Faster)

# Step 1: Pre-mask junctions with BBDuk
bbduk.sh in=trimmed.fq out=stdout.fq ktmask=J k=19 hdist=1 mink=11 hdist2=0 literal=CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG | \
# Step 2: Split using pre-masked junctions
splitnextera.sh in=stdin.fq out=lmp.fq outf=fragments.fq outu=unknown.fq outs=singletons.fq

This approach is somewhat faster as it separates junction detection from library splitting, yielding identical results.

Parameters

Parameters are organized by their function in the splitting process, matching the exact organization from the shell script.

I/O Parameters

in=<file>
Input reads. Set to 'stdin.fq' to read from stdin.
out=<file>
Output for pairs with LMP orientation.
outf=<file>
Output for pairs with fragment orientation.
outu=<file>
Pairs with unknown orientation.
outs=<file>
Singleton output.
ow=f
(overwrite) Overwrites files that already exist. Default: f
app=f
(append) Append to files that already exist. Default: f
zl=4
(ziplevel) Set compression level, 1 (low) to 9 (max). Default: 4
int=f
(interleaved) Determines whether INPUT file is considered interleaved. Default: f
qin=auto
ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto. Default: auto
qout=auto
ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input). Default: auto

Processing Parameters

mask=f
Set to true if you did not already convert junctions to some symbol, and it will be done automatically. When enabled, constructs kmer hash tables to detect the Nextera junction sequence CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG and replaces detected sequences with junction symbols. Default: f
junction=J
Look for this symbol to designate the junction bases. The tool searches for this character to identify adapter locations when processing pre-masked sequences. Default: J
innerlmp=f
Generate long mate pairs from the inner pair also, when the junction is found in both reads. This creates additional LMP pairs from the inner segments when both reads contain adapters. Default: f
rename=t
Rename read 2 of output when using single-ended input. Automatically updates read names to maintain proper pairing information. Default: t
minlength=40
(ml) Do not output reads shorter than this. Filters out reads that are too short to be useful after adapter trimming. Default: 40
merge=f
Attempt to merge overlapping reads before looking for junctions. Uses exact overlap detection to identify reads that can be merged into single sequences before junction processing. Default: f
testmerge=0.0
If nonzero, only merge reads if at least this fraction of input reads are mergable. Tests merge rate on a subset (up to 1 million read pairs) and enables merging only if the rate exceeds 10%. Default: 0.0

Sampling Parameters

reads=-1
Set to a positive number to only process this many INPUT reads (or pairs), then quit. Useful for testing or processing subsets. Default: -1 (all reads)
samplerate=1
Randomly output only this fraction of reads; 1 means sampling is disabled. Range: 0.0-1.0. Default: 1
sampleseed=-1
Set to a positive number to use that prng seed for sampling (allowing deterministic sampling). Default: -1 (random seed)

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Complete Nextera LMP Processing Workflow

# Step 1: Remove standard adapters
bbduk.sh in=nextera_lmp_raw.fq out=adapter_trimmed.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

# Step 2: Split library by junction orientation
splitnextera.sh in=adapter_trimmed.fq out=lmp.fq outf=fragments.fq outu=unknown.fq outs=singletons.fq mask=t

Complete workflow for processing raw Nextera LMP libraries following Brian's recommended approach.

High-Performance Processing with Pre-masking

# Method 1: Pre-mask adapters with BBDuk for maximum speed
bbduk.sh in=adapter_trimmed.fq out=junction_masked.fq ktmask=J k=19 hdist=1 mink=11 hdist2=0 literal=CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG
splitnextera.sh in=junction_masked.fq out=lmp.fq outf=frag.fq outu=unknown.fq outs=singleton.fq

# Method 2: Pipeline approach (equivalent, slightly faster)
bbduk.sh in=adapter_trimmed.fq out=stdout.fq ktmask=J k=19 hdist=1 mink=11 hdist2=0 literal=CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG | \
splitnextera.sh in=stdin.fq out=lmp.fq outf=frag.fq outu=unknown.fq outs=singleton.fq

For maximal speed, pre-mask the junction sequences with BBDuk before splitting. Both methods yield identical output.

Paired-End Processing with Inner LMP Generation

# Process paired-end Nextera LMP data with comprehensive output
splitnextera.sh in1=trimmed_R1.fq in2=trimmed_R2.fq \
                out1=lmp_R1.fq out2=lmp_R2.fq \
                outf1=frag_R1.fq outf2=frag_R2.fq \
                outu1=unk_R1.fq outu2=unk_R2.fq \
                outs=singleton.fq \
                mask=t innerlmp=t merge=t testmerge=0.1

Processes paired-end data with automatic junction detection, inner LMP generation when both reads contain adapters, and conditional merging based on overlap rate.

Quality-Filtered Processing

# Apply stricter length filtering and generate processing statistics
splitnextera.sh in=library.fq out=lmp.fq outf=frag.fq outu=unk.fq outs=single.fq \
                mask=t minlength=60 stats=split_stats.txt

Uses stricter minimum length filtering (60bp instead of default 40bp) and outputs detailed processing statistics to a file.

Algorithm Details

Nextera LMP Library Architecture

Nextera LMP libraries are created through a specialized process that circularizes DNA fragments, incorporates junction sequences, and then linearizes them to create long-distance mate pairs. The junction sequence CTGTCTCTTATACACATCTAGATGTGTATAAGAGACAG serves as the critical marker that distinguishes different read pair orientations and enables proper library demultiplexing.

Junction Detection Strategy

SplitNextera implements a dual-mode junction detection system optimized for different performance requirements:

Read Pair Classification Logic

Each read pair undergoes systematic classification based on junction positions stored in Read.start and Read.stop coordinates:

Overlap Merging Integration

The optional merging functionality provides enhanced junction detection through sequence overlap analysis. When merge=t, BBMerge.findOverlapStrict() identifies exact overlaps between paired reads before junction processing. The testmerge parameter enables adaptive behavior: if not reading from stdin, BBMerge.mergeableFraction() samples up to 1 million read pairs to measure overlap rate, enabling merging only when the rate exceeds the hardcoded 10% threshold. Successfully merged reads use Read.joinRead() to create single sequences, with Read.reverseComplement() operations maintaining proper orientation. Junction detection then proceeds on the merged sequence, potentially improving detection accuracy for short insert libraries.

Memory Architecture and Performance

SplitNextera uses concurrent I/O architecture with ConcurrentReadInputStream and ConcurrentReadOutputStream systems employing buffer size 4 for parallel processing. Read processing occurs through ListNum batch structures, enabling simultaneous read/write operations across four output streams (LMP, fragment, unknown, singleton). The base memory requirement is approximately 200MB (default -Xmx200m), scaling linearly when mask=t to accommodate kmer hash tables at roughly 20 bytes per unique kmer. For optimal performance, Brian recommends pre-masking with BBDuk using identical parameters (k=19, mink=11, hdist=1, ktmask=J) rather than built-in masking, as this eliminates redundant TableLoaderLockFree overhead and table construction during runtime.

Statistical Reporting and Quality Control

Comprehensive processing statistics track junction detection efficiency and read fate distribution. The system maintains counters for junctionsSought (total read pairs processed), junctionsDetected (pairs with identified junctions), and category-specific read/base counts (readsLmp, readsFrag, readsUnk, readsSingle). Statistical output uses TextStreamWriter to generate detailed reports including junction detection rate (junctionsDetected/junctionsSought × 100%), read distribution percentages calculated with pairedInput multipliers (100.0 for paired, 50.0 for single-ended), and base recovery metrics. When merging is enabled, additional statistics track mergedReadCount and mergedBaseCount with percentage calculations. All metrics are formatted using Tools.format() for consistent decimal precision and written to configurable output streams (default: stderr).

Output Categories

Long Mate Pairs (LMP)

These represent the primary product of successful Nextera LMP library preparation. LMP pairs have the expected long-distance orientation where the outer segments of junction-containing reads are paired together, creating mate pairs with insert distances much larger than the original DNA fragments. These pairs are essential for applications requiring long-range connectivity information such as scaffolding, structural variant detection, and complex genome assembly.

Fragment Pairs

Standard paired-end fragments that result from incomplete adapter ligation, failed circularization, or other library preparation artifacts. These pairs have normal fragment insert sizes (typically 300-800bp) rather than the long mate pair distances (2-40kb+). While not the intended product of LMP preparation, fragment pairs retain standard paired-end utility for mapping and variant calling applications.

Unknown Pairs

Read pairs where junction sequences were not detected despite the presence of potential LMP structure. This category captures several scenarios: LMP pairs where junction detection failed due to sequencing errors or base quality issues, fragments that lack incorporated junction sequences, or cases where the junction sequence has been modified during library preparation. Unknown pairs typically represent approximately 15-30% of LMP libraries and require careful evaluation to determine their ultimate classification.

Singletons

Individual reads that lost their mate during the splitting process. Common causes include asymmetric junction detection (junction found in only one read of a pair), size filtering that removes segments below the minimum length threshold, or quality issues that prevent proper junction identification. Singletons can sometimes be rescued through secondary analysis or merged with other singleton populations for single-end applications.

Troubleshooting

Low Junction Detection Rate

If junction detection rates are below 60%, consider:

High Unknown Percentage

Unknown pairs exceeding 40% may indicate:

Performance Optimization

For maximum throughput with large datasets:

Support

For questions and support: