SplitSam6Way

Basic Usage

splitsam6way.sh <input> <r1plus> <r1minus> <r1unmapped> <r2plus> <r2minus> <r2unmapped> [maxreads]

This tool processes paired-end SAM files and separates reads into six categories based on their mapping status and strand orientation.

This tool uses positional arguments in a specific order:

input: Input SAM file to be split. Must be a valid SAM format file.
r1plus: Output file for R1 reads mapped to the plus strand. Use 'null' to skip this output.
r1minus: Output file for R1 reads mapped to the minus strand. Use 'null' to skip this output.
r1unmapped: Output file for R1 reads that are unmapped. Use 'null' to skip this output.
r2plus: Output file for R2 reads mapped to the plus strand. Use 'null' to skip this output.
r2minus: Output file for R2 reads mapped to the minus strand. Use 'null' to skip this output.
r2unmapped: Output file for R2 reads that are unmapped. Use 'null' to skip this output.

maxreads: Maximum number of reads to process. Default: unlimited (Long.MAX_VALUE). Accepts K/M/G suffixes for thousands/millions/billions.

splitsam6way.sh input.sam r1_plus.sam r1_minus.sam r1_unmapped.sam r2_plus.sam r2_minus.sam r2_unmapped.sam

Splits the input SAM file into 6 separate output files based on read pair and mapping status.

splitsam6way.sh input.sam r1_plus.sam null null r2_plus.sam null null

Only outputs R1 and R2 reads mapped to the plus strand, skipping minus strand and unmapped reads.

splitsam6way.sh input.sam r1_plus.sam r1_minus.sam r1_unmapped.sam r2_plus.sam r2_minus.sam r2_unmapped.sam 1000000

Process only the first 1 million reads from the input file.

splitsam6way.sh input.sam null null r1_unmapped.sam null null r2_unmapped.sam

Extract only unmapped reads for both R1 and R2, useful for recovering unaligned sequences.

SplitSam6Way implements a stream-based SAM file parsing algorithm with concurrent I/O processing for categorizing paired-end sequencing data:

Stream Processing: ByteFile.nextLine() iterates through input with line-by-line parsing, maintaining constant memory footprint
Header Propagation: Header lines (byte[0]=='@') are broadcast to all active ByteStreamWriter instances via println() method
Memory Allocation: Fixed 128MB heap allocation (-Xmx128m -Xms128m) prevents memory scaling issues with large files
Concurrent I/O: Six independent ByteStreamWriter threads with SAM format FileFormat.SAM specification handle parallel output writing

Each read undergoes systematic classification using SamLine parsing methods:

SamLine Construction: new SamLine(line) parses tab-delimited SAM fields into structured object
Pair Classification: sl.pairnum()==0 distinguishes R1 reads from R2 reads (non-zero values)
Mapping Detection: sl.mapped() evaluates FLAG field bits to determine alignment status
Strand Analysis: sl.strand() compares against Shared.PLUS constant for forward/reverse strand identification
Conditional Routing: Nested if-else structure routes reads to appropriate ByteStreamWriter based on classification

Null File Handling: "null".equalsIgnoreCase() comparison skips ByteStreamWriter initialization for unwanted outputs
Thread Lifecycle: start() initializes writer threads, poisonAndWait() ensures graceful termination with complete buffer flush
Compression Integration: ReadWrite.USE_PIGZ=true enables automatic gzip handling via PIGZ library
Format Enforcement: FileFormat.SAM parameter ensures proper SAM header and record formatting in output streams

Memory Efficiency: Constant 128MB allocation with line-level processing avoids loading entire file into memory
Parsing Optimization: Single SamLine object instantiation per read minimizes object allocation overhead
I/O Parallelism: Six concurrent ByteStreamWriter threads maximize disk throughput on multi-core systems
Sequential Access: ByteFile streaming maintains O(n) time complexity with single-pass file reading

The tool tracks processing metrics using dedicated counters:

Throughput Calculation: Tools.timeReadsBasesProcessed() computes reads/second and bases/second from Timer measurements
Category Counting: Six long counters (r1preads, r1mreads, r1ureads, r2preads, r2mreads, r2ureads) track distribution
Base Accumulation: bases+=sl.seq.length aggregates total sequence length processed
Read Limiting: Parse.parseKMG() handles K/M/G suffix parsing for maxReads parameter enforcement

Separate reads by strand orientation for strand-specific RNA-seq analysis or antisense transcript detection.

Isolate unmapped reads for further analysis, adapter contamination checking, or alternative alignment strategies.

Apply different processing pipelines to reads based on their mapping characteristics and pair orientation.

Evaluate strand bias in sequencing libraries by comparing plus and minus strand read distributions.

For questions and support: