SplitSam6Way
Splits sam reads into 6 output files depending on mapping status and strand orientation for paired-end reads.
Basic Usage
splitsam6way.sh <input> <r1plus> <r1minus> <r1unmapped> <r2plus> <r2minus> <r2unmapped> [maxreads]
This tool processes paired-end SAM files and separates reads into six categories based on their mapping status and strand orientation.
Parameters
This tool uses positional arguments in a specific order:
Required Arguments
- input
- Input SAM file to be split. Must be a valid SAM format file.
- r1plus
- Output file for R1 reads mapped to the plus strand. Use 'null' to skip this output.
- r1minus
- Output file for R1 reads mapped to the minus strand. Use 'null' to skip this output.
- r1unmapped
- Output file for R1 reads that are unmapped. Use 'null' to skip this output.
- r2plus
- Output file for R2 reads mapped to the plus strand. Use 'null' to skip this output.
- r2minus
- Output file for R2 reads mapped to the minus strand. Use 'null' to skip this output.
- r2unmapped
- Output file for R2 reads that are unmapped. Use 'null' to skip this output.
Optional Arguments
- maxreads
- Maximum number of reads to process. Default: unlimited (Long.MAX_VALUE). Accepts K/M/G suffixes for thousands/millions/billions.
Examples
Basic Usage - Split All Categories
splitsam6way.sh input.sam r1_plus.sam r1_minus.sam r1_unmapped.sam r2_plus.sam r2_minus.sam r2_unmapped.sam
Splits the input SAM file into 6 separate output files based on read pair and mapping status.
Skip Unwanted Categories
splitsam6way.sh input.sam r1_plus.sam null null r2_plus.sam null null
Only outputs R1 and R2 reads mapped to the plus strand, skipping minus strand and unmapped reads.
Process Limited Number of Reads
splitsam6way.sh input.sam r1_plus.sam r1_minus.sam r1_unmapped.sam r2_plus.sam r2_minus.sam r2_unmapped.sam 1000000
Process only the first 1 million reads from the input file.
Separate Mapped and Unmapped Only
splitsam6way.sh input.sam null null r1_unmapped.sam null null r2_unmapped.sam
Extract only unmapped reads for both R1 and R2, useful for recovering unaligned sequences.
Algorithm Details
SplitSam6Way implements a stream-based SAM file parsing algorithm with concurrent I/O processing for categorizing paired-end sequencing data:
Processing Architecture
- Stream Processing: ByteFile.nextLine() iterates through input with line-by-line parsing, maintaining constant memory footprint
- Header Propagation: Header lines (byte[0]=='@') are broadcast to all active ByteStreamWriter instances via println() method
- Memory Allocation: Fixed 128MB heap allocation (-Xmx128m -Xms128m) prevents memory scaling issues with large files
- Concurrent I/O: Six independent ByteStreamWriter threads with SAM format FileFormat.SAM specification handle parallel output writing
Read Classification Implementation
Each read undergoes systematic classification using SamLine parsing methods:
- SamLine Construction: new SamLine(line) parses tab-delimited SAM fields into structured object
- Pair Classification: sl.pairnum()==0 distinguishes R1 reads from R2 reads (non-zero values)
- Mapping Detection: sl.mapped() evaluates FLAG field bits to determine alignment status
- Strand Analysis: sl.strand() compares against Shared.PLUS constant for forward/reverse strand identification
- Conditional Routing: Nested if-else structure routes reads to appropriate ByteStreamWriter based on classification
File Management System
- Null File Handling: "null".equalsIgnoreCase() comparison skips ByteStreamWriter initialization for unwanted outputs
- Thread Lifecycle: start() initializes writer threads, poisonAndWait() ensures graceful termination with complete buffer flush
- Compression Integration: ReadWrite.USE_PIGZ=true enables automatic gzip handling via PIGZ library
- Format Enforcement: FileFormat.SAM parameter ensures proper SAM header and record formatting in output streams
Performance Implementation
- Memory Efficiency: Constant 128MB allocation with line-level processing avoids loading entire file into memory
- Parsing Optimization: Single SamLine object instantiation per read minimizes object allocation overhead
- I/O Parallelism: Six concurrent ByteStreamWriter threads maximize disk throughput on multi-core systems
- Sequential Access: ByteFile streaming maintains O(n) time complexity with single-pass file reading
Statistics Collection
The tool tracks processing metrics using dedicated counters:
- Throughput Calculation: Tools.timeReadsBasesProcessed() computes reads/second and bases/second from Timer measurements
- Category Counting: Six long counters (r1preads, r1mreads, r1ureads, r2preads, r2mreads, r2ureads) track distribution
- Base Accumulation: bases+=sl.seq.length aggregates total sequence length processed
- Read Limiting: Parse.parseKMG() handles K/M/G suffix parsing for maxReads parameter enforcement
Use Cases
Strand-Specific Analysis
Separate reads by strand orientation for strand-specific RNA-seq analysis or antisense transcript detection.
Quality Control
Isolate unmapped reads for further analysis, adapter contamination checking, or alternative alignment strategies.
Differential Processing
Apply different processing pipelines to reads based on their mapping characteristics and pair orientation.
Library Preparation Assessment
Evaluate strand bias in sequencing libraries by comparing plus and minus strand read distributions.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org