SplitSam
Splits a sam file into three files: Plus-mapped reads, Minus-mapped reads, and Unmapped. If 'header' is the 5th argument, header lines will be included.
Basic Usage
splitsam <input> <plus output> <minus output> <unmapped output> [header]
Input may be stdin or a sam file, raw or gzipped. Outputs must be sam files, and may be gzipped.
Parameters
SplitSam uses positional arguments rather than named parameters:
Positional Arguments
- input
- Input SAM file to split. Can be stdin, a regular SAM file, or gzipped SAM file.
- plus output
- Output file for reads mapped to the plus strand (forward strand). Must be a SAM file and may be gzipped.
- minus output
- Output file for reads mapped to the minus strand (reverse strand). Must be a SAM file and may be gzipped.
- unmapped output
- Output file for unmapped reads. Must be a SAM file and may be gzipped.
- header (optional)
- If the 5th argument is "header", SAM header lines (starting with @) will be included in all three output files. Default: header lines are excluded from output files.
Examples
Basic splitting without headers
splitsam aligned.sam plus_reads.sam minus_reads.sam unmapped_reads.sam
Splits aligned.sam into three files based on mapping strand, excluding SAM header lines from outputs.
Splitting with headers included
splitsam aligned.sam plus_reads.sam minus_reads.sam unmapped_reads.sam header
Same as above but includes SAM header lines (@SQ, @HD, etc.) in all three output files.
Working with gzipped files
splitsam aligned.sam.gz plus_reads.sam.gz minus_reads.sam.gz unmapped_reads.sam.gz header
Processes a gzipped input file and creates gzipped output files with headers included.
Using stdin input
samtools view input.bam | splitsam stdin plus.sam minus.sam unmapped.sam
Reads SAM data from stdin (converted from BAM using samtools) and splits into three files.
Algorithm Details
Strand Determination Implementation
SplitSam uses the SamLine.parseFlagOnly() method to extract SAM flag bits from line byte arrays, then applies SamLine.strand() and SamLine.mapped() static methods for classification:
- Plus strand (forward): SamLine.strand(flag) returns 0 when bit 0x10 is unset
- Minus strand (reverse): SamLine.strand(flag) returns 1 when bit 0x10 is set
- Unmapped: SamLine.mapped(flag) returns false when bit 0x4 is set
Processing Architecture
The tool implements a ByteFile.nextLine() streaming parser with three concurrent ByteStreamWriter threads:
- Single-threaded line parsing using ByteFile.nextLine() iterator pattern
- Three ByteStreamWriter instances (fStream, rStream, uStream) with automatic compression detection
- ByteStreamWriter.start() and ByteStreamWriter.poisonAndWait() thread lifecycle management
- Constant memory footprint: processes one line at a time without buffering entire file
Header Processing Logic
When includeHeader boolean is true (5th argument equals "header"):
- Lines beginning with '@' character are identified as header lines
- Header lines are written via ByteStreamWriter.println() to all three output streams
- Ensures each output file contains complete SAM header for downstream compatibility
Performance Implementation
- Memory allocation: Fixed 128MB heap (-Xmx128m) as configured in shell script
- I/O architecture: ByteStreamWriter with built-in gzip compression support via file extension detection
- Threading model: Three concurrent writer threads with producer-consumer pattern
- Scalability: O(1) memory complexity independent of input file size due to line-by-line streaming
Statistical Output Implementation
Statistics are generated using Timer.stop() measurement and System.err.println() reporting:
- Total reads: plus + minus + other counter accumulation
- Plus strand reads: incremented when SamLine.strand(flag) == 0 and SamLine.mapped(flag) == true
- Minus strand reads: incremented when SamLine.strand(flag) == 1 and SamLine.mapped(flag) == true
- Unmapped reads: incremented when SamLine.mapped(flag) == false
- Processing time: Timer class elapsed time measurement
Technical Notes
SAM Flag Interpretation
The tool relies on standard SAM flag bits for classification:
- 0x4 (unmapped): Read is unmapped
- 0x10 (reverse complement): Read sequence is reverse complemented
File Format Requirements
- Input must be valid SAM format (not BAM)
- Output files are always SAM format
- Gzipped files are automatically detected by extension
- All output files are created even if they would be empty
Error Handling Implementation
- Tools.testForDuplicateFiles() validates no duplicate output paths exist
- Tools.testOutputFiles() ensures output files are writable before processing begins
- Malformed SAM lines are processed via try-catch blocks in SamLine.parseFlagOnly()
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org