SplitSam4Way
Splits SAM reads into 4 output files based on mapping status: plus-strand mapped, minus-strand mapped, chimeric/discordant pairs, and unmapped reads. Useful for analyzing mapping quality and identifying different types of read pairs in SAM alignment files.
Basic Usage
splitsam4way.sh <input> <outplus> <outminus> <outchimeric> <outunmapped>
Takes exactly 5 positional arguments specifying input SAM file and the 4 output categories. Use 'null' for any output file you don't want to generate.
Parameters
This tool uses only positional arguments and has no optional parameters. All arguments are required except output files can be set to 'null' to skip generation.
Positional Arguments
- input
- Input SAM file containing aligned reads. Headers are preserved and written to all non-null output files.
- outplus
- Output file for reads mapping to the plus strand. Determined by examining the first fragment's strand orientation. Use 'null' to skip.
- outminus
- Output file for reads mapping to the minus strand. Determined by examining the first fragment's strand orientation. Use 'null' to skip.
- outchimeric
- Output file for chimeric or discordant read pairs. Includes pairs mapping to different chromosomes or same strand (same orientation). Use 'null' to skip.
- outunmapped
- Output file for unmapped reads or pairs. Includes reads where either fragment is unmapped, has no mate, or is not primary alignment. Use 'null' to skip.
Examples
Basic Read Splitting
splitsam4way.sh input.sam plus.sam minus.sam chimeric.sam unmapped.sam
Splits input.sam into four categories: plus-strand mapped pairs, minus-strand mapped pairs, chimeric/discordant pairs, and unmapped reads.
Skip Unwanted Categories
splitsam4way.sh input.sam plus.sam minus.sam null unmapped.sam
Splits reads into plus, minus, and unmapped categories while skipping chimeric reads (set to 'null').
Extract Only Chimeric Reads
splitsam4way.sh input.sam null null chimeric.sam null
Extracts only chimeric/discordant read pairs, useful for structural variant detection or quality assessment.
Separate Mapped from Unmapped
splitsam4way.sh input.sam mapped_plus.sam mapped_minus.sam chimeric.sam unmapped.sam
Complete four-way separation for downstream analysis of different mapping categories.
Algorithm Details
Classification Logic
SplitSam4Way uses a hierarchical classification system to categorize read pairs based on their SAM flags and mapping information:
1. Header Preservation
All SAM header lines (starting with '@') are copied to every non-null output file, ensuring downstream tools have complete format information.
2. Unmapped Classification
Reads are classified as unmapped if any of these conditions are true:
- Either fragment is not mapped (!sl.mapped() || !sl.nextMapped())
- Read has no mate pair (!sl.hasMate())
- Read is not a primary alignment (!sl.primary())
This prioritizes unmapped status over other classifications.
3. Chimeric/Discordant Classification
For mapped pairs, reads are classified as chimeric if:
- Pair fragments map to different chromosomes (!sl.pairedOnSameChrom())
- Both fragments map to the same strand (sl.strand() == sl.nextStrand())
This identifies structural variants, translocations, and mapping artifacts.
4. Strand-Based Classification
For proper pairs, classification is based on the first fragment's strand orientation:
- Plus strand: (sl.firstFragment() ? sl.strand() : sl.nextStrand()) == PLUS
- Minus strand: (sl.firstFragment() ? sl.strand() : sl.nextStrand()) == MINUS
This ensures consistent strand assignment regardless of which fragment appears first in the SAM file.
Performance Characteristics
- Memory Usage: Very low memory footprint (128MB default) - processes reads one at a time
- I/O Efficiency: Single-pass algorithm with streaming I/O
- Scalability: Linear time complexity O(n) where n is number of reads
- Thread Safety: Uses thread-safe TextStreamWriter for concurrent output
Output Statistics
The tool provides detailed runtime statistics upon completion:
- Total processing time and throughput (reads/second, bases/second)
- Plus strand read count
- Minus strand read count
- Chimeric/discordant read count
- Unmapped read count
Use Cases
- Quality Assessment: Identify proportion of properly paired vs. discordant reads
- Structural Variant Detection: Extract chimeric pairs for SV calling pipelines
- Strand-Specific Analysis: Separate plus/minus strand reads for RNA-seq or directional libraries
- Data Cleanup: Remove unmapped reads to reduce file size for downstream analysis
- Mapping Validation: Assess mapping quality by examining different read categories
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org