StreamSam
Converts sam/bam to fastq rapidly with multiple threads. bam files require samtools or sambamba in the path.
Basic Usage
streamsam.sh in=<file> out=<file>
StreamSam converts SAM/BAM alignment files to FASTQ format using multiple threads via SamReadStreamer class. Processing uses SamStreamer.DEFAULT_THREADS when ordered=false and provides filtering via the SamFilter class to extract reads based on mapping quality, genomic coordinates, alignment flags, and contig names.
Parameters
Parameters are organized by their function in the SAM/BAM to FASTQ conversion process.
Input/Output Parameters
- in=<file>
- Input SAM or BAM file. Can be stdin.
- out=<file>
- Output FASTQ file. Can be stdout.
- ref=<file>
- Optional reference file. When provided, loads reference using ScafMap.loadReference() and sets RNAME_AS_BYTES=false for string-based contig name processing.
Filtering Parameters
- minpos=
- Ignore alignments not overlapping this range. Specifies minimum genomic position for coordinate-based filtering.
- maxpos=
- Ignore alignments not overlapping this range. Specifies maximum genomic position for coordinate-based filtering.
- minmapq=
- Ignore alignments with mapping quality (MAPQ) below this threshold. Higher values select more confidently mapped reads.
- maxmapq=
- Ignore alignments with mapping quality (MAPQ) above this threshold. Useful for selecting poorly mapped reads.
- contigs=
- Comma-delimited list of contig names to include. These should have no spaces, or underscores instead of spaces. Contig name matching handled by the SamFilter class.
- mapped=t
- Include mapped reads. Set to false to exclude reads with valid alignments.
- unmapped=t
- Include unmapped reads. Set to false to exclude unaligned reads.
- secondary=f
- Include secondary alignments. Secondary alignments represent alternative mapping locations for multi-mapping reads.
- supplimentary=t
- Include supplementary alignments. Supplementary alignments represent chimeric alignments or split reads.
- lengthzero=f
- Include alignments without bases. Controls whether zero-length alignments are retained in the output.
- invert=f
- Invert sam filters. When true, selects reads that would normally be excluded by the filtering criteria.
Processing Parameters
- ordered=t
- Keep reads in input order. False is faster but may change read order. When disabled, uses multiple threads more efficiently.
- verbose=f
- Print verbose progress information during processing.
- reads=
- Process at most this many reads. Also accepts 'maxreads'. Useful for testing or processing subsets.
- forceparse=f
- Force full parsing of SAM optional fields even when not needed for output format. Increases accuracy but reduces speed.
SAM Version Parameters
- samversion=1.4
- SAM format version to use for output. Also accepts 'samv' or 'sam'. Affects CIGAR string formatting.
Advanced Filtering Parameters
- minid=0.0
- Minimum alignment identity (0.0-1.0). Values >1 are interpreted as percentages and divided by 100.
- maxid=1.0
- Maximum alignment identity (0.0-1.0). Values >1 are interpreted as percentages and divided by 100.
- duplicate=t
- Include duplicate reads (reads marked as PCR or optical duplicates).
- qfail=f
- Include reads that failed quality checks (as marked in SAM flags).
Examples
Basic SAM to FASTQ Conversion
streamsam.sh in=alignments.sam out=reads.fastq
Converts a SAM file to FASTQ format, retaining all reads (mapped and unmapped).
BAM to FASTQ with Quality Filter
streamsam.sh in=alignments.bam out=high_quality.fastq minmapq=20
Converts BAM to FASTQ, keeping only reads with mapping quality ≥20. Requires samtools or sambamba in PATH for BAM input.
Extract Unmapped Reads Only
streamsam.sh in=alignments.bam out=unmapped.fastq mapped=f
Extracts only unmapped reads from a BAM file, useful for recovering unaligned sequences for further analysis.
Coordinate-Based Filtering
streamsam.sh in=alignments.sam out=region_reads.fastq contigs=chr1,chr2 minpos=1000000 maxpos=2000000
Extracts reads mapping to chromosomes 1 and 2 within the coordinate range 1,000,000 to 2,000,000.
High-Throughput Processing
streamsam.sh in=large_alignment.bam out=reads.fastq ordered=f
Fast conversion with multi-threading enabled (ordered=f). Order of reads in output may differ from input but processing is significantly faster.
Primary Alignments Only
streamsam.sh in=alignments.bam out=primary.fastq secondary=f supplimentary=f
Extracts only primary alignments, excluding secondary and supplementary alignments for cleaner downstream analysis.
Algorithm Details
StreamSam uses a multi-threaded streaming architecture for SAM/BAM to FASTQ conversion:
Threading Strategy
- SamReadStreamer: Dedicated thread for reading and parsing SAM/BAM input
- ConcurrentReadOutputStream: Buffered output writing with configurable buffer size (default 4)
- Processing Pipeline: Reads are processed in batches using ListNum collections
- Default Threads: Uses SamStreamer.DEFAULT_THREADS when ordered=false
Memory Management
- Shared Header: When outputting SAM/BAM format, uses shared header to reduce memory overhead
- Buffer Optimization: Caps buffers at 4 using Shared.capBuffers() to prevent excessive memory usage
- Streaming Processing: Processes reads in chunks rather than loading entire files into memory
SAM Parsing Optimization
- Selective Parsing: When output format is FASTQ, disables parsing of unnecessary SAM fields (PARSE_2, PARSE_5, PARSE_6, PARSE_7, PARSE_8, PARSE_OPTIONAL) unless forceparse=true
- CIGAR Processing: Only parses CIGAR strings when needed for filtering (identity calculations) or format conversion
- Attached SamLine: Uses USE_ATTACHED_SAMLINE=true for memory-efficient read processing
Filtering Implementation
- SamFilter Integration: Uses var2.SamFilter class for all filtering operations
- Coordinate Overlap: Uses Tools.overlap() for genomic coordinate range testing
- Identity Calculation: Calculates alignment identity from CIGAR strings using SamLine.calcIdentity()
- Contig Name Handling: Automatically handles contig name variants (space/underscore conversion)
Performance Characteristics
- Throughput: Reports processing speed in reads/sec and Mbp/sec
- Scalability: Linear scaling with thread count when ordered=false
- Memory Usage: Constant memory usage regardless of input file size due to streaming architecture
- I/O Handling: Uses pigz compression and parallel I/O configured by ReadWrite.setZipThreads()
Reference Integration
When a reference file is provided via ref=<file>:
- ScafMap Loading: Loads reference using ScafMap.loadReference() for contig name resolution
- Coordinate Processing: Enables coordinate-based filtering with reference scaffolds
- RNAME Handling: Sets RNAME_AS_BYTES=false for string-based contig name processing
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org