SamStreamer
SamStreamerWrapper streams SAM/BAM files through a multithreaded pipeline that supports optional filtering and format conversion. A dedicated input thread reads and decompresses file blocks, while worker threads parse alignments in parallel, with a priority-queue-based job ordering system that maintains sequential output despite out-of-order processing completion. Handles SAM to FASTQ/FASTA conversion, SAM/BAM to SAM/BAM conversion with optional quality filtering, and CIGAR normalization for different SAM specification versions. Native BAM support includes multithreaded BGZF decompression via BgzfInputStreamMT and BAI index generation.
Basic Usage
samstreamer.sh in=<file> out=<file>
samstreamer.sh <input> <output>
SamStreamer accepts SAM, BAM, FASTA, or FASTQ input and can output to any of these formats. Format is detected automatically from file extensions. When outputting to .bai extension, generates a BAM index file instead of converting formats.
Examples
samstreamer.sh reads.sam.gz mapped.bam unmapped=f
samstreamer.sh sorted.bam sorted.bai
samstreamer.sh sorted.bam reads.fq.gz
samstreamer.sh in=aligned.bam out=filtered.bam minmapq=30 minid=0.95
Parameters
Parameters are organized by their function in the streaming and conversion process. Filtering parameters apply only to SAM/BAM input.
Input/Output Parameters
- in=<file>
- Input file in SAM, BAM, FASTA, or FASTQ format. Compression detected automatically from extension (.gz, .bz2). Can be stdin.
- out=<file>
- Output file. Format determined by extension (.sam, .bam, .fa, .fq). Supports gzip and bzip2 compression. When extension is .bai, generates a BAM index file from sorted BAM input. Can be stdout.
- ref=<file>
- Optional reference file. Loads reference using ScafMap for coordinate translation operations.
Filtering Parameters
Note: Filtering options apply only to SAM/BAM input
- minpos=
- Ignore alignments not overlapping this minimum genomic position.
- maxpos=
- Ignore alignments not overlapping this maximum genomic position.
- minmapq=
- Ignore alignments with mapping quality (MAPQ) below this threshold.
- maxmapq=
- Ignore alignments with mapping quality (MAPQ) above this threshold.
- minid=0.0
- Ignore alignments with percent identity below this value (0.0-1.0).
- maxid=1.0
- Ignore alignments with percent identity above this value (0.0-1.0).
- contigs=
- Comma-delimited list of contig names to include (whitelist). Contig names should use underscores instead of spaces.
- mapped=t
- Include mapped reads. Set to false to exclude reads with valid alignments.
- unmapped=t
- Include unmapped reads. Set to false to exclude unaligned reads.
- mappedonly=
- If true, include only mapped reads.
- unmappedonly=
- If true, include only unmapped reads.
- secondary=t
- Include secondary alignments (alternative mapping locations for multi-mapping reads).
- supplementary=t
- Include supplementary alignments (chimeric alignments or split reads).
- lengthzero=t
- Include alignments without bases (zero-length sequences).
- duplicate=t
- Include reads marked as PCR or optical duplicates.
- qfail=t
- Include reads marked as failing quality control.
- invert=f
- Invert all filtering criteria. When true, selects reads that would normally be excluded.
Processing Parameters
- ordered=t
- Maintain input order in output. The priority-queue-based JobQueue enforces sequential output ordering even when worker threads complete out of order.
Technical Details
Multithreaded Pipeline Architecture
The streaming pipeline uses a producer-consumer architecture with three processing stages:
- Input thread (Thread 0): Reads the input file and batches records into lists. For SAM files, reads text lines; for BAM files, reads binary blocks and manages multithreaded BGZF decompression
- Worker threads: Parse batched records in parallel. SAM lines are parsed using tab-delimited parsing; BAM binary records are converted to SAM text representation
- Format conversion: When outputting to FASTQ/FASTA, parsed alignments are converted to Read objects with quality score validation
Ordered Output System
Output ordering is maintained through a priority-queue-based system that allows worker threads to complete out of order while ensuring sequential output:
- JobQueue with ordering: Uses a min-heap to track completed work by ID number, releasing results in sequential order
- Blocking retrieval: Output waits for the next sequential batch even if later batches complete first
- Backpressure management: Bounded queue capacity prevents memory overflow by blocking producers when queues fill
Multithreaded BGZF Decompression
For BAM files, multithreaded BGZF decompression runs in three parallel stages:
- Block reading: Producer thread reads BGZF block headers and raw compressed data from the BAM file
- Parallel decompression: Worker threads decompress blocks independently using standard gzip inflation, validating checksums
- Ordered delivery: Decompressed blocks are delivered to the BAM parser in original file order via the JobQueue system
Filtering and Optimization
Performance optimizations are applied based on the requested operation:
- Conditional filtering: When filters are specified, each alignment is checked against mapping quality, identity, flag, and coordinate criteria
- CIGAR normalization: Converts between SAM specification versions when the samversion parameter is set
- Parse optimization: SAM field parsing is skipped when not needed for the output format, reducing processing time
- Direct attachment: For SAM-to-SAM conversion, original text is reused without re-serialization
BAI Index Generation
When the output file has a .bai extension, SamStreamer generates a BAM index file compatible with samtools. The input must be coordinate-sorted BAM. Index generation tracks alignment bins and virtual file offsets as specified in the BAM index format specification.
Notes
- SamStreamer replaces the older streamsam.sh with native BAM support
- No external dependencies required for BAM operations (samtools/sambamba not needed)
- Filtering parameters only apply to SAM/BAM input; FASTA/FASTQ pass through without filtering
- BAI generation requires coordinate-sorted BAM input
- For simple format conversion without filtering, consider stream.sh