SamStreamer

Script: samstreamer.sh Package: stream Class: SamStreamerWrapper.java

SamStreamerWrapper streams SAM/BAM files through a multithreaded pipeline that supports optional filtering and format conversion. A dedicated input thread reads and decompresses file blocks, while worker threads parse alignments in parallel, with a priority-queue-based job ordering system that maintains sequential output despite out-of-order processing completion. Handles SAM to FASTQ/FASTA conversion, SAM/BAM to SAM/BAM conversion with optional quality filtering, and CIGAR normalization for different SAM specification versions. Native BAM support includes multithreaded BGZF decompression via BgzfInputStreamMT and BAI index generation.

Basic Usage

samstreamer.sh in=<file> out=<file>
samstreamer.sh <input> <output>

SamStreamer accepts SAM, BAM, FASTA, or FASTQ input and can output to any of these formats. Format is detected automatically from file extensions. When outputting to .bai extension, generates a BAM index file instead of converting formats.

Examples

samstreamer.sh reads.sam.gz mapped.bam unmapped=f
samstreamer.sh sorted.bam sorted.bai
samstreamer.sh sorted.bam reads.fq.gz
samstreamer.sh in=aligned.bam out=filtered.bam minmapq=30 minid=0.95

Parameters

Parameters are organized by their function in the streaming and conversion process. Filtering parameters apply only to SAM/BAM input.

Input/Output Parameters

in=<file>
Input file in SAM, BAM, FASTA, or FASTQ format. Compression detected automatically from extension (.gz, .bz2). Can be stdin.
out=<file>
Output file. Format determined by extension (.sam, .bam, .fa, .fq). Supports gzip and bzip2 compression. When extension is .bai, generates a BAM index file from sorted BAM input. Can be stdout.
ref=<file>
Optional reference file. Loads reference using ScafMap for coordinate translation operations.

Filtering Parameters

Note: Filtering options apply only to SAM/BAM input

minpos=
Ignore alignments not overlapping this minimum genomic position.
maxpos=
Ignore alignments not overlapping this maximum genomic position.
minmapq=
Ignore alignments with mapping quality (MAPQ) below this threshold.
maxmapq=
Ignore alignments with mapping quality (MAPQ) above this threshold.
minid=0.0
Ignore alignments with percent identity below this value (0.0-1.0).
maxid=1.0
Ignore alignments with percent identity above this value (0.0-1.0).
contigs=
Comma-delimited list of contig names to include (whitelist). Contig names should use underscores instead of spaces.
mapped=t
Include mapped reads. Set to false to exclude reads with valid alignments.
unmapped=t
Include unmapped reads. Set to false to exclude unaligned reads.
mappedonly=
If true, include only mapped reads.
unmappedonly=
If true, include only unmapped reads.
secondary=t
Include secondary alignments (alternative mapping locations for multi-mapping reads).
supplementary=t
Include supplementary alignments (chimeric alignments or split reads).
lengthzero=t
Include alignments without bases (zero-length sequences).
duplicate=t
Include reads marked as PCR or optical duplicates.
qfail=t
Include reads marked as failing quality control.
invert=f
Invert all filtering criteria. When true, selects reads that would normally be excluded.

Processing Parameters

ordered=t
Maintain input order in output. The priority-queue-based JobQueue enforces sequential output ordering even when worker threads complete out of order.

Technical Details

Multithreaded Pipeline Architecture

The streaming pipeline uses a producer-consumer architecture with three processing stages:

Ordered Output System

Output ordering is maintained through a priority-queue-based system that allows worker threads to complete out of order while ensuring sequential output:

Multithreaded BGZF Decompression

For BAM files, multithreaded BGZF decompression runs in three parallel stages:

Filtering and Optimization

Performance optimizations are applied based on the requested operation:

BAI Index Generation

When the output file has a .bai extension, SamStreamer generates a BAM index file compatible with samtools. The input must be coordinate-sorted BAM. Index generation tracks alignment bins and virtual file offsets as specified in the BAM index format specification.

Notes