Stream

Script: stream.sh Package: stream Class: StreamerWrapper.java

Converts between sam, bam, fasta, and fastq formats with optional subsampling and paired file support. Uses the same multithreaded streaming architecture as SamStreamer but without filtering capabilities, providing a simpler interface for basic format conversion tasks. Supports random subsampling via configurable samplerate parameter and read count limiting. Native BAM support includes multithreaded BGZF decompression. Handles both interleaved and twin-file paired-end formats with automatic file numbering using the # symbol.

Basic Usage

stream.sh in=<file> out=<file>
stream.sh <input> <output>

Stream accepts SAM, BAM, FASTA, or FASTQ input and outputs to any of these formats. Format is detected automatically from file extensions. Compression formats (.gz, .bz2) are detected and handled automatically.

Examples

stream.sh mapped.bam mapped.sam.gz
stream.sh in=reads.fq out=subset.fq samplerate=0.1
stream.sh in1=reads_1.fq in2=reads_2.fq out=merged.fq
stream.sh in=merged.fq out=reads_#.fq
stream.sh in=large.fq out=subset.fq reads=1000000

Parameters

Parameters are organized by their function in the streaming and conversion process.

File Parameters

in=<file>: Primary input file. Format detected automatically from extension (.sam, .bam, .fa, .fq). Supports gzip (.gz) and bzip2 (.bz2) compression. Can be stdin.
in2=<file>: Secondary input file for paired-end reads. Used when processing twin-file paired reads separately.
out=<file>: Primary output file. Format determined by extension. Supports compression. Can be stdout. Optional if only counting reads.
out2=<file>: Secondary output file for paired-end reads. Use # symbol for auto-numbering (e.g., reads_#.fq generates reads_1.fq and reads_2.fq).

Processing Parameters

samplerate=1.0: Fraction of reads to retain in output (0.0 to 1.0). Reads are selected randomly based on sampleseed. Use 0.1 for 10% random sampling.
sampleseed=17: Random seed for subsampling. Use -1 for random seed based on system time. Same seed produces reproducible subsampling.
reads=-1: Stop after processing this many reads. Use -1 to process all reads. Useful for extracting fixed-size subsets from large files.
ordered=t: Maintain input order in output. Setting to false enables slightly faster processing when order is not required.

Threading Parameters

threadsin=0: Number of reader threads. 0 = auto-detect based on CPU cores and compression type. More threads help with compressed input.
threadsout=0: Number of writer threads. 0 = auto-detect. More threads help with compressed output.

Performance Parameters

simd: Enable SIMD (vectorized) processing for accelerated operations. Requires Java 17+ and 256-bit vector instruction sets (AVX2 or equivalent).

Technical Details

Format Detection

Input and output formats are automatically detected from file extensions. Supported formats include:

SAM: .sam, .sam.gz, .sam.bz2 (text-based alignment format)
BAM: .bam (binary BGZF-compressed alignment format)
FASTA: .fa, .fasta, .fna, .fa.gz, .fasta.gz (sequence format)
FASTQ: .fq, .fastq, .fq.gz, .fastq.gz (sequence with quality scores)

Paired File Handling

Stream supports both interleaved and twin-file paired-end formats:

Interleaved: Both read pairs in single file (read1, read2, read1, read2...)
Twin files: Separate files for read 1 and read 2 using in1/in2 or out1/out2 parameters
Auto-numbering: Use # in filename to generate _1 and _2 files automatically (e.g., reads_#.fq → reads_1.fq and reads_2.fq)

Subsampling

Random subsampling provides flexible read selection:

Random sampling: Uses cryptographically strong random number generator for uniform distribution based on samplerate parameter
Reproducible subsets: Same sampleseed value produces identical read selections across multiple runs
Count limiting: Processing terminates after reaching the specified read count, avoiding full file processing

Streaming Architecture

Stream uses the same multithreaded pipeline architecture as SamStreamer but without the filtering layer:

Producer thread: Reads and decompresses input files, batching records for worker threads
Worker threads: Parse records in parallel, converting between formats as needed
Ordered output: Priority-queue-based system maintains sequential output ordering
BAM decompression: Multithreaded BGZF decompression processes compressed blocks in parallel

Notes

Native BAM support requires no external dependencies (samtools/sambamba not needed)
Automatic format detection from file extensions - no manual specification needed
For SAM/BAM filtering by alignment properties (mapping quality, identity, flags), use samstreamer.sh instead
SIMD flag requires modern CPU with 256-bit vector instructions (most Intel/AMD processors since 2013)
The # auto-numbering feature works in both input and output filename parameters