Stream

Script: stream.sh Package: stream Class: StreamerWrapper.java

Converts between sam, bam, fasta, and fastq formats with optional subsampling and paired file support. Uses the same multithreaded streaming architecture as SamStreamer but without filtering capabilities, providing a simpler interface for basic format conversion tasks. Supports random subsampling via configurable samplerate parameter and read count limiting. Native BAM support includes multithreaded BGZF decompression. Handles both interleaved and twin-file paired-end formats with automatic file numbering using the # symbol.

Basic Usage

stream.sh in=<file> out=<file>
stream.sh <input> <output>

Stream accepts SAM, BAM, FASTA, or FASTQ input and outputs to any of these formats. Format is detected automatically from file extensions. Compression formats (.gz, .bz2) are detected and handled automatically.

Examples

stream.sh mapped.bam mapped.sam.gz
stream.sh in=reads.fq out=subset.fq samplerate=0.1
stream.sh in1=reads_1.fq in2=reads_2.fq out=merged.fq
stream.sh in=merged.fq out=reads_#.fq
stream.sh in=large.fq out=subset.fq reads=1000000

Parameters

Parameters are organized by their function in the streaming and conversion process.

File Parameters

in=<file>
Primary input file. Format detected automatically from extension (.sam, .bam, .fa, .fq). Supports gzip (.gz) and bzip2 (.bz2) compression. Can be stdin.
in2=<file>
Secondary input file for paired-end reads. Used when processing twin-file paired reads separately.
out=<file>
Primary output file. Format determined by extension. Supports compression. Can be stdout. Optional if only counting reads.
out2=<file>
Secondary output file for paired-end reads. Use # symbol for auto-numbering (e.g., reads_#.fq generates reads_1.fq and reads_2.fq).

Processing Parameters

samplerate=1.0
Fraction of reads to retain in output (0.0 to 1.0). Reads are selected randomly based on sampleseed. Use 0.1 for 10% random sampling.
sampleseed=17
Random seed for subsampling. Use -1 for random seed based on system time. Same seed produces reproducible subsampling.
reads=-1
Stop after processing this many reads. Use -1 to process all reads. Useful for extracting fixed-size subsets from large files.
ordered=t
Maintain input order in output. Setting to false enables slightly faster processing when order is not required.

Threading Parameters

threadsin=0
Number of reader threads. 0 = auto-detect based on CPU cores and compression type. More threads help with compressed input.
threadsout=0
Number of writer threads. 0 = auto-detect. More threads help with compressed output.

Performance Parameters

simd
Enable SIMD (vectorized) processing for accelerated operations. Requires Java 17+ and 256-bit vector instruction sets (AVX2 or equivalent).

Technical Details

Format Detection

Input and output formats are automatically detected from file extensions. Supported formats include:

Paired File Handling

Stream supports both interleaved and twin-file paired-end formats:

Subsampling

Random subsampling provides flexible read selection:

Streaming Architecture

Stream uses the same multithreaded pipeline architecture as SamStreamer but without the filtering layer:

Notes