Stream
Converts between sam, bam, fasta, and fastq formats with optional subsampling and paired file support. Uses the same multithreaded streaming architecture as SamStreamer but without filtering capabilities, providing a simpler interface for basic format conversion tasks. Supports random subsampling via configurable samplerate parameter and read count limiting. Native BAM support includes multithreaded BGZF decompression. Handles both interleaved and twin-file paired-end formats with automatic file numbering using the # symbol.
Basic Usage
stream.sh in=<file> out=<file>
stream.sh <input> <output>
Stream accepts SAM, BAM, FASTA, or FASTQ input and outputs to any of these formats. Format is detected automatically from file extensions. Compression formats (.gz, .bz2) are detected and handled automatically.
Examples
stream.sh mapped.bam mapped.sam.gz
stream.sh in=reads.fq out=subset.fq samplerate=0.1
stream.sh in1=reads_1.fq in2=reads_2.fq out=merged.fq
stream.sh in=merged.fq out=reads_#.fq
stream.sh in=large.fq out=subset.fq reads=1000000
Parameters
Parameters are organized by their function in the streaming and conversion process.
File Parameters
- in=<file>
- Primary input file. Format detected automatically from extension (.sam, .bam, .fa, .fq). Supports gzip (.gz) and bzip2 (.bz2) compression. Can be stdin.
- in2=<file>
- Secondary input file for paired-end reads. Used when processing twin-file paired reads separately.
- out=<file>
- Primary output file. Format determined by extension. Supports compression. Can be stdout. Optional if only counting reads.
- out2=<file>
- Secondary output file for paired-end reads. Use # symbol for auto-numbering (e.g., reads_#.fq generates reads_1.fq and reads_2.fq).
Processing Parameters
- samplerate=1.0
- Fraction of reads to retain in output (0.0 to 1.0). Reads are selected randomly based on sampleseed. Use 0.1 for 10% random sampling.
- sampleseed=17
- Random seed for subsampling. Use -1 for random seed based on system time. Same seed produces reproducible subsampling.
- reads=-1
- Stop after processing this many reads. Use -1 to process all reads. Useful for extracting fixed-size subsets from large files.
- ordered=t
- Maintain input order in output. Setting to false enables slightly faster processing when order is not required.
Threading Parameters
- threadsin=0
- Number of reader threads. 0 = auto-detect based on CPU cores and compression type. More threads help with compressed input.
- threadsout=0
- Number of writer threads. 0 = auto-detect. More threads help with compressed output.
Performance Parameters
- simd
- Enable SIMD (vectorized) processing for accelerated operations. Requires Java 17+ and 256-bit vector instruction sets (AVX2 or equivalent).
Technical Details
Format Detection
Input and output formats are automatically detected from file extensions. Supported formats include:
- SAM: .sam, .sam.gz, .sam.bz2 (text-based alignment format)
- BAM: .bam (binary BGZF-compressed alignment format)
- FASTA: .fa, .fasta, .fna, .fa.gz, .fasta.gz (sequence format)
- FASTQ: .fq, .fastq, .fq.gz, .fastq.gz (sequence with quality scores)
Paired File Handling
Stream supports both interleaved and twin-file paired-end formats:
- Interleaved: Both read pairs in single file (read1, read2, read1, read2...)
- Twin files: Separate files for read 1 and read 2 using in1/in2 or out1/out2 parameters
- Auto-numbering: Use # in filename to generate _1 and _2 files automatically (e.g., reads_#.fq → reads_1.fq and reads_2.fq)
Subsampling
Random subsampling provides flexible read selection:
- Random sampling: Uses cryptographically strong random number generator for uniform distribution based on samplerate parameter
- Reproducible subsets: Same sampleseed value produces identical read selections across multiple runs
- Count limiting: Processing terminates after reaching the specified read count, avoiding full file processing
Streaming Architecture
Stream uses the same multithreaded pipeline architecture as SamStreamer but without the filtering layer:
- Producer thread: Reads and decompresses input files, batching records for worker threads
- Worker threads: Parse records in parallel, converting between formats as needed
- Ordered output: Priority-queue-based system maintains sequential output ordering
- BAM decompression: Multithreaded BGZF decompression processes compressed blocks in parallel
Notes
- Native BAM support requires no external dependencies (samtools/sambamba not needed)
- Automatic format detection from file extensions - no manual specification needed
- For SAM/BAM filtering by alignment properties (mapping quality, identity, flags), use samstreamer.sh instead
- SIMD flag requires modern CPU with 256-bit vector instructions (most Intel/AMD processors since 2013)
- The # auto-numbering feature works in both input and output filename parameters