GetReads

Basic Usage

getreads.sh in=<file> id=<number,number,number...> out=<file>

Extracts specific reads by their numeric position in the input file. Read numbering starts from 0.

Parameters

Parameters are organized by function. All standard BBTools parsing parameters are supported.

Core Parameters

in=<file>: Specify the input file, or stdin. Required parameter.
out=<file>: Specify the output file, or stdout. If not specified, defaults to stdout.
id=: Comma delimited list of numbers or ranges, in any order. For example: id=5,93,17-31,8,0,12-13. Can specify individual read IDs or ranges using hyphen notation.

Paired-End Parameters

in2=<file>: Second input file for paired-end reads.
out2=<file>: Second output file for paired-end reads.

Quality File Parameters

qfin1=<file>: Quality input file for first mate.
qfin2=<file>: Quality input file for second mate.
qfout1=<file>: Quality output file for first mate.
qfout2=<file>: Quality output file for second mate.

Processing Parameters

passes=1: Number of passes through the input data.
reads=-1: Maximum number of reads to process. -1 means process all reads.
verbose=f: Enable verbose output for debugging.
build=1: Set genome build number for internal tracking.

Sampling Parameters

samplerate=1.0: Sample rate for processing reads (0.0-1.0). 1.0 processes all reads.
sampleseed=1: Random seed for sampling reproducibility.

File Handling Parameters

overwrite=f: Overwrite existing output files if they exist.
append=f: Append to existing output files instead of overwriting.
testsize=f: Print file processing speed statistics.

Input Format Parameters

minscaf=0: Minimum scaffold/contig length for FASTA input processing.

Examples

Extract Single Reads

getreads.sh in=input.fastq out=selected.fastq id=0,5,10

Extracts the 1st, 6th, and 11th reads from the input file.

Extract Read Ranges

getreads.sh in=reads.fq out=subset.fq id=100-200,500,1000-1010

Extracts reads 101-201, read 501, and reads 1001-1011 from the input.

Paired-End Read Extraction

getreads.sh in1=reads_R1.fq in2=reads_R2.fq out1=selected_R1.fq out2=selected_R2.fq id=50-100

Extracts read pairs 51-101 from paired-end input files.

Complex Range Selection

getreads.sh in=large_dataset.fastq out=training_set.fastq id=0-999,5000-5999,10000-10999

Creates a training dataset by extracting the first 1000 reads, reads 5001-6000, and reads 10001-11000.

Algorithm Details

GetReads uses a HashSet-based lookup strategy for efficient read selection:

Core Algorithm

ID Parsing: Parses the comma-separated ID list, expanding ranges (e.g., "10-15" becomes 10,11,12,13,14,15) into individual IDs stored in a HashSet for O(1) lookup performance.
Sequential Processing: Reads through the input file sequentially, checking each read's numeric ID against the HashSet. When a match is found, the read (and its mate if paired) is written to output.
Early Termination: Processing stops when all requested reads have been found and the HashSet becomes empty, improving performance for sparse selections from large files.
Memory Efficiency: Uses a single HashSet to track remaining IDs to find, removing found IDs to reduce memory usage and enable early termination.

Paired-End Handling

For paired-end data, the tool treats each pair as a single unit with one ID. When a read pair is selected, both mates are written to their respective output files simultaneously.

Multi-Pass Support

The tool supports multiple passes through the input data, useful for complex workflows or when combined with sampling. Each pass reinitializes the input stream and resets the ID lookup table.

Performance Characteristics

Time Complexity: O(n) where n is the number of reads processed until all target IDs are found
Memory Usage: O(k) where k is the number of unique IDs requested, plus stream buffers
I/O Efficiency: Single-pass reading with early termination minimizes disk access

Format Support

Supports all standard sequence formats including FASTQ, FASTA, and compressed variants. Automatic format detection handles mixed input types seamlessly.

Technical Notes

Read ID Assignment

Read IDs are assigned sequentially starting from 0. For paired-end data, each pair gets one ID (not separate IDs for each mate). The first pair has ID 0, second pair has ID 1, etc.

Range Syntax

Range notation uses inclusive bounds. "10-15" includes reads with IDs 10, 11, 12, 13, 14, and 15.

Memory Considerations

Default memory allocation is 200MB, suitable for most ID lists. For extremely large ID sets, increase memory with standard Java heap parameters.

Error Handling

The tool validates all ID ranges and reports errors for malformed ID specifications. Missing reads (IDs beyond the file length) are silently skipped.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org