GetReads

Script: getreads.sh Package: jgi Class: GetReads.java

Selects reads with designated numeric IDs from sequencing files. The first read (or pair) has ID 0, the second read (or pair) has ID 1, and so on.

Basic Usage

getreads.sh in=<file> id=<number,number,number...> out=<file>

Extracts specific reads by their numeric position in the input file. Read numbering starts from 0.

Parameters

Parameters are organized by function. All standard BBTools parsing parameters are supported.

Core Parameters

in=<file>
Specify the input file, or stdin. Required parameter.
out=<file>
Specify the output file, or stdout. If not specified, defaults to stdout.
id=
Comma delimited list of numbers or ranges, in any order. For example: id=5,93,17-31,8,0,12-13. Can specify individual read IDs or ranges using hyphen notation.

Paired-End Parameters

in2=<file>
Second input file for paired-end reads.
out2=<file>
Second output file for paired-end reads.

Quality File Parameters

qfin1=<file>
Quality input file for first mate.
qfin2=<file>
Quality input file for second mate.
qfout1=<file>
Quality output file for first mate.
qfout2=<file>
Quality output file for second mate.

Processing Parameters

passes=1
Number of passes through the input data.
reads=-1
Maximum number of reads to process. -1 means process all reads.
verbose=f
Enable verbose output for debugging.
build=1
Set genome build number for internal tracking.

Sampling Parameters

samplerate=1.0
Sample rate for processing reads (0.0-1.0). 1.0 processes all reads.
sampleseed=1
Random seed for sampling reproducibility.

File Handling Parameters

overwrite=f
Overwrite existing output files if they exist.
append=f
Append to existing output files instead of overwriting.
testsize=f
Print file processing speed statistics.

Input Format Parameters

minscaf=0
Minimum scaffold/contig length for FASTA input processing.

Examples

Extract Single Reads

getreads.sh in=input.fastq out=selected.fastq id=0,5,10

Extracts the 1st, 6th, and 11th reads from the input file.

Extract Read Ranges

getreads.sh in=reads.fq out=subset.fq id=100-200,500,1000-1010

Extracts reads 101-201, read 501, and reads 1001-1011 from the input.

Paired-End Read Extraction

getreads.sh in1=reads_R1.fq in2=reads_R2.fq out1=selected_R1.fq out2=selected_R2.fq id=50-100

Extracts read pairs 51-101 from paired-end input files.

Complex Range Selection

getreads.sh in=large_dataset.fastq out=training_set.fastq id=0-999,5000-5999,10000-10999

Creates a training dataset by extracting the first 1000 reads, reads 5001-6000, and reads 10001-11000.

Algorithm Details

GetReads uses a HashSet-based lookup strategy for efficient read selection:

Core Algorithm

Paired-End Handling

For paired-end data, the tool treats each pair as a single unit with one ID. When a read pair is selected, both mates are written to their respective output files simultaneously.

Multi-Pass Support

The tool supports multiple passes through the input data, useful for complex workflows or when combined with sampling. Each pass reinitializes the input stream and resets the ID lookup table.

Performance Characteristics

Format Support

Supports all standard sequence formats including FASTQ, FASTA, and compressed variants. Automatic format detection handles mixed input types seamlessly.

Technical Notes

Read ID Assignment

Read IDs are assigned sequentially starting from 0. For paired-end data, each pair gets one ID (not separate IDs for each mate). The first pair has ID 0, second pair has ID 1, etc.

Range Syntax

Range notation uses inclusive bounds. "10-15" includes reads with IDs 10, 11, 12, 13, 14, and 15.

Memory Considerations

Default memory allocation is 200MB, suitable for most ID lists. For extremely large ID sets, increase memory with standard Java heap parameters.

Error Handling

The tool validates all ID ranges and reports errors for malformed ID specifications. Missing reads (IDs beyond the file length) are silently skipped.

Support

For questions and support: