GetReads
Selects reads with designated numeric IDs from sequencing files. The first read (or pair) has ID 0, the second read (or pair) has ID 1, and so on.
Basic Usage
getreads.sh in=<file> id=<number,number,number...> out=<file>
Extracts specific reads by their numeric position in the input file. Read numbering starts from 0.
Parameters
Parameters are organized by function. All standard BBTools parsing parameters are supported.
Core Parameters
- in=<file>
- Specify the input file, or stdin. Required parameter.
- out=<file>
- Specify the output file, or stdout. If not specified, defaults to stdout.
- id=
- Comma delimited list of numbers or ranges, in any order. For example: id=5,93,17-31,8,0,12-13. Can specify individual read IDs or ranges using hyphen notation.
Paired-End Parameters
- in2=<file>
- Second input file for paired-end reads.
- out2=<file>
- Second output file for paired-end reads.
Quality File Parameters
- qfin1=<file>
- Quality input file for first mate.
- qfin2=<file>
- Quality input file for second mate.
- qfout1=<file>
- Quality output file for first mate.
- qfout2=<file>
- Quality output file for second mate.
Processing Parameters
- passes=1
- Number of passes through the input data.
- reads=-1
- Maximum number of reads to process. -1 means process all reads.
- verbose=f
- Enable verbose output for debugging.
- build=1
- Set genome build number for internal tracking.
Sampling Parameters
- samplerate=1.0
- Sample rate for processing reads (0.0-1.0). 1.0 processes all reads.
- sampleseed=1
- Random seed for sampling reproducibility.
File Handling Parameters
- overwrite=f
- Overwrite existing output files if they exist.
- append=f
- Append to existing output files instead of overwriting.
- testsize=f
- Print file processing speed statistics.
Input Format Parameters
- minscaf=0
- Minimum scaffold/contig length for FASTA input processing.
Examples
Extract Single Reads
getreads.sh in=input.fastq out=selected.fastq id=0,5,10
Extracts the 1st, 6th, and 11th reads from the input file.
Extract Read Ranges
getreads.sh in=reads.fq out=subset.fq id=100-200,500,1000-1010
Extracts reads 101-201, read 501, and reads 1001-1011 from the input.
Paired-End Read Extraction
getreads.sh in1=reads_R1.fq in2=reads_R2.fq out1=selected_R1.fq out2=selected_R2.fq id=50-100
Extracts read pairs 51-101 from paired-end input files.
Complex Range Selection
getreads.sh in=large_dataset.fastq out=training_set.fastq id=0-999,5000-5999,10000-10999
Creates a training dataset by extracting the first 1000 reads, reads 5001-6000, and reads 10001-11000.
Algorithm Details
GetReads uses a HashSet-based lookup strategy for efficient read selection:
Core Algorithm
- ID Parsing: Parses the comma-separated ID list, expanding ranges (e.g., "10-15" becomes 10,11,12,13,14,15) into individual IDs stored in a HashSet for O(1) lookup performance.
- Sequential Processing: Reads through the input file sequentially, checking each read's numeric ID against the HashSet. When a match is found, the read (and its mate if paired) is written to output.
- Early Termination: Processing stops when all requested reads have been found and the HashSet becomes empty, improving performance for sparse selections from large files.
- Memory Efficiency: Uses a single HashSet to track remaining IDs to find, removing found IDs to reduce memory usage and enable early termination.
Paired-End Handling
For paired-end data, the tool treats each pair as a single unit with one ID. When a read pair is selected, both mates are written to their respective output files simultaneously.
Multi-Pass Support
The tool supports multiple passes through the input data, useful for complex workflows or when combined with sampling. Each pass reinitializes the input stream and resets the ID lookup table.
Performance Characteristics
- Time Complexity: O(n) where n is the number of reads processed until all target IDs are found
- Memory Usage: O(k) where k is the number of unique IDs requested, plus stream buffers
- I/O Efficiency: Single-pass reading with early termination minimizes disk access
Format Support
Supports all standard sequence formats including FASTQ, FASTA, and compressed variants. Automatic format detection handles mixed input types seamlessly.
Technical Notes
Read ID Assignment
Read IDs are assigned sequentially starting from 0. For paired-end data, each pair gets one ID (not separate IDs for each mate). The first pair has ID 0, second pair has ID 1, etc.
Range Syntax
Range notation uses inclusive bounds. "10-15" includes reads with IDs 10, 11, 12, 13, 14, and 15.
Memory Considerations
Default memory allocation is 200MB, suitable for most ID lists. For extremely large ID sets, increase memory with standard Java heap parameters.
Error Handling
The tool validates all ID ranges and reports errors for malformed ID specifications. Missing reads (IDs beyond the file length) are silently skipped.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org