LoadReads

Basic Usage

loadreads.sh in=<file>

Loads all reads from a sequence file into memory and reports detailed memory usage statistics including initial/final memory consumption, memory ratios, and per-read memory overhead.

Parameters

Input Parameters

in=file: Input file. Can be any sequence format supported by BBTools (FASTA, FASTQ, gzipped files). Required parameter.
lowcomplexity=f: Assume input library is low-complexity. This affects memory estimation calculations by assuming sequences have fewer unique k-mers and may compress better in memory. Default: false.
verbose=f: Print detailed processing information including read batches fetched and returned. Default: false.
earlyexit=f: Exit early during memory estimation calculations. Default: false.
gc=f: Perform garbage collection at the end and report memory usage after GC. Default: false.
overhead=0: Override the overhead parameter in memory estimation calculations. Default: 0 (auto-detect).

Java Parameters

-Xmx: Sets Java's maximum memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions for potentially faster execution.

Examples

Basic Memory Testing

loadreads.sh in=reads.fastq

Tests memory usage of a FASTQ file by loading all reads and reporting memory consumption statistics.

Low-Complexity Library Testing

loadreads.sh in=amplicon_reads.fq lowcomplexity=t

Tests memory usage assuming the input is a low-complexity library (such as amplicon data) which may have different memory characteristics.

With Custom Memory Settings

loadreads.sh -Xmx8g in=large_dataset.fq.gz

Tests memory usage with 8GB maximum heap size for a large gzipped dataset.

Algorithm Details

Memory Testing Strategy

LoadReads implements memory testing using ArrayList<ArrayList<Read>> storage structure with ConcurrentReadInputStream for I/O operations:

Data Loading Process

Sequential Read Loading: Uses ConcurrentReadInputStream.getReadInputStream() with buffered ListNum<Read> processing
Memory Storage: Stores all reads in ArrayList<ArrayList<Read>> structure to simulate real-world usage patterns
Continuous Monitoring: Calls calcMem() after each batch using Shared.memUsed() for precise memory tracking
Paired-End Support: Automatically detects and properly handles paired-end sequencing data

Memory Measurement Methodology

Multi-Point Sampling: Records memory usage before, during, and after loading process
Min/Max Tracking: Updates minMem and maxMem fields using Tools.min() and Tools.max() methods
Byte-Level Accounting: Uses r1.countFastqBytes() for disk format and r1.countBytes() for memory representation
Per-Read Metrics: Calculates average memory consumption per read, including bases, qualities, and headers

Estimation Algorithms

Dual Estimation: Uses Tools.estimateFileMemory() and actual measurement with usedMem=maxMem-minMem calculation
Ratio Calculations: Computes memory-to-disk ratios for capacity planning
Overhead Analysis: Separates actual sequence data from Java object overhead
Compression Awareness: Accounts for differences between compressed file sizes and memory usage

Output Metrics

LoadReads outputs specific memory metrics calculated from tracked variables:

Absolute Memory: Initial, final, minimum, and maximum memory usage in megabytes
Memory Estimates: Two different estimation methods for comparison and validation
Ratios: Memory-to-disk ratios for understanding storage efficiency
Per-Read Averages: Memory usage per read, bases per read, quality scores per read
Overhead Calculation: Java object overhead separate from sequence data
Processing Speed: Throughput metrics for both disk and memory processing

Low-Complexity Optimization

When lowcomplexity=true is specified, the tool adjusts its estimation algorithms to account for:

Reduced K-mer Diversity: Fewer unique subsequences leading to potential memory savings
Compression Benefits: Better compression ratios for repetitive sequences
Specialized Data Structures: Optimizations for handling redundant sequence content

Performance Characteristics

Memory Overhead: Approximately 2-4x disk size depending on sequence characteristics
I/O Implementation: Uses ByteFile.FORCE_MODE_BF2 with multi-threaded reading when Shared.threads()>2
Scalability: Can handle files limited only by available system memory
Format Support: Works with FASTA, FASTQ, and their compressed variants (gzip, bzip2)

Technical Notes

Memory Management

Uses Shared.memUsed() with calcMem() method that updates minMem and maxMem tracking variables
Automatically detects available system memory and sets reasonable defaults
Optional System.gc() call when gc=true parameter is set, followed by additional Shared.memUsed() measurement
Handles out-of-memory conditions gracefully with appropriate error reporting

Input File Support

Supports standard sequence formats: FASTA, FASTQ
Automatic compression detection and decompression
Paired-end file detection using # symbol replacement: in1.replace("#", "1") and in1.replace("#", "2")
Quality score file support for separate quality data

Output Interpretation

Memory Ratio: Values >1 indicate memory usage exceeds disk size
Disk Ratio: Accounts for format overhead (FASTQ headers, quality scores)
Overhead: Java object overhead beyond sequence data itself
Estimates vs Actual: Compare estimation accuracy with real measurements

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org