LoadReads
Tests the memory usage of a sequence file by loading reads into memory and measuring memory consumption patterns.
Basic Usage
loadreads.sh in=<file>
Loads all reads from a sequence file into memory and reports detailed memory usage statistics including initial/final memory consumption, memory ratios, and per-read memory overhead.
Parameters
Input Parameters
- in=file
- Input file. Can be any sequence format supported by BBTools (FASTA, FASTQ, gzipped files). Required parameter.
- lowcomplexity=f
- Assume input library is low-complexity. This affects memory estimation calculations by assuming sequences have fewer unique k-mers and may compress better in memory. Default: false.
- verbose=f
- Print detailed processing information including read batches fetched and returned. Default: false.
- earlyexit=f
- Exit early during memory estimation calculations. Default: false.
- gc=f
- Perform garbage collection at the end and report memory usage after GC. Default: false.
- overhead=0
- Override the overhead parameter in memory estimation calculations. Default: 0 (auto-detect).
Java Parameters
- -Xmx
- Sets Java's maximum memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions for potentially faster execution.
Examples
Basic Memory Testing
loadreads.sh in=reads.fastq
Tests memory usage of a FASTQ file by loading all reads and reporting memory consumption statistics.
Low-Complexity Library Testing
loadreads.sh in=amplicon_reads.fq lowcomplexity=t
Tests memory usage assuming the input is a low-complexity library (such as amplicon data) which may have different memory characteristics.
With Custom Memory Settings
loadreads.sh -Xmx8g in=large_dataset.fq.gz
Tests memory usage with 8GB maximum heap size for a large gzipped dataset.
Algorithm Details
Memory Testing Strategy
LoadReads implements memory testing using ArrayList<ArrayList<Read>> storage structure with ConcurrentReadInputStream for I/O operations:
Data Loading Process
- Sequential Read Loading: Uses ConcurrentReadInputStream.getReadInputStream() with buffered ListNum<Read> processing
- Memory Storage: Stores all reads in ArrayList<ArrayList<Read>> structure to simulate real-world usage patterns
- Continuous Monitoring: Calls calcMem() after each batch using Shared.memUsed() for precise memory tracking
- Paired-End Support: Automatically detects and properly handles paired-end sequencing data
Memory Measurement Methodology
- Multi-Point Sampling: Records memory usage before, during, and after loading process
- Min/Max Tracking: Updates minMem and maxMem fields using Tools.min() and Tools.max() methods
- Byte-Level Accounting: Uses r1.countFastqBytes() for disk format and r1.countBytes() for memory representation
- Per-Read Metrics: Calculates average memory consumption per read, including bases, qualities, and headers
Estimation Algorithms
- Dual Estimation: Uses Tools.estimateFileMemory() and actual measurement with usedMem=maxMem-minMem calculation
- Ratio Calculations: Computes memory-to-disk ratios for capacity planning
- Overhead Analysis: Separates actual sequence data from Java object overhead
- Compression Awareness: Accounts for differences between compressed file sizes and memory usage
Output Metrics
LoadReads outputs specific memory metrics calculated from tracked variables:
- Absolute Memory: Initial, final, minimum, and maximum memory usage in megabytes
- Memory Estimates: Two different estimation methods for comparison and validation
- Ratios: Memory-to-disk ratios for understanding storage efficiency
- Per-Read Averages: Memory usage per read, bases per read, quality scores per read
- Overhead Calculation: Java object overhead separate from sequence data
- Processing Speed: Throughput metrics for both disk and memory processing
Low-Complexity Optimization
When lowcomplexity=true is specified, the tool adjusts its estimation algorithms to account for:
- Reduced K-mer Diversity: Fewer unique subsequences leading to potential memory savings
- Compression Benefits: Better compression ratios for repetitive sequences
- Specialized Data Structures: Optimizations for handling redundant sequence content
Performance Characteristics
- Memory Overhead: Approximately 2-4x disk size depending on sequence characteristics
- I/O Implementation: Uses ByteFile.FORCE_MODE_BF2 with multi-threaded reading when Shared.threads()>2
- Scalability: Can handle files limited only by available system memory
- Format Support: Works with FASTA, FASTQ, and their compressed variants (gzip, bzip2)
Technical Notes
Memory Management
- Uses Shared.memUsed() with calcMem() method that updates minMem and maxMem tracking variables
- Automatically detects available system memory and sets reasonable defaults
- Optional System.gc() call when gc=true parameter is set, followed by additional Shared.memUsed() measurement
- Handles out-of-memory conditions gracefully with appropriate error reporting
Input File Support
- Supports standard sequence formats: FASTA, FASTQ
- Automatic compression detection and decompression
- Paired-end file detection using # symbol replacement: in1.replace("#", "1") and in1.replace("#", "2")
- Quality score file support for separate quality data
Output Interpretation
- Memory Ratio: Values >1 indicate memory usage exceeds disk size
- Disk Ratio: Accounts for format overhead (FASTQ headers, quality scores)
- Overhead: Java object overhead beyond sequence data itself
- Estimates vs Actual: Compare estimation accuracy with real measurements
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org