Repair
Re-pairs reads that became disordered or had some mates eliminated. Designed to fix files of paired reads that became corrupted by non-pair-aware software processing.
When to Use This Tool
Best Practice: When possible, go back to the original raw reads and process them correctly with pair-aware tools like BBDuk instead of using Repair.
Repair is designed as a recovery tool for specific situations:
- Primary use case: Files corrupted by old, non-pair-aware software like Fastx Toolkit that broke pairing order
- Recovery scenario: When you don't have access to the original reads and must fix corrupted pairing
- Broken interleaving: Files where reads were removed from properly interleaved data
- Arbitrary disorder: Files where paired reads became completely shuffled
Important: This tool requires read names in specific formats: Illumina format (identical prefix with 1:/2: or /1//2 suffixes) or completely identical names for both reads in a pair (SAM format).
Memory Requirements
Repair has two shell scripts with different default memory allocations:
- repair.sh
- Requests all available memory by default. Use for arbitrarily disordered files that may require storing all reads in memory.
- bbsplitpairs.sh
- Requests a small amount of memory by default. Use for fixing broken interleaving where only minimal memory is needed.
Memory usage by mode:
- Repair mode (repair=t): High memory usage - potentially all reads stored in hash map
- Fix interleaving mode (fint=t): Low memory usage - sequential processing with single read buffer
Usage
repair.sh in=<input file> out=<pair output> outs=<singleton output>
Input may be fasta, fastq, or sam, compressed or uncompressed.
Parameters
Repair provides parameters for input/output specification, processing modes, and file handling options.
Input/Output Parameters
- in=<file>
- The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in.
- in2=<file>
- Use this if 2nd read of pairs are in a different file.
- out=<file>
- The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out.
- out2=<file>
- Use this to write 2nd read of pairs to a different file.
- outs=<file>
- (outsingle) Write singleton reads here.
Processing Parameters
- repair=t
- (rp) Fixes arbitrarily corrupted paired reads by using read names. Uses much more memory than 'fint' mode.
- fint=f
- (fixinterleaving) Fixes corrupted interleaved files using read names. Only use on files with broken interleaving - correctly interleaved files from which some reads were removed.
- ain=f
- (allowidenticalnames) When detecting pair names, allows identical names, instead of requiring /1 and /2 or 1: and 2:
File Handling Parameters
- overwrite=t
- (ow) Set to false to force the program to abort rather than overwrite an existing file.
- showspeed=t
- (ss) Set to 'f' to suppress display of processing speed.
- ziplevel=2
- (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions.
Examples
Repairing arbitrarily disordered files
repair.sh in=broken.fq out=fixed.fq outs=singletons.fq repair
Use when reads have been completely shuffled by non-pair-aware processing. High memory usage.
Repairing disordered dual files
repair.sh in1=broken1.fq in2=broken2.fq out1=fixed1.fq out2=fixed2.fq outs=singletons.fq repair
Process separate R1 and R2 files that lost proper pairing relationships.
Fixing broken interleaving (low memory)
bbsplitpairs.sh in=broken.fq out=fixed.fq outs=singletons.fq fint
Use bbsplitpairs.sh with fint mode for interleaved files where some reads were removed. Uses minimal memory.
Allowing identical read names
repair.sh in=reads.fq out=paired.fq outs=singles.fq repair ain=t
For reads with identical names (no /1, /2 or 1:, 2: suffixes), such as SAM format reads.
Algorithm Details
Repair Mode Strategy
For arbitrarily disordered files, Repair uses a LinkedHashMap-based approach:
- Maintains a LinkedHashMap<String, Read> to track unpaired reads by name prefix
- Parses read names using String.split("\\s+") and indexOf('/') to extract prefixes
- For each read, checks if its mate already exists in the map
- If mate found: creates paired read and removes entry from map
- If no mate found: stores read in map for future matching
- After processing: remaining unpaired reads become singletons
Memory usage scales with the number of unpaired reads, potentially requiring all reads in memory for highly disordered files.
Fix Interleaving Strategy
For broken interleaved files, uses a sequential approach:
- Maintains single "previous read" buffer instead of hash map
- Tests consecutive reads using FASTQ.testPairNames() for valid pairing
- Processes reads in streaming fashion with minimal memory footprint
- Suitable when most reads are still in correct order but some are missing
Read Name Parsing Formats
Repair recognizes multiple read name conventions:
- Illumina slash format: prefix/1 and prefix/2 (e.g., "read123/1")
- Illumina colon format: prefix 1: and prefix 2: (e.g., "read123 1:")
- SAM format: Uses pair number from SAM line when available
- Identical names: When ain=t, allows completely identical read names
Memory Management
The tool employs different memory strategies:
- Default allocation: repair.sh uses calcXmx() for automatic memory sizing (4GB base with 84% RAM limit)
- ByteFile optimization: Automatically selects BF2 mode when threads > 2 for improved I/O
- Streaming buffers: Uses Shared.bufferLen() for output buffering
- Hash map sizing: LinkedHashMap grows dynamically to accommodate unpaired reads
Quality Trimming Integration
When quality parameters are specified:
- Applies TrimRead.trimFast() with specified trimq threshold
- Supports qtrimLeft and qtrimRight directional trimming
- Re-evaluates read lengths after trimming for minimum length filtering
- Tracks trimming statistics (basesTrimmed, readsTrimmed)
Technical Implementation
File Format Support
Automatic format detection and processing:
- FASTQ: Standard and compressed (.gz), interleaved or separate files
- FASTA: Standard and compressed formats
- SAM: With automatic pair number extraction from SAM lines
- Compression: Uses ReadWrite.USE_PIGZ for parallel compression/decompression
Concurrent Processing
The implementation uses concurrent I/O streams:
- ConcurrentReadInputStream with DualCris support for dual-file input
- ConcurrentReadOutputStream for paired and singleton output streams
- ReadWrite.setZipThreads() for parallel compression operations
- Configurable buffer sizes for memory-efficient processing
Error Handling and Validation
- Mutual exclusion: Assert statements prevent simultaneous repair and fint modes
- File validation: Tools.canWrite() prevents accidental overwrites
- Read validation: Null checks and assertions for reads without names
- Statistical tracking: Comprehensive counters for success/failure reporting
Performance Characteristics
- Fix interleaving: Linear time complexity, minimal memory usage
- Repair mode: Time proportional to total reads, memory scales with disorder level
- I/O optimization: Concurrent streams with automatic compression threading
- Progress reporting: Real-time statistics using Timer and formatted output
Performance and Usage Guidelines
Choosing the Right Mode
Use fint mode when:
- Working with interleaved files where some reads were removed
- Most read order is preserved but some pairs are broken
- Memory is limited
- Using bbsplitpairs.sh for automatic low-memory allocation
Use repair mode when:
- Reads have been arbitrarily shuffled or disordered
- Processing separate R1/R2 files that lost synchronization
- Sufficient memory is available
- Using repair.sh for automatic high-memory allocation
Memory Planning
Estimate memory requirements based on input characteristics:
- Highly disordered data: May require memory for all reads
- Partially disordered: Memory scales with number of unpaired reads
- Rule of thumb: Start with repair.sh default allocation, increase if needed
- Large datasets: Consider splitting input files if memory constraints exist
See Also
- BBDuk: Recommended for proper paired-read processing from raw data
- bbsplitpairs.sh: Alternative script with low default memory allocation
- RepairGuide.txt: Comprehensive guide in bbtools/docs/guides/
- Reformat: For format conversion and basic read manipulation
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org
- Detailed guide: bbtools/docs/guides/RepairGuide.txt