Repair

When to Use This Tool

Best Practice: When possible, go back to the original raw reads and process them correctly with pair-aware tools like BBDuk instead of using Repair.

Repair is designed as a recovery tool for specific situations:

Primary use case: Files corrupted by old, non-pair-aware software like Fastx Toolkit that broke pairing order
Recovery scenario: When you don't have access to the original reads and must fix corrupted pairing
Broken interleaving: Files where reads were removed from properly interleaved data
Arbitrary disorder: Files where paired reads became completely shuffled

Important: This tool requires read names in specific formats: Illumina format (identical prefix with 1:/2: or /1//2 suffixes) or completely identical names for both reads in a pair (SAM format).

Memory Requirements

Repair has two shell scripts with different default memory allocations:

repair.sh: Requests all available memory by default. Use for arbitrarily disordered files that may require storing all reads in memory.
bbsplitpairs.sh: Requests a small amount of memory by default. Use for fixing broken interleaving where only minimal memory is needed.

Memory usage by mode:

Repair mode (repair=t): High memory usage - potentially all reads stored in hash map
Fix interleaving mode (fint=t): Low memory usage - sequential processing with single read buffer

Usage

repair.sh in=<input file> out=<pair output> outs=<singleton output>

Input may be fasta, fastq, or sam, compressed or uncompressed.

Parameters

Repair provides parameters for input/output specification, processing modes, and file handling options.

Input/Output Parameters

in=<file>: The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in.
in2=<file>: Use this if 2nd read of pairs are in a different file.
out=<file>: The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out.
out2=<file>: Use this to write 2nd read of pairs to a different file.
outs=<file>: (outsingle) Write singleton reads here.

Processing Parameters

repair=t: (rp) Fixes arbitrarily corrupted paired reads by using read names. Uses much more memory than 'fint' mode.
fint=f: (fixinterleaving) Fixes corrupted interleaved files using read names. Only use on files with broken interleaving - correctly interleaved files from which some reads were removed.
ain=f: (allowidenticalnames) When detecting pair names, allows identical names, instead of requiring /1 and /2 or 1: and 2:

File Handling Parameters

overwrite=t: (ow) Set to false to force the program to abort rather than overwrite an existing file.
showspeed=t: (ss) Set to 'f' to suppress display of processing speed.
ziplevel=2: (zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Examples

Repairing arbitrarily disordered files

repair.sh in=broken.fq out=fixed.fq outs=singletons.fq repair

Use when reads have been completely shuffled by non-pair-aware processing. High memory usage.

Repairing disordered dual files

repair.sh in1=broken1.fq in2=broken2.fq out1=fixed1.fq out2=fixed2.fq outs=singletons.fq repair

Process separate R1 and R2 files that lost proper pairing relationships.

Fixing broken interleaving (low memory)

bbsplitpairs.sh in=broken.fq out=fixed.fq outs=singletons.fq fint

Use bbsplitpairs.sh with fint mode for interleaved files where some reads were removed. Uses minimal memory.

Allowing identical read names

repair.sh in=reads.fq out=paired.fq outs=singles.fq repair ain=t

For reads with identical names (no /1, /2 or 1:, 2: suffixes), such as SAM format reads.

Algorithm Details

Repair Mode Strategy

For arbitrarily disordered files, Repair uses a LinkedHashMap-based approach:

Maintains a LinkedHashMap<String, Read> to track unpaired reads by name prefix
Parses read names using String.split("\\s+") and indexOf('/') to extract prefixes
For each read, checks if its mate already exists in the map
If mate found: creates paired read and removes entry from map
If no mate found: stores read in map for future matching
After processing: remaining unpaired reads become singletons

Memory usage scales with the number of unpaired reads, potentially requiring all reads in memory for highly disordered files.

Fix Interleaving Strategy

For broken interleaved files, uses a sequential approach:

Maintains single "previous read" buffer instead of hash map
Tests consecutive reads using FASTQ.testPairNames() for valid pairing
Processes reads in streaming fashion with minimal memory footprint
Suitable when most reads are still in correct order but some are missing

Read Name Parsing Formats

Repair recognizes multiple read name conventions:

Illumina slash format: prefix/1 and prefix/2 (e.g., "read123/1")
Illumina colon format: prefix 1: and prefix 2: (e.g., "read123 1:")
SAM format: Uses pair number from SAM line when available
Identical names: When ain=t, allows completely identical read names

Memory Management

The tool employs different memory strategies:

Default allocation: repair.sh uses calcXmx() for automatic memory sizing (4GB base with 84% RAM limit)
ByteFile optimization: Automatically selects BF2 mode when threads > 2 for improved I/O
Streaming buffers: Uses Shared.bufferLen() for output buffering
Hash map sizing: LinkedHashMap grows dynamically to accommodate unpaired reads

Quality Trimming Integration

When quality parameters are specified:

Applies TrimRead.trimFast() with specified trimq threshold
Supports qtrimLeft and qtrimRight directional trimming
Re-evaluates read lengths after trimming for minimum length filtering
Tracks trimming statistics (basesTrimmed, readsTrimmed)

Technical Implementation

File Format Support

Automatic format detection and processing:

FASTQ: Standard and compressed (.gz), interleaved or separate files
FASTA: Standard and compressed formats
SAM: With automatic pair number extraction from SAM lines
Compression: Uses ReadWrite.USE_PIGZ for parallel compression/decompression

Concurrent Processing

The implementation uses concurrent I/O streams:

ConcurrentReadInputStream with DualCris support for dual-file input
ConcurrentReadOutputStream for paired and singleton output streams
ReadWrite.setZipThreads() for parallel compression operations
Configurable buffer sizes for memory-efficient processing

Error Handling and Validation

Mutual exclusion: Assert statements prevent simultaneous repair and fint modes
File validation: Tools.canWrite() prevents accidental overwrites
Read validation: Null checks and assertions for reads without names
Statistical tracking: Comprehensive counters for success/failure reporting

Performance Characteristics

Fix interleaving: Linear time complexity, minimal memory usage
Repair mode: Time proportional to total reads, memory scales with disorder level
I/O optimization: Concurrent streams with automatic compression threading
Progress reporting: Real-time statistics using Timer and formatted output

Performance and Usage Guidelines

Choosing the Right Mode

Use fint mode when:

Working with interleaved files where some reads were removed
Most read order is preserved but some pairs are broken
Memory is limited
Using bbsplitpairs.sh for automatic low-memory allocation

Use repair mode when:

Reads have been arbitrarily shuffled or disordered
Processing separate R1/R2 files that lost synchronization
Sufficient memory is available
Using repair.sh for automatic high-memory allocation

Memory Planning

Estimate memory requirements based on input characteristics:

Highly disordered data: May require memory for all reads
Partially disordered: Memory scales with number of unpaired reads
Rule of thumb: Start with repair.sh default allocation, increase if needed
Large datasets: Consider splitting input files if memory constraints exist

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Detailed guide: bbtools/docs/guides/RepairGuide.txt