Repair

Script: repair.sh Package: jgi Class: SplitPairsAndSingles.java

Re-pairs reads that became disordered or had some mates eliminated. Designed to fix files of paired reads that became corrupted by non-pair-aware software processing.

When to Use This Tool

Best Practice: When possible, go back to the original raw reads and process them correctly with pair-aware tools like BBDuk instead of using Repair.

Repair is designed as a recovery tool for specific situations:

Important: This tool requires read names in specific formats: Illumina format (identical prefix with 1:/2: or /1//2 suffixes) or completely identical names for both reads in a pair (SAM format).

Memory Requirements

Repair has two shell scripts with different default memory allocations:

repair.sh
Requests all available memory by default. Use for arbitrarily disordered files that may require storing all reads in memory.
bbsplitpairs.sh
Requests a small amount of memory by default. Use for fixing broken interleaving where only minimal memory is needed.

Memory usage by mode:

Usage

repair.sh in=<input file> out=<pair output> outs=<singleton output>

Input may be fasta, fastq, or sam, compressed or uncompressed.

Parameters

Repair provides parameters for input/output specification, processing modes, and file handling options.

Input/Output Parameters

in=<file>
The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in.
in2=<file>
Use this if 2nd read of pairs are in a different file.
out=<file>
The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out.
out2=<file>
Use this to write 2nd read of pairs to a different file.
outs=<file>
(outsingle) Write singleton reads here.

Processing Parameters

repair=t
(rp) Fixes arbitrarily corrupted paired reads by using read names. Uses much more memory than 'fint' mode.
fint=f
(fixinterleaving) Fixes corrupted interleaved files using read names. Only use on files with broken interleaving - correctly interleaved files from which some reads were removed.
ain=f
(allowidenticalnames) When detecting pair names, allows identical names, instead of requiring /1 and /2 or 1: and 2:

File Handling Parameters

overwrite=t
(ow) Set to false to force the program to abort rather than overwrite an existing file.
showspeed=t
(ss) Set to 'f' to suppress display of processing speed.
ziplevel=2
(zl) Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da
Disable assertions.

Examples

Repairing arbitrarily disordered files

repair.sh in=broken.fq out=fixed.fq outs=singletons.fq repair

Use when reads have been completely shuffled by non-pair-aware processing. High memory usage.

Repairing disordered dual files

repair.sh in1=broken1.fq in2=broken2.fq out1=fixed1.fq out2=fixed2.fq outs=singletons.fq repair

Process separate R1 and R2 files that lost proper pairing relationships.

Fixing broken interleaving (low memory)

bbsplitpairs.sh in=broken.fq out=fixed.fq outs=singletons.fq fint

Use bbsplitpairs.sh with fint mode for interleaved files where some reads were removed. Uses minimal memory.

Allowing identical read names

repair.sh in=reads.fq out=paired.fq outs=singles.fq repair ain=t

For reads with identical names (no /1, /2 or 1:, 2: suffixes), such as SAM format reads.

Algorithm Details

Repair Mode Strategy

For arbitrarily disordered files, Repair uses a LinkedHashMap-based approach:

Memory usage scales with the number of unpaired reads, potentially requiring all reads in memory for highly disordered files.

Fix Interleaving Strategy

For broken interleaved files, uses a sequential approach:

Read Name Parsing Formats

Repair recognizes multiple read name conventions:

Memory Management

The tool employs different memory strategies:

Quality Trimming Integration

When quality parameters are specified:

Technical Implementation

File Format Support

Automatic format detection and processing:

Concurrent Processing

The implementation uses concurrent I/O streams:

Error Handling and Validation

Performance Characteristics

Performance and Usage Guidelines

Choosing the Right Mode

Use fint mode when:

  • Working with interleaved files where some reads were removed
  • Most read order is preserved but some pairs are broken
  • Memory is limited
  • Using bbsplitpairs.sh for automatic low-memory allocation

Use repair mode when:

  • Reads have been arbitrarily shuffled or disordered
  • Processing separate R1/R2 files that lost synchronization
  • Sufficient memory is available
  • Using repair.sh for automatic high-memory allocation

Memory Planning

Estimate memory requirements based on input characteristics:

See Also

Support

For questions and support: