Reformat

Script: reformat.sh Package: jgi Class: ReformatReads.java

Generic streaming read-processing tool designed for format conversion, subsampling, and various filtering operations with low memory and computational demands. Supports fastq, fasta, fasta+qual, scarf, oneline, sam, bam, gzip, bz2 formats.

Overview

Reformat is designed for streaming read-processing tasks that have low memory or computational demands, such as format conversion, subsampling, and various filtering operations. While some functionality (like quality-trimming, length-filtering, histogram generation) is shared with BBDuk, much of it (like converting degenerate bases to N) is unique to Reformat.

When to Use Reformat vs BBDuk

Choose Reformat when: You need low resource consumption, are piping data to/from high-resource programs, or need format conversion capabilities
Choose BBDuk when: You need maximum performance for quality trimming, length filtering, or kmer-based operations and have sufficient memory

Resource Requirements

Memory: Only trivial amount for short reads regardless of quantity. For very long sequences (e.g., human genome), use readbufferlength=1 readbuffers=1 to reduce buffering
Threading: Uses single worker thread but multiple I/O and compression threads. Even with t=1, typically uses over 2 CPU cores due to separate I/O threading
Output Streams: Two standard streams: "out" for normal reads, "outs" for singleton reads that pass filters but whose mates fail

Basic Usage

reformat.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2>

Note: in2 and out2 are for paired reads and are optional. If input is paired and there is only one output file, it will be written interleaved.

File Format Support

Reformat supports automatic format detection and conversion between:

Sequence formats: fastq, fasta, fasta+qual, scarf, oneline, sam, bam
Compression: gzip, bz2 (automatic detection and compression)
Quality encodings: Sanger (ASCII-33), Illumina (ASCII-64), automatic detection

Common Use Cases

Format Conversion

reformat.sh in=reads.fastq out=reads.fasta

Convert between any supported formats. Compression is automatic: reformat.sh in=reads.fa.gz out=reads.sam

Batch Processing with Wildcards

reformat.sh in=read#.fq out=%.fa

Converts read1.fq and read2.fq to read1.fa and read2.fa automatically

Quality Encoding Conversion

reformat.sh in=reads.fq out=reads.fq qin=33 qout=64

Convert ASCII-33 qualities (Sanger) to ASCII-64 (obsolete Illumina)

FASTA+QUAL Conversion

# FASTQ to FASTA+QUAL
reformat.sh in=reads.fq out=reads.fa qfout=reads.qual

# FASTA+QUAL to FASTQ
reformat.sh in=reads.fa qfin=reads.qual out=reads.fq

Interleaving/Deinterleaving

# Deinterleave
reformat.sh in=reads.fq out1=read1.fq out2=read2.fq

# Interleave
reformat.sh in1=read1.fq in2=read2.fq out=reads.fq

# Concise form
reformat.sh in=read#.fq out=reads.fq

Read Name Modifications

# Add /1 and /2 to paired read names
reformat.sh in=reads.fq out=renamed.fq addslash int

# Replace whitespace with underscores
reformat.sh in=reads.fq out=renamed.fq underscore

# Trim names after first whitespace
reformat.sh in=reads.fq out=renamed.fq trd

Sequence Transformations

# Reverse complement all reads
reformat.sh in=reads.fq out=out.fq rcomp

# Reverse complement only read 2
reformat.sh in=reads.fq out=out.fq rcompmate

# Convert to uppercase
reformat.sh in=reads.fq out=out.fq tuc

# Convert degenerate bases to N
reformat.sh in=reads.fq out=out.fq iupacton

# Custom base remapping
reformat.sh in=reads.fq out=out.fq remap=aZGP

Data Validation

# Verify paired read names
reformat.sh in=reads.fq vint                    # Interleaved reads
reformat.sh in=read#.fq vpair                   # Separate files

# Fix broken reads (use with caution)
reformat.sh in=reads.fq out=fixed.fq tossbrokenreads

Quality Score Management

# Cap quality scores to specific range
reformat.sh in=reads.fq out=out.fq mincalledquality=2 maxcalledquality=41

# Ensure unique sequence names
reformat.sh in=reads.fq out=out.fq uniquenames

Parameters

Reformat provides extensive functionality through its comprehensive parameter set. Parameters are organized by their primary function to simplify usage and configuration.

Parameters and their defaults

ow=f: (overwrite) Overwrites files that already exist.
app=f: (append) Append to files that already exist.
zl=4: (ziplevel) Set compression level, 1 (low) to 9 (max).
int=f: (interleaved) Determines whether INPUT file is considered interleaved.
fastawrap=70: Length of lines in fasta output.
fastareadlen=0: Set to a non-zero number to break fasta files into reads of at most this length.
fastaminlen=1: Ignore fasta reads shorter than this.
qin=auto: ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
qout=auto: ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
qfake=30: Quality value used for fasta to fastq reformatting.
qfin=<.qual file>: Read qualities from this qual file, for the reads coming from 'in=<fasta file>'
qfin2=<.qual file>: Read qualities from this qual file, for the reads coming from 'in2=<fasta file>'
qfout=<.qual file>: Write qualities from this qual file, for the reads going to 'out=<fasta file>'
qfout2=<.qual file>: Write qualities from this qual file, for the reads coming from 'out2=<fasta file>'
outsingle=<file>: (outs) If a read is longer than minlength and its mate is shorter, the longer one goes here.
deleteinput=f: Delete input upon successful completion.
ref=<file>: Optional reference fasta for sam processing.

Processing Parameters

verifypaired=f: (vpair) When true, checks reads to see if the names look paired. Prints an error message if not.
verifyinterleaved=f: (vint) sets 'vpair' to true and 'interleaved' to true.
allowidenticalnames=f: (ain) When verifying pair names, allows identical names, instead of requiring /1 and /2 or 1: and 2:
tossbrokenreads=f: (tbr) Discard reads that have different numbers of bases and qualities. By default this will be detected and cause a crash.
ignorebadquality=f: (ibq) Fix out-of-range quality values instead of crashing with a warning.
addslash=f: Append ' /1' and ' /2' to read names, if not already present. Please include the flag 'int=t' if the reads are interleaved.
spaceslash=t: Put a space before the slash in addslash mode.
addcolon=f: Append ' 1:' and ' 2:' to read names, if not already present. Please include the flag 'int=t' if the reads are interleaved.
underscore=f: Change whitespace in read names to underscores.
rcomp=f: (rc) Reverse-complement reads.
rcompmate=f: (rcm) Reverse-complement read 2 only.
comp=f: (complement) Reverse-complement reads.
changequality=t: (cq) N bases always get a quality of 0 and ACGT bases get a min quality of 2.
quantize=f: Quantize qualities to a subset of values like NextSeq. Can also be used with comma-delimited list, like quantize=0,8,13,22,27,32,37
tuc=f: (touppercase) Change lowercase letters in reads to uppercase.
uniquenames=f: Make duplicate names unique by appending _<number>.
remap=: A set of pairs: remap=CTGN will transform C>T and G>N. Use remap1 and remap2 to specify read 1 or 2.
iupacToN=f: (itn) Convert non-ACGTN symbols to N.
monitor=f: Kill this process if it crashes. monitor=600,0.01 would kill after 600 seconds under 1% usage.
crashjunk=t: Crash when encountering reads with invalid bases.
tossjunk=f: Discard reads with invalid characters as bases.
fixjunk=f: Convert invalid bases to N (or X for amino acids).
dotdashxton=f: Specifically convert . - and X to N (or X for amino acids).
recalibrate=f: (recal) Recalibrate quality scores. Must first generate matrices with CalcTrueQuality.
maxcalledquality=41: Quality scores capped at this upper bound.
mincalledquality=2: Quality scores of ACGT bases will be capped at lower bound.
trimreaddescription=f: (trd) Trim the names of reads after the first whitespace.
trimrname=f: For sam/bam files, trim rname/rnext fields after the first space.
fixheaders=f: Replace characters in headers such as space, *, and | to make them valid file names.
warnifnosequence=t: For fasta, issue a warning if a sequenceless header is encountered.
warnfirsttimeonly=t: Issue a warning for only the first sequenceless header.
utot=f: Convert U to T (for RNA -> DNA translation).
padleft=0: Pad the left end of sequences with this many symbols.
padright=0: Pad the right end of sequences with this many symbols.
pad=0: Set padleft and padright to the same value.
padsymbol=N: Symbol to use for padding.

Histogram output parameters

bhist=<file>: Base composition histogram by position.
qhist=<file>: Quality histogram by position.
qchist=<file>: Count of bases with each quality value.
aqhist=<file>: Histogram of average read quality.
bqhist=<file>: Quality histogram designed for box plots.
lhist=<file>: Read length histogram.
gchist=<file>: Read GC content histogram.
gcbins=100: Number gchist bins. Set to 'auto' to use read length.
gcplot=f: Add a graphical representation to the gchist.
maxhistlen=6000: Set an upper bound for histogram lengths; higher uses more memory. The default is 6000 for some histograms and 80000 for others.

Histogram parameters for sam files only (requires sam format 1.4 or higher)

ehist=<file>: Errors-per-read histogram.
qahist=<file>: Quality accuracy histogram of error rates versus quality score.
indelhist=<file>: Indel length histogram.
mhist=<file>: Histogram of match, sub, del, and ins rates by read location.
ihist=<file>: Insert size histograms. Requires paired reads in a sam file.
idhist=<file>: Histogram of read count versus percent identity.
idbins=100: Number idhist bins. Set to 'auto' to use read length.

Sampling parameters

reads=-1: Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1: Skip (discard) this many INPUT reads before processing the rest.
samplerate=1: Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1: Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0: (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0: (sbt) Exact number of OUTPUT bases desired. Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f: Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.
prioritizelength=f: If true, calculate a length threshold to reach the target, and retain all reads of at least that length (must set srt or sbt).

Trimming and filtering parameters

qtrim=f: Trim read ends to remove bases with quality below trimq. Values: t (trim both ends), f (neither end), r (right end only), l (left end only), w (sliding window).
trimq=6: Regions with average quality BELOW this will be trimmed. Can be a floating-point number like 7.3.
minlength=0: (ml) Reads shorter than this after trimming will be discarded. Pairs will be discarded only if both are shorter.
mlf=0: (mlf) Reads shorter than this fraction of original length after trimming will be discarded.
maxlength=0: If nonzero, reads longer than this after trimming will be discarded.
breaklength=0: If nonzero, reads longer than this will be broken into multiple reads of this length. Does not work for paired reads.
requirebothbad=t: (rbb) Only discard pairs if both reads are shorter than minlen.
invertfilters=f: (invert) Output failing reads instead of passing reads.
minavgquality=0: (maq) Reads with average quality (after trimming) below this will be discarded.
maqb=0: If positive, calculate maq from this many initial bases.
chastityfilter=f: (cf) Reads with names containing ' 1:Y:' or ' 2:Y:' will be discarded.
barcodefilter=f: Remove reads with unexpected barcodes if barcodes is set, or barcodes containing 'N' otherwise. A barcode must be the last part of the read header.
barcodes=: Comma-delimited list of barcodes or files of barcodes.
maxns=-1: If 0 or greater, reads with more Ns than this (after trimming) will be discarded.
minconsecutivebases=0: (mcb) Discard reads without at least this many consecutive called bases.
forcetrimleft=0: (ftl) If nonzero, trim left bases of the read to this position (exclusive, 0-based).
forcetrimright=-1: (ftr) If nonnegative, trim right bases of the read after this position (exclusive, 0-based).
forcetrimright2=0: (ftr2) If positive, trim this many bases on the right end.
forcetrimmod=5: (ftm) If positive, trim length to be equal to zero modulo this number.
mingc=0: Discard reads with GC content below this.
maxgc=1: Discard reads with GC content above this.
gcpairs=t: Use average GC of paired reads. Also affects gchist.

Tag-filtering parameters

tag=: Look for this tag in the header to filter by the next value. To filter reads with a header like 'foo,depth=5.5,bar' where you only want depths of at least 3, the necessary flags would be 'tag=depth= minvalue=3 delimiter=,'
delimiter=: Character after the end of the value, such as delimiter=X. Control and whitespace symbols may be spelled out, like delimiter=tab or delimiter=pipe. The tag may contain the delimiter. If the value is the last term in the header, the delimiter doesn't matter but is still required.
minvalue=: If set, only accept a numeric value of at least this.
maxvalue=: If set, only accept a numeric value of at most this.
value=: If set, only accept a string value of exactly this.

Illumina-specific parameters

top=true: Include reads from the top of the flowcell.
bottom=true: Include reads from the bottom of the flowcell.

Sam and bam processing parameters

mappedonly=f: Toss unmapped reads.
unmappedonly=f: Toss mapped reads.
pairedonly=f: Toss reads that are not mapped as proper pairs.
unpairedonly=f: Toss reads that are mapped as proper pairs.
primaryonly=f: Toss secondary alignments. Set this to true for sam to fastq conversion.
minmapq=-1: If non-negative, toss reads with mapq under this.
maxmapq=-1: If non-negative, toss reads with mapq over this.
requiredbits=0: (rbits) Toss sam lines with any of these flag bits unset. Similar to samtools -f.
filterbits=0: (fbits) Toss sam lines with any of these flag bits set. Similar to samtools -F.
stoptag=f: Set to true to write a tag indicating read stop location, prefixed by YS:i:
sam=: Set to 'sam=1.3' to convert '=' and 'X' cigar symbols (from sam 1.4+ format) to 'M'. Set to 'sam=1.4' to convert 'M' to '=' and 'X' (sam=1.4 requires MD tags to be present, or ref to be specified).

Sam and bam alignment filtering parameters

These require = and X symbols in cigar strings, or MD tags, or a reference fasta. -1 means disabled; to filter reads with any of a symbol type, set to 0.

subfilter=-1: Discard reads with more than this many substitutions.
minsubs=-1: Discard reads with fewer than this many substitutions.
insfilter=-1: Discard reads with more than this many insertions.
delfilter=-1: Discard reads with more than this many deletions.
indelfilter=-1: Discard reads with more than this many indels.
editfilter=-1: Discard reads with more than this many edits.
inslenfilter=-1: Discard reads with an insertion longer than this.
dellenfilter=-1: Discard reads with a deletion longer than this.
minidfilter=-1.0: Discard reads with identity below this (0-1).
maxidfilter=1.0: Discard reads with identity above this (0-1).
clipfilter=-1: Discard reads with more than this many soft-clipped bases.

Kmer counting and cardinality estimation parameters

k=0: If positive, count the total number of kmers.
cardinality=f: (loglog) Count unique kmers using the LogLog algorithm.
loglogbuckets=1999: Use this many buckets for cardinality estimation.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions.

Shortcuts

The # symbol will be substituted for 1 and 2. The % symbol in out will be substituted for input name minus extensions.

reformat.sh in=read#.fq out=%.fa

...is equivalent to:

reformat.sh in1=read1.fq in2=read2.fq out1=read1.fa out2=read2.fa

Algorithm Details

Stream Processing Architecture

Reformat uses a concurrent streaming architecture optimized for low memory usage and high throughput. The ConcurrentReadInputStream and ConcurrentReadOutputStream classes process reads in configurable batches (buffer size 4) to balance memory efficiency with performance. This streaming design allows processing of files of any size without loading entire datasets into memory.

Memory Management

The tool uses specific memory allocation strategies for optimal performance:

Default Allocation: 300MB heap (z="-Xmx300m") suitable for most operations
Buffer Control: Shared.capBuffers(4) limits concurrent buffer allocation to prevent memory spikes
Histogram Limits: maxhistlen parameter (default 6000/80000) prevents excessive memory usage for quality/length histograms
Long Sequence Handling: For very long sequences, use readbufferlength=1 readbuffers=1 to minimize buffering

Quality Score Processing

Multiple specialized quality processing methods are available:

Encoding Detection: Parser.processQuality() examines ASCII ranges to distinguish between Sanger (33), Illumina (64), and other schemes
Quality Trimming: TrimRead.trimFast() implements sliding window and end-trimming using configurable quality thresholds
Quality Recalibration: CalcTrueQuality.recalibrate() applies empirical error rate matrices when recalibrate=t
Quality Quantization: Quantizer.quantize() reduces quality complexity using NextSeq-compatible or custom binning schemes

Sampling Implementation

Reformat provides multiple sampling strategies:

Rate-based: Uses Random.nextDouble() with optional deterministic seeding for reproducible subsampling
Exact Targeting: countReads() method performs initial pass to calculate precise probability ratios for exact read/base targets
Length-priority: makeLengthHist() builds SuperLongList histograms to calculate length thresholds when prioritizelength=t

Format Detection and Conversion

FileFormat.testInput() provides automatic format detection based on file extensions and content headers. The system supports seamless conversion between sequence formats (FASTQ, FASTA, SAM, BAM) and compression formats (gzip, bz2) without intermediate files. Quality file handling enables FASTA+QUAL conversions using separate quality streams.

Threading Model

While using a single worker thread for read processing, Reformat employs multiple I/O and compression threads:

I/O Threading: Separate threads handle input/output operations to prevent blocking
Compression Threading: ReadWrite.setZipThreads() enables parallel compression/decompression
PIGZ Integration: When available, uses pigz for multi-threaded gzip operations

Related Tools

While Reformat provides extensive functionality, some specialized operations are handled by dedicated tools:

rename.sh: Advanced read name manipulation and renaming operations
repair.sh/bbsplitpairs.sh: Reordering paired reads that have lost synchronization
readlength.sh: Advanced length histogram analysis and control options
filterbyname.sh: Name-based filtering with pattern matching capabilities
fuse.sh/split.sh: Sequence shredding and concatenation operations
phylip2fasta: Phylip format conversion and processing
translate6frames: Amino acid to nucleotide conversion and translation

Performance Characteristics

Memory Usage: Default 300MB heap, configurable via -Xmx with automatic detection up to 85% of physical memory
Resource Efficiency: Minimal memory footprint for short reads regardless of dataset size
Threading: Single worker thread with multiple I/O and compression threads for optimal resource utilization
Scalability: Streaming architecture processes unlimited file sizes without memory constraints
Compression Support: Native multi-threaded compression/decompression with pigz integration

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org
Guide: bbtools/docs/guides/ReformatGuide.txt for comprehensive usage examples