Rename

Script: rename.sh Package: jgi Class: RenameReads.java

Renames reads to <prefix>_<number> where you specify the prefix and the numbers are ordered. There are other renaming modes too. If reads are paired, pairs should be processed together; if reads are interleaved, the interleaved flag should be set. This ensures that if a read number (such as 1: or 2:) is added, it will be added correctly.

Basic Usage

rename.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> prefix=<prefix>

in2 and out2 are for paired reads and are optional. If input is paired and there is only one output file, it will be written interleaved.

Parameters

Parameters are organized by function. The tool supports sequential numbering, custom prefixes, coordinate-based renaming, and header trimming operations.

Parameters

prefix=: The string to prepend to existing read names. When combined with sequential numbering, creates names like "prefix_1", "prefix_2", etc.
suffix=: If a suffix is supplied, it will be appended to the existing read name, after a tab character. Useful for adding metadata.
ow=f: (overwrite) Overwrites files that already exist. Default: false
zl=4: (ziplevel) Set compression level, 1 (low) to 9 (max). Default: 4
int=f: (interleaved) Determines whether INPUT file is considered interleaved. Default: false
fastawrap=70: Length of lines in fasta output. Default: 70 characters per line
minscaf=1: Ignore fasta reads shorter than this length. Default: 1
qin=auto: ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto. Default: auto
qout=auto: ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input). Default: auto
ignorebadquality=f: (ibq) Fix out-of-range quality values instead of crashing with a warning. Default: false

Renaming Mode Parameters (if not default)

renamebyinsert=f: Rename the read to indicate its correct insert size. Uses prefix="insert=" and adds calculated insert size to read names. Default: false
renamebymapping=f: Rename the read to indicate its correct mapping coordinates. Requires fastq output format. Default: false
renamebytrim=f: Rename the read to indicate its correct post-trimming length. Creates names like "ID_readlength_insertlength". Default: false
renamebycoords=f: Rename Illumina headers to leave coordinates but remove redundant info. Extracts coordinate information from Illumina headers. Default: false
addprefix=f: Rename the read by prepending the prefix to the existing name, keeping the original name intact. Default: false
prefixonly=f: Only use the prefix; don't add _<number> sequential numbering. All reads will have identical names. Default: false
addunderscore=t: Add an underscore after the prefix (if there is a prefix). Only applies when not using prefixonly mode. Default: true
addpairnum=t: Add a pairnum (e.g. ' 1:', ' 2:') to paired reads in some modes. Helps distinguish read pairs. Default: true
fixsra=f: Fixes headers of SRA reads renamed from Illumina. Specifically, it converts something like this: "SRR17611.11 HWI-ST79:17:D091UACXX:4:1101:210:824 length=75" into this: "HWI-ST79:17:D091UACXX:4:1101:210:824 1:". Default: false

Trimming Parameters

trimleft=0: Trim this many characters from the header start. Applied before other renaming operations. Default: 0
trimright=0: Trim this many characters from the header end. Applied before other renaming operations. Default: 0
trimbeforesymbol=0: Trim this many characters before the last instance of a specified symbol. Used with symbol parameter. Default: 0
symbol=: Trim before this symbol. This can be a literal like ':' or a word like tab or lessthan for reserved symbols. Works with trimbeforesymbol parameter.

Other Parameters

reads=-1: Set to a positive number to only process this many INPUT reads (or pairs), then quit. Default: -1 (process all reads)
quantize=: Set this to reduce compressed file size by binning quality scores. E.g., quantize=2 will eliminate odd qscores, keeping only even values.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. May provide minor performance improvement in production use.

Examples

Basic Sequential Renaming

rename.sh in=reads.fq out=renamed.fq prefix=sample

Renames reads to sample_1, sample_2, sample_3, etc.

Paired-End Renaming

rename.sh in1=reads_1.fq in2=reads_2.fq out1=renamed_1.fq out2=renamed_2.fq prefix=experiment

Renames paired reads to experiment_1 1:, experiment_1 2:, experiment_2 1:, experiment_2 2:, etc.

Add Prefix to Existing Names

rename.sh in=reads.fq out=prefixed.fq prefix=lib1_ addprefix=t

Prepends "lib1_" to existing read names, preserving original identifiers.

Trim Headers

rename.sh in=reads.fq out=trimmed.fq trimleft=5 trimright=10

Removes 5 characters from the start and 10 characters from the end of each header.

Fix SRA Headers

rename.sh in=sra_reads.fq out=fixed.fq fixsra=t

Converts SRA-style headers back to original Illumina format with proper pair numbering.

Rename by Insert Size

rename.sh in1=reads_1.fq in2=reads_2.fq out=inserts.fq renamebyinsert=t

Renames reads to indicate their calculated insert sizes (requires paired reads).

Algorithm Details

Renaming Strategy

The rename tool implements several distinct renaming strategies that can be selected via parameters:

Sequential Numbering (Default)

The default mode assigns sequential numbers to reads in the format prefix_number. For paired reads, both mates receive the same number with different pair identifiers (1: and 2:). The numbering counter increments only after processing both mates of a pair.

Insert Size-Based Renaming

When renamebyinsert=true, the tool calculates insert sizes for paired reads and incorporates this information into read names. The algorithm uses Read.insertSizeMapped() to determine the distance between paired reads, providing valuable size information directly in the identifier.

Coordinate-Based Renaming

The renamebycoords mode uses IlluminaHeaderParser2 to extract coordinate information from Illumina headers. It reconstructs minimal coordinate strings in the format that preserves essential positioning data while removing redundant information.

Trim-Based Renaming

The renamebytrim mode creates informative names containing the read length and calculated insert size, using the format: numericID_readlength_insertsize. This provides immediate access to size metrics without requiring separate analysis.

Header Processing

The tool provides header trimming capabilities:

Left/Right Trimming: Removes specified numbers of characters from header start or end
Symbol-Based Trimming: Uses trimBeforeSymbol() method to remove characters before the last occurrence of a specified symbol
SRA Header Fixing: Parses SRA-format headers using regex patterns to extract original Illumina identifiers

Memory Efficiency

The rename operation processes reads in streaming fashion using ConcurrentReadInputStream and ConcurrentReadOutputStream. This approach maintains constant memory usage regardless of input file size, making it suitable for large-scale sequencing datasets.

Quality Score Handling

When quantizeQuality is enabled, the tool applies Quantizer.quantize() to reduce quality score precision. This can significantly reduce compressed file sizes by eliminating fine-grained quality differences while preserving essential quality information.

Paired-Read Awareness

The tool maintains proper paired-read relationships throughout all renaming operations. It ensures that paired reads receive coordinated names and applies pair-specific suffixes (" 1:" and " 2:") when appropriate. The pair numbering system uses the pairnum() method to correctly identify which read in a pair is being processed.

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org