ReplaceHeaders
Replaces read names with names from another file. The other file can either be sequences or simply names, with one name per line (and no > or @ symbols). If you use one name per line, please give the file a .header extension.
Basic Usage
replaceheaders.sh in=<file> hin=<headers file> out=<out file>
This tool replaces the headers (names) of sequences in one file with headers from another file. The input and header files must have the same number of sequences in the same order.
Parameters
Parameters are organized by function. The tool supports standard BBTools I/O parameters along with specific header replacement options.
Parameters
- in=
- Input sequences file. Use in2 for a second paired file. This is the file whose headers will be replaced.
- hin=
- Header input sequences file. Use hin2 for a second paired file. This file contains the replacement headers. Can be sequences or plain text with one name per line (use .header extension for plain text).
- out=
- Output sequences file. Use out2 for a second paired file. Will contain the original sequences with the new headers.
- ow=f
- (overwrite) Overwrites files that already exist. Set to true to allow overwriting existing output files.
- zl=4
- (ziplevel) Set compression level, 1 (low) to 9 (max). Controls gzip compression level for output files.
- int=f
- (interleaved) Determines whether INPUT file is considered interleaved. Set to true if input contains paired reads in a single file.
- fastawrap=70
- Length of lines in fasta output. Controls line wrapping for FASTA format output sequences.
- qin=auto
- ASCII offset for input quality scores. May be 33 (Sanger), 64 (Illumina), or auto to detect automatically.
- qout=auto
- ASCII offset for output quality scores. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
Renaming mode parameters (if not default)
- addprefix=f
- Rename the read by prepending the new name to the existing name. When true, combines new and old headers (new_header old_header). When false (default), completely replaces the old header.
Sampling parameters
- reads=-1
- Set to a positive number to only process this many INPUT reads (or pairs), then quit. Use -1 (default) to process all reads.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
- -da
- Disable assertions. May provide slight performance improvement in production use.
Examples
Basic Header Replacement
replaceheaders.sh in=reads.fq hin=newnames.fq out=renamed_reads.fq
Replaces all headers in reads.fq with headers from newnames.fq, writing output to renamed_reads.fq.
Using Plain Text Header File
replaceheaders.sh in=sequences.fa hin=names.header out=renamed.fa
Uses a plain text file (names.header) containing one name per line to replace sequence headers. The .header extension tells the tool to treat it as plain text rather than sequence format.
Paired-end Files
replaceheaders.sh in=reads_1.fq in2=reads_2.fq hin=names_1.fq hin2=names_2.fq out=renamed_1.fq out2=renamed_2.fq
Processes paired-end files, replacing headers in both R1 and R2 files with corresponding headers from the header files.
Prefix Mode
replaceheaders.sh in=reads.fq hin=prefixes.header out=prefixed_reads.fq addprefix=t
Prepends new names to existing headers instead of replacing them completely. Results in headers like "newname originalname".
Limited Processing
replaceheaders.sh in=large_dataset.fq hin=new_names.fq out=sample.fq reads=1000
Only processes the first 1000 reads from the input file, useful for testing or sampling large datasets.
Algorithm Details
ReplaceHeaders uses ConcurrentReadInputStream with synchronized dual-stream processing to handle sequence and header files simultaneously:
Processing Strategy
- ConcurrentReadInputStream Architecture: Creates two independent ConcurrentReadInputStream instances (cris for sequences, hcris for headers) with parallel ListNum processing for thread-safe batch operations
- Pairedness Validation: Compares cris.paired() and hcris.paired() states in constructor, terminating with KillSwitch.kill() if read and header files have mismatched pairedness
- ListNum Synchronization: Uses matching ListNum sizes between sequence and header streams, enforcing count equality with KillSwitch.kill() when hreads.size() != reads.size()
- Buffer Management: Implements Shared.setBufferLen(1) with Shared.capBuffers(4) for optimized memory allocation and concurrent stream processing
Header Replacement Logic
- processReadPair() Method: Core replacement logic in single method handling both prefix and direct replacement modes
- Direct Replacement (prefix=false): Simple assignment r1.id=h1.id and r2.id=h2.id for complete header substitution
- Prefix Mode (prefix=true): String concatenation r1.id=h1.id+" "+r1.id with space separator between new and original headers
- Paired-end Consistency: Handles both r1/r2 and h1/h2 pairs simultaneously, maintaining mate relationship structure
File Format Support
- FileFormat Detection: Uses FileFormat.testInput() with FASTQ and HEADER format constants for automatic format recognition
- Header File Processing: FileFormat.HEADER type handles .header extension files as plain text with single name per line parsing
- Stream Integration: FastaReadInputStream with ConcurrentReadInputStream wrapper provides unified sequence parsing across FASTA/FASTQ formats
- Quality Score Processing: Parser.processQuality() handles qin/qout ASCII offset conversion for FASTQ quality preservation
Performance Characteristics
- O(n) Processing: Single-pass through both files with processInner() method iterating once through all reads
- Streaming Memory Model: Uses ListNum batches with cris.nextList() and hcris.nextList() for constant memory footprint regardless of file size
- Concurrent I/O: ConcurrentReadOutputStream.getStream() with configurable buffer=4 enables overlapped read/write operations
- Compression Handling: ReadWrite.USE_PIGZ=true enables parallel gzip with ReadWrite.setZipThreads() for transparent compression support
Error Handling
- KillSwitch Integration: Uses KillSwitch.kill() for immediate termination on count mismatches or pairedness validation failures
- Tools.testInputFiles(): Pre-validates all input files exist and are readable before stream creation
- Tools.testOutputFiles(): Verifies output file write permissions with overwrite/append flag handling
- Tools.testForDuplicateFiles(): Prevents file conflicts by checking for duplicate file specifications across inputs/outputs
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org