Cat
Concatenates and recompresses files. This tool reads multiple input files sequentially and outputs everything to a single output file, allowing for recompression while avoiding the use of stdio.
Basic Usage
cat.sh *.fna out=catted.fa.gz
The cat tool accepts multiple input files (either specified with the in=
parameter or as bare filenames) and concatenates them into a single output file. It can handle compressed files and recompress the output as needed.
Parameters
Parameters are organized by their function in the concatenation process. The tool currently has a minimal set of parameters focused on file input/output and compression control.
Standard parameters
- in=<file>
- Comma-delimited input files. Multiple files can be specified by separating with commas. Filenames with no 'in=' prefix will also be treated as input files.
- out=<file>
- Output destination. Defaults to stdout if not specified. The output format will be automatically determined from the file extension.
- ziplevel=2
- (zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest). Lower compression levels process faster but produce larger files. Default is 2 for a good balance of speed and compression.
Java Parameters
- -Xmx
- Sets Java's memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory. Default for this tool is 200m.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
- -da
- Disable Java assertions. May provide a small performance improvement in production use.
Examples
Basic Concatenation
cat.sh file1.fasta file2.fasta file3.fasta out=combined.fasta
Concatenates three FASTA files into a single output file.
Concatenation with Compression
cat.sh *.fastq out=all_reads.fastq.gz ziplevel=6
Concatenates all FASTQ files in the current directory into a compressed output file using compression level 6.
Using Comma-Delimited Input
cat.sh in=sample1.fa,sample2.fa,sample3.fa out=merged.fa.gz
Specifies input files using the in= parameter with comma separation.
Output to stdout
cat.sh file1.fq file2.fq | other_tool.sh
Concatenates files and pipes output to another tool. When no output file is specified, data goes to stdout.
Algorithm Details
The concatenation tool uses a straightforward sequential processing approach:
Processing Strategy
- Sequential File Processing: Input files are processed one at a time via processInner() method in the order specified
- Batch Line Reading: Each file is read using ByteFile.nextList() which returns ListNum<byte[]> batches for memory-efficient processing
- ByteStreamWriter Output: Uses ByteStreamWriter.makeBSW() factory method with configurable buffering for output writing
- Format Preservation: FileFormat.testInput() detects input format, FileFormat.testOutput() handles output format conversion
Memory Usage
The tool is designed for minimal memory footprint:
- Default memory allocation is 200MB (-Xmx200m, -Xms200m as configured in shell script)
- ByteFile.nextList() reads files in ListNum batches, avoiding full file loading into memory
- Memory usage scales with batch processing size rather than input file sizes
- Variables linesProcessed and bytesProcessed track progress without storing file contents
Compression Handling
Automatic format detection and conversion:
- FileFormat.testInput() automatically detects compressed input files (.gz, .bz2, etc.)
- Can decompress input and recompress output with different settings via FileFormat methods
- Compression level affects output size vs. processing time trade-off
- Avoids stdio pipes for better performance with compressed data using ByteFile/ByteStreamWriter
Performance Characteristics
- Speed: Limited primarily by disk I/O rather than CPU, using ByteFile buffering
- Scalability: Handles arbitrary numbers of input files via ArrayList<String> in
- Reliability: assert(!out1.equalsIgnoreCase(s)) validates that output filename doesn't match any input filename
- Progress Tracking: Variables linesProcessed and bytesProcessed track progress via Tools.timeLinesBytesProcessed()
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org