Cat

Script: cat.sh Package: fileIO Class: Concatenate.java

Concatenates and recompresses files. This tool reads multiple input files sequentially and outputs everything to a single output file, allowing for recompression while avoiding the use of stdio.

Basic Usage

cat.sh *.fna out=catted.fa.gz

The cat tool accepts multiple input files (either specified with the in= parameter or as bare filenames) and concatenates them into a single output file. It can handle compressed files and recompress the output as needed.

Parameters

Parameters are organized by their function in the concatenation process. The tool currently has a minimal set of parameters focused on file input/output and compression control.

Standard parameters

in=<file>: Comma-delimited input files. Multiple files can be specified by separating with commas. Filenames with no 'in=' prefix will also be treated as input files.
out=<file>: Output destination. Defaults to stdout if not specified. The output format will be automatically determined from the file extension.
ziplevel=2: (zl) Set compression level from 1 (lowest/fastest) through 9 (maximum/slowest). Lower compression levels process faster but produce larger files. Default is 2 for a good balance of speed and compression.

Java Parameters

-Xmx: Sets Java's memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory. Default for this tool is 200m.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92 or later.
-da: Disable Java assertions. May provide a small performance improvement in production use.

Examples

Basic Concatenation

cat.sh file1.fasta file2.fasta file3.fasta out=combined.fasta

Concatenates three FASTA files into a single output file.

Concatenation with Compression

cat.sh *.fastq out=all_reads.fastq.gz ziplevel=6

Concatenates all FASTQ files in the current directory into a compressed output file using compression level 6.

Using Comma-Delimited Input

cat.sh in=sample1.fa,sample2.fa,sample3.fa out=merged.fa.gz

Specifies input files using the in= parameter with comma separation.

Output to stdout

cat.sh file1.fq file2.fq | other_tool.sh

Concatenates files and pipes output to another tool. When no output file is specified, data goes to stdout.

Algorithm Details

The concatenation tool uses a straightforward sequential processing approach:

Processing Strategy

Sequential File Processing: Input files are processed one at a time via processInner() method in the order specified
Batch Line Reading: Each file is read using ByteFile.nextList() which returns ListNum<byte[]> batches for memory-efficient processing
ByteStreamWriter Output: Uses ByteStreamWriter.makeBSW() factory method with configurable buffering for output writing
Format Preservation: FileFormat.testInput() detects input format, FileFormat.testOutput() handles output format conversion

Memory Usage

The tool is designed for minimal memory footprint:

Default memory allocation is 200MB (-Xmx200m, -Xms200m as configured in shell script)
ByteFile.nextList() reads files in ListNum batches, avoiding full file loading into memory
Memory usage scales with batch processing size rather than input file sizes
Variables linesProcessed and bytesProcessed track progress without storing file contents

Compression Handling

Automatic format detection and conversion:

FileFormat.testInput() automatically detects compressed input files (.gz, .bz2, etc.)
Can decompress input and recompress output with different settings via FileFormat methods
Compression level affects output size vs. processing time trade-off
Avoids stdio pipes for better performance with compressed data using ByteFile/ByteStreamWriter

Performance Characteristics

Speed: Limited primarily by disk I/O rather than CPU, using ByteFile buffering
Scalability: Handles arbitrary numbers of input files via ArrayList<String> in
Reliability: assert(!out1.equalsIgnoreCase(s)) validates that output filename doesn't match any input filename
Progress Tracking: Variables linesProcessed and bytesProcessed track progress via Tools.timeLinesBytesProcessed()

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org