MergeSam
Concatenates sam files, keeping only the header from the first file while merging all alignment records. Handles header conflicts by preserving only the first header and filtering out subsequent headers.
Basic Usage
mergesam.sh <files> out=<file>
Input files can be specified as positional arguments or using the in= parameter. If no output file is specified, results are written to stdout.
Parameters
MergeSam accepts standard BBTools parameters for input/output handling and processing control.
Core Parameters
- in=<file>
- Input SAM file(s). Multiple files can be specified as comma-separated list or as positional arguments. If a file path is provided as a positional argument and the file exists, it will be treated as input.
- out=stdout.sam
- Output file for merged SAM data. Default is stdout. Use 'null' to disable output.
- invalid=<file>
- Optional output file for invalid lines (headers found after the first file). Lines that don't pass validation are written here instead of the main output.
- lines=<long>
- Maximum number of lines to process. Set to -1 or omit for unlimited processing. Default processes all lines.
Processing Parameters
- verbose=f
- Enable verbose output showing detailed processing information. Affects multiple internal components including file readers and writers.
- overwrite=t
- Allow overwriting of existing output files. Default is true.
- append=f
- Append to existing output files instead of overwriting. Default is false.
Java Parameters
- -da
- Disable Java assertions for improved performance in production environments.
Examples
Basic SAM File Merging
mergesam.sh file1.sam file2.sam file3.sam out=merged.sam
Merges three SAM files into a single output file, keeping only the header from file1.sam.
Using Input Parameter
mergesam.sh in=file1.sam,file2.sam,file3.sam out=merged.sam
Alternative syntax using the in= parameter to specify multiple input files.
Writing to Standard Output
mergesam.sh *.sam > merged.sam
Merges all SAM files in the current directory, writing results to stdout and redirecting to a file.
Handling Invalid Headers
mergesam.sh file1.sam file2.sam out=merged.sam invalid=rejected_headers.sam
Merges files while saving any invalid header lines (headers from files after the first) to a separate file.
Limited Processing
mergesam.sh in=large_file.sam out=sample.sam lines=1000
Processes only the first 1000 lines of the input file, useful for testing or sampling.
Algorithm Details
MergeSam implements a streaming merge algorithm that efficiently concatenates SAM files while handling header conflicts:
Header Processing Strategy
The tool uses a header mode flag that starts as true for the first file. When processing:
- First File: All header lines (starting with @) are considered valid and written to output
- Subsequent Files: Header lines are marked invalid and optionally written to the invalid output stream
- Mode Transition: Header mode switches to false when the first non-header line is encountered
Line-by-Line Processing
The algorithm processes each input file sequentially using ByteFile readers:
- Stream Processing: Files are read line-by-line to minimize memory usage
- Validation Logic: Lines starting with @ are valid only during header mode
- Output Routing: Valid lines go to main output, invalid lines to separate stream if specified
- Statistics Tracking: Counts processed lines, valid lines, and bytes for reporting
Memory Efficiency
The implementation is designed for large-scale SAM file processing:
- Streaming Architecture: Uses ByteStreamWriter for efficient output buffering
- Minimal Memory Footprint: Processes one line at a time rather than loading entire files
- Configurable Limits: Optional line limit prevents runaway processing
- Automatic Compression: Supports pigz compression for input/output files
Error Handling
Robust error handling ensures data integrity:
- File Validation: Checks output file writability before processing
- Stream Management: Properly closes all readers and writers
- Error State Tracking: Monitors for errors throughout processing
- Safe Shutdown: Uses poison-and-wait pattern for clean thread termination
Performance Characteristics
- Time Complexity: O(n) where n is the total number of lines across all input files
- Space Complexity: O(1) constant memory usage regardless of file size
- Scalability: Can handle arbitrarily large files limited only by disk space
- Threading: Uses configurable thread count for compression operations
Technical Notes
SAM Format Considerations
When merging SAM files, be aware of potential compatibility issues:
- Reference Sequences: All files should use the same reference genome
- Read Groups: Different read group IDs across files may cause conflicts
- Program Headers: Only the first file's program information is preserved
- Coordinate Systems: Ensure consistent coordinate systems across input files
Best Practices
- File Ordering: Place the file with the most comprehensive header first
- Validation: Use the invalid= parameter to check for header conflicts
- Testing: Use lines= parameter to test merge logic on file subsets
- Disk Space: Ensure adequate space for output files before processing large datasets
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org