Unzip
Compresses or decompresses files based on extensions. This only exists because the syntax and default behavior of many compression utilities is unintuitive; it is just a wrapper, and relies on existing executables in the command line (pigz, lbzip, etc.) Does not delete the input file. Does not untar files.
Basic Usage
unzip.sh in=<file> out=<file>
The unzip tool automatically detects compression based on file extensions and applies the appropriate compression or decompression operation. It preserves the input file and creates a new output file.
Parameters
Parameters control input/output files, compression settings, and processing options.
File Parameters
- in=<file>
- Input file to compress or decompress. Required parameter.
- out=<file>
- Output file for processed data. If not specified, output goes to stdout.
- invalid=<file>
- Output file for invalid or problematic data during processing.
Compression Parameters
- zl=
- Set the compression level; accepts values 0-9 or 11. Higher values provide better compression but take longer to process. Default varies by compression algorithm.
Processing Parameters
- lines=<number>
- Maximum number of lines to process. Use -1 for unlimited processing. Default: unlimited (Long.MAX_VALUE).
- verbose=<boolean>
- Enable verbose output for debugging and monitoring progress. Default: false.
Examples
Basic Decompression
unzip.sh in=data.fq.gz out=data.fq
Decompresses a gzipped FASTQ file to plain text format.
Compression with Level Control
unzip.sh in=data.fq out=data.fq.gz zl=9
Compresses a FASTQ file using maximum compression level (9).
Verbose Processing
unzip.sh in=large_file.bz2 out=large_file verbose=t
Decompresses a bzip2 file with verbose output showing progress information.
Limited Line Processing
unzip.sh in=data.gz out=sample.txt lines=1000
Processes only the first 1000 lines from a compressed file.
Algorithm Details
The unzip tool is a wrapper around standard compression utilities that addresses common usability issues with command-line compression tools. The implementation uses simple I/O streaming rather than implementing custom compression algorithms.
Processing Strategy
- Extension-based Detection: Uses Tools.fixExtension() to determine compression format from file extensions (.gz, .bz2, etc.)
- Buffered I/O: Uses a 256KB buffer (65536×4 bytes) allocated in processInner() method for file streaming
- External Tool Integration: Calls ReadWrite.getInputStream() to leverage existing compression utilities (pigz, lbzip2)
- Parallel Processing: Enables ReadWrite.USE_PIGZ and ReadWrite.USE_UNPIGZ flags for multi-threaded gzip operations
Memory Management
The tool uses minimal memory (default -Xmx80m set in shell script) and processes files via InputStream.read() calls to a fixed buffer. The single buffer allocation ensures constant memory usage regardless of file size.
Error Handling
The tool validates input/output file accessibility using Tools.testInputFiles() and Tools.testOutputFiles() before processing. IOException handling occurs in processInner() during stream read operations, setting errorState flag when errors occur.
Compatibility
Built on BBTools infrastructure using FileFormat.testInput() and FileFormat.testOutput() for file handling, ByteStreamWriter for output, and PreParser for argument processing. Uses Tools.testForDuplicateFiles() to prevent file conflicts.
Technical Notes
Supported Formats
- Gzip: .gz files using pigz when available for parallel processing
- Bzip2: .bz2 files using lbzip2 when available
- Other formats: Based on available system utilities
Performance Characteristics
- Throughput: Limited by InputStream.read() buffer size and underlying compression utility performance
- Memory Usage: Fixed at ~80MB (shell script -Xmx80m setting) plus single 256KB buffer
- Threading: Uses ReadWrite.setZipThreads(Shared.threads()) to configure parallel compression threads
Integration with BBTools
Uses standard BBTools classes: FileFormat for file type detection, Tools utilities for validation, ReadWrite for compression handling, and Shared for thread configuration. Follows BBTools patterns for argument parsing via PreParser and error handling via errorState flags.
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org