ShrinkAccession
Shrinks accession2taxid tables by removing unneeded columns. This is not necessary but makes accession2taxid files smaller and load faster.
Basic Usage
shrinkaccession.sh in=<file> out=<outfile>
This tool processes accession2taxid files and removes unnecessary columns to reduce file size and improve loading performance. It's particularly useful for large taxonomic databases where storage space and loading time are concerns.
Parameters
Parameters control file handling, compression, and output format options for shrinking accession2taxid files.
File I/O Parameters
- in=<file>
- Input accession2taxid file. This is the primary input file containing accession numbers and their corresponding taxonomic IDs. The file can be compressed.
- out=<outfile>
- Output file for the processed accession2taxid data. Will contain only the essential columns (accession and taxid, optionally gi numbers).
- ow=f
- (overwrite) Overwrites files that already exist. Set to true to allow overwriting of existing output files.
- app=f
- (append) Append to files that already exist. Set to true to append results to existing files instead of overwriting.
Compression Parameters
- zl=4
- (ziplevel) Set compression level, 1 (low) to 9 (max). Higher levels provide better compression at the cost of processing time.
- pigz=t
- Use pigz for compression, if available. Pigz is a parallel implementation of gzip that uses ReadWrite.setZipThreads() to utilize multiple cores for compression.
Content Parameters
- gi=t
- Retain gi numbers. When set to true, the tool will preserve GI (GenInfo Identifier) numbers in the output. When false, GI numbers are discarded to further reduce file size.
Java Parameters
- -Xmx
- This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default for shrinkaccession is 80MB which should be sufficient for most files.
- -eoom
- This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines that need to handle memory issues gracefully.
- -da
- Disable assertions. Can provide a small performance boost in production environments by disabling Java assertion checking.
Examples
Basic File Shrinking
shrinkaccession.sh in=nucl_gb.accession2taxid out=nucl_gb_shrunk.accession2taxid
Shrinks a nucleotide GenBank accession2taxid file, removing unnecessary columns while retaining GI numbers.
Shrinking Without GI Numbers
shrinkaccession.sh in=prot.accession2taxid out=prot_minimal.accession2taxid gi=f
Creates a minimal accession2taxid file with only accession and taxid columns, discarding GI numbers for maximum space savings.
High Compression Processing
shrinkaccession.sh in=dead_nucl.accession2taxid.gz out=dead_nucl_shrunk.accession2taxid.gz zl=9 pigz=t
Processes a compressed input file with maximum compression settings, using parallel compression if available.
Pipeline Integration
shrinkaccession.sh in=wgs.accession2taxid out=wgs_processed.accession2taxid ow=t gi=f zl=6
Pipeline-friendly processing with overwrite permission, no GI numbers, and balanced compression for automated workflows.
Algorithm Details
ShrinkAccession processes accession2taxid files by removing redundant columns and optionally discarding GI numbers while preserving essential taxonomic mapping information.
File Format Processing
The tool handles two main input formats:
- 4-column format: accession, accession.version, taxid, gi - reduces to accession, taxid, gi (optional)
- 2-column format: accession.version, taxid - converts to accession, taxid
Processing Strategy
The algorithm uses a streaming approach with these specific optimizations:
- Streaming Processing: Uses ByteFile.nextLine() to process files line by line without loading entire files into memory, enabling handling of very large accession2taxid files (multi-GB sizes)
- Column Parsing: Uses direct byte array iteration to extract only required columns (accession and taxid), skipping accession.version column in 4-column format
- Invalid Line Handling: Uses AccessionToTaxid.parseLineToTaxid() to validate taxonomic IDs (tid<1 discarded), reporting the number of discarded entries
- Buffered Output: Uses ByteBuilder with 8KB buffer (bb.length()>8000 threshold) to batch I/O operations when writing processed data
GI Number Retention
When gi=true (default), the tool preserves GenInfo Identifier numbers from the input. GI numbers are validated as numeric values and "na" entries are handled appropriately. This is useful for maintaining compatibility with older NCBI tools that still reference GI numbers.
Compression Integration
The tool integrates with BBTools' compression framework:
- Automatic Detection: Uses FileFormat.testInput() to detect compressed input files and handles decompression via ByteFile
- Pigz Support: Sets ReadWrite.USE_PIGZ=true and uses ReadWrite.setZipThreads() to enable parallel gzip compression when available
- Compression Levels: Supports compression levels 1-9, with level 4 as default (ReadWrite.ZIPLEVEL set to max(6, current) when Data.PIGZ() is true)
Memory Efficiency
The tool uses a default JVM allocation of 80MB (z="-Xmx80m" in shell script), making it suitable for processing large files on memory-constrained systems. The streaming approach using ByteFile.nextLine() and ByteBuilder buffering ensures memory usage remains constant regardless of input file size.
Performance Characteristics
- Processing Speed: Uses Tools.timeLinesBytesProcessed() to report processing statistics; performance depends on I/O throughput and file compression
- Size Reduction: Removes accession.version column and optionally GI numbers; actual reduction depends on original format and gi parameter setting
- Scalability: Streaming design with constant memory usage (ByteFile.nextLine() approach) enables processing of arbitrarily large input files
Input/Output Formats
Input Format
Accepts standard NCBI accession2taxid files in either format:
# 4-column format
accession accession.version taxid gi
A00002 A00002.1 32630 2
A00003 A00003.1 32630 3
# 2-column format
accession.version taxid
A00002.1 32630
A00003.1 32630
Output Format
Produces streamlined format with essential columns only:
# With GI numbers (gi=t)
accession taxid gi
A00002 32630 2
A00003 32630 3
# Without GI numbers (gi=f)
accession taxid
A00002 32630
A00003 32630
Notes and Limitations
- Optional Processing: This tool is not required for using accession2taxid files with other BBTools, but provides benefits in terms of storage space and loading speed
- Validation: Lines with invalid taxonomy IDs (taxid < 1) are automatically discarded and counted
- Header Handling: Automatically detects and preserves appropriate headers in the output
- File Size: Output files are smaller than input files due to removal of accession.version column and optional GI number removal, with actual reduction depending on original format
- Compatibility: Output files maintain full compatibility with BBTools taxonomy functions
Support
For questions and support:
- Email: bbushnell@lbl.gov
- Documentation: bbmap.org