ShrinkAccession

Basic Usage

shrinkaccession.sh in=<file> out=<outfile>

This tool processes accession2taxid files and removes unnecessary columns to reduce file size and improve loading performance. It's particularly useful for large taxonomic databases where storage space and loading time are concerns.

Parameters

Parameters control file handling, compression, and output format options for shrinking accession2taxid files.

File I/O Parameters

in=<file>: Input accession2taxid file. This is the primary input file containing accession numbers and their corresponding taxonomic IDs. The file can be compressed.
out=<outfile>: Output file for the processed accession2taxid data. Will contain only the essential columns (accession and taxid, optionally gi numbers).
ow=f: (overwrite) Overwrites files that already exist. Set to true to allow overwriting of existing output files.
app=f: (append) Append to files that already exist. Set to true to append results to existing files instead of overwriting.

Compression Parameters

zl=4: (ziplevel) Set compression level, 1 (low) to 9 (max). Higher levels provide better compression at the cost of processing time.
pigz=t: Use pigz for compression, if available. Pigz is a parallel implementation of gzip that uses ReadWrite.setZipThreads() to utilize multiple cores for compression.

Content Parameters

gi=t: Retain gi numbers. When set to true, the tool will preserve GI (GenInfo Identifier) numbers in the output. When false, GI numbers are discarded to further reduce file size.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default for shrinkaccession is 80MB which should be sufficient for most files.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines that need to handle memory issues gracefully.
-da: Disable assertions. Can provide a small performance boost in production environments by disabling Java assertion checking.

Examples

Basic File Shrinking

shrinkaccession.sh in=nucl_gb.accession2taxid out=nucl_gb_shrunk.accession2taxid

Shrinks a nucleotide GenBank accession2taxid file, removing unnecessary columns while retaining GI numbers.

Shrinking Without GI Numbers

shrinkaccession.sh in=prot.accession2taxid out=prot_minimal.accession2taxid gi=f

Creates a minimal accession2taxid file with only accession and taxid columns, discarding GI numbers for maximum space savings.

High Compression Processing

shrinkaccession.sh in=dead_nucl.accession2taxid.gz out=dead_nucl_shrunk.accession2taxid.gz zl=9 pigz=t

Processes a compressed input file with maximum compression settings, using parallel compression if available.

Pipeline Integration

shrinkaccession.sh in=wgs.accession2taxid out=wgs_processed.accession2taxid ow=t gi=f zl=6

Pipeline-friendly processing with overwrite permission, no GI numbers, and balanced compression for automated workflows.

Algorithm Details

ShrinkAccession processes accession2taxid files by removing redundant columns and optionally discarding GI numbers while preserving essential taxonomic mapping information.

File Format Processing

The tool handles two main input formats:

4-column format: accession, accession.version, taxid, gi - reduces to accession, taxid, gi (optional)
2-column format: accession.version, taxid - converts to accession, taxid

Processing Strategy

The algorithm uses a streaming approach with these specific optimizations:

Streaming Processing: Uses ByteFile.nextLine() to process files line by line without loading entire files into memory, enabling handling of very large accession2taxid files (multi-GB sizes)
Column Parsing: Uses direct byte array iteration to extract only required columns (accession and taxid), skipping accession.version column in 4-column format
Invalid Line Handling: Uses AccessionToTaxid.parseLineToTaxid() to validate taxonomic IDs (tid<1 discarded), reporting the number of discarded entries
Buffered Output: Uses ByteBuilder with 8KB buffer (bb.length()>8000 threshold) to batch I/O operations when writing processed data

GI Number Retention

When gi=true (default), the tool preserves GenInfo Identifier numbers from the input. GI numbers are validated as numeric values and "na" entries are handled appropriately. This is useful for maintaining compatibility with older NCBI tools that still reference GI numbers.

Compression Integration

The tool integrates with BBTools' compression framework:

Automatic Detection: Uses FileFormat.testInput() to detect compressed input files and handles decompression via ByteFile
Pigz Support: Sets ReadWrite.USE_PIGZ=true and uses ReadWrite.setZipThreads() to enable parallel gzip compression when available
Compression Levels: Supports compression levels 1-9, with level 4 as default (ReadWrite.ZIPLEVEL set to max(6, current) when Data.PIGZ() is true)

Memory Efficiency

The tool uses a default JVM allocation of 80MB (z="-Xmx80m" in shell script), making it suitable for processing large files on memory-constrained systems. The streaming approach using ByteFile.nextLine() and ByteBuilder buffering ensures memory usage remains constant regardless of input file size.

Performance Characteristics

Processing Speed: Uses Tools.timeLinesBytesProcessed() to report processing statistics; performance depends on I/O throughput and file compression
Size Reduction: Removes accession.version column and optionally GI numbers; actual reduction depends on original format and gi parameter setting
Scalability: Streaming design with constant memory usage (ByteFile.nextLine() approach) enables processing of arbitrarily large input files

Input/Output Formats

Input Format

Accepts standard NCBI accession2taxid files in either format:

# 4-column format
accession	accession.version	taxid	gi
A00002	A00002.1	32630	2
A00003	A00003.1	32630	3

# 2-column format  
accession.version	taxid
A00002.1	32630
A00003.1	32630

Output Format

Produces streamlined format with essential columns only:

# With GI numbers (gi=t)
accession		taxid	gi
A00002		32630	2
A00003		32630	3

# Without GI numbers (gi=f)
accession		taxid
A00002		32630
A00003		32630

Notes and Limitations

Optional Processing: This tool is not required for using accession2taxid files with other BBTools, but provides benefits in terms of storage space and loading speed
Validation: Lines with invalid taxonomy IDs (taxid < 1) are automatically discarded and counted
Header Handling: Automatically detects and preserves appropriate headers in the output
File Size: Output files are smaller than input files due to removal of accession.version column and optional GI number removal, with actual reduction depending on original format
Compatibility: Output files maintain full compatibility with BBTools taxonomy functions

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org