ShrinkAccession

Script: shrinkaccession.sh Package: tax Class: ShrinkAccession.java

Shrinks accession2taxid tables by removing unneeded columns. This is not necessary but makes accession2taxid files smaller and load faster.

Basic Usage

shrinkaccession.sh in=<file> out=<outfile>

This tool processes accession2taxid files and removes unnecessary columns to reduce file size and improve loading performance. It's particularly useful for large taxonomic databases where storage space and loading time are concerns.

Parameters

Parameters control file handling, compression, and output format options for shrinking accession2taxid files.

File I/O Parameters

in=<file>
Input accession2taxid file. This is the primary input file containing accession numbers and their corresponding taxonomic IDs. The file can be compressed.
out=<outfile>
Output file for the processed accession2taxid data. Will contain only the essential columns (accession and taxid, optionally gi numbers).
ow=f
(overwrite) Overwrites files that already exist. Set to true to allow overwriting of existing output files.
app=f
(append) Append to files that already exist. Set to true to append results to existing files instead of overwriting.

Compression Parameters

zl=4
(ziplevel) Set compression level, 1 (low) to 9 (max). Higher levels provide better compression at the cost of processing time.
pigz=t
Use pigz for compression, if available. Pigz is a parallel implementation of gzip that uses ReadWrite.setZipThreads() to utilize multiple cores for compression.

Content Parameters

gi=t
Retain gi numbers. When set to true, the tool will preserve GI (GenInfo Identifier) numbers in the output. When false, GI numbers are discarded to further reduce file size.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Default for shrinkaccession is 80MB which should be sufficient for most files.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for automated pipelines that need to handle memory issues gracefully.
-da
Disable assertions. Can provide a small performance boost in production environments by disabling Java assertion checking.

Examples

Basic File Shrinking

shrinkaccession.sh in=nucl_gb.accession2taxid out=nucl_gb_shrunk.accession2taxid

Shrinks a nucleotide GenBank accession2taxid file, removing unnecessary columns while retaining GI numbers.

Shrinking Without GI Numbers

shrinkaccession.sh in=prot.accession2taxid out=prot_minimal.accession2taxid gi=f

Creates a minimal accession2taxid file with only accession and taxid columns, discarding GI numbers for maximum space savings.

High Compression Processing

shrinkaccession.sh in=dead_nucl.accession2taxid.gz out=dead_nucl_shrunk.accession2taxid.gz zl=9 pigz=t

Processes a compressed input file with maximum compression settings, using parallel compression if available.

Pipeline Integration

shrinkaccession.sh in=wgs.accession2taxid out=wgs_processed.accession2taxid ow=t gi=f zl=6

Pipeline-friendly processing with overwrite permission, no GI numbers, and balanced compression for automated workflows.

Algorithm Details

ShrinkAccession processes accession2taxid files by removing redundant columns and optionally discarding GI numbers while preserving essential taxonomic mapping information.

File Format Processing

The tool handles two main input formats:

Processing Strategy

The algorithm uses a streaming approach with these specific optimizations:

GI Number Retention

When gi=true (default), the tool preserves GenInfo Identifier numbers from the input. GI numbers are validated as numeric values and "na" entries are handled appropriately. This is useful for maintaining compatibility with older NCBI tools that still reference GI numbers.

Compression Integration

The tool integrates with BBTools' compression framework:

Memory Efficiency

The tool uses a default JVM allocation of 80MB (z="-Xmx80m" in shell script), making it suitable for processing large files on memory-constrained systems. The streaming approach using ByteFile.nextLine() and ByteBuilder buffering ensures memory usage remains constant regardless of input file size.

Performance Characteristics

Input/Output Formats

Input Format

Accepts standard NCBI accession2taxid files in either format:

# 4-column format
accession	accession.version	taxid	gi
A00002	A00002.1	32630	2
A00003	A00003.1	32630	3

# 2-column format  
accession.version	taxid
A00002.1	32630
A00003.1	32630

Output Format

Produces streamlined format with essential columns only:

# With GI numbers (gi=t)
accession		taxid	gi
A00002		32630	2
A00003		32630	3

# Without GI numbers (gi=f)
accession		taxid
A00002		32630
A00003		32630

Notes and Limitations

Support

For questions and support: