Shred

Script: shred.sh Package: synth Class: Shred.java

Shreds sequences into shorter, possibly overlapping sequences with support for uniform, even-length, and random length distributions.

Basic Usage

shred.sh in=<file> out=<file> length=<int>

Shreds input sequences into shorter subsequences of specified length, with optional overlap and various length distribution modes.

Parameters

Parameters control input/output files, shred length specifications, distribution modes, and naming conventions.

File Parameters

in=<file>
Input sequences. Accepts FASTA or FASTQ format, compressed or uncompressed. Required parameter.
out=<file>
Destination of output shreds. Will be written in the same format as input (FASTA or FASTQ). Required parameter.

Processing Parameters

length=500
Desired length of shreds if a uniform length is desired. Default value is 500 bases. This is the target length for each shred when using uniform length mode.
minlen=-1
Shortest allowed shred. The last shred of each input sequence may be shorter than desired length if this is not set. When set to -1 (default), no minimum length is enforced. If set to a positive value, shreds shorter than this will be discarded.
maxlen=-1
Longest shred length. If minlength and maxlength are both set, shreds will use a random flat length distribution. When set to -1 (default), no maximum length is enforced beyond the sequence length.
median=-1
Median length for random length distribution. Setting median and variance will override minlen and maxlen settings. Used in conjunction with variance to define a random distribution around the median value.
variance=-1
Variance for random length distribution. Used with median parameter to create a random length distribution. The actual range will be median ± variance.
linear
Use linear distribution for random lengths. When maxlen is greater than minlen, the distribution can be linear, exp, or log (pick one as a flag). Linear distribution gives equal probability to all lengths in the range.
exp
Use exponential distribution for random lengths. Alternative to linear and log distributions. Exponential distribution favors shorter lengths with decreasing probability for longer lengths.
log
Use log-uniform distribution for random lengths. Alternative to linear and exp distributions. Log-uniform distribution samples from a continuous log-uniform distribution between minlen and maxlen.
overlap=0
Amount of overlap between successive shreds in bases. Default is 0 (no overlap). Positive values create overlapping shreds where increment = length - overlap. Used in the increment calculation for processUnevenly() and processEvenly() methods.
reads=-1
Maximum number of input sequences to process. If set to a non-negative value, processing will stop after this many input sequences. Default of -1 processes all input sequences.
equal=f
Shred each sequence into subsequences of equal size of at most 'length', instead of a fixed size. When enabled, sequences are divided into equal-sized chunks rather than using fixed-length shreds with potentially shorter remainder.
qfake=30
Quality score to assign to all bases when output format is FASTQ. Default is 30 (corresponding to 99.9% accuracy). Only used when input is FASTA but output needs to be FASTQ format.
filetid=f
Parse taxonomic ID from filename and incorporate into shred names. When enabled, extracts TID from the input filename (e.g., "sample_tid_12345.fasta") and includes it in output sequence names.
headertid=f
Parse taxonomic ID from sequence headers and incorporate into shred names. When enabled, extracts TID from individual sequence headers and includes it in the corresponding shred names.
prefix=null
Prefix to use for shred names instead of original sequence names. When specified, shred names will start with this prefix followed by coordinate information.

Examples

Basic Uniform Shredding

shred.sh in=genome.fasta out=shreds.fasta length=1000

Shreds input sequences into 1000-base fragments with no overlap.

Overlapping Shreds

shred.sh in=contigs.fasta out=overlapping_shreds.fasta length=500 overlap=100

Creates 500-base shreds with 100 bases of overlap between consecutive shreds (400-base increment).

Random Length Distribution

shred.sh in=sequences.fasta out=random_shreds.fasta minlen=300 maxlen=800 linear

Generates shreds with random lengths between 300-800 bases using linear distribution.

Even-Length Shredding

shred.sh in=assembly.fasta out=even_shreds.fasta length=1000 equal=t

Divides each sequence into equal-sized chunks of at most 1000 bases, ensuring uniform coverage.

Quality Score Assignment

shred.sh in=sequences.fasta out=shreds.fastq length=500 qfake=25

Converts FASTA input to FASTQ output with quality scores of 25 for all bases in the shreds.

Median-Variance Distribution

shred.sh in=genome.fasta out=variable_shreds.fasta median=600 variance=200 exp

Creates shreds with exponential length distribution centered around 600 bases with 200-base variance (range: 400-800).

Algorithm Details

Shredding Modes

SHRED implements three primary shredding strategies:

1. Uniform Length Shredding (Default)

Processes sequences using a fixed increment calculated as increment = length - overlap. Each shred has exactly the specified length (except potentially the last shred of each sequence). Uses the processUnevenly() method with simple arithmetic progression through sequence coordinates.

2. Even-Length Distribution

When equal=true, sequences are divided into approximately equal-sized chunks. The processEvenly() method calculates chunks using chunks = ceil((bases.length - overlap) * incMult) and increment using inc2 = bases.length / chunks. Each chunk position is calculated as a = floor(inc2 * chunk) and b = overlap + floor(inc2 * (chunk + 1)).

3. Random Length Distribution

When minlen and maxlen are both specified, shreds are generated with random lengths. Three distribution modes are available:

Coordinate Naming

Shred names follow the format original_name_start-end where start and end represent 0-based coordinates in the original sequence. When taxonomic IDs are parsed (via filetid or headertid), the format becomes original_name_start-end_tid_XXXXX.

Quality Score Handling

For FASTQ input, original quality scores are preserved and trimmed to match shred coordinates. For FASTA input with FASTQ output, uniform quality scores specified by qfake are assigned to all bases.

Memory Optimization

SHRED uses streaming processing with buffer limits set by Shared.capBuffers(4) and Shared.capBufferLen(100). The ConcurrentReadInputStream processes reads in batches using ListNum<Read> structures, maintaining O(1) memory usage independent of input file size.

Performance Characteristics

Support

For questions and support: