CrossContaminate

Basic Usage

crosscontaminate.sh in=<file,file,...> out=<file,file,...>

CrossContaminate takes multiple clean input files and produces the same number of contaminated output files. Each source file randomly contaminates a subset of other files based on configurable probability distributions.

Parameters

Parameters are organized by their function in the cross-contamination process. All parameters from the shell script are documented below, preserving the exact organization and parameter names used by Brian Bushnell.

Input parameters

in=<file,file,...>: Clean input reads. Specify multiple files separated by commas. Each input file represents a separate source that can contaminate other files.
innamefile=<file>: A file containing the names of input files, one name per line. Alternative to specifying multiple files with the in= parameter.
interleaved=auto: (int) t/f overrides interleaved autodetection. Set to true for interleaved paired-end files, false for single-end or separate paired files.
qin=auto: Input quality offset: 33 (Sanger), 64, or auto. Auto-detection works for most files. Sanger format (33) is standard for modern sequencers.
reads=-1: If positive, quit after processing X reads or pairs. Useful for testing with a subset of data. Default -1 processes all reads.

Processing Parameters

minsinks=1: Min contamination destinations from one source. Each source file will contaminate at least this many other files. Must be ≤ maxsinks.
maxsinks=8: Max contamination destinations from one source. Each source file will contaminate at most this many other files. Limited by total number of files.
minprob=0.000005: Min allowed contamination rate (geometric distribution). The lowest probability that any contamination event can have. Used with maxprob to define the range for geometric distribution sampling.
maxprob=0.025: Max allowed contamination rate. The highest probability that any contamination event can have. Contamination rates are sampled geometrically between minprob and maxprob.

Output parameters

out=<file,file,...>: Contaminated output reads. Must specify the same number of output files as input files. Each output file receives reads from its corresponding input file plus contaminating reads from other sources.
outnamefile=<file>: A file containing the names of output files, one name per line. Alternative to specifying multiple files with the out= parameter.
overwrite=t: (ow) Grant permission to overwrite files. Set to false to prevent accidental overwriting of existing output files.
ziplevel=2: (zl) Compression level; 1 (min) through 9 (max). Higher values provide better compression but slower processing. 2 provides good balance of speed and compression.
threads=auto: (t) Set number of threads to use; default is number of logical processors. More threads can improve performance for I/O-bound operations.
qout=auto: Output quality offset: 33 (Sanger), 64, or auto. Usually matches input format. Auto-detection preserves the input quality encoding.
shuffle=f: Shuffle contents of output files. When true, randomizes the order of reads within each output file after contamination is complete.
shufflethreads=3: Use this many threads for shuffling (uses more memory). Only relevant when shuffle=t. More threads can speed up shuffling at the cost of memory usage.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da: Disable assertions. Can provide a small performance boost in production use.

Examples

Basic Cross-Contamination

crosscontaminate.sh in=clean1.fq,clean2.fq,clean3.fq out=contaminated1.fq,contaminated2.fq,contaminated3.fq

Create cross-contaminated versions of three clean files. Each output file will contain reads from its corresponding input plus contaminating reads from the other sources.

Controlled Contamination Parameters

crosscontaminate.sh in=sample1.fq,sample2.fq,sample3.fq,sample4.fq out=cont1.fq,cont2.fq,cont3.fq,cont4.fq minsinks=2 maxsinks=3 minprob=0.0001 maxprob=0.01

Each source will contaminate 2-3 other files with contamination rates between 0.01% and 1%. This creates realistic contamination levels for testing decontamination tools.

Using File Lists

crosscontaminate.sh innamefile=input_files.txt outnamefile=output_files.txt shuffle=t

Use file lists for large numbers of samples, with output shuffling enabled to randomize read order within files.

Testing with Limited Reads

crosscontaminate.sh in=test1.fq,test2.fq out=cont_test1.fq,cont_test2.fq reads=10000

Process only the first 10,000 reads from each file for quick testing of parameters.

Algorithm Details

Cross-Contamination Model

CrossContaminate implements a logarithmic probability distribution model with cumulative probability lookup for generating realistic cross-contamination patterns in sequencing data:

Sink Assignment Strategy

For each source file, the algorithm randomly determines the number of contamination targets (sinks) within the range [minsinks, maxsinks]. It then randomly selects that many files from the remaining files to serve as contamination destinations. This ensures each source can contaminate a different set and number of targets.

Probability Distribution

Contamination rates are assigned using Math.log() conversion followed by Math.pow(Math.E, ...) exponentiation. The assignSinks() method converts minprob and maxprob to logarithmic values using Math.log(), then samples uniformly in log space with Random.nextDouble() and exponentiates back. This geometric distribution favors lower contamination rates while allowing occasional higher contamination events.

The probability assignment implementation:

Convert probability bounds: minProbPow = Math.log(minprob), maxProbPow = Math.log(maxprob)
Sample uniformly in log space: Math.pow(Math.E, minProbPow + randy.nextDouble() * probRange)
Decrement remaining probability mass: remaining -= c
Convert to cumulative distribution with reverse iteration for O(log n) lookup

Read Distribution

The addRead() method assigns each read using Random.nextDouble() comparison against the cumulative probability distribution. Each Vessel maintains a prob field representing its cumulative threshold, and reads are assigned to the first Vessel where p >= v.prob. Collections.reverse() ensures the source file (with highest probability) appears last in the iteration order.

Memory Efficiency

The implementation uses ByteStreamWriter with FileFormat.testOutput() for stream-based I/O handling. ByteFile.FORCE_MODE_BF2 is automatically enabled when Shared.threads() > 2 for improved throughput. ConcurrentReadInputStream processes reads in ListNum batches to minimize memory allocation overhead.

Optional Shuffling

When shuffle=t, the shuffle() method launches ShuffleThread instances from the sort.Shuffle class for each output file. Shuffle.setMaxThreads(shufflethreads) configures the thread pool, and Shuffle.waitForFinish() synchronizes completion. This randomizes read order within files after contamination processing is complete.

Performance Characteristics

Memory Usage: Stream-based processing with ConcurrentReadInputStream and ByteStreamWriter avoids loading entire files into memory
I/O Strategy: ByteFile.FORCE_MODE_BF2 selected automatically when threads > 2 for improved throughput on multi-core systems
Thread Management: Configurable shufflethreads parameter controls Shuffle thread pool size, separate from main processing threads
File Format Support: FileFormat.testInput() handles FASTQ detection and compression (gzip/pigz), preserves quality scores through ByteStreamWriter

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org