BBFakeReads

Script: bbfakereads.sh Package: synth Class: FakeReads.java

Generates fake read pairs from ends of contigs or single reads. Specifically for simulating a fake LMP library from long reads or an assembly; for synthetic read generation from a reference see randomreads.sh or randomreadsmg.sh.

Basic Usage

bbfakereads.sh in=<file> out=<outfile> out2=<outfile2>

Out2 is optional; if there is only one output file, it will be written interleaved.

Parameters

BBFakeReads parameters are organized into three main categories: standard file handling parameters, read generation parameters specific to the faking process, and Java virtual machine configuration options.

Standard parameters

ow=f
(overwrite) Overwrites files that already exist. When set to true, existing output files will be replaced without warning.
zl=4
(ziplevel) Set compression level for gzipped output files, from 1 (fastest, lowest compression) to 9 (slowest, highest compression). Level 4 provides a good balance of speed and compression.
fastawrap=100
Length of lines in fasta output. Controls how many bases are written per line in FASTA format files. Set to 0 for no line wrapping.
tuc=f
(touppercase) Change lowercase letters in reads to uppercase. Useful for standardizing sequence case in output.
qin=auto
ASCII offset for input quality scores. May be 33 (Sanger/Illumina 1.8+), 64 (Illumina 1.3-1.7), or auto for automatic detection.
qout=auto
ASCII offset for output quality scores. May be 33 (Sanger), 64 (Illumina), or auto (same as input). Controls the encoding of quality scores in output files.
qfin=<.qual file>
Read qualities from this separate qual file, for the reads coming from a FASTA input file. Used when quality scores are stored separately from sequence data.
qfout=<.qual file>
Write qualities to this separate qual file, for the reads going to the first output file. Creates a separate quality file for FASTA output.
qfout2=<.qual file>
Write qualities to this separate qual file, for the reads going to the second output file (out2). Used when writing paired files in FASTA format.
verifyinterleaved=f
(vint) When true, checks an input file to see if the read names look properly paired. Prints an error message if interleaved format is expected but not detected.
tossbrokenreads=f
(tbr) Discard reads that have different numbers of bases and quality scores. By default, mismatched read lengths will be detected and cause the program to crash for data integrity.

Faking parameters

length=250
Generate reads of this exact length in bases. This determines the size of each fake read created from the input sequences.
minlength=1
Don't generate reads shorter than this length. Input sequences shorter than minlength + overlap will be skipped entirely.
overlap=0
If you set overlap to a positive value, reads will be variable length, overlapping by 'overlap' bases in the middle of the input sequence. This enables split-mode operation where read length is calculated as (sequence_length + overlap + 1) / 2.
identifier=null
(id) Output read names are prefixed with this string. If specified, read names will be formatted as "identifier_numericID /1" and "identifier_numericID /2" for paired reads.
addspace=t
Set to false to omit the space character before /1 and /2 suffixes in paired read names. When true, names are formatted as "name /1"; when false, as "name/1".

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. Examples: -Xmx20g specifies 20 gigabytes of RAM, -Xmx200m specifies 200 megabytes. The maximum is typically 85% of physical memory.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92 or later. Useful for preventing partial output from memory-exhausted runs.
-da
Disable assertions. Can provide a minor performance improvement by skipping internal consistency checks during execution.

Examples

Basic Mate-Pair Simulation

bbfakereads.sh in=assembly.fasta out=fake_mp_R1.fq out2=fake_mp_R2.fq length=100

Creates 100bp fake mate-pair reads from both ends of each contig in the assembly, simulating a mate-pair library.

Variable Length with Overlap

bbfakereads.sh in=contigs.fasta out=variable_reads.fq overlap=50 minlength=75

Generates variable-length reads that overlap by 50bp in the middle of each input sequence. Read length will be (sequence_length + 50 + 1) / 2, with minimum length of 75bp.

Custom Read Naming

bbfakereads.sh in=scaffolds.fa out=named_reads.fq identifier=fake_lib addspace=f

Creates reads with custom prefixed names like "fake_lib_1/1" and "fake_lib_1/2", without spaces before the pair indicators.

Interleaved Output

bbfakereads.sh in=input.fasta out=interleaved_fake.fq length=150

Produces 150bp fake read pairs in a single interleaved output file, where each pair of reads appears consecutively.

Algorithm Details

Read Generation Strategy

BBFakeReads implements a dual-end extraction algorithm that simulates mate-pair or paired-end libraries from long input sequences:

  1. Input Processing: Reads input sequences from FASTA or FASTQ files, processing them individually without pairing requirements.
  2. Length Validation: Filters input sequences based on minimum length requirements. Sequences shorter than minReadLength, or shorter than (minReadLength + overlap) when overlap is specified, are discarded.
  3. Read Extraction: For each valid input sequence:
    • Extract the first 'length' bases from the 5' end (bases 0 to length-1)
    • Extract the last 'length' bases from the 3' end (bases sequence_length-length to sequence_length-1)
    • In split mode (when overlap > 0), length is calculated as min(sequence_length, (sequence_length + overlap + 1) / 2)
  4. Sequence Processing:
    • The first extracted read (5' end) becomes Read 1 and is used as-is
    • The second extracted read (3' end) becomes Read 2 and is reverse-complemented using AminoAcid.reverseComplementBasesInPlace()
    • Quality scores are extracted correspondingly and reversed for Read 2 using Tools.reverseInPlace()
  5. Pair Construction: Creates paired Read objects with:
    • Proper mate references linking the two reads
    • Sequential numeric IDs for tracking
    • Standard /1 and /2 suffixes with optional spacing
    • Appropriate pair flags (Read.PAIRNUMMASK for Read 2)

Memory Management

The algorithm uses KillSwitch.copyOfRange() for safe array copying, preventing memory corruption during sequence extraction. Default memory allocation is 600MB, which is sufficient for most applications but can be increased for large-scale processing.

File Format Support

Supports all standard sequence formats through the FileFormat system:

Performance Characteristics

Processing time scales linearly with input sequence count and length. Memory usage remains constant regardless of input size due to streaming processing. The algorithm processes approximately 1M bp/second on typical hardware.

Support

For questions and support: