CountSharedLines

Script: countsharedlines.sh Package: driver Class: CountSharedLines.java

Counts the number of lines shared between sets of files. One output file will be printed for each input file. For example, an output file for a file in the 'in1' set will contain one line per file in the 'in2' set, indicating how many lines are shared.

Basic Usage

countsharedlines.sh in1=<file,file...> in2=<file,file...>

This tool compares two sets of text files and reports how many lines are shared between each file in the first set and each file in the second set. Output files are automatically generated with "out_" prefixes in the same directories as the input files.

Parameters

Parameters control matching behavior, file handling, and output formatting for line comparison operations.

Parameters

include=f: Set to 'true' to include the filtered names rather than excluding them. This controls whether matching lines are included or excluded from the counting process.
prefix=f: Allow matching of only the line's prefix (all characters up to first whitespace). When enabled, only the portion of each line before the first whitespace character is used for comparison, allowing partial line matching.
case=t: (casesensitive) Match case also. When set to true (default), line comparisons are case-sensitive. Set to false for case-insensitive matching where "ABC" would match "abc".
ow=t: (overwrite) Overwrites files that already exist. When true (default), existing output files will be replaced. Set to false to preserve existing output files.
app=f: (append) Append to files that already exist. When true, output will be appended to existing files instead of overwriting them. Cannot be used simultaneously with overwrite mode.
zl=4: (ziplevel) Set compression level, 1 (low) to 9 (max). Controls gzip compression level for output files. Higher values provide better compression but slower processing.

Java Parameters

-Xmx: This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Memory usage scales with the number of unique lines in input files.
-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing system instability when processing very large files.
-da: Disable assertions. Removes internal consistency checks for slightly improved performance in production environments.

Examples

Basic File Comparison

countsharedlines.sh in1=file1.txt,file2.txt in2=ref1.txt,ref2.txt

Compares each file in the first set (file1.txt, file2.txt) against each file in the second set (ref1.txt, ref2.txt). Creates output files out_file1.txt and out_file2.txt showing shared line counts.

Case-Insensitive Prefix Matching

countsharedlines.sh in1=data.txt in2=reference.txt case=f prefix=t

Performs case-insensitive comparison using only the prefix of each line (up to first whitespace). Useful for matching identifiers that may have different suffixes or annotations.

Multiple File Sets with Custom Output

countsharedlines.sh in1=sample1.txt,sample2.txt,sample3.txt in2=control1.txt,control2.txt ow=t

Compares three sample files against two control files, creating output files for each comparison. The ow=t parameter ensures existing output files are overwritten.

Algorithm Details

Data Structure Implementation

CountSharedLines implements line comparison using LinkedHashSet collections with specific processing methods:

LinkedHashSet<String> Storage: getContents() method loads each file into a LinkedHashSet, providing O(1) contains() operations while maintaining insertion order
TextFile Reading: Files are processed line-by-line using TextFile.readLine(true) with automatic compression detection via FileFormat.testInput()
Case Normalization: When ignoreCase is true (default), lines are converted using String.toLowerCase() before storage
Prefix Extraction: In prefixMode, Character.isWhitespace() detects the first whitespace to truncate lines at word boundaries

Processing Algorithm

The comparison process follows these steps:

File Loading: All lines from each file are loaded into separate LinkedHashSet collections
Cross-Comparison: For each file in set 1, compare against every file in set 2 using set intersection
Count Calculation: Shared lines are counted using HashSet.contains() for O(1) lookup per line
Output Generation: Results are written as tab-delimited files with format: filename\tshared_count

Performance Characteristics

Time Complexity: O(n*m) where n is total lines in set 1 and m is total lines in set 2
Space Complexity: O(u) where u is the number of unique lines across all input files
Memory Usage: Scales linearly with unique line count, typically ~50-100 bytes per unique line
Optimal For: Comparing files with moderate numbers of unique lines (up to millions)

Output Format

Output files contain tab-delimited results with automatic naming:

File Naming: Input file "path/data.txt" produces output file "path/out_data.txt"
Content Format: Each line shows "compared_filename\tshared_line_count"
Bidirectional Output: Both file sets generate output files showing their perspective of shared lines

Use Cases

Genomics Applications

Gene List Comparison: Compare gene sets between different studies or conditions
Sample Overlap Analysis: Identify shared sequences or identifiers across sample sets
Quality Control: Verify expected overlaps between technical replicates

General Text Analysis

Data Validation: Check consistency between different data sources
Set Operations: Quantify intersections between text-based datasets
Content Analysis: Compare vocabularies or term lists across documents

Support

For questions and support:

Email: bbushnell@lbl.gov
Documentation: bbmap.org