CountSharedLines

Script: countsharedlines.sh Package: driver Class: CountSharedLines.java

Counts the number of lines shared between sets of files. One output file will be printed for each input file. For example, an output file for a file in the 'in1' set will contain one line per file in the 'in2' set, indicating how many lines are shared.

Basic Usage

countsharedlines.sh in1=<file,file...> in2=<file,file...>

This tool compares two sets of text files and reports how many lines are shared between each file in the first set and each file in the second set. Output files are automatically generated with "out_" prefixes in the same directories as the input files.

Parameters

Parameters control matching behavior, file handling, and output formatting for line comparison operations.

Parameters

include=f
Set to 'true' to include the filtered names rather than excluding them. This controls whether matching lines are included or excluded from the counting process.
prefix=f
Allow matching of only the line's prefix (all characters up to first whitespace). When enabled, only the portion of each line before the first whitespace character is used for comparison, allowing partial line matching.
case=t
(casesensitive) Match case also. When set to true (default), line comparisons are case-sensitive. Set to false for case-insensitive matching where "ABC" would match "abc".
ow=t
(overwrite) Overwrites files that already exist. When true (default), existing output files will be replaced. Set to false to preserve existing output files.
app=f
(append) Append to files that already exist. When true, output will be appended to existing files instead of overwriting them. Cannot be used simultaneously with overwrite mode.
zl=4
(ziplevel) Set compression level, 1 (low) to 9 (max). Controls gzip compression level for output files. Higher values provide better compression but slower processing.

Java Parameters

-Xmx
This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. Memory usage scales with the number of unique lines in input files.
-eoom
This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. Useful for preventing system instability when processing very large files.
-da
Disable assertions. Removes internal consistency checks for slightly improved performance in production environments.

Examples

Basic File Comparison

countsharedlines.sh in1=file1.txt,file2.txt in2=ref1.txt,ref2.txt

Compares each file in the first set (file1.txt, file2.txt) against each file in the second set (ref1.txt, ref2.txt). Creates output files out_file1.txt and out_file2.txt showing shared line counts.

Case-Insensitive Prefix Matching

countsharedlines.sh in1=data.txt in2=reference.txt case=f prefix=t

Performs case-insensitive comparison using only the prefix of each line (up to first whitespace). Useful for matching identifiers that may have different suffixes or annotations.

Multiple File Sets with Custom Output

countsharedlines.sh in1=sample1.txt,sample2.txt,sample3.txt in2=control1.txt,control2.txt ow=t

Compares three sample files against two control files, creating output files for each comparison. The ow=t parameter ensures existing output files are overwritten.

Algorithm Details

Data Structure Implementation

CountSharedLines implements line comparison using LinkedHashSet collections with specific processing methods:

Processing Algorithm

The comparison process follows these steps:

  1. File Loading: All lines from each file are loaded into separate LinkedHashSet collections
  2. Cross-Comparison: For each file in set 1, compare against every file in set 2 using set intersection
  3. Count Calculation: Shared lines are counted using HashSet.contains() for O(1) lookup per line
  4. Output Generation: Results are written as tab-delimited files with format: filename\tshared_count

Performance Characteristics

Output Format

Output files contain tab-delimited results with automatic naming:

Use Cases

Genomics Applications

General Text Analysis

Support

For questions and support: