PatternJoin

This project is a tool for words edit similarity joins (a.k.a. all-pairs similarity search) under small ($< 3$) edit distance constraints. The software is specifically optimized for edit similarity joins of amino-acid sequences from TCR/BCR profiles, where the number of words is relatively large ($10^5-10^6$) and the average length of words is relatively small ($< 20$).

What is Edit Similarity Join?

Edit distance between two words is the minimal number of elementary operations needed to transform one word to another. In this project, we consider Levenshtein distance (allows substitutions and insertions/deletions of single letters) and Hamming distance (allows substitutions of any letters and insertions/deletions of letters in the end).

Edit Similarity Join is a specific problem when we have a set of words $X=\{w_1, w_2, \ldots, w_n\}$ and we want to find all pairs of 'similar words' $Y_c=\{(w_i, w_j): distance(w_i, w_j) < c\}$, where $c \in \mathbb{N}_0$ is called a cutoff parameter.

Motivation

Let's briefly introduce the problem why this project was originally developed.

Intro to TCR(BCR) profiles

T-cells and B-cells are types of white blood immune cells which play a major role in human immunity. On the surface of each T/B-cell we can find receptors (TCR/BCR) - special molecules that are responsible for recognizing antigens. The collection of all TCRs/BCRs is called TCR (BCR) profile. One way to represent an organism's TCR/BCR profile is to collect the genetic sequences of most variable region (CDR3) of each TCR/BCR. Recent research (for instance [1], [2]) shows that it is useful to (1) represent these profiles as graphs with nodes being TCR/BCR and edge weights edit distances between nodes and (2) run some network analysis on these graphs (take a look at the NAIR project if you are interested in network analysis in immunology). In order to build these graphs we need efficient edit similarity join software and this is the main purpose of this project.

Paper

In progress..🧙

Building

Build with CMake

From the command line (Windows/macOS/Linux):

> git clone https://github.com/MatveevDaniil/PatternJoin.git
> cd PatternJoin
> mkdir build
> cd build
> cmake ..
> make

Usage

Assuming you are in the build directory

macOS/Linux

> pattern_join --file_name <..> --cutoff <..> --metric_type <..> --method <..> --include_duplicates <..>

Windows

> pattern_join.exe --file_name <..> --cutoff <..> --metric_type <..> --method <..> --include_duplicates <..>

Arguments

<file_name>: The path to the input file.
<cutoff>: The edit distance cutoff (0, 1 or 2). If cutoff = 0, then the value of metric_type, method and include_duplicates does not matter.
<metric_type>: The edit distance metric (L for Levenshtein, H for Hamming).
<method>: The core method of edit similarity join (pattern, semi_pattern, or partition_pattern). As default, we recommend using partition_pattern as the most memory-efficient while still fast method. For more details take a look to the paper.
<include_duplicates>: Consider duplicates in input (true or false). If false the program will ignore duplicate strings in the input and output unique pairs of strings. If true program will treat duplicate strings in the input as a pair (index, string) and output pairs of indeces.

Input file format

List of words separated by \n: <word_1>\n<word_2>\n....

Output file format

If include_duplicates = true: Space-separated pairs of words separated by \n: <word_i> <word_j>\n.... If include_duplicates = false: Space-separated pairs of indeces separated by \n: <idx_i> <idx_j>\n....

Benchmarks

Competitors

The performance of PatternJoin project is compared to several state-of-the-art projects for exact edit similarity join. Exact means that we find all pairs with a given edit distance threshold. Several projects such as VChunk, EmbedJoin, MinJoin and others were excluded due to low recall on tested datasets while not being faster. All compared projects were obtained from original authors (see table below). All tests were done on the same laptop with 16gb RAM and 2.2 GHz CPU.

Algorithm	Paper	Source code
QChunk*	link	link
PassJoin	link	link
TrieJoin	link	link
AdaptJoin	link	link

*Although QChunk algorithm is designed for exact similarity join we noticed that it produce some wrong pairs and ommit some correct pairs of strings. However recall and precision were $>0.99$ in our tests so we decided to keep it.

TCR Profile

We obtained dataset from Cytomegalovirus research and took the largest TCR CDR3 Amino Acid profile (TODO: submit data) from it. There are $425080$ sequences in the profile with the average sequence length is $14.6$. Below are speed comparisons for edit distance threshold 1 and 2.

BCR Profile

(TODO: introduce data). The number of sequences is $464104$ with the average sequence length $18.7$. Below are speed comparisons for edit distance threshold 1 and 2.

Third-party software

Currently, the project is using third-party software for hashmap, hashsets, and inlined vectors. Hashmaps and hashset are taken from the ankerl project and inlined vector from gch project. You can easily replace any of these containers with any other implementations of hashmap/hashset/vector by changing the corresponding line in src/hash_containers.h:

// src/hash_containers.hpp
using str2int = your_map<std::string, int>;
using ints = your_vector<int>;
using str2ints = your_map<std::string, ints>;
using int_pair_set = your_set<std::pair<int, int>>;
using str_pair_set = your_set<std::pair<std::string, std::string>>

Roadmap

Finish documentation.
Create R/Python packages.
PAPER
Cover edit distances $\geq 3$.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
benchmarks		benchmarks
src		src
test		test
thirdparty		thirdparty
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PatternJoin

What is Edit Similarity Join?

Motivation

Intro to TCR(BCR) profiles

Paper

Building

Build with CMake

Usage

macOS/Linux

Windows

Arguments

Input file format

Output file format

Benchmarks

Competitors

TCR Profile

BCR Profile

Third-party software

Roadmap

About

Releases

Packages

Languages

License

MatveevDaniil/PatternJoin

Folders and files

Latest commit

History

Repository files navigation

PatternJoin

What is Edit Similarity Join?

Motivation

Intro to TCR(BCR) profiles

Paper

Building

Build with CMake

Usage

macOS/Linux

Windows

Arguments

Input file format

Output file format

Benchmarks

Competitors

TCR Profile

BCR Profile

Third-party software

Roadmap

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages