pqai reranker

PQAI Reranker

This service provides high-latency, high-accuracy models for determining text similarity.

This is how they work: you give them two text snippets (a query and a document typically) and they'll return a numerical value (score; see note below*) that'll signify how similar they are in meaning.

If you have read the documentation of pqai-index and pqai-encoder components, then you must have seen that one way of determining similarity can also be through vector representations of text. Basically, you convert two texts into their vector form (using a deep neural network, for example) and then cosine product of the two resulting vectors can be considered as a proxy for their semantic similarity.

Then where do rerankers fit in?

Unlike representation based models, which can create an independent representation of a text, such as a vector, that can be pre-computed and stored on disk, reranker models do not create such a represenation.

Instead, they ingest the two texts during run-time and then return a similarity. The practical effect of this is in the latency, particularly when you have a lot of document (say, a million). Since reranker models have to ingest each document at run-time (that is, after the query has been submitted by the user) they take much longer.

The upside of reranker models is that they can determine the similarity with more accuracy.

In a real-world scenario, and in PQAI, the search is done in two stages, first by using a fast, vector-based approach, and then, in a second stage, a reranker model is used to re-shuffle the results of the first stage in a descending order of similarity.

*NOTE: Whether the semantic similarity (i.e., similarity of their meaning) increase or decrease with the numerical score depends on the type of model. Some models provide distance based score. In their case, score of 2.2 means lower similarity than a score of 1.3. This is opposite for other models that provide similarity based scores. Of course, this is a surface level difference, because by doing an additive inverse, the scores can be flipped, but it's an important implementation level detail to be kept in mind.

Code structure

root/
  |-- core/
       |-- custom_reranker.py	# an interaction based custom reranker model
       |-- reranker.py			# defines a number of reranker models
       |-- matchpyramid.py		# MatchPyramid model for text similarity
  |-- assets/					# files (e.g. ML models) used by core modules
  |-- tests/
        |-- test_custom_reranker.py
        |-- test_matchpyramid.py
        |-- test_reranker.py
        |-- test_server.py		# Tests for the REST API
  |-- main.py					# Defines the REST API
  |
  |-- requirements.txt			# List of Python dependencies
  |
  |-- Dockerfile				# Docker files
  |-- docker-compose.yml
  |
  |-- env						# .env file template
  |-- deploy.sh					# Script for setting up service on local

Core modules

Reranker

This module defines the following classes:

Ranker
ConceptMatchRanker

Ranker is an abstract class that defines the interface for any reranker. Essentially, it specifies that any reranker should expose two methods: score and rank.

The score method should return a numerical score when given two texts as arguments.

On the other hand, the rank method must accept a textual query and a list of documents. It returns a list of indexes corresponding to the positions in the input document list. For example, if it returns [2, 0, 1], this means that the document at index 0 in the input list documents has 2 (least similar) and document at index 1 is most similar.

The ConceptMatchRanker class defines a similarity determination model that uses Word Mover's Distance to quantify the similarity between two text snippets. It compares and pools the similarity among the word embeddings of the two inputs to arrive at a numerical score.

MatchPyramid

This is an implementation of Pang et al., 2016 model, popularly known as MatchPyramid. It has been fine-tuned on a patent-ranking task.

The module exposes a method called calculate_similarity, which accepts two text snippets and returns a numerical similarity score.

import numpy as np
from core.matchpyramid import calculate_similarity

text1 = "This invention relates with coffee makers."
text2 = "A coffee making machine has been disclosed.""
score = calculate_similarity(text1, text2)
assert isinstance(sore, np.float32)

CustomReranker

This is an implementation of a custom interaction-based semantic similarity model. It defines the following classes:

CustomRanker
GloveWordEmbeddings
VectorSequence
Interaction
InteractionMatrix

Except from CustomReranker the other classes are not used outside this module. They should be considered as internal details of implementation for CustomReanker.

The CustomReranker inherits the score method from the abstract Ranker class defined in the reranker module. It accepts two input snippets (e.g., query and a document) and returns their similarity score as a float value. This value increases with increasing similarity but does not have an upper threshold.

Typical usage:

reranker = CustomRanker()
query = "This is a red apple, which is a fruit"
document = "This is a green apple"
score = reranker.score(query, document)

Assets

The assets required to run this service are stored in the /assets directory.

When you clone the Github repository, the /assets directory will have nothing but a README file. You will need to download actual asset files as a zip archive from the following link:

https://https://s3.amazonaws.com/pqai.s3/public/assets-pqai-reranker.zip

After downloading, extract the zip file into the /assets directory.

(alternatively, you can also use the deploy.sh script to do this step automatically - see next section)

The assets contain the following files/directories:

MatchPyramid_200_tokens/: files for the MatchPyramid model
dfs.json: document frequencies for terms
glove-dictionary.json: term:index mapping for a vocabulary
glove-dictionary.variations.json: a mapping of lemmatized versions of words to their syntactic variations
glove-vocab.json: a term vocabulary
glove-vocab.lemmas.json: term:lemma mapping for a vocabulary
glove-We.npy: GloVe word embeddings
glove-Ww.npy: SIF (smooth-inverse-frequency) weights for terms
stopwords.txt: list of patent-specific stopwords

Deployment

Prerequisites

The following deployment steps assume that you are running a Linux distribution and have Git and Docker installed on your system.

Setup

The easiest way to get this service up and running on your local system is to follow these steps:

Clone the repository

git clone https://github.com/pqaidevteam/pqai-reranker.git

Using the env template in the repository, create a .env file and set the environment variables.
```
cd pqai-reranker
cp env .env
nano .env
```
Run deploy.sh script.
```
chmod +x deploy.sh
bash ./deploy.sh
```

This will create a docker image and run it as a docker container on the port number you specified in the .env file.

Alternatively, after following steps (1) and (2) above, you can use the command python main.py to run the service in a terminal.

Service dependency

This service is dependent on the following other services:

pqai-encoder

Dependent services

The following services depend on this service:

pqai-gateway

References

Pang, Liang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Text Matching as Image Recognition. arXiv, February 19, 2016. https://doi.org/10.48550/arXiv.1602.06359.