-
Notifications
You must be signed in to change notification settings - Fork 28
pqai reranker
This service provides high-latency, high-accuracy models for determining text similarity.
This is how they work: you give them two text snippets (a query and a document typically) and they'll return a numerical value (score; see note below*) that'll signify how similar they are in meaning.
If you have read the documentation of pqai-index
and pqai-encoder
components, then you must have seen that one way of determining similarity can also be through vector representations of text. Basically, you convert two texts into their vector form (using a deep neural network, for example) and then cosine product of the two resulting vectors can be considered as a proxy for their semantic similarity.
Then where do rerankers fit in?
Unlike representation based models, which can create an independent representation of a text, such as a vector, that can be pre-computed and stored on disk, reranker models do not create such a represenation.
Instead, they ingest the two texts during run-time and then return a similarity. The practical effect of this is in the latency, particularly when you have a lot of document (say, a million). Since reranker models have to ingest each document at run-time (that is, after the query has been submitted by the user) they take much longer.
The upside of reranker models is that they can determine the similarity with more accuracy.
In a real-world scenario, and in PQAI, the search is done in two stages, first by using a fast, vector-based approach, and then, in a second stage, a reranker model is used to re-shuffle the results of the first stage in a descending order of similarity.
*NOTE: Whether the semantic similarity (i.e., similarity of their meaning) increase or decrease with the numerical score depends on the type of model. Some models provide distance based score. In their case, score of 2.2 means lower similarity than a score of 1.3. This is opposite for other models that provide similarity based scores. Of course, this is a surface level difference, because by doing an additive inverse, the scores can be flipped, but it's an important implementation level detail to be kept in mind.
root/
|-- core/
|-- custom_reranker.py # an interaction based custom reranker model
|-- reranker.py # defines a number of reranker models
|-- matchpyramid.py # MatchPyramid model for text similarity
|-- assets/ # files (e.g. ML models) used by core modules
|-- tests/
|-- test_custom_reranker.py
|-- test_matchpyramid.py
|-- test_reranker.py
|-- test_server.py # Tests for the REST API
|-- main.py # Defines the REST API
|
|-- requirements.txt # List of Python dependencies
|
|-- Dockerfile # Docker files
|-- docker-compose.yml
|
|-- env # .env file template
|-- deploy.sh # Script for setting up service on local
This module defines the following classes:
Ranker
ConceptMatchRanker
Ranker
is an abstract class that defines the interface for any reranker. Essentially, it specifies that any reranker should expose two methods: score
and rank
.
The score
method should return a numerical score when given two texts as arguments.
On the other hand, the rank
method must accept a textual query and a list of documents. It returns a list of indexes corresponding to the positions in the input document list. For example, if it returns [2, 0, 1]
, this means that the document at index 0
in the input list documents has 2 (least similar) and document at index 1
is most similar.
The ConceptMatchRanker
class defines a similarity determination model that uses Word Mover's Distance to quantify the similarity between two text snippets. It compares and pools the similarity among the word embeddings of the two inputs to arrive at a numerical score.
This is an implementation of Pang et al., 2016 model, popularly known as MatchPyramid. It has been fine-tuned on a patent-ranking task.
The module exposes a method called calculate_similarity
, which accepts two text snippets and returns a numerical similarity score.
import numpy as np
from core.matchpyramid import calculate_similarity
text1 = "This invention relates with coffee makers."
text2 = "A coffee making machine has been disclosed.""
score = calculate_similarity(text1, text2)
assert isinstance(sore, np.float32)
This is an implementation of a custom interaction-based semantic similarity model. It defines the following classes:
CustomRanker
GloveWordEmbeddings
VectorSequence
Interaction
InteractionMatrix
Except from CustomReranker
the other classes are not used outside this module. They should be considered as internal details of implementation for CustomReanker
.
The CustomReranker
inherits the score
method from the abstract Ranker
class defined in the reranker
module. It accepts two input snippets (e.g., query and a document) and returns their similarity score as a float
value. This value increases with increasing similarity but does not have an upper threshold.
Typical usage:
reranker = CustomRanker()
query = "This is a red apple, which is a fruit"
document = "This is a green apple"
score = reranker.score(query, document)
The assets required to run this service are stored in the /assets
directory.
When you clone the Github repository, the /assets
directory will have nothing but a README file. You will need to download actual asset files as a zip archive from the following link:
https://https://s3.amazonaws.com/pqai.s3/public/assets-pqai-reranker.zip
After downloading, extract the zip file into the /assets
directory.
(alternatively, you can also use the deploy.sh
script to do this step automatically - see next section)
The assets contain the following files/directories:
-
MatchPyramid_200_tokens/
: files for the MatchPyramid model -
dfs.json
: document frequencies for terms -
glove-dictionary.json
: term:index mapping for a vocabulary -
glove-dictionary.variations.json
: a mapping of lemmatized versions of words to their syntactic variations -
glove-vocab.json
: a term vocabulary -
glove-vocab.lemmas.json
: term:lemma mapping for a vocabulary -
glove-We.npy
: GloVe word embeddings -
glove-Ww.npy
: SIF (smooth-inverse-frequency) weights for terms -
stopwords.txt
: list of patent-specific stopwords
Prerequisites
The following deployment steps assume that you are running a Linux distribution and have Git and Docker installed on your system.
Setup
The easiest way to get this service up and running on your local system is to follow these steps:
-
Clone the repository
git clone https://github.com/pqaidevteam/pqai-reranker.git
-
Using the
env
template in the repository, create a.env
file and set the environment variables.cd pqai-reranker cp env .env nano .env
-
Run
deploy.sh
script.chmod +x deploy.sh bash ./deploy.sh
This will create a docker image and run it as a docker container on the port number you specified in the .env
file.
Alternatively, after following steps (1) and (2) above, you can use the command python main.py
to run the service in a terminal.
This service is dependent on the following other services:
- pqai-encoder
The following services depend on this service:
- pqai-gateway
- Pang, Liang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Text Matching as Image Recognition. arXiv, February 19, 2016. https://doi.org/10.48550/arXiv.1602.06359.