This repo enable us to find look-alike(closest) sentences leveraging sent2vec concept.
Most of the pre-trained language models learn a lot of "common lanugage understanding" that exists in the corpora. This way, sub-words embeddings would align themselves to learn semantics based on the given context in the sentence. An extension to this is to generate sentence-vectors (by leveraging tokens of the sentence) a.k.a. 'sent2vec' from the transformer's(BERT in this scenario) as a numeric representation of this semantic relationship. Now, given two vectors, we can find the similarity using cosine-similarity to show 'how semantically similar they are'.
- Create virtual environment follow🤏
- Install requirements
- Open config.yaml file and modify parameters as per your setup
python semantic_lookalike_transformer.py --config_file_path=config.yaml
- The best representation of a sentence based on the context and words relationship which can not be achieved with tf-idf (does not consider words order for example)
- Tf-idf might not be able to generate a proper vector if the sentence have way too many out of vacob words while transformers have outperformed most of the text vectorization on this aspect.
- Sentences (search space) need not to have same words to be identified as lookalike ones to the targeted sentence.
- To find unique semantic patterns exist in the corpus
- For labeling un-tagged data based on targeted sentence
- To find similar documents or articles
Sentence to vector dimension : 128 (Try to increase this dims by trying different BERT configuration, to get even better results for longer sentences)
Targeted sentence index : 2841 (A random index)
Targeted sentence : ['i want to fly from boston to denver with a stop in philadelphia']
Top 5 semantic similar sentences are :
----------------------------------------------------------------------------------------------
0 : i want to fly from boston to denver with a stop in philadelphia
1 : i want to fly from philadelphia to dallas and make a stopover in atlanta
2 : i want a flight originating in denver going to pittsburgh and atlanta in either order
3 : i want a flight on twa from boston to denver
4 : i need to go to san diego from toronto but i want to stopover in denver
----------------------------------------------------------------------------------------------
Note: This output will vary based on the targeted sentence (for which you need to find look alike sentences)
- Increase the dimension of transformer embedding layer (currently 128) to learn more language representation hence more close/better look-alike sentences. Try 256, 512 or 768 embedding dims, in order to increase the overall performance.
- flairNLP provides good ways to create sentence/document level embeddings with the flexibility to pick glove, gensim and flair embeddings (or stack them) and then learn sentence/document level embeddings using pool/RNN methods. There are four main document embeddings in Flair (https://github.com/flairNLP/flair):
- DocumentPoolEmbeddings that simply do an average over all word embeddings in the sentence
- DocumentRNNEmbeddings that train an RNN over all word embeddings in a sentence
- TransformerDocumentEmbeddings that use pre-trained transformers and are recommended for most text classification tasks
- SentenceTransformerDocumentEmbeddings that use pre-trained transformers and are recommended if you need a good vector representation of a sentence
Kinldy download the data from the below link -
https://www.kaggle.com/hassanamin/atis-airlinetravelinformationsystem