-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup check. Script to get keywords for comparing against SimpleMaths, TextRank and Philology results #220
Comments
Thanks for sharing your code!
From a quick glance, it seems correctly setup but it all depends on your definition of correctly. Are you running into any errors or do you want to optimize performance/diversity/etc.? What is it exactly that you want checked? What is your use case and what goal do you want to achieve? It helps if you start by describing your use case first, the problem that you are facing, and the kinds of solutions/feedback you might be looking for. Reading your questions, it isn't clear to me what the main question is. To illustrate, the following comment is quite broad and does not tell me what kind of feedback you are looking for:
In other words, can you specify your question a bit? |
Thank you for your reply, really appreciate it!
Most importantly I wished for the Author to check whether or not the keyword extraction was called out correctly or my setup was correct. - if that makes sense :) My goal is to create a script in which a User can use 3 different keyword extraction methods and the methods would display their rankings. The script will have SketcheEngine Simple Maths and TextRank + KeyBERT with 3 provided models. All meant for Estonian texts. My masters goal is to compare LLM against existing Simple Maths , Textrank and Philologically found keywords to measure how accurate are LLM using the minimal and brilliant KeyBERT solution. Therefore I need maximum accuracy that can possibly be achieved with KeyBERT in finding keywords and keyphrases.
Additionally the question regarding POS and ner_tags. in lemma (with word lemma is replaced with original word) form:
keyphrases lemma (with word lemma are also instead with original words) form:
The code has been updated a bit to iterate over all three models and create for each model a subfolder with ngrams ranging 1 to 3.
Repetitive errors I only get:
|
From a pure coding perspective, yes you are calling out the functions correctly. Do note though that sentence-transformers models generally work a bit better (and in my experience faster) than than flair, so perhaps use that instead. Also, make sure to check the MTEB leaderboard for a nice overview of models.
Any processing during extraction is done through the CountVectorizer which should be used to change how you would like to see this processing.
Yes, see my answer above.
Not within KeyBERT itself other than using the CountVectorizer to lemmatize any input words that you receive.
Do you mean a citation? If so, then you can follow along with the README.
You can use the [KeyphraseVectorizers](https://maartengr.github.io/KeyBERT/guides/countvectorizer.html#keyphrasevectorizers) although I think it is not maintained anymore. |
@MaartenGr thank you for your reply and suggestions! Currently reinstalled my envs in Conda and added Torch and everything runs fast. For Estonian I chose e5 as it was the first multilingual suggestions fitting Estonian. Results so far promising. Will for sure update on the comparison info. I do have couple of more questions. 1.The suggested MMR 0.7 in the example is it optimal ? Thank you upfront! Regarding MMR , out of curiosity for my masters: More recent code:
|
That depends on your use case and definition of "optimal". For some use cases, a lower values is enough to remove some redundancy but for others you might want to increase the value if you have many synonyms or generally are interested in more diverse representations. Always make sure to first define what you think is "optimal", "good", "performant", etc.
The official documentation contains the most recent information.
You can cite KeyBERT as mentioned in the README.
As a small tip. If you have a large dataset, then it might be worthwhile to set |
So currently using the GPT and some read manuals. Did I correctly setup the code and transformer model? Or are there any suggestions which I could use? I will also try with ngrams up to 3.
Maybe some preprocessing suggestions or how to achieve to results with KeyBERT POS ner tags in the process.
I am comparing currently KeyBERT vs SketchEngine (SimpleMaths) and TextRank and also Philologically found keywords. And for my masters I thought KeyBERT would be best. I would like to checkout the efficiency with Estonian language models (tartuNLP/EstBERT) but also try with mBART(facebook/mbart-large-50) and mT5(google/mt5-base) and could add some value to the research. I tried also reaching out via linkedIn
All the best.
The text was updated successfully, but these errors were encountered: