Ngram 1/try llm code speedup #90

AaronWChen · 2024-04-04T00:29:39Z

No description provided.

Several did not work (which is ok), but to do further testing, needed to install more-itertools in venv, and can rerun some code

Forced restart cleared jupyter kernel, so might as well add FlagEmbedding library to create embeddings for retrieval. Read a few discussions on how slow Stanza is with the pipeline. One thing is that I think there is a strong benefit to using lemmas for tokens, which needs many of the processing pipelines. Unfortunately, this looks like the slow part of processing. Saw this discussion on [using multiprocessing combined with Stanza](stanfordnlp/stanza#552) and it seems like multiprocessing with GPU doesn't result in much performance increase but does drastically cause the GPU to work harder. However, discussion reminded me that I turned off GPU with Stanza in an attempt to use BERTopic with Stanza. Will re-enable and retime.

Moving lemmafication out of sklearn pipeline to speed up count vectorization/TFIDF Will have to add similar refactor to query code

Running into an error: ```TypeError: CustomSKLearnAnalyzer.ngrams_maker() got multiple values for argument 'min_ngram_length'```

Experiment with BGE-M3 and smaller datasets as laptop builds model(s)

dagshub · 2024-04-04T00:29:42Z

Join the discussion on DagsHub!

AaronWChen added 8 commits February 19, 2024 22:30

Commit changes with TFIDF version

a52d29a

Add code suggestions from chatbots, test

5b3d638

Several did not work (which is ok), but to do further testing, needed to install more-itertools in venv, and can rerun some code

Add lemmaficattion to preprocessing steps

a4014c2

Black reformat code

29e23b1

Moving lemmafication out of sklearn pipeline to speed up count vectorization/TFIDF Will have to add similar refactor to query code

Refactor with new custom ngram function

e19440b

Running into an error: ```TypeError: CustomSKLearnAnalyzer.ngrams_maker() got multiple values for argument 'min_ngram_length'```

Kernel crash, rerun code on laptop

6950153

Experiment with BGE-M3 and smaller datasets as laptop builds model(s)

Adding missing scripts

2161e1d

AaronWChen merged commit 9c0983d into dev Apr 4, 2024
0 of 2 checks passed

AaronWChen deleted the NGRAM-1/try-llm-code-speedup branch April 4, 2024 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ngram 1/try llm code speedup #90

Ngram 1/try llm code speedup #90

AaronWChen commented Apr 4, 2024

dagshub bot commented Apr 4, 2024

Ngram 1/try llm code speedup #90

Ngram 1/try llm code speedup #90

Conversation

AaronWChen commented Apr 4, 2024

dagshub bot commented Apr 4, 2024