Skip to content

LoriTosoChef/msc-thesis-llm-clustering

Repository files navigation

msc-thesis-llm-clustering

Aim of this study

The primary objective of this study is to evaluate the effectiveness of AI-driven techniques, specifically using large language models (LLMs) built on the transformer neural network architecture, and trained on a considerable amount of text data, in spotting and labeling potential leads on social media platforms, with a specific emphasis on Twitter.


Instructions

To download tweets and to run the models create a virtual env and install the requirements with

pip install -r requirements.txt

Initial Configurations

Create all the environment variables needed and set them. Look for variable names in the config.py file.

  • RAW_DATA_DIR - where raw datasets will be stored
  • DATA_DIR - where clean and ready to use datasets will be stored
  • {MODEL}_PATH - path to local models
  • OPENAI_API_KEY - OpenAI API secret key

Download Tweets

Use dataset.ipynb notebook. Set Twitter API keys (be sure to check out twitter api and developer docs) by setting:

  • TWTR_BEARER_TOKEN
  • TWTR_API
  • TWTR_API_SECRET
  • TWTR_ACCESS_TOKEN
  • TWTR_ACCESS_TOKEN_SECRET

Use pre-configured contexts (apparel, cars and beauty) or change them as needed. Run the notebook, being careful about Twitter api rate limits. The datasets will be stored in the RAW_DATA_DIR as .parquet files.

Data cleaning and initial preparation can be done using this notebook too. The output will be stored in the DATA_DIR directory.

Build DB

Use bots/tweetdb.py. Script to downlaod and store last n tweets from previous 7 days. Change DOMAINS list to look for desired tweet contexts and annotations.

Simply navigate to the bots directory with cd bots, then run python3 tweetdb.py. Adjust the time buffer in time.sleep() for optimal time saving with no timeouts.

Running Local - Models Setup

  1. Clone llama.cpp github repo into the project folder, this will be only used to convert *.pth files to *.bin files and to quantize them into 4-bit. To convert and quantize the models follow the instructions found on the llama.cpp README (everything working fine as of 04/04/2023).
  2. GPT4All: download the model from the gpt4all github repo referenced in the Resources section. Convert the model to ggml format and quantize it, then save the model path in the environment variable GPT4ALL_PATH.
  3. Llama Models: download the models from the internet, convert and quantize them following the instructions found in the llama.cpp repo, then save the model path in the environment variables LLAMA_7B_PATH and LLAMA_13B_PATH.
  4. Alpaca Native: download the model from the internet (huggingface repo of alpaca-native, community section), migrate it following the instructions found in the llama.cpp repo using migrate-ggml-2023-03-30-pr613.py, then save the model path in the environment variables ALPACA_7B_NATIVE_PATH.

Additional Info

Runtimes

Machine: Macbook Pro M1 Pro 8 Cores 16GB.

100 Generations on ~250-token prompt, 5s fast cooldown, 2m slow cooldown:

  • BLOOM (api): 9m 20s
  • Alpaca 3B (api): 1h 11m 58s
  • Alpaca 770M (api): 28m 53d
  • GPT4All (local, 6 threads): 37m 26s
  • Llama 7B (local, 6 threads): 1h 3m 39s
  • Llama 13B (local, 6 threads): 1h 39m 54s

Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published