The primary objective of this study is to evaluate the effectiveness of AI-driven techniques, specifically using large language models (LLMs) built on the transformer neural network architecture, and trained on a considerable amount of text data, in spotting and labeling potential leads on social media platforms, with a specific emphasis on Twitter.
To download tweets and to run the models create a virtual env and install the requirements with
pip install -r requirements.txt
Create all the environment variables needed and set them. Look for variable names in the config.py
file.
RAW_DATA_DIR
- where raw datasets will be storedDATA_DIR
- where clean and ready to use datasets will be stored{MODEL}_PATH
- path to local modelsOPENAI_API_KEY
- OpenAI API secret key
Use dataset.ipynb
notebook. Set Twitter API keys (be sure to check out twitter api and developer docs) by setting:
TWTR_BEARER_TOKEN
TWTR_API
TWTR_API_SECRET
TWTR_ACCESS_TOKEN
TWTR_ACCESS_TOKEN_SECRET
Use pre-configured contexts (apparel, cars and beauty) or change them as needed.
Run the notebook, being careful about Twitter api rate limits. The datasets will be stored in the RAW_DATA_DIR
as .parquet
files.
Data cleaning and initial preparation can be done using this notebook too. The output will be stored in the DATA_DIR
directory.
Use bots/tweetdb.py
. Script to downlaod and store last n tweets from previous 7 days. Change DOMAINS
list to look for desired tweet contexts and annotations.
Simply navigate to the bots directory with cd bots
, then run python3 tweetdb.py
. Adjust the time buffer in time.sleep()
for optimal time saving with no timeouts.
- Clone llama.cpp github repo into the project folder, this will be only used to convert *.pth files to *.bin files and to quantize them into 4-bit. To convert and quantize the models follow the instructions found on the llama.cpp README (everything working fine as of 04/04/2023).
- GPT4All: download the model from the gpt4all github repo referenced in the Resources section. Convert the model to ggml format and quantize it, then save the model path in the environment variable
GPT4ALL_PATH
. - Llama Models: download the models from the internet, convert and quantize them following the instructions found in the llama.cpp repo, then save the model path in the environment variables
LLAMA_7B_PATH
andLLAMA_13B_PATH
. - Alpaca Native: download the model from the internet (huggingface repo of alpaca-native, community section), migrate it following the instructions found in the llama.cpp repo using
migrate-ggml-2023-03-30-pr613.py
, then save the model path in the environment variablesALPACA_7B_NATIVE_PATH
.
Machine: Macbook Pro M1 Pro 8 Cores 16GB.
100 Generations on ~250-token prompt, 5s fast cooldown, 2m slow cooldown:
- BLOOM (api): 9m 20s
- Alpaca 3B (api): 1h 11m 58s
- Alpaca 770M (api): 28m 53d
- GPT4All (local, 6 threads): 37m 26s
- Llama 7B (local, 6 threads): 1h 3m 39s
- Llama 13B (local, 6 threads): 1h 39m 54s
- Twitter API Docs: https://developer.twitter.com/en/docs/twitter-api
- Twitter context annotations: https://github.com/twitterdev/twitter-context-annotations/tree/main/files
- Langchain Docs: https://python.langchain.com/en/latest/index.html
- bigscience/bloom: https://huggingface.co/bigscience/bloom
- declare-lab/flan-alpaca-xl: https://huggingface.co/declare-lab/flan-alpaca-xl
- GPT4ALL Repo: https://github.com/nomic-ai/gpt4all
- llama.cpp: https://github.com/ggerganov/llama.cpp
- UMAP+HDBSCAN: Paper -> https://ieeexplore.ieee.org/document/9640285