This repo is a case study of text classification with LLMs. It compares the performance on text classification via two methods:
- Full finetuning of Bert-like LLM models which are relatively small of few hundred millions of parameters.
- Zero-shot text classification via prompt-based in-context learning using super large generative LLMs which are of few dozen/hundred billions of parameters.
Code is in classification.ipynb and the report and solution are explained in report.pdf
Using one of the TweetEval datasets, I chose the emotion classification task which includes 3k+ training examples and 10% of that for validation with 4 emotion labels: anger
, joy
, optimism
, sadness
.
For full finetuning, I used the Bert and Bertweet models from Hugging Face. For zero-shot text classification, I used the Meta LLaMA 3 70B Instruct model.
I was able to reproduce the main results of comparing both methods from literature [1, 2, 3]. The main takeaways:
- In case we have 1000+ data points, we should rely on the full finetuning of relatively small Bert-like models as we can get more robust task-specific SOTA classifiers.
- In case we have less data, the results of full finetuning can become less reliable and therefore, we should rely on zero-shot in-context learning. The experiments I did show that zero-shot in-context learning works very well and achieves SOTA results if prompt design and the choice of LLM is done properly.