Skip to content

This repo details code written as part of the 1st place solution for the AI Tinkerer's Hackathon in Kuala Lumpur for an LLM-as-a-Judge use case.

Notifications You must be signed in to change notification settings

wanadzhar913/aitinkerers-hackathon-malaysia-llm-as-judge

Repository files navigation

AI Tinkerers Kuala Lumpur October 2024 Hackathon (LLM-as-a-Judge): Fine-Tuning Malaysian DeBERTaV2 & Mistral 7B for Logical Consistency Classification and Reasoning

Overview

This repo details code written as part of the 1st place solution for the AI Tinkerer's Hackathon in Kuala Lumpur for an LLM-as-a-Judge use case.

It involves fine-tuning Malaysian DeBERTaV2 and Mistral 7B models for yes/no classification and reasoning tasks, focusing on a Natural language inference (NLI) task.

In our case, NLI is the task of determining whether a "hypothesis" is true (entailment) or false (contradiction) given a statement-question/paragraph-statement pair. By leveraging translated datasets and Chain-of-Thought reasoning techniques, this project demonstrates the potential for finetuned smaller models to act as scalable judges, in line with the JudgeLM paper.

Crucially, we leverage open source models and datasets in the Malay language to enable adaptation to the local Malaysian context.

Methodology

A comprehensive presentation we prepared for the Hackathon can be found here.

  1. Dataset Translation: Translated English datasets into Malay to enable focused fine-tuning for Malay language understanding using OpenAI's 4o-mini.
  2. Chain-of-Thought Reasoning: Augmented datasets with CoT reasoning using OpenAI's 4o-mini to enhance logical reasoning capabilities.
  3. Fine-Tuning: Utilized Google Colab's A100 GPU (40 GB VRAM) to fine-tune models on the curated datasets using QLoRA and Huggingface's SFTTrainer.
  4. Benchmarking: Benchmarking and training runs was done/monitored using weave (Weights & Biases)

Models

Original models:

Fine-tuned models:

Datasets

Original datasets:

Translated datasets:

Translated & Reasoning column generated datasets:

Results

Our approach yielded significant improvements in logical reasoning tasks for Malay and English language , validated by metrics including accuracy, F1-score. These results secured 1st place at the AI Tinkerer's Hackathon in Kuala Lumpur.*

Model Accuracy (%) F1-Score (%)
OpenAI 4o-mini 78 80
Malaysian DeBERTaV2 51 48
Malaysian Mistral V2 65 74
Malaysian Mistral V2 61 69

*Due to time/compute constraints, we didn't evaluate on the entire test set. You can check how we sampled the testing set here.

Acknowledgments

Special thanks to:

  • Mesoltica for their open-source models we used for fine-tuning.
  • AI Tinkerer's Kuala Lumpur for organizing the hackathon.
  • Joseph from DocuAsk for providing OpenAI credits enabling us to access 4o-mini.
  • Team members and collaborators for their contributions.

Improvements

  • Due to time/compute constraints, we didn't evaluate on the entire test set. A more accurate result can be obtained by evaluating on the entire dataset(s).
  • Set bf16 parameter to True to optimize compute efficiency without significantly sacrificing model accuracy.
  • Increase the gradient_accumulation_steps to deal with the small GPU constraints or increase the batch_size if we've access to a larger GPU. The reasoning is mainly to avoid Out of Memory Errors (OOM).
  • Given more compute resources, we can also increase our patience variable and train for more than 10 epochs.
  • Limiting the reasoning portion (in the training dataset) to only be in Malay. Since the model has been instruction finetuned to mainly reply in Malay, it'd be confusing to have it reason back in English.

About

This repo details code written as part of the 1st place solution for the AI Tinkerer's Hackathon in Kuala Lumpur for an LLM-as-a-Judge use case.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published