AI Tinkerers Kuala Lumpur October 2024 Hackathon (LLM-as-a-Judge): Fine-Tuning Malaysian DeBERTaV2 & Mistral 7B for Logical Consistency Classification and Reasoning

Overview

This repo details code written as part of the 1st place solution for the AI Tinkerer's Hackathon in Kuala Lumpur for an LLM-as-a-Judge use case.

It involves fine-tuning Malaysian DeBERTaV2 and Mistral 7B models for yes/no classification and reasoning tasks, focusing on a Natural language inference (NLI) task.

In our case, NLI is the task of determining whether a "hypothesis" is true (entailment) or false (contradiction) given a statement-question/paragraph-statement pair. By leveraging translated datasets and Chain-of-Thought reasoning techniques, this project demonstrates the potential for finetuned smaller models to act as scalable judges, in line with the JudgeLM paper.

Crucially, we leverage open source models and datasets in the Malay language to enable adaptation to the local Malaysian context.

Methodology

A comprehensive presentation we prepared for the Hackathon can be found here.

Dataset Translation: Translated English datasets into Malay to enable focused fine-tuning for Malay language understanding using OpenAI's 4o-mini.
Chain-of-Thought Reasoning: Augmented datasets with CoT reasoning using OpenAI's 4o-mini to enhance logical reasoning capabilities.
Fine-Tuning: Utilized Google Colab's A100 GPU (40 GB VRAM) to fine-tune models on the curated datasets using QLoRA and Huggingface's SFTTrainer.
Benchmarking: Benchmarking and training runs was done/monitored using weave (Weights & Biases)

Models

Original models:

Fine-tuned models:

NLI only: https://huggingface.co/wanadzhar913/malaysian-debertav2-finetune-on-boolq
NLI only: https://huggingface.co/wanadzhar913/malaysian-mistral-llmasajudge-v2
NLI & Reasoning: https://huggingface.co/wanadzhar913/malaysian-mistral-llmasajudge-v3

Datasets

Original datasets:

Translated datasets:

Translated & Reasoning column generated datasets:

Results

Our approach yielded significant improvements in logical reasoning tasks for Malay and English language , validated by metrics including accuracy, F1-score. These results secured 1st place at the AI Tinkerer's Hackathon in Kuala Lumpur.*

Model	Accuracy (%)	F1-Score (%)
OpenAI 4o-mini	78	80
Malaysian DeBERTaV2	51	48
Malaysian Mistral V2	65	74
Malaysian Mistral V2	61	69

*Due to time/compute constraints, we didn't evaluate on the entire test set. You can check how we sampled the testing set here.

Acknowledgments

Special thanks to:

Mesoltica for their open-source models we used for fine-tuning.
AI Tinkerer's Kuala Lumpur for organizing the hackathon.
Joseph from DocuAsk for providing OpenAI credits enabling us to access 4o-mini.
Team members and collaborators for their contributions.

Improvements

Due to time/compute constraints, we didn't evaluate on the entire test set. A more accurate result can be obtained by evaluating on the entire dataset(s).
Set bf16 parameter to True to optimize compute efficiency without significantly sacrificing model accuracy.
Increase the gradient_accumulation_steps to deal with the small GPU constraints or increase the batch_size if we've access to a larger GPU. The reasoning is mainly to avoid Out of Memory Errors (OOM).
Given more compute resources, we can also increase our patience variable and train for more than 10 epochs.
Limiting the reasoning portion (in the training dataset) to only be in Malay. Since the model has been instruction finetuned to mainly reply in Malay, it'd be confusing to have it reason back in English.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
datasets		datasets
miscellaneous		miscellaneous
notebooks-benchmarking-exercises		notebooks-benchmarking-exercises
notebooks-data-preparation		notebooks-data-preparation
notebooks-finetuning-models		notebooks-finetuning-models
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Tinkerers Kuala Lumpur October 2024 Hackathon (LLM-as-a-Judge): Fine-Tuning Malaysian DeBERTaV2 & Mistral 7B for Logical Consistency Classification and Reasoning

Overview

Methodology

Models

Datasets

Results

Acknowledgments

Improvements

About

Releases

Packages

Contributors 5

Languages

wanadzhar913/aitinkerers-hackathon-malaysia-llm-as-judge

Folders and files

Latest commit

History

Repository files navigation

AI Tinkerers Kuala Lumpur October 2024 Hackathon (LLM-as-a-Judge): Fine-Tuning Malaysian DeBERTaV2 & Mistral 7B for Logical Consistency Classification and Reasoning

Overview

Methodology

Models

Datasets

Results

Acknowledgments

Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages