This repo contains my solution code for the Eedi - Mining Misconceptions in Mathematics
Kaggle competition, which won 1st place. The full solution is described here. Please refer to the following sections for details on dependencies, training, and synthetic data generation. If you run into any issues with the setup/code or have any questions/suggestions, please feel free to contact me at [email protected]
. Thanks!
I rented compute from the vast.ai cloud platform. The models were trained on an instance with the following specifications:
- 2x H100 SMX 80GB / 2x H100 NVL 94GB
- GPU Memory Bandwidth: 2045 GB/s
- Xeon® Gold 6448Y CPU (32 vCPUs)
- RAM: 256 GB
- Disk space: 512 GB
To train the models, launch a VM based using the pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel docker image. On vast.ai, you can use this template.
Next, please clone the repo and install the dependencies.
git clone https://github.com/rbiswasfc/eedi-mining-misconceptions.git
cd eedi-mining-misconceptions
pip install -r requirements.txt
pip install "flash_attn==2.6.3" --no-build-isolation
Please export your Kaggle username and token to the environment variables KAGGLE_USERNAME
and KAGGLE_KEY
. It will be needed the download the competition datasets. The API keys can be obtained from the Kaggle Settings page.
export KAGGLE_USERNAME=******
export KAGGLE_KEY=******
Next, download the required datasets by running:
python download_datasets.py
The script will download and cache several datasets using the kagglehub library:
eedi-mining-misconceptions-in-mathematics
: Competition datasetconjuring92/eedi-five-folds
: 5-fold cross-validation splits. Please refer to this notebook for more details on validation.conjuring92/eedi-silver-v3
: Synthetic dataset containing 1.8k competition MCQs + 10.6k synthetic MCQsconjuring92/eedi-embed-pretrain-mix-final
: Synthetic dataset for pre-training of retrieval modelsconjuring92/eedi-embed-mix-silver-v3
: Retriever fine-tuning datasetconjuring92/eedi-ranker-silver-v3-teacher-blended-cot
: Pointwise re-ranker training dataset with CoT and teacher scores for distillationconjuring92/eedi-tutor-mix-v8
: Listwise re-ranker training datasetconjuring92/eedi-misconception-clusters
: Clusters of misconceptions for synthetic data generationconjuring92/eedi-cot-gen-base
: Base dataset for generation of reasoning samples from Claude 3.5 Sonnetconjuring92/eedi-cot-sonnet-6k
: 6k reasoning samples from Claude 3.5 Sonnetconjuring92/eedi-cot-train-silver-v3
: Synthetic dataset (with minor processing pairing each question + incorrect answer combination) to be used for training the reasoning models
It is recommended to download the required backbones from HF Hub before training the models. With hf_transfer downloading is much faster.
pip install huggingface_hub[hf_transfer]
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Qwen/Qwen2.5-Math-7B
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Qwen/Qwen2.5-14B
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Qwen/Qwen2.5-32B
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Qwen/Qwen2.5-72B
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download intfloat/e5-mistral-7b-instruct
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download BAAI/bge-en-icl
These backbones will be used for fine-tuning retrievers, re-rankers and reasoning models.
Models were trained using the HF accelerate library with DDP. Specifically, the following accelerate config was used:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
which can be configured by running accelerate config
from the terminal.
The solution pipeline involved 4 types of models:
- Retrievers: Used to retrieve top 32-64 misconceptions for a given Question and Incorrect Answer pair.
- Pointwise re-rankers (14B and 32B): Used to re-rank retrieved misconceptions. Model sees one misconception at a time in its context.
- Listwise re-rankers (72B): Used to re-rank retrieved misconceptions. Model sees top
n
misconceptions together in its context. - Reasoners: Used to generate the reasoning behind selecting an incorrect answer. These reasoning traces are used in the re-rankers to help with ranking.
If you want to track training runs using wandb
, please log in to your wandb account by running wandb login
from the terminal.
The retriever models were trained using the train_llm_embedding.py
script. Please run the following commands to fine-tune intfloat/e5-mistral-7b-instruct
and BAAI/bge-en-icl
for the misconceptions retrieval task.
accelerate launch ./code/train_llm_embedding.py --config-name conf_intfloat use_wandb=true full_fit=true
accelerate launch ./code/train_llm_embedding.py --config-name conf_bge use_wandb=true full_fit=true
The full_fit
flag will train the models on all available data. If you want to validate model performance on fold=0
, please set full_fit=false
.
The Qwen/Qwen2.5-14B
retriever was trained in two stages. It was first pre-trained on a synthetic dataset with large number of MCQs and misconceptions:
accelerate launch ./code/train_llm_embedding.py --config-name conf_qwen14b_pretrain use_wandb=true
Next, the LoRA adapters were merged with Qwen/Qwen2.5-14B
to create the base model for the further fine-tuning:
python code/merge_adapter.py \
--backbone_path Qwen/Qwen2.5-14B \
--adapter_path ../models/eedi_embed_qwen14b_pretrain_lora \
--save_dir ../models/eedi_embed_qwen14b_pretrain
Finally, the merge model was fine-tuned similar to the other retriever models:
accelerate launch ./code/train_llm_embedding.py --config-name conf_qwen14b_finetune use_wandb=true full_fit=true
The reasoning models were trained using the train_llm_reasoner.py
script. Please run the following commands to train the models.
accelerate launch ./code/train_llm_reasoner.py --config-name conf_reasoner_7b use_wandb=true full_fit=true
accelerate launch ./code/train_llm_reasoner.py --config-name conf_reasoner_14b use_wandb=true full_fit=true
accelerate launch ./code/train_llm_reasoner.py --config-name conf_reasoner_32b use_wandb=true full_fit=true
The pointwise re-rankers were trained using the train_ranker_pointwise.py
script. Please run the following commands to train the models.
accelerate launch ./code/train_ranker_pointwise.py --config-name conf_pointwise_14b use_wandb=true full_fit=true
accelerate launch ./code/train_ranker_pointwise.py --config-name conf_pointwise_32b use_wandb=true full_fit=true
Training of the 32b
re-ranker requires 2xH100 NVL
GPU.
Pointwise re-rankers processed one misconception at a time in the context window. An example model input is shown below:
The listwise re-rankers were trained using the train_ranker_listwise.py
script. Please run the following commands to train the models.
accelerate launch ./code/train_ranker_listwise.py --config-name conf_listwise_72b use_wandb=true full_fit=true
Listwise re-rankers processed top 5 misconceptions together in the context window. An example model input is shown below:
Synthetic data played a crucial role in improving both raw performance and generalization capability with respect to unseen misconceptions. It was generated with the help of Claude 3.5 Sonnet and GPT-4o. The synthetic examples can be accessed here. Optionally, you can generate your own examples using the scripts in synthetic
folder.
First, please make sure to export required API keys:
export OPENAI_API_KEY=***
export ANTHROPIC_API_KEY=***
Next, you can generate and curate synthetic data using the following scripts:
python synthetic/generate_claude.py --config-path conf/synthetic/conf_gen_claude.yaml
python synthetic/judge_oai.py --config-path conf/synthetic/conf_eval_oai.yaml
For Chain of Thought (CoT) generation, please run the following command:
python synthetic/cot_claude.py --config-path conf/synthetic/conf_cot_claude.yaml
Notes:
- This notebook demonstrates clustering of similar misconceptions.
- CoT generation input dataset was processed by this notebook.
Please refer to clustering notebook and CoT generation preparation notebook
Trained models were quantized using autoawq for inference. You can use the following commands to quantize different models:
python awq_quantization.py --model_path ../models/qwen_pointwise_merged --quant_path ../models/pointwise_awq --calib_data rbiswasfc/eedi-awq-calibration --max_calib_seq_len 1024
python awq_quantization.py --model_path ../models/qwen_listwise_merged --quant_path ../models/listwise_awq --calib_data rbiswasfc/eedi-awq-calibration-tutor --max_calib_seq_len 1600
python awq_quantization.py --model_path ../models/qwen_reasoner_merged --quant_path ../models/reasoner_awq --calib_data rbiswasfc/eedi-awq-calibration-cot --max_calib_seq_len 1024
You will need to update the model_path with the trained model path (after merging LoRA adapters). quant_path is the path where the quantized model will be saved.
My best selected inference notebook can be found here.