This is a script to simplify running many evals simultaneously in our slurm environment. It accepts hugging face model IDs (i.e. org/modelname) or directory paths to models in the huggingface format.
Common run configs can be found in command line scripts like cpt.sh
which you
can run like
# run common tests for finnish CPT evals
sh cpt.sh /path/to/somemodel
You can also invoke the script directly to run individual evals as needed.
python main.py \
--partition standard-g
--time 04:00:00
--model path/to/model_step1234 \
eval_name1 eval_name2
The script will try to avoid running nevals for which you already have results
or for which there already appear to be jobs in the slurm queue. It determines
this latter case by reviewing the logs in command_history.jsonl
.
Slurm job output is stored in the logs
subdir.
Output is written by default into the output
subdirectory. The results are
stored in json format which is not particularly convenient. There is
a summary.sh
script which will extract the correct scores each eval that is
available.
sh summary.sh output/v2/meta-llama/Llama-3.1-8B
The watch.py
script is a convenience script to help keep track of the jobs you
have running, it has two operational modes.
In the default mode it prints the jobs that are currently in the queue or
running, and if available it prints the last line in the error log in each
file, which often contains the most recent tqdm progress bar for jobs that are
running. If you specify the --once
flag it will do this and exit, if you do
not specify the --once
flag it will keep checking job status periodically and
provide updates as jobs complete.
$ python watch.py --once
9678732 /scratch/project_462000353/converted-checkpoints/llama31_8B_culturax50B_2e-5/iter_0011920 hellaswag_mt_fi is queued.
9678731 /scratch/project_462000353/converted-checkpoints/llama31_8B_culturax50B_2e-5/iter_0011920 hellaswag is queued.
9678730 /scratch/project_462000353/converted-checkpoints/llama31_8B_culturax50B_2e-5/iter_0011920 gsm8k_mt_fi is queued.
9678729 /scratch/project_462000353/converted-checkpoints/llama31_8B_culturax50B_2e-5/iter_0011920 gsm8k is queued.
9678728 /scratch/project_462000353/converted-checkpoints/llama31_8B_culturax50B_2e-5/iter_0011920 mmlu_mt_fi is queued.
9670938 meta-llama/Llama-3.1-70B hellaswag_mt_fi is running.
Running loglikelihood requests: 43%|████▎ | 17337/40168 [13:41:50<16:37:51, 2.62s/it]
9678481 /scratch/project_462000353/converted-checkpoints/llama31-8b-tp2-pp1-megatron-format-lr5e-5_iter_0011920_bfloat16 gsm8k is running.
Running generate_until requests: 4%|▍ | 59/1319 [07:29<1:57:40, 5.60s/it]
9678727 /scratch/project_462000353/converted-checkpoints/llama31_8B_culturax50B_2e-5/iter_0011920 mmlu is running.
Running loglikelihood requests: 83%|████████▎ | 46725/56168 [34:57<03:33, 44.33it/s]
There is another new operational mode recently added that might be more useful.
Specifying the --hist
flag will show a report of the jobs that have completed
in the last 3 days (controllable with the --days
flag) sorted by model name
and status. It also does some coalescing: if an eval is ultimately successful,
it won't bother reporting on failed runs, etc. This is helpful to identify
evals which have failed and need to be investigated or rerun.
$ python watch.py --hist --days 1
Model: meta-llama/Llama-3.1-70B
Results dir: /pfs/lustrep2/scratch/project_462000353/jburdge/git/evals/output/v2/meta-llama/Llama-3.1-70B
Completed:
gsm8k_mt_fi
gsm8k_mt_fi
Running/Queued:
hellaswag_mt_fi 9670938
Failed:
hellaswag_mt_fi /pfs/lustrep2/scratch/project_462000353/jburdge/git/evals/logs/9639428.err
All evals are logged in command_history.jsonl
, which is used by various
scripts to monitor job status and report history.
An entry looks like this.
{
"timestamp": "2023-11-09 08:21:12",
"script_name": "/tmp/tmpvv66ri7g",
"job_id": "4868114",
"eval": "hellaswag",
"model": "/scratch/project_462000319/general-tools/checkpoints/33B_torch_step70128_bfloat16",
"tokenizer": "/scratch/project_462000319/tokenizers/tokenizer_v6_fixed_fin",
"err_log": "/pfs/lustrep4/scratch/project_462000319/evals/logs/4868114.err",
"out_log": "/pfs/lustrep4/scratch/project_462000319/evals/logs/4868114.out",
"output_file": "/pfs/lustrep4/scratch/project_462000319/evals/output/poro-34b/step70128/hellaswag.json"
}
You can utilize the information directly as well. If you've just queued up a bunch of evals against a model and realized you made a mistake and need to cancel them all, you could do something like this to save a lot of typing:
grep /path/to/model command_history.jsonl | jq -r .job_id | xargs scancel