Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding script for processing many intermediate checkpoints at once for offline evals #731

Open
wants to merge 93 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 77 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
ef7a31c
batch convert checkpoint
jenahwang Sep 6, 2024
62d9a1d
batch convert checkpoint
jenahwang Sep 6, 2024
dba011b
batch convert checkpoint
jenahwang Sep 6, 2024
84a4f1b
batch convert checkpoint
jenahwang Sep 6, 2024
9a7b03b
batch convert checkpoint
jenahwang Sep 6, 2024
d4687e9
batch convert checkpoint
jenahwang Sep 6, 2024
24ec144
batch convert checkpoint
jenahwang Sep 6, 2024
6187889
batch convert checkpoint
jenahwang Sep 6, 2024
fbfda0e
batch convert checkpoint
jenahwang Sep 6, 2024
a862a0b
tinkering
jenahwang Sep 6, 2024
8d79a01
testing
jenahwang Sep 6, 2024
b4ed78d
testing
jenahwang Sep 6, 2024
2f2a764
testing
jenahwang Sep 6, 2024
b9601c4
testing
jenahwang Sep 6, 2024
8cc86ee
testing
jenahwang Sep 6, 2024
50e7090
testing
jenahwang Sep 6, 2024
8aa450f
testing
jenahwang Sep 6, 2024
f357320
testing
jenahwang Sep 6, 2024
02899a3
testing
jenahwang Sep 6, 2024
9ac3739
testing
jenahwang Sep 6, 2024
ef0b403
convert checkpoint batch
jenahwang Sep 9, 2024
c0ff186
convert checkpoint batch
jenahwang Sep 9, 2024
15092ae
convert checkpoint batch
jenahwang Sep 9, 2024
c489f53
convert checkpoint batch
jenahwang Sep 9, 2024
2ed6a50
convert checkpoint batch
jenahwang Sep 9, 2024
dd3dc18
convert checkpoint batch
jenahwang Sep 9, 2024
9bc11d7
convert checkpoint batch
jenahwang Sep 9, 2024
e08895f
convert checkpoint batch
jenahwang Sep 9, 2024
0b82c2b
convert checkpoint batch
jenahwang Sep 9, 2024
482a487
convert checkpoint batch
jenahwang Sep 9, 2024
5c015ca
convert checkpoint batch
jenahwang Sep 9, 2024
268d74d
convert checkpoint batch
jenahwang Sep 10, 2024
1326da7
convert checkpoint batch
jenahwang Sep 10, 2024
ccbeef2
convert checkpoint batch
jenahwang Sep 10, 2024
d83f2ed
convert checkpoint batch
jenahwang Sep 10, 2024
a42de22
convert checkpoint batch
jenahwang Sep 10, 2024
e07796b
error catch
jenahwang Sep 11, 2024
c6e773d
checking for existing conversions
jenahwang Sep 11, 2024
798ded3
minor change
jenahwang Sep 11, 2024
45a9374
minor change
jenahwang Sep 11, 2024
26a1e26
adding a cleanup flag for removing local directory at the end of the …
jenahwang Sep 11, 2024
30a9cb9
fix
jenahwang Sep 11, 2024
083ff3e
troubleshooting
jenahwang Sep 11, 2024
9c6dc75
troubleshooting
jenahwang Sep 11, 2024
54a0d62
troubleshooting
jenahwang Sep 11, 2024
25d5a7d
troubleshooting
jenahwang Sep 11, 2024
dd9ae8a
minor fixes
jenahwang Sep 12, 2024
402f7b7
minor fixes
jenahwang Sep 12, 2024
f5806da
minor fixes
jenahwang Sep 12, 2024
7718e9f
minor fixes
jenahwang Sep 12, 2024
d8436da
minor fixes
jenahwang Sep 12, 2024
49dff54
Merge remote-tracking branch 'origin' into jena/consistent-ranking
jenahwang Sep 12, 2024
c7717a1
fix
jenahwang Sep 12, 2024
b8445ce
updates
jenahwang Sep 13, 2024
855d666
handle directories that have unsharded counterparts
jenahwang Sep 13, 2024
ed5abb7
fixing error catching
jenahwang Sep 13, 2024
cbcbc86
output log edits
jenahwang Sep 18, 2024
d49db7b
output log edits
jenahwang Sep 18, 2024
cd6a75a
output log edits
jenahwang Sep 18, 2024
f8e9c96
output log edits
jenahwang Sep 18, 2024
de82922
code cleanup
jenahwang Oct 4, 2024
d3e16d7
code cleanup
jenahwang Oct 4, 2024
73fc47b
Merge remote-tracking branch 'origin' into jena/consistent-ranking
jenahwang Oct 4, 2024
ef8ffbd
testing
jenahwang Oct 4, 2024
fac649d
testing
jenahwang Oct 4, 2024
7597534
testing
jenahwang Oct 4, 2024
94fa6da
downloading fix
jenahwang Oct 8, 2024
5e46840
downloading fix
jenahwang Oct 8, 2024
3c808a8
Merge remote-tracking branch 'origin' into jena/consistent-ranking
jenahwang Oct 11, 2024
da90ae2
.
jenahwang Oct 11, 2024
447de12
.
jenahwang Oct 11, 2024
3920f2e
addressing errors
jenahwang Oct 11, 2024
acaccdd
error fixes for pr
jenahwang Oct 11, 2024
a4a40e1
fixing errors for pr
jenahwang Oct 11, 2024
3d2bd32
fixing errors for pr
jenahwang Oct 11, 2024
d529f5a
fixing errors for pr
jenahwang Oct 11, 2024
904ae26
Merge branch 'main' into jena/consistent-ranking
dirkgr Oct 25, 2024
4a32be0
removing temp outputs
jenahwang Oct 26, 2024
2dc26a9
fixes and cleanups
jenahwang Oct 26, 2024
378aafe
adding beaker-gantry to dependencies
jenahwang Dec 13, 2024
69d12f3
adding beaker-gantry to dependencies
jenahwang Dec 13, 2024
579d612
adding beaker-gantry to dependencies
jenahwang Dec 13, 2024
3b9563d
python version apparently has to be 3.10 above for olmo/util.py to run
jenahwang Dec 13, 2024
4a882a3
err... no
jenahwang Dec 13, 2024
6acbfcc
err... no
jenahwang Dec 13, 2024
6e67a9c
tinkering
jenahwang Dec 13, 2024
21193cb
undoing changes
jenahwang Dec 16, 2024
1b4da65
fix
jenahwang Dec 16, 2024
7b6e37f
error code updated
jenahwang Dec 16, 2024
264ce05
minor change to the error log
jenahwang Dec 16, 2024
2154ea8
fixed error
jenahwang Dec 17, 2024
d4e1f42
edited conversion error output
jenahwang Dec 17, 2024
f39a522
edited conversion error output
jenahwang Dec 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# beaker yaml
guided-trout-2f805b9.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


# build artifacts

.eggs/
Expand Down
9 changes: 9 additions & 0 deletions hf_olmo/convert_olmo_to_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,12 @@ def main():
help="Keep olmo-specific artifacts in the checkpoint.",
)

parser.add_argument(
"--cleanup-local-dir",
action="store_true",
help="Remove local download of the directory."
)

args = parser.parse_args()

args.destination_dir = args.destination_dir or args.checkpoint_dir
Expand All @@ -308,6 +314,9 @@ def main():
upload_local_checkpoint(local_checkpoint_dir, args.destination_dir)

print(f"Converted checkpoint saved to {args.destination_dir}")
if args.cleanup_local_dir:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a reason not to do this?

Copy link

@jenahwang jenahwang Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the if statement & the flag.

print(f"Removing temporary local dir: {local_checkpoint_dir}")
shutil.rmtree(local_checkpoint_dir)


if __name__ == "__main__":
Expand Down
2 changes: 2 additions & 0 deletions install_torch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#!/bin/bash
pip install torch
10 changes: 10 additions & 0 deletions log.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't commit temp output.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. removed

o=======[]
__ _ _ _ |_ []
/ _` | __ _ _ _ | |_ _ _ | || | []
\__, | / _` | | ' \ | _| | '_| \_, | _/ ]_
|___/ \__,_| |_||_| _\__| _|_|_ _|__/ |_____|
_|"""""|_|"""""|_|"""""|_|"""""|_|"""""|_| """"|
`---------------------------------------------'

Experiment submitted, see progress at https://beaker.org/ex/01J7446KB7EXZ35D8NST0JTNTY
7 changes: 7 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
torch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use requirements.txt in OLMo. We use pyproject.toml.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed --- it was created to troubleshoot.

datasets
rich
botocore
cached-path
transformers
beaker-gantry
95 changes: 95 additions & 0 deletions scripts/convert_checkpoints.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
#!/usr/bin/env bash

# Converts s3 checkpoints into WEKA
# To be run at the top of the root of OLMo repository.
# Script requires the use of GANTRY and AWS access to WEKA
#
# Usage: scripts/convert_checkpoints.sh <s3 checkpoint to process> [-s]
# -s if converted checkpoint is found in s3, then save to weka
# -c sanity check; don't actually do conversion. just go through the motions and print stuff
#
# calls: convert_checkpoints_batch.py
# usage: convert_checkpoints_batch.py [-h]
# (--checkpoint-path CHECKPOINT_PATH | --checkpoint-path-file CHECKPOINT_PATH_FILE)
# [--weka-load-dir WEKA_LOAD_DIR]
# [--weka-prefix WEKA_PREFIX]
# [--sanity-check] [--save-to-weka]
#
# Example use:
# Run:
# sh scripts/convert_checkpoints.sh s3://ai2-llm/checkpoints/cheap_decisions/dolma-v1-6-and-sources-baseline-3x-code-1B-N-1T-D-mitchish1-001/step9*
# This will convert all models in the directory and save them to:
# weka://oe-eval-default/ai2-llm/checkpoints/cheap_decisions/dolma-v1-6-and-sources-baseline-3x-code-1B-N-1T-D-mitchish1-001-hf/step9*
#
# It will first, though, check that the weka directory doesn't exist AND that s3 doesn't have a corresponding directory (so as not to replicate what conversions already made)
#
# ASSUMPTIONS
# - INPUT must be on s3. Multiple wildcards allowed
# - OUTPUT to weka is saved to the path as found on s3 with "-hf" suffix appended to the path
# - Assumes tokenizer allenai/gpt-neox-olmo-dolma-v1_5.json
#
# OUTPUT logs
# - saves log.jsonl. For every checkpoint found given input:
# - "unprocessed_path" := checkpoint path to convert
# - "converted_path" := checkpoint converted path
# - "conversion_status" := [new | existing (already in weka) | existing-downloaded (from s3) ]
# - "date" := datestamp
# - "error" := error if any conversions didn't pan out for any reason
# - saves model_checkpoints.jsonl: this is input file is formatted for oe-eval-internal experiments
# - example log files for the following run:
# > sh scripts/convert_checkpoints.sh s3://ai2-llm/checkpoints/OLMo-ladder/baseline-300M-1xC/step91*6-unsharded
# log.jsonl:
# {"unprocessed_path": "s3://ai2-llm/checkpoints/OLMo-ladder/baseline-300M-1xC/step9176-unsharded", "converted_path": "weka://oe-eval-default/ianm/ai2-llm/checkpoints/OLMo-ladder/baseline-300M-1xC/step9176-unsharded-hf", "conversion": "existing", "date_time": "Oct-04-2024_2012", "error": ""}
# {"unprocessed_path": "s3://ai2-llm/checkpoints/OLMo-ladder/baseline-300M-1xC/step9166-unsharded", "converted_path": "weka://oe-eval-default/ianm/ai2-llm/checkpoints/OLMo-ladder/baseline-300M-1xC/step9166-unsharded-hf", "conversion": "existing", "date_time": "Oct-04-2024_2012", "error": ""}
# {"unprocessed_path": "s3://ai2-llm/checkpoints/OLMo-ladder/baseline-300M-1xC/step9186-unsharded", "converted_path": "weka://oe-eval-default/ianm/ai2-llm/checkpoints/OLMo-ladder/baseline-300M-1xC/step9186-unsharded-hf", "conversion": "existing", "date_time": "Oct-04-2024_2012", "error": ""}
# model_checkpoints.jsonl:
# {"model_name": "baseline-300M-1xC", "checkpoints_location": "weka://oe-eval-default/ianm/ai2-llm/checkpoints/OLMo-ladder/baseline-300M-1xC", "revisions": ["step9176-unsharded-hf", "step9166-unsharded-hf", "step9186-unsharded-hf"]}
#
# SH run SPECIFICATION DEFAULTS:
# - Budget for oe-eval (see below)
# - Loading for weka weka://oe-eval-default/ (see below)
# - Gantry experiments saved to beaker://ai2/cheap-decisions
# - Weka prefix is used for model_checkpoints.jsonl
#
# TODOs
# - Make tokenizer updatable

CHECKPOINT_PATH=$1
SAVE_TO_WEKA=""
SANITY_CHECK=""
shift

usage() {
echo "Usage: $0 <s3 checkpoint to process> [-s]"
echo " -s --save-to-weka"
echo " -c --sanity-check"
exit 1;
}

while getopts "sc" opt;
do
case $opt in
s) SAVE_TO_WEKA="--save-to-weka" ;;
c) SANITY_CHECK="--sanity-check" ;; # mostly useful for local test runs - it will stop from doing any copying or conversions.
*) usage ;;
esac
done

#echo "python scripts/convert_checkpoints_batch.py --checkpoint-path $CHECKPOINT_PATH --weka-load-dir '/data/input' --weka-prefix 'weka://oe-eval-default' $SAVE_TO_WEKA $SANITY_CHECK"

gantry run \
--description "Converting $CHECKPOINT_PATH" \
--allow-dirty \
--workspace ai2/cheap-decisions \
--priority normal \
--gpus 0 \
--preemptible \
--cluster ai2/jupiter-cirrascale-2 \
--budget ai2/oe-eval \
--env-secret AWS_ACCESS_KEY_ID=JENA_AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=JENA_AWS_SECRET_ACCESS_KEY \
--shared-memory 10GiB \
--weka=oe-eval-default:/data/input \
--yes \
-- /bin/bash -c "python scripts/convert_checkpoints_batch.py --checkpoint-path $CHECKPOINT_PATH --weka-load-dir '/data/input' --weka-prefix 'weka://oe-eval-default' $SAVE_TO_WEKA $SANITY_CHECK"

Loading
Loading