-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peteish13 #739
Peteish13 #739
Changes from 250 commits
0656ce5
b699753
ca8c485
5f1a369
d6cdc0c
57cc09b
88a06bc
0f6b896
ce11f9f
c1d3ffe
d3d39a0
d8f2aac
2c0d11a
6cc6f62
43184f3
435c3e6
a74ab6e
08718b9
b093945
51d0a39
7cf11d8
212f55f
68b022f
2594a2a
4888fab
59f2247
746d394
ea5ca3f
b52f228
def6987
0b70025
23f743c
a9c7daa
a9bb215
174da42
1678ef6
fce511b
129426a
9858048
e2a6840
ca0abef
2f9a726
7e8ba3f
695f53d
aba734b
7a76b03
2f94412
1b020e6
a424b25
2f3bddf
e66bce7
c40635c
054dc2d
71284d1
5871379
05ead0e
e42206c
c151681
eeba9cc
48f7b78
3f7bd8f
2316171
10c468b
29b9b9c
cbe9781
e7c212b
892d224
5315e68
f5badaf
ce06d18
c2abab4
e717f44
935d695
bc4f6b9
01669d8
81a2fa9
f91cebd
6f068fa
78f9c6f
b329bab
b15bcc4
aff2124
09f447c
e4a103d
c0a130c
21c8fa3
f8ef856
35300c0
c130118
7f12d39
a3db348
3ea5a35
88ec63d
cfb7f4c
19b5ad8
8ff9ebd
b79afee
ee08b92
0b3329e
0263c7a
3e3e2e3
d24a198
82aa577
32b869b
58cd108
5ed69ed
db18672
fdc998c
e90423e
4f6e96b
b4652d4
2144b57
ef3dca1
aa84ed4
992b884
de3142a
c5f2d47
b03f23d
b76eb1a
d22347c
683afed
e52cd73
f6c1eea
05957f2
810e23d
2d12eb6
61bdf2b
9692409
aa3fd74
cf6a87f
37d57f9
9015cf1
76763ad
e23a829
51fa643
793bf63
ddea087
11f11c8
af2c4fd
2bb3de8
e69e577
9cda3de
e7f3bf5
e056744
8c710e8
b899c18
888a9ff
1abccdd
d2ca20c
ab4ceb6
23f8f14
2246140
43c8480
9a77b60
eb3d754
eafdc3e
8726e45
605a3c0
4ea0d94
5408e7e
39c964a
6eec513
7ea0112
d7dad11
7dba688
8031acd
7973bef
c35f8c5
ae111f4
c141a35
acba910
75a1395
f8f8a1e
4f5866b
21edc84
2e1fd4a
a1271eb
cb9c22a
acf1ccc
3de3c8e
e598e0c
cb740a6
74e30ca
de19d65
a6b1c36
bdb6c13
1ac0437
e87265c
02ae870
62927bd
0e8246e
8b9b0e4
9b1c670
f33abac
cdf4319
6ca55e7
ede2464
2ba7632
57dac1c
d78fb17
2ee05eb
6819484
e688f07
1959960
a489d84
21143d4
cdc544c
76e42d7
df55ddb
3dbc034
dbaa442
2e635cf
842ad6a
f18381e
3a2d62e
0e7b088
320fd7b
6099826
30eb3a6
f6c4cb8
2ec3fa0
17b2bca
eba9075
7a35ce9
34d12fe
8908d65
c5b0369
b3668f4
d657a19
35aaaf3
7dc73b3
bca921f
7b0c0f8
949d80c
ab7d870
93df396
bad96cd
214aea5
200bd1f
5d8da46
8b709b9
c7c0c5b
6f4a49a
4e57631
18b2f6a
0b37469
2997b24
c0fc15e
96e4b0d
a0c09ea
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
FROM --platform=linux/amd64 nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04 | ||
|
||
ARG DEBIAN_FRONTEND="noninteractive" | ||
ENV TZ="America/Los_Angeles" | ||
|
||
# Install base tools. | ||
RUN apt-get update && apt-get install -y \ | ||
build-essential \ | ||
curl \ | ||
git \ | ||
jq \ | ||
language-pack-en \ | ||
make \ | ||
sudo \ | ||
unzip \ | ||
vim \ | ||
wget \ | ||
parallel \ | ||
iputils-ping \ | ||
tmux | ||
|
||
ARG BEAKER_VERSION | ||
RUN curl --silent \ | ||
--connect-timeout 5 \ | ||
--max-time 10 \ | ||
--retry 5 \ | ||
--retry-delay 0 \ | ||
--retry-max-time 40 \ | ||
--output beaker.tar.gz \ | ||
"https://beaker.org/api/v3/release/cli?os=linux&arch=amd64&version=${BEAKER_VERSION}" \ | ||
&& tar -zxf beaker.tar.gz -C /usr/local/bin/ ./beaker \ | ||
&& rm beaker.tar.gz | ||
|
||
# This ensures the dynamic linker (or NVIDIA's container runtime, I'm not sure) | ||
# puts the right NVIDIA things in the right place | ||
ENV NVIDIA_DRIVER_CAPABILITIES=graphics,utility,compute | ||
|
||
# Install conda. We give anyone in the users group the ability to run | ||
# conda commands and install packages in the base (default) environment. | ||
# Things installed into the default environment won't persist, but we prefer | ||
# convenience in this case and try to make sure the user is aware of this | ||
# with a message that's printed when the session starts. | ||
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.1.0-1-Linux-x86_64.sh \ | ||
&& echo "32d73e1bc33fda089d7cd9ef4c1be542616bd8e437d1f77afeeaf7afdb019787 Miniconda3-py310_23.1.0-1-Linux-x86_64.sh" \ | ||
| sha256sum --check \ | ||
&& bash Miniconda3-py310_23.1.0-1-Linux-x86_64.sh -b -p /opt/miniconda3 \ | ||
&& rm Miniconda3-py310_23.1.0-1-Linux-x86_64.sh | ||
|
||
ENV PATH=/opt/miniconda3/bin:/opt/miniconda3/condabin:$PATH | ||
ENV LD_LIBRARY_PATH=/usr/local/cuda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH | ||
|
||
RUN conda install -y pytorch::pytorch==2.5.1 packaging "numpy<2" | ||
|
||
# Ensure users can modify their container environment. | ||
RUN echo '%users ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers | ||
|
||
# Install MLNX OFED user-space drivers | ||
# See https://docs.nvidia.com/networking/pages/releaseview.action?pageId=15049785#Howto:DeployRDMAacceleratedDockercontaineroverInfiniBandfabric.-Dockerfile | ||
ENV MOFED_VER 5.8-1.1.2.1 | ||
ENV OS_VER ubuntu20.04 | ||
ENV PLATFORM x86_64 | ||
RUN wget --quiet https://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VER}/MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz && \ | ||
tar -xvf MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz && \ | ||
MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}/mlnxofedinstall --basic --user-space-only --without-fw-update -q && \ | ||
rm -rf MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM} && \ | ||
rm MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz | ||
|
||
RUN apt-get install ninja-build -y | ||
|
||
ENV HF_HUB_ENABLE_HF_TRANSFER=1 | ||
RUN pip install --no-cache-dir --upgrade pip "setuptools<70.0.0" wheel | ||
# TODO, unpin setuptools when this issue in flash attention is resolved | ||
RUN pip install --no-cache-dir flash-attn==2.6.3 --no-build-isolation | ||
RUN python -c "import torch; print(torch.__version__)" | ||
|
||
RUN pip install --no-cache-dir ai2-olmo-core==0.1.0 omegaconf rich boto3 google-cloud-storage tokenizers "cached_path>=1.6.2" transformers importlib_resources py-spy wandb beaker-gantry click torchmetrics safetensors datasets scikit-learn "msgspec>=0.14.0" "smashed[remote]>=0.21.1" | ||
|
||
RUN apt-get clean | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
#!/usr/bin/env bash | ||
|
||
set -ex | ||
|
||
NUM_NODES=$1 | ||
shift | ||
|
||
gantry run \ | ||
--workspace ai2/13B \ | ||
--task-name peteish1 \ | ||
--description "Peteish1" \ | ||
--priority urgent \ | ||
--preemptible \ | ||
--beaker-image michalg/cuda11.8-ubuntu20.04-arb \ | ||
--cluster ai2/augusta-google-1 \ | ||
--gpus 8 \ | ||
--replicas "${NUM_NODES}" \ | ||
--leader-selection \ | ||
--host-networking \ | ||
--budget ai2/oe-training \ | ||
--no-nfs \ | ||
--propagate-failure \ | ||
--propagate-preemption \ | ||
--synchronized-start-timeout 15m \ | ||
--no-python \ | ||
--env LOG_FILTER_TYPE=local_rank0_only \ | ||
--env OMP_NUM_THREADS=8 \ | ||
--env OLMO_TASK=model \ | ||
--env-secret WANDB_API_KEY=DIRKG_WANDB_API_KEY \ | ||
--env-secret AWS_ACCESS_KEY_ID=DIRKG_AWS_ACCESS_KEY_ID \ | ||
--env-secret AWS_SECRET_ACCESS_KEY=DIRKG_AWS_SECRET_ACCESS_KEY \ | ||
--shared-memory 10GiB \ | ||
--yes \ | ||
--timeout=-1 \ | ||
--allow-dirty \ | ||
--retries 10 \ | ||
-- /bin/bash -c "scripts/augusta/peteish1.sh \$BEAKER_LEADER_REPLICA_HOSTNAME \$BEAKER_REPLICA_RANK" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
#!/usr/bin/env bash | ||
|
||
set -ex | ||
|
||
NUM_NODES=$1 | ||
shift | ||
|
||
gantry run \ | ||
--workspace ai2/13B \ | ||
--task-name peteish1-muplr \ | ||
--description "Peteish1 muP LR" \ | ||
--priority high \ | ||
--preemptible \ | ||
--beaker-image michalg/cuda11.8-ubuntu20.04-arb \ | ||
--cluster ai2/augusta-google-1 \ | ||
--gpus 8 \ | ||
--replicas "${NUM_NODES}" \ | ||
--leader-selection \ | ||
--host-networking \ | ||
--budget ai2/oe-training \ | ||
--no-nfs \ | ||
--propagate-failure \ | ||
--propagate-preemption \ | ||
--synchronized-start-timeout 15m \ | ||
--no-python \ | ||
--env LOG_FILTER_TYPE=local_rank0_only \ | ||
--env OMP_NUM_THREADS=8 \ | ||
--env OLMO_TASK=model \ | ||
--env-secret WANDB_API_KEY=DIRKG_WANDB_API_KEY \ | ||
--env-secret AWS_ACCESS_KEY_ID=DIRKG_AWS_ACCESS_KEY_ID \ | ||
--env-secret AWS_SECRET_ACCESS_KEY=DIRKG_AWS_SECRET_ACCESS_KEY \ | ||
--shared-memory 10GiB \ | ||
--yes \ | ||
--timeout=-1 \ | ||
--allow-dirty \ | ||
--retries 10 \ | ||
-- /bin/bash -c "scripts/augusta/peteish1-muplr.sh \$BEAKER_LEADER_REPLICA_HOSTNAME \$BEAKER_REPLICA_RANK" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
#!/usr/bin/env bash | ||
|
||
set -exuo pipefail | ||
IFS=$'\n\t' | ||
|
||
BEAKER_LEADER_REPLICA_HOSTNAME=$1 | ||
shift | ||
|
||
BEAKER_REPLICA_RANK=$1 | ||
shift | ||
|
||
# augusta specific environment | ||
export LD_LIBRARY_PATH="/var/lib/tcpxo/lib64:${LD_LIBRARY_PATH}" | ||
export NCCL_CROSS_NIC=0 | ||
export NCCL_ALGO=Ring,Tree | ||
export NCCL_PROTO=Simple | ||
export NCCL_MIN_NCHANNELS=4 | ||
export NCCL_P2P_NET_CHUNKSIZE=524288 | ||
export NCCL_P2P_PCI_CHUNKSIZE=524288 | ||
export NCCL_P2P_NVL_CHUNKSIZE=1048576 | ||
export NCCL_FASTRAK_NUM_FLOWS=2 | ||
export NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL=0 | ||
export NCCL_BUFFSIZE=8388608 | ||
export NCCL_FASTRAK_USE_SNAP=1 | ||
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 | ||
export NCCL_NET_GDR_LEVEL=PIX | ||
export NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING=0 | ||
export NCCL_TUNER_PLUGIN=libnccl-tuner.so | ||
export NCCL_TUNER_CONFIG_PATH=/var/lib/tcpxo/lib64/a3plus_tuner_config.textproto | ||
export NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/var/lib/tcpxo/lib64/a3plus_guest_config.textproto | ||
export NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS=600000 | ||
export NCCL_NVLS_ENABLE=0 | ||
export NCCL_DEBUG=WARN | ||
export NCCL_FASTRAK_CTRL_DEV=enp0s12 | ||
export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0 | ||
export NCCL_SOCKET_IFNAME=enp0s12 | ||
export NCCL_USE_SNAP=1 | ||
export NCCL_FASTRAK_USE_LLCM=1 | ||
export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices | ||
|
||
# Install flash-attn | ||
#conda install -y pytorch-cuda==12.4 packaging ninja cccl cuda-nvcc libcusolver-dev cuda-profiler-api libcusparse-dev libcublas-dev -c pytorch -c nvidia | ||
#pip install flash-attn==2.5.9.post1 --no-build-isolation | ||
pip install '.[train]' | ||
pip freeze | ||
|
||
# Force processes to synchronize at init_process_group | ||
export TORCH_DIST_INIT_BARRIER=1 | ||
# Better error handling from Python | ||
export PYTHONFAULTHANDLER=1 | ||
|
||
NAME=${GANTRY_TASK_NAME// /_} | ||
RUN_NAME=$NAME-$(date -u +"%Y%m%d_%H%M%S") | ||
SAVE_FOLDER=/data/$RUN_NAME | ||
mkdir -p $SAVE_FOLDER | ||
|
||
torchrun \ | ||
--nnodes "${BEAKER_REPLICA_COUNT}:${BEAKER_REPLICA_COUNT}" \ | ||
--nproc-per-node 8 \ | ||
--rdzv_id 12348 \ | ||
--rdzv_backend static \ | ||
--rdzv_endpoint "${BEAKER_LEADER_REPLICA_HOSTNAME}:29400" \ | ||
--node_rank "${BEAKER_REPLICA_RANK}" \ | ||
--rdzv_conf 'read_timeout=420' \ | ||
scripts/train.py \ | ||
configs/peteish1-google.yaml \ | ||
--run_name=$RUN_NAME \ | ||
--wandb.group=$NAME \ | ||
--optimizer.learning_rate=7.81e-3 \ | ||
--save_interval_ephemeral=10000 \ | ||
--eval_interval=10000 \ | ||
--fsdp.sharding_strategy=HYBRID_SHARD \ | ||
--fsdp.hybrid_sharding_num_model_replicas="${BEAKER_REPLICA_COUNT}" \ | ||
--fsdp.wrapping_strategy=by_block_and_size \ | ||
--save_folder=$SAVE_FOLDER \ | ||
--remote_save_folder="gs://ai2-llm/checkpoints/OLMo-medium/$NAME/" \ | ||
--try_load_latest_save \ | ||
--save_overwrite \ | ||
--sharded_checkpointer=olmo_core \ | ||
--device_train_microbatch_size=4 \ | ||
--device_eval_batch_size=8 \ | ||
--compile.fullgraph=false \ | ||
--fused_loss=false \ | ||
--model.flash_attention=false \ | ||
--data.num_workers=32 \ | ||
--optimizer.metrics_log_interval=10 \ | ||
--data.prefetch_factor=8 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
#!/usr/bin/env bash | ||
|
||
set -ex | ||
|
||
NUM_NODES=$1 | ||
shift | ||
|
||
NAME=$1 | ||
shift | ||
|
||
SEED=$1 | ||
shift | ||
|
||
gantry run \ | ||
--workspace ai2/13B \ | ||
--task-name $NAME \ | ||
--description "Peteish1 annealing : $NAME with seed $SEED" \ | ||
--priority urgent \ | ||
--preemptible \ | ||
--beaker-image dirkg/OLMo \ | ||
--cluster ai2/augusta-google-1 \ | ||
--gpus 8 \ | ||
--replicas "${NUM_NODES}" \ | ||
--leader-selection \ | ||
--host-networking \ | ||
--budget ai2/oe-training \ | ||
--no-nfs \ | ||
--propagate-failure \ | ||
--propagate-preemption \ | ||
--synchronized-start-timeout 15m \ | ||
--no-python \ | ||
--env LOG_FILTER_TYPE=local_rank0_only \ | ||
--env OMP_NUM_THREADS=8 \ | ||
--env OLMO_TASK=model \ | ||
--env-secret WANDB_API_KEY=DIRKG_WANDB_API_KEY \ | ||
--env-secret AWS_ACCESS_KEY_ID=DIRKG_AWS_ACCESS_KEY_ID \ | ||
--env-secret AWS_SECRET_ACCESS_KEY=DIRKG_AWS_SECRET_ACCESS_KEY \ | ||
--shared-memory 10GiB \ | ||
--yes \ | ||
--timeout=-1 \ | ||
-- /bin/bash -c "scripts/augusta/peteish1-seed-anneal.sh \$BEAKER_LEADER_REPLICA_HOSTNAME \$BEAKER_REPLICA_RANK $SEED" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you that the result is terrible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this prompted me to look into this a bit more and I think I've found a better solution: just mark the model input sizes as dynamic. I tested this out in OLMo-core and it appears to work well.
allenai/OLMo-core#105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it compiles a bunch of versions for different batch sizes, because that's how we call it during eval, and then they stick around. In all of my early runs I had high tps until the first eval, and then low tps afterwards. This is what fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried
dynamic
and it was bad. I don't remember the way in which it was bad, but it didn't work. That's why I added that version in the first place.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, oh well. I tested with nightly so maybe it's just better now with recent compiler advances.