-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New improved modelling for LLM Deepspeed. #230
base: main
Are you sure you want to change the base?
Conversation
I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior. I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.
|
Which file system your on? I tested this on lustre and it was working fine. Maybe the file system synchronization is different on your file system.
From: Wes Vaske ***@***.***>
Date: Monday, October 7, 2024 at 3:32 PM
To: argonne-lcf/dlio_benchmark ***@***.***>
Cc: Hariharan Devarajan ***@***.***>, Author ***@***.***>
Subject: Re: [argonne-lcf/dlio_benchmark] New improved modelling for LLM Deepspeed. (PR #230)
I hit one issue while testing this. If the checkpoint files did not exist, I would see writes after doing the checkpoint and a comm.barrier(). If the checkpoint files DID exist and I was overwriting them I didn't see this behavior.
I was abled to "fix" this by adding a fsync in the pytorch_checkpointing.py file, but I'm not sure if that's the best way to fix this or if it's a system issue but it ensures that the checkpoint is written as a blocking operation.
@dlp.log
def save_state(self, suffix, state):
name = self.get_name(suffix)
with open(name, "wb") as f:
torch.save(state, f)
os.fsync(f.fileno())
f.close()
—
Reply to this email directly, view it on GitHub<#230 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFB2NRFE2FUMC3INOCO7ZU3Z2MDXTAVCNFSM6AAAAABPND2P72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJYGAZTQNJXGU>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I am hesitating for doing sync as it will significantly slow down the system. Can u describe the filesystem on which your writing the checkpoints? We probably need a flag in dlio_benchmark to be enable fsync for some filesystems. |
I'm using XFS with a single local NVMe drive. I'm OK tracking this change in my local branch for now until I can better confirm if it's a real issue or an artifact of some system configuration issue. |
The logic is as follows now.
Assume we have 40 layers with tensor parallelism of 4 and pipeline parallelism of 8
Then, the checkpointing would have 44 layers (40 + 4 tensor pipeline layers) spread across every 32 ranks.
So, given pipeline_rank being every four ranks in this case, rank 0-3 is pipeline rank 0, 4-7 is pipeline rank 1, and so on.
Then, I expect a layer distribution among each pipeline rank to be
(pipeline_rank, start_layer_index, end_layer_index) both the start and end are inclusive.
(0, 0, 5)
(1, 6, 11)
(2, 12, 17)
(3, 18, 23)
(4, 24, 28)
(5, 29, 33)
(6, 34, 38)
(7, 39, 43)
Also, a tensor parallelism of 4 would mean each layer tensor would be divided by four on each rank.
So if a layer was (1MB, 1GB) tensors
They would be stored as 256KB, 256MB tensors by each rank.