Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huge temp files while uploading data using MDS writer #734

Open
MaxxP0 opened this issue Jul 24, 2024 · 2 comments
Open

huge temp files while uploading data using MDS writer #734

MaxxP0 opened this issue Jul 24, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@MaxxP0
Copy link

MaxxP0 commented Jul 24, 2024

Environment

  • OS: Windows 11

To reproduce

Steps to reproduce the behavior:

  1. Upload and convert a local webdataset using MDS writer like the following code produces huge temp files (the webdataset is 800gb and the temp file is 1.8tb stored in AppData/Local/Temp eventually crashing the upload.

Code

file = r"file:d:/Datasets/shards50m/{00000..04999}.tar"

dataset = wds.WebDataset(file).decode("pil").to_tuple("jpg", "txt")

data_dir = "s3://50m/mds/"

columns = {
    'image': 'pil',
    'caption': 'str'
}


with MDSWriter(out=data_dir, columns=columns,progress_bar=True) as out:
    try:
        for sample in tqdm(dataset):
            try:
                if len(sample) != 2:
                    print("Skipping sample, missing 'txt' or 'jpg'.")
                    continue

                img, caption = sample
                
                sample = {
                    'image': img,
                    'caption': caption,
                }
                out.write(sample)
            except Exception as e:
                print(f"Error processing sample: {e}")

    except Exception as e:
        print(f"Error processing sample: {e}")
@MaxxP0 MaxxP0 added the bug Something isn't working label Jul 24, 2024
@XiaohanZhangCMU
Copy link
Contributor

Thank you @MaxxP0 for reporting the issue! With "800gb", do you mean by the .tar dataset or the uncompressed raw webdataset?

It may be a OS thing where temp file is handled differently. But first, can we check if keep_local defined here is actually False? When keep_local= False, a local file is removed after upload_file is done, logic is here

To debug this, can you add some printouts and let's make sure keep_local=false and the local removal is happening first.

@MaxxP0
Copy link
Author

MaxxP0 commented Jul 24, 2024

Thank you for the quick response. Yes the webdataset has 4999 tar files which have a total size of 800gb. Yes i have keep_local=False and i was able to reduce the problem by saving the bytes directly instead of decoding them to PIL Images and then storing them as PIL Images usig the MDSWriter. The temp file keeps increasing until it reaches like 300 gb after which it slowly decreases in size as files are uploaded. So thankfully this doesnt fill up my disk completly and the code doesnt crash. But it would proably be useful if there was something in place which would prevent temp buildup to a certain extend if this is not already happening. Thanks alot for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants